Technical Insights: Azure, .NET, Dynamics 365 & EV Charging Architecture

Month: February 2025

Scraping JSON-LD from a Next.js Site with Crawl4AI: My Debugging Journey

Scraping data from modern websites can feel like a puzzle, especially when they’re built with Next.js and all that fancy JavaScript magic. Recently, I needed to pull some product info—like names, prices, and a few extra details—from an e-commerce page that was giving me a headache. The site (let’s just call it https://shop.example.com/products/[hidden-stuff]) used JSON-LD tucked inside a <script> tag, but my first attempts with Crawl4AI came up empty. Here’s how I cracked it, step by step, and got the data I wanted.

The Headache: Empty Results from a Next.js Page

I was trying to grab details from a product page—think stuff like the item name, description, member vs. non-member prices, and some category info. The JSON-LD looked something like this (I’ve swapped out the real details for a fake example):

{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Beginner’s Guide to Coffee Roasting",
  "description": "Learn the basics of roasting your own coffee beans at home. Recorded live last summer.",
  "provider": {
    "@type": "Organization",
    "name": "Bean Enthusiast Co."
  },
  "offers": [
    {"@type": "Offer", "price": 49.99, "priceCurrency": "USD"},
    {"@type": "Offer", "price": 59.99, "priceCurrency": "USD"}
  ],
  "skillLevel": "Beginner",
  "hasWorkshop": [
    {
      "@type": "WorkshopInstance",
      "deliveryMethod": "Online",
      "workshopSchedule": {"startDate": "2024-08-15"}
    }
  ]
}

My goal was to extract this, label the cheaper price as “member” and the higher one as “non-member,” and snag extras like skillLevel and deliveryMethod. Simple, right? Nope. My first stab at it with Crawl4AI gave me nothing—just an empty [].

What Went Wrong: Next.js Threw Me a Curveball

Next.js loves doing things dynamically, which means the JSON-LD I saw in my browser’s dev tools wasn’t always in the raw HTML Crawl4AI fetched. I started with this basic setup:

from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Product Schema",
    "baseSelector": "script[type='application/ld+json']",
    "fields": [{"name": "json_ld_content", "selector": "script[type='application/ld+json']", "type": "text"}]
}

async def extract_data(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, extraction_strategy=JsonCssExtractionStrategy(schema))
        extracted_data = json.loads(result.extracted_content)
        print(extracted_data)

# Output: []

Empty. Zilch. I dug into the debug output and saw the JSON-LD was in result.html, but result.extracted_content was blank. Turns out, Next.js was injecting that <script> tag after the page loaded, and Crawl4AI wasn’t catching it without some extra nudging.

How I Fixed It: A Workaround That Worked

After banging my head against the wall, I figured out I needed to make Crawl4AI wait for the JavaScript to do its thing and then grab the JSON-LD myself from the HTML. Here’s the code that finally worked:

import json
import asyncio
from crawl4ai import AsyncWebCrawler

async def extract_product_schema(url):
    async with AsyncWebCrawler(verbose=True, user_agent="Mozilla/5.0") as crawler:
        print(f"Checking out: {url}")
        result = await crawler.arun(
            url=url,
            js_code=[
                "window.scrollTo(0, document.body.scrollHeight);",  # Wake up the page
                "await new Promise(resolve => setTimeout(resolve, 5000));"  # Give it 5 seconds
            ],
            bypass_cache=True,
            timeout=30
        )

        if not result.success:
            print(f"Oops, something broke: {result.error_message}")
            return None

        # Digging into the HTML myself
        html = result.html
        start_marker = '<script type="application/ld+json">'
        end_marker = '</script>'
        start_idx = html.find(start_marker) + len(start_marker)
        end_idx = html.find(end_marker, start_idx)

        if start_idx == -1 or end_idx == -1:
            print("Couldn’t find the JSON-LD.")
            return None

        json_ld_raw = html[start_idx:end_idx].strip()
        json_ld = json.loads(json_ld_raw)

        # Sorting out the product details
        if json_ld.get("@type") == "Product":
            offers = sorted(
                [{"price": o.get("price"), "priceCurrency": o.get("priceCurrency")} for o in json_ld.get("offers", [])],
                key=lambda x: x["price"]
            )
            workshop_instances = json_ld.get("hasWorkshop", [])
            schedule = workshop_instances[0].get("workshopSchedule", {}) if workshop_instances else {}
            
            product_info = {
                "name": json_ld.get("name"),
                "description": json_ld.get("description"),
                "providerName": json_ld.get("provider", {}).get("name"),
                "memberPrice": offers[0] if offers else None,
                "nonMemberPrice": offers[-1] if offers else None,
                "skillLevel": json_ld.get("skillLevel"),
                "deliveryMethod": workshop_instances[0].get("deliveryMethod") if workshop_instances else None,
                "startDate": schedule.get("startDate")
            }
            return product_info
        print("No product data here.")
        return None

async def main():
    url = "https://shop.example.com/products/[hidden-stuff]"
    product_data = await extract_product_schema(url)
    if product_data:
        print("Here’s what I got:")
        print(json.dumps(product_data, indent=2))

if __name__ == "__main__":
    asyncio.run(main())

What I Got Out of It

{
  "name": "Beginner’s Guide to Coffee Roasting",
  "description": "Learn the basics of roasting your own coffee beans at home. Recorded live last summer.",
  "providerName": "Bean Enthusiast Co.",
  "memberPrice": {
    "price": 49.99,
    "priceCurrency": "USD"
  },
  "nonMemberPrice": {
    "price": 59.99,
    "priceCurrency": "USD"
  },
  "skillLevel": "Beginner",
  "deliveryMethod": "Online",
  "startDate": "2024-08-15"
}

How I Made It Work

Waiting for JavaScript: I told Crawl4AI to scroll and hang out for 5 seconds with js_code. That gave Next.js time to load everything up.DIY Parsing: The built-in extractor wasn’t cutting it, so I searched the HTML for the <script> tag and pulled the JSON-LD out myself.Price Tags: Sorted the prices and called the lowest “member” and the highest “non-member”—seemed like a safe bet for this site.

What I Learned Along the Way

  • Next.js is Tricky: It’s not just about the HTML you get—it’s about what shows up after the JavaScript runs. Timing is everything.
  • Sometimes You Gotta Get Hands-On: When the fancy tools didn’t work, digging into the raw HTML saved me.
  • Debugging Pays Off: Printing out the HTML and extractor output showed me exactly where things were going wrong.

Azure Service Bus Peek-Lock: A Comprehensive Guide to Reliable Message Processing

Working with Peek-Lock in Azure Service Bus: A Practical Guide

In many distributed systems, reliable message handling is a top priority. When I first started building an order processing application, I learned very quickly that losing even one message could cause major headaches. That’s exactly where Azure Service Bus and its Peek-Lock mode came to the rescue. By using Peek-Lock, you don’t remove the message from the queue as soon as you receive it. Instead, you lock it for a certain period, process it, and then decide what to do next—complete, abandon, dead-letter, or defer. Here’s how it all fits together.

Why Peek-Lock Matters

Peek-Lock is one of the two receiving modes offered by Azure Service Bus. The other is Receive and Delete, which automatically removes messages from the queue upon receipt. While that might be fine for scenarios where occasional message loss is acceptable, many real-world applications need stronger guarantees.

  1. Reliability: With Peek-Lock, if processing fails, you can abandon the message. This makes it visible again for another attempt, reducing the risk of data loss.
  2. Explicit Control: You decide when a message is removed. After you successfully handle the message (e.g., update a database or complete a transaction), you explicitly mark it as complete.
  3. Error Handling: If the same message repeatedly fails, you can dead-letter it for investigation. This helps avoid getting stuck in an endless processing loop.

What Happens If the Lock Expires?

By default, the lock is held for a certain period (often 30 seconds, which can be adjusted). If your code doesn’t complete or abandon the message before the lock expires, the message becomes visible to other receivers. To handle potentially lengthy processes, you can renew the lock programmatically, although that introduces additional complexity. The key takeaway is that you should design your service to either complete or abandon messages quickly, or renew the lock if more time is truly necessary.

Default Peek-Lock in Azure Functions

When you use Azure Service Bus triggers in Azure Functions, you generally don’t need to configure or manage the Peek-Lock behavior yourself. According to the official documentation, the default behavior in Azure Functions is already set to Peek-Lock. This means you can focus on your function’s core logic without explicitly dealing with message locking or completion in most scenarios.

Don’t Swallow Exceptions

One important detail to note is that in Azure Functions, any unhandled exceptions in your function code will signal to the runtime that message processing failed. This prevents the function from automatically completing the message, allowing the Service Bus to retry later. However, if you wrap your logic in a try/catch block and inadvertently swallow the exception—meaning you catch the error without rethrowing or handling it properly—you might unintentionally signal success. That would lead to the message being completed even though a downstream service might have failed.

Recommendation:

  • If you must use a try/catch, make sure errors are re-thrown or handled in a way that indicates failure if the message truly hasn’t been processed successfully. Otherwise, you’ll end up completing the message and losing valuable information about the error.

Typical Use Cases

  1. Financial Transactions: Losing a message that represents a monetary transaction is not an option. Peek-Lock ensures messages remain available until your code confirms it was successfully processed.
  2. Critical Notifications: If you have an alerting system that notifies users about important events, you don’t want those notifications disappearing in case of a crash.
  3. Order Processing: In ecommerce or supply chain scenarios, every order message has to be accounted for. Peek-Lock helps avoid partial or lost orders due to transient errors.

Example in C#

Here’s a short snippet that demonstrates how you can receive messages in Peek-Lock mode using the Azure.Messaging.ServiceBus library:

using System;
using System.Threading.Tasks;
using Azure.Messaging.ServiceBus;

public class PeekLockExample
{
    private const string ConnectionString = "<YOUR_SERVICE_BUS_CONNECTION_STRING>";
    private const string QueueName = "<YOUR_QUEUE_NAME>";

    public async Task RunPeekLockSample()
    {
        // Create a Service Bus client
        var client = new ServiceBusClient(ConnectionString);

        // Create a receiver in Peek-Lock mode
        var receiver = client.CreateReceiver(
            QueueName, 
            new ServiceBusReceiverOptions 
            { 
                ReceiveMode = ServiceBusReceiveMode.PeekLock 
            }
        );

        try
        {
            // Attempt to receive a single message
            ServiceBusReceivedMessage message = await receiver.ReceiveMessageAsync(TimeSpan.FromSeconds(10));

            if (message != null)
            {
                // Process the message
                string body = message.Body.ToString();
                Console.WriteLine($"Processing message: {body}");

                // If processing is successful, complete the message
                await receiver.CompleteMessageAsync(message);
                Console.WriteLine("Message completed and removed from the queue.");
            }
            else
            {
                Console.WriteLine("No messages were available to receive.");
            }
        }
        catch (Exception ex)
        {
            Console.WriteLine($"An error occurred: {ex.Message}");
            // Optionally handle or log the exception
        }
        finally
        {
            // Clean up resources
            await receiver.CloseAsync();
            await client.DisposeAsync();
        }
    }
}

What’s Happening Here?

  • We create a ServiceBusClient to connect to Azure Service Bus.
  • We specify ServiceBusReceiveMode.PeekLock when creating the receiver.
  • The code then attempts to receive one message and processes it.
  • If everything goes smoothly, we call CompleteMessageAsync to remove it from the queue. If something goes wrong, the message remains locked until the lock expires or until we choose to abandon it.

Final Thoughts

Peek-Lock strikes a balance between reliability and performance. It ensures you won’t lose critical data while giving you the flexibility to handle errors gracefully. Whether you’re dealing with financial operations, critical user notifications, or any scenario where each message must be processed correctly, Peek-Lock is an indispensable tool in your Azure Service Bus arsenal.

In Azure Functions, you get this benefit without having to manage the locking details, so long as you don’t accidentally swallow your exceptions. For other applications, adopting Peek-Lock might demand a bit more coding, but it’s well worth it if you need guaranteed, at-least-once message delivery.

Whether you’re building a simple queue-based workflow or a complex event-driven system, Peek-Lock ensures your messages remain safe until you decide they’re processed successfully. It’s a powerful approach that balances performance with reliability, which is why it’s a must-know feature for developers relying on Azure Service Bus.

Powered by WordPress & Theme by Anders Norén