Scraping JSON-LD from a Next.js Site with Crawl4AI: My Debugging Journey

Scraping data from modern websites can feel like a puzzle, especially when they’re built with Next.js and all that fancy JavaScript magic. Recently, I needed to pull some product info—like names, prices, and a few extra details—from an e-commerce page that was giving me a headache. The site (let’s just call it https://shop.example.com/products/[hidden-stuff]) used JSON-LD tucked inside a <script> tag, but my first attempts with Crawl4AI came up empty. Here’s how I cracked it, step by step, and got the data I wanted.

The Headache: Empty Results from a Next.js Page

I was trying to grab details from a product page—think stuff like the item name, description, member vs. non-member prices, and some category info. The JSON-LD looked something like this (I’ve swapped out the real details for a fake example):

{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Beginner’s Guide to Coffee Roasting",
  "description": "Learn the basics of roasting your own coffee beans at home. Recorded live last summer.",
  "provider": {
    "@type": "Organization",
    "name": "Bean Enthusiast Co."
  },
  "offers": [
    {"@type": "Offer", "price": 49.99, "priceCurrency": "USD"},
    {"@type": "Offer", "price": 59.99, "priceCurrency": "USD"}
  ],
  "skillLevel": "Beginner",
  "hasWorkshop": [
    {
      "@type": "WorkshopInstance",
      "deliveryMethod": "Online",
      "workshopSchedule": {"startDate": "2024-08-15"}
    }
  ]
}

My goal was to extract this, label the cheaper price as “member” and the higher one as “non-member,” and snag extras like skillLevel and deliveryMethod. Simple, right? Nope. My first stab at it with Crawl4AI gave me nothing—just an empty [].

What Went Wrong: Next.js Threw Me a Curveball

Next.js loves doing things dynamically, which means the JSON-LD I saw in my browser’s dev tools wasn’t always in the raw HTML Crawl4AI fetched. I started with this basic setup:

from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Product Schema",
    "baseSelector": "script[type='application/ld+json']",
    "fields": [{"name": "json_ld_content", "selector": "script[type='application/ld+json']", "type": "text"}]
}

async def extract_data(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, extraction_strategy=JsonCssExtractionStrategy(schema))
        extracted_data = json.loads(result.extracted_content)
        print(extracted_data)

# Output: []

Empty. Zilch. I dug into the debug output and saw the JSON-LD was in result.html, but result.extracted_content was blank. Turns out, Next.js was injecting that <script> tag after the page loaded, and Crawl4AI wasn’t catching it without some extra nudging.

How I Fixed It: A Workaround That Worked

After banging my head against the wall, I figured out I needed to make Crawl4AI wait for the JavaScript to do its thing and then grab the JSON-LD myself from the HTML. Here’s the code that finally worked:

import json
import asyncio
from crawl4ai import AsyncWebCrawler

async def extract_product_schema(url):
    async with AsyncWebCrawler(verbose=True, user_agent="Mozilla/5.0") as crawler:
        print(f"Checking out: {url}")
        result = await crawler.arun(
            url=url,
            js_code=[
                "window.scrollTo(0, document.body.scrollHeight);",  # Wake up the page
                "await new Promise(resolve => setTimeout(resolve, 5000));"  # Give it 5 seconds
            ],
            bypass_cache=True,
            timeout=30
        )

        if not result.success:
            print(f"Oops, something broke: {result.error_message}")
            return None

        # Digging into the HTML myself
        html = result.html
        start_marker = '<script type="application/ld+json">'
        end_marker = '</script>'
        start_idx = html.find(start_marker) + len(start_marker)
        end_idx = html.find(end_marker, start_idx)

        if start_idx == -1 or end_idx == -1:
            print("Couldn’t find the JSON-LD.")
            return None

        json_ld_raw = html[start_idx:end_idx].strip()
        json_ld = json.loads(json_ld_raw)

        # Sorting out the product details
        if json_ld.get("@type") == "Product":
            offers = sorted(
                [{"price": o.get("price"), "priceCurrency": o.get("priceCurrency")} for o in json_ld.get("offers", [])],
                key=lambda x: x["price"]
            )
            workshop_instances = json_ld.get("hasWorkshop", [])
            schedule = workshop_instances[0].get("workshopSchedule", {}) if workshop_instances else {}
            
            product_info = {
                "name": json_ld.get("name"),
                "description": json_ld.get("description"),
                "providerName": json_ld.get("provider", {}).get("name"),
                "memberPrice": offers[0] if offers else None,
                "nonMemberPrice": offers[-1] if offers else None,
                "skillLevel": json_ld.get("skillLevel"),
                "deliveryMethod": workshop_instances[0].get("deliveryMethod") if workshop_instances else None,
                "startDate": schedule.get("startDate")
            }
            return product_info
        print("No product data here.")
        return None

async def main():
    url = "https://shop.example.com/products/[hidden-stuff]"
    product_data = await extract_product_schema(url)
    if product_data:
        print("Here’s what I got:")
        print(json.dumps(product_data, indent=2))

if __name__ == "__main__":
    asyncio.run(main())

What I Got Out of It

{
  "name": "Beginner’s Guide to Coffee Roasting",
  "description": "Learn the basics of roasting your own coffee beans at home. Recorded live last summer.",
  "providerName": "Bean Enthusiast Co.",
  "memberPrice": {
    "price": 49.99,
    "priceCurrency": "USD"
  },
  "nonMemberPrice": {
    "price": 59.99,
    "priceCurrency": "USD"
  },
  "skillLevel": "Beginner",
  "deliveryMethod": "Online",
  "startDate": "2024-08-15"
}

How I Made It Work

Waiting for JavaScript: I told Crawl4AI to scroll and hang out for 5 seconds with js_code. That gave Next.js time to load everything up.DIY Parsing: The built-in extractor wasn’t cutting it, so I searched the HTML for the <script> tag and pulled the JSON-LD out myself.Price Tags: Sorted the prices and called the lowest “member” and the highest “non-member”—seemed like a safe bet for this site.

What I Learned Along the Way

Next.js is Tricky: It’s not just about the HTML you get—it’s about what shows up after the JavaScript runs. Timing is everything.
Sometimes You Gotta Get Hands-On: When the fancy tools didn’t work, digging into the raw HTML saved me.
Debugging Pays Off: Printing out the HTML and extractor output showed me exactly where things were going wrong.

Technical Insights: Azure, .NET, Dynamics 365 & EV Charging Architecture

The Headache: Empty Results from a Next.js Page

What Went Wrong: Next.js Threw Me a Curveball

How I Fixed It: A Workaround That Worked

What I Got Out of It

How I Made It Work

What I Learned Along the Way

Leave a Reply Cancel reply

Scraping JSON-LD from a Next.js Site with Crawl4AI: My Debugging Journey

The Headache: Empty Results from a Next.js Page

What Went Wrong: Next.js Threw Me a Curveball

How I Fixed It: A Workaround That Worked

What I Got Out of It

How I Made It Work

What I Learned Along the Way

Azure Service Bus Peek-Lock: A Comprehensive Guide to Reliable Message Processing

What is OCPP? A Complete Guide to the EV Charging Communication Protocol

Leave a Reply Cancel reply