Scraping data from modern websites can feel like a puzzle, especially when they’re built with Next.js and all that fancy JavaScript magic. Recently, I needed to pull some product info—like names, prices, and a few extra details—from an e-commerce page that was giving me a headache. The site (let’s just call it https://shop.example.com/products/[hidden-stuff]) used JSON-LD tucked inside a <script> tag, but my first attempts with Crawl4AI came up empty. Here’s how I cracked it, step by step, and got the data I wanted.
The Headache: Empty Results from a Next.js Page
I was trying to grab details from a product page—think stuff like the item name, description, member vs. non-member prices, and some category info. The JSON-LD looked something like this (I’ve swapped out the real details for a fake example):
{
"@context": "https://schema.org",
"@type": "Product",
"name": "Beginner’s Guide to Coffee Roasting",
"description": "Learn the basics of roasting your own coffee beans at home. Recorded live last summer.",
"provider": {
"@type": "Organization",
"name": "Bean Enthusiast Co."
},
"offers": [
{"@type": "Offer", "price": 49.99, "priceCurrency": "USD"},
{"@type": "Offer", "price": 59.99, "priceCurrency": "USD"}
],
"skillLevel": "Beginner",
"hasWorkshop": [
{
"@type": "WorkshopInstance",
"deliveryMethod": "Online",
"workshopSchedule": {"startDate": "2024-08-15"}
}
]
}
My goal was to extract this, label the cheaper price as “member” and the higher one as “non-member,” and snag extras like skillLevel and deliveryMethod. Simple, right? Nope. My first stab at it with Crawl4AI gave me nothing—just an empty [].
What Went Wrong: Next.js Threw Me a Curveball
Next.js loves doing things dynamically, which means the JSON-LD I saw in my browser’s dev tools wasn’t always in the raw HTML Crawl4AI fetched. I started with this basic setup:
from crawl4ai import AsyncWebCrawler from crawl4ai.extraction_strategy import JsonCssExtractionStrategy schema = { "name": "Product Schema", "baseSelector": "script[type='application/ld+json']", "fields": [{"name": "json_ld_content", "selector": "script[type='application/ld+json']", "type": "text"}] } async def extract_data(url): async with AsyncWebCrawler() as crawler: result = await crawler.arun(url=url, extraction_strategy=JsonCssExtractionStrategy(schema)) extracted_data = json.loads(result.extracted_content) print(extracted_data) # Output: []
Empty. Zilch. I dug into the debug output and saw the JSON-LD was in result.html, but result.extracted_content was blank. Turns out, Next.js was injecting that <script> tag after the page loaded, and Crawl4AI wasn’t catching it without some extra nudging.
How I Fixed It: A Workaround That Worked
After banging my head against the wall, I figured out I needed to make Crawl4AI wait for the JavaScript to do its thing and then grab the JSON-LD myself from the HTML. Here’s the code that finally worked:
import json import asyncio from crawl4ai import AsyncWebCrawler async def extract_product_schema(url): async with AsyncWebCrawler(verbose=True, user_agent="Mozilla/5.0") as crawler: print(f"Checking out: {url}") result = await crawler.arun( url=url, js_code=[ "window.scrollTo(0, document.body.scrollHeight);", # Wake up the page "await new Promise(resolve => setTimeout(resolve, 5000));" # Give it 5 seconds ], bypass_cache=True, timeout=30 ) if not result.success: print(f"Oops, something broke: {result.error_message}") return None # Digging into the HTML myself html = result.html start_marker = '<script type="application/ld+json">' end_marker = '</script>' start_idx = html.find(start_marker) + len(start_marker) end_idx = html.find(end_marker, start_idx) if start_idx == -1 or end_idx == -1: print("Couldn’t find the JSON-LD.") return None json_ld_raw = html[start_idx:end_idx].strip() json_ld = json.loads(json_ld_raw) # Sorting out the product details if json_ld.get("@type") == "Product": offers = sorted( [{"price": o.get("price"), "priceCurrency": o.get("priceCurrency")} for o in json_ld.get("offers", [])], key=lambda x: x["price"] ) workshop_instances = json_ld.get("hasWorkshop", []) schedule = workshop_instances[0].get("workshopSchedule", {}) if workshop_instances else {} product_info = { "name": json_ld.get("name"), "description": json_ld.get("description"), "providerName": json_ld.get("provider", {}).get("name"), "memberPrice": offers[0] if offers else None, "nonMemberPrice": offers[-1] if offers else None, "skillLevel": json_ld.get("skillLevel"), "deliveryMethod": workshop_instances[0].get("deliveryMethod") if workshop_instances else None, "startDate": schedule.get("startDate") } return product_info print("No product data here.") return None async def main(): url = "https://shop.example.com/products/[hidden-stuff]" product_data = await extract_product_schema(url) if product_data: print("Here’s what I got:") print(json.dumps(product_data, indent=2)) if __name__ == "__main__": asyncio.run(main())
What I Got Out of It
{
"name": "Beginner’s Guide to Coffee Roasting",
"description": "Learn the basics of roasting your own coffee beans at home. Recorded live last summer.",
"providerName": "Bean Enthusiast Co.",
"memberPrice": {
"price": 49.99,
"priceCurrency": "USD"
},
"nonMemberPrice": {
"price": 59.99,
"priceCurrency": "USD"
},
"skillLevel": "Beginner",
"deliveryMethod": "Online",
"startDate": "2024-08-15"
}
How I Made It Work
Waiting for JavaScript: I told Crawl4AI to scroll and hang out for 5 seconds with js_code. That gave Next.js time to load everything up.DIY Parsing: The built-in extractor wasn’t cutting it, so I searched the HTML for the <script> tag and pulled the JSON-LD out myself.Price Tags: Sorted the prices and called the lowest “member” and the highest “non-member”—seemed like a safe bet for this site.
What I Learned Along the Way
- Next.js is Tricky: It’s not just about the HTML you get—it’s about what shows up after the JavaScript runs. Timing is everything.
- Sometimes You Gotta Get Hands-On: When the fancy tools didn’t work, digging into the raw HTML saved me.
- Debugging Pays Off: Printing out the HTML and extractor output showed me exactly where things were going wrong.