Scraping data from modern websites can feel like a puzzle, especially when they’re built with Next.js and all that fancy JavaScript magic. Recently, I needed to pull some product info—like names, prices, and a few extra details—from an e-commerce page that was giving me a headache. The site (let’s just call it https://shop.example.com/products/[hidden-stuff]) used JSON-LD tucked inside a <script> tag, but my first attempts with Crawl4AI came up empty. Here’s how I cracked it, step by step, and got the data I wanted.
The Headache: Empty Results from a Next.js Page
I was trying to grab details from a product page—think stuff like the item name, description, member vs. non-member prices, and some category info. The JSON-LD looked something like this (I’ve swapped out the real details for a fake example):
{
"@context": "https://schema.org",
"@type": "Product",
"name": "Beginner’s Guide to Coffee Roasting",
"description": "Learn the basics of roasting your own coffee beans at home. Recorded live last summer.",
"provider": {
"@type": "Organization",
"name": "Bean Enthusiast Co."
},
"offers": [
{"@type": "Offer", "price": 49.99, "priceCurrency": "USD"},
{"@type": "Offer", "price": 59.99, "priceCurrency": "USD"}
],
"skillLevel": "Beginner",
"hasWorkshop": [
{
"@type": "WorkshopInstance",
"deliveryMethod": "Online",
"workshopSchedule": {"startDate": "2024-08-15"}
}
]
}
My goal was to extract this, label the cheaper price as “member” and the higher one as “non-member,” and snag extras like skillLevel and deliveryMethod. Simple, right? Nope. My first stab at it with Crawl4AI gave me nothing—just an empty [].
What Went Wrong: Next.js Threw Me a Curveball
Next.js loves doing things dynamically, which means the JSON-LD I saw in my browser’s dev tools wasn’t always in the raw HTML Crawl4AI fetched. I started with this basic setup:
from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy
schema = {
"name": "Product Schema",
"baseSelector": "script[type='application/ld+json']",
"fields": [{"name": "json_ld_content", "selector": "script[type='application/ld+json']", "type": "text"}]
}
async def extract_data(url):
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, extraction_strategy=JsonCssExtractionStrategy(schema))
extracted_data = json.loads(result.extracted_content)
print(extracted_data)
# Output: []
Empty. Zilch. I dug into the debug output and saw the JSON-LD was in result.html, but result.extracted_content was blank. Turns out, Next.js was injecting that <script> tag after the page loaded, and Crawl4AI wasn’t catching it without some extra nudging.
How I Fixed It: A Workaround That Worked
After banging my head against the wall, I figured out I needed to make Crawl4AI wait for the JavaScript to do its thing and then grab the JSON-LD myself from the HTML. Here’s the code that finally worked:
import json
import asyncio
from crawl4ai import AsyncWebCrawler
async def extract_product_schema(url):
async with AsyncWebCrawler(verbose=True, user_agent="Mozilla/5.0") as crawler:
print(f"Checking out: {url}")
result = await crawler.arun(
url=url,
js_code=[
"window.scrollTo(0, document.body.scrollHeight);", # Wake up the page
"await new Promise(resolve => setTimeout(resolve, 5000));" # Give it 5 seconds
],
bypass_cache=True,
timeout=30
)
if not result.success:
print(f"Oops, something broke: {result.error_message}")
return None
# Digging into the HTML myself
html = result.html
start_marker = '<script type="application/ld+json">'
end_marker = '</script>'
start_idx = html.find(start_marker) + len(start_marker)
end_idx = html.find(end_marker, start_idx)
if start_idx == -1 or end_idx == -1:
print("Couldn’t find the JSON-LD.")
return None
json_ld_raw = html[start_idx:end_idx].strip()
json_ld = json.loads(json_ld_raw)
# Sorting out the product details
if json_ld.get("@type") == "Product":
offers = sorted(
[{"price": o.get("price"), "priceCurrency": o.get("priceCurrency")} for o in json_ld.get("offers", [])],
key=lambda x: x["price"]
)
workshop_instances = json_ld.get("hasWorkshop", [])
schedule = workshop_instances[0].get("workshopSchedule", {}) if workshop_instances else {}
product_info = {
"name": json_ld.get("name"),
"description": json_ld.get("description"),
"providerName": json_ld.get("provider", {}).get("name"),
"memberPrice": offers[0] if offers else None,
"nonMemberPrice": offers[-1] if offers else None,
"skillLevel": json_ld.get("skillLevel"),
"deliveryMethod": workshop_instances[0].get("deliveryMethod") if workshop_instances else None,
"startDate": schedule.get("startDate")
}
return product_info
print("No product data here.")
return None
async def main():
url = "https://shop.example.com/products/[hidden-stuff]"
product_data = await extract_product_schema(url)
if product_data:
print("Here’s what I got:")
print(json.dumps(product_data, indent=2))
if __name__ == "__main__":
asyncio.run(main())
What I Got Out of It
{
"name": "Beginner’s Guide to Coffee Roasting",
"description": "Learn the basics of roasting your own coffee beans at home. Recorded live last summer.",
"providerName": "Bean Enthusiast Co.",
"memberPrice": {
"price": 49.99,
"priceCurrency": "USD"
},
"nonMemberPrice": {
"price": 59.99,
"priceCurrency": "USD"
},
"skillLevel": "Beginner",
"deliveryMethod": "Online",
"startDate": "2024-08-15"
}
How I Made It Work
Waiting for JavaScript: I told Crawl4AI to scroll and hang out for 5 seconds with js_code. That gave Next.js time to load everything up.DIY Parsing: The built-in extractor wasn’t cutting it, so I searched the HTML for the <script> tag and pulled the JSON-LD out myself.Price Tags: Sorted the prices and called the lowest “member” and the highest “non-member”—seemed like a safe bet for this site.
What I Learned Along the Way
- Next.js is Tricky: It’s not just about the HTML you get—it’s about what shows up after the JavaScript runs. Timing is everything.
- Sometimes You Gotta Get Hands-On: When the fancy tools didn’t work, digging into the raw HTML saved me.
- Debugging Pays Off: Printing out the HTML and extractor output showed me exactly where things were going wrong.