Technical Insights: Azure, .NET, Dynamics 365 & EV Charging Architecture

Tag: Javascript

Scraping JSON-LD from a Next.js Site with Crawl4AI: My Debugging Journey

Scraping data from modern websites can feel like a puzzle, especially when they’re built with Next.js and all that fancy JavaScript magic. Recently, I needed to pull some product info—like names, prices, and a few extra details—from an e-commerce page that was giving me a headache. The site (let’s just call it https://shop.example.com/products/[hidden-stuff]) used JSON-LD tucked inside a <script> tag, but my first attempts with Crawl4AI came up empty. Here’s how I cracked it, step by step, and got the data I wanted.

The Headache: Empty Results from a Next.js Page

I was trying to grab details from a product page—think stuff like the item name, description, member vs. non-member prices, and some category info. The JSON-LD looked something like this (I’ve swapped out the real details for a fake example):

{
  "@context": "https://schema.org",
  "@type": "Product",
  "name": "Beginner’s Guide to Coffee Roasting",
  "description": "Learn the basics of roasting your own coffee beans at home. Recorded live last summer.",
  "provider": {
    "@type": "Organization",
    "name": "Bean Enthusiast Co."
  },
  "offers": [
    {"@type": "Offer", "price": 49.99, "priceCurrency": "USD"},
    {"@type": "Offer", "price": 59.99, "priceCurrency": "USD"}
  ],
  "skillLevel": "Beginner",
  "hasWorkshop": [
    {
      "@type": "WorkshopInstance",
      "deliveryMethod": "Online",
      "workshopSchedule": {"startDate": "2024-08-15"}
    }
  ]
}

My goal was to extract this, label the cheaper price as “member” and the higher one as “non-member,” and snag extras like skillLevel and deliveryMethod. Simple, right? Nope. My first stab at it with Crawl4AI gave me nothing—just an empty [].

What Went Wrong: Next.js Threw Me a Curveball

Next.js loves doing things dynamically, which means the JSON-LD I saw in my browser’s dev tools wasn’t always in the raw HTML Crawl4AI fetched. I started with this basic setup:

from crawl4ai import AsyncWebCrawler
from crawl4ai.extraction_strategy import JsonCssExtractionStrategy

schema = {
    "name": "Product Schema",
    "baseSelector": "script[type='application/ld+json']",
    "fields": [{"name": "json_ld_content", "selector": "script[type='application/ld+json']", "type": "text"}]
}

async def extract_data(url):
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url=url, extraction_strategy=JsonCssExtractionStrategy(schema))
        extracted_data = json.loads(result.extracted_content)
        print(extracted_data)

# Output: []

Empty. Zilch. I dug into the debug output and saw the JSON-LD was in result.html, but result.extracted_content was blank. Turns out, Next.js was injecting that <script> tag after the page loaded, and Crawl4AI wasn’t catching it without some extra nudging.

How I Fixed It: A Workaround That Worked

After banging my head against the wall, I figured out I needed to make Crawl4AI wait for the JavaScript to do its thing and then grab the JSON-LD myself from the HTML. Here’s the code that finally worked:

import json
import asyncio
from crawl4ai import AsyncWebCrawler

async def extract_product_schema(url):
    async with AsyncWebCrawler(verbose=True, user_agent="Mozilla/5.0") as crawler:
        print(f"Checking out: {url}")
        result = await crawler.arun(
            url=url,
            js_code=[
                "window.scrollTo(0, document.body.scrollHeight);",  # Wake up the page
                "await new Promise(resolve => setTimeout(resolve, 5000));"  # Give it 5 seconds
            ],
            bypass_cache=True,
            timeout=30
        )

        if not result.success:
            print(f"Oops, something broke: {result.error_message}")
            return None

        # Digging into the HTML myself
        html = result.html
        start_marker = '<script type="application/ld+json">'
        end_marker = '</script>'
        start_idx = html.find(start_marker) + len(start_marker)
        end_idx = html.find(end_marker, start_idx)

        if start_idx == -1 or end_idx == -1:
            print("Couldn’t find the JSON-LD.")
            return None

        json_ld_raw = html[start_idx:end_idx].strip()
        json_ld = json.loads(json_ld_raw)

        # Sorting out the product details
        if json_ld.get("@type") == "Product":
            offers = sorted(
                [{"price": o.get("price"), "priceCurrency": o.get("priceCurrency")} for o in json_ld.get("offers", [])],
                key=lambda x: x["price"]
            )
            workshop_instances = json_ld.get("hasWorkshop", [])
            schedule = workshop_instances[0].get("workshopSchedule", {}) if workshop_instances else {}
            
            product_info = {
                "name": json_ld.get("name"),
                "description": json_ld.get("description"),
                "providerName": json_ld.get("provider", {}).get("name"),
                "memberPrice": offers[0] if offers else None,
                "nonMemberPrice": offers[-1] if offers else None,
                "skillLevel": json_ld.get("skillLevel"),
                "deliveryMethod": workshop_instances[0].get("deliveryMethod") if workshop_instances else None,
                "startDate": schedule.get("startDate")
            }
            return product_info
        print("No product data here.")
        return None

async def main():
    url = "https://shop.example.com/products/[hidden-stuff]"
    product_data = await extract_product_schema(url)
    if product_data:
        print("Here’s what I got:")
        print(json.dumps(product_data, indent=2))

if __name__ == "__main__":
    asyncio.run(main())

What I Got Out of It

{
  "name": "Beginner’s Guide to Coffee Roasting",
  "description": "Learn the basics of roasting your own coffee beans at home. Recorded live last summer.",
  "providerName": "Bean Enthusiast Co.",
  "memberPrice": {
    "price": 49.99,
    "priceCurrency": "USD"
  },
  "nonMemberPrice": {
    "price": 59.99,
    "priceCurrency": "USD"
  },
  "skillLevel": "Beginner",
  "deliveryMethod": "Online",
  "startDate": "2024-08-15"
}

How I Made It Work

Waiting for JavaScript: I told Crawl4AI to scroll and hang out for 5 seconds with js_code. That gave Next.js time to load everything up.DIY Parsing: The built-in extractor wasn’t cutting it, so I searched the HTML for the <script> tag and pulled the JSON-LD out myself.Price Tags: Sorted the prices and called the lowest “member” and the highest “non-member”—seemed like a safe bet for this site.

What I Learned Along the Way

  • Next.js is Tricky: It’s not just about the HTML you get—it’s about what shows up after the JavaScript runs. Timing is everything.
  • Sometimes You Gotta Get Hands-On: When the fancy tools didn’t work, digging into the raw HTML saved me.
  • Debugging Pays Off: Printing out the HTML and extractor output showed me exactly where things were going wrong.

webform_dopostbackwithoptions is undefined

A few days back I’ve got the error message “webform_dopostbackwithoptions is undefined” on the project that i’m working on.
The strange thing it happened only when I activated the securepagemodule, when i deactivated the module it works perfectly. I tried to debug it with HTTP debugger (fiddler tool) and i found that on that particular page, there is a request from webresources.axd but the request is not into https but into http and what i believe since the page is on secure mode therefore it discards the “webresources.axd” since it’s not secure. The workaround for this issue is by adding entry of “webresources.axd” under securepage and the problem is solved.

This is the sample of web.config for it

 
        
         
   </secureWebPages

NOTE:This is resolved in the new version of securepage module

SYS is undefined in AJAX Update Panel

I tried to implement AJAX on my website by using AJAX extension on visual studio .NET. I tried to drag update panel to my page and I’ve put my button event handler to the trigger collection in update panel. I was expecting when i click that particular button then it will not refresh or load up the page, but in fact it reloads the page which showing me that the AJAX panel does not work, and my firebug detects that there are errors related with sys is undefined.

This happens because I haven’t AJAX config on my web.config. The config is related with HTTP handler, you can try to put this in your web.config. Please make sure it should be inside “”


    
      
    

Powered by WordPress & Theme by Anders Norén