Technique

Scrape without XPath — what a schema-first extractor actually looks like

XPath selectors break every time a site redesigns. Schema-first extraction with an AI selector doesn't. Here is why.

Stekpad TeamApril 4, 20266 min read

On this page

The problem, written out in code
What `extract` does instead
Why it survives a redesign
Before and after, on a real page
When you still want raw selectors
What the schema actually buys you
A useful pattern: schema in version control
Next steps

Every scraper built before 2023 has the same hidden tax. You write a selector. The selector is fragile. The site redesigns. The selector breaks. You patch it. Next month, the site A/B tests a new layout, and your CSS path matches the wrong element. You patch it again. A senior engineer once told me he spent forty percent of his time on a four-year-old scraping project maintaining selectors. Four out of ten working days, every week, patching strings like div.product > div:nth-child(3) > span.price.

This is the tax nobody prices into the build estimate. And it is what POST /v1/extract is designed to remove.

The problem, written out in code

Here is what a classic BeautifulSoup + XPath pipeline looks like for a single product page. It does exactly what you think it does, and it is going to break the next time the site ships a PR.

python

import requests
from bs4 import BeautifulSoup
 
def scrape_product(url: str) -> dict:
    html = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).text
    soup = BeautifulSoup(html, "lxml")
 
    name_el  = soup.select_one("h1.product-title")
    price_el = soup.select_one("div.price-wrapper span.price")
    stock_el = soup.select_one("div.availability > span")
    sku_el   = soup.select_one("span[data-testid='product-sku']")
 
    return {
        "name":  name_el.text.strip()  if name_el  else None,
        "price": float(price_el.text.replace("$", "").replace(",", "")) if price_el else None,
        "stock": stock_el.text.strip() if stock_el else None,
        "sku":   sku_el.text.strip()   if sku_el   else None,
    }

Count the assumptions. There are at least nine. h1.product-title assumes the title is in an H1 with that exact class. div.price-wrapper span.price assumes two nested elements, with two exact class names. The price-parsing line assumes dollars and commas and no currency prefix. The stock block assumes a specific data-testid on the SKU element. Every one of those assumptions is a string the site owner can change without telling you.

The worst part is the silent failure. When the site renames product-title to ProductTitle, name_el becomes None, your .text call blows up, and if you defended against it you get a row with name: null that quietly pollutes your dataset for a week before someone notices the dashboard chart went flat.

What `extract` does instead

Stekpad's extract verb takes a URL, a JSON Schema, and a natural-language prompt. It renders the page in a real browser, turns the DOM into a clean text representation, and asks a language model to fill in the schema. The model sees the page the way a human reader does — semantic text, headings, labels — not the way a CSS selector does.

Here is the same product scrape against POST /v1/extract:

bash

curl -X POST https://api.stekpad.com/v1/extract \
  -H "Authorization: Bearer stkpd_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products/widget-42",
    "prompt": "Extract the product details from this page.",
    "schema": {
      "type": "object",
      "properties": {
        "name":  { "type": "string" },
        "price": { "type": "number", "description": "Price in USD, numeric only" },
        "stock": { "type": "string", "enum": ["in_stock", "out_of_stock", "backorder"] },
        "sku":   { "type": "string" }
      },
      "required": ["name", "price"]
    }
  }'

The response is a typed JSON object matching the schema, or a typed error if the schema cannot be filled. No selectors. No CSS paths. No "data-testid hunt" on a redesign.

json

{
  "run_id": "run_01HZ8K...",
  "url": "https://example.com/products/widget-42",
  "status": "completed",
  "json": {
    "name": "Widget 42",
    "price": 129.00,
    "stock": "in_stock",
    "sku": "WDG-042"
  },
  "credits_charged": 5
}

Five credits. One call. Zero selectors.

Why it survives a redesign

A CSS selector is coupled to the structure of the page. A schema is coupled to the meaning of the page. When the site renames a class from product-title to ProductTitle, the structure changes, but the meaning — "there is a product name, and it is the biggest heading at the top" — does not. The language model reads the page and finds the title the same way a human does: by looking at the text, not the class.

This is why a schema-based extractor survives a redesign that would break a BeautifulSoup script. The only thing that breaks an extract call is the site genuinely removing the data — if the page no longer lists a price, no selector on earth will find it.

Retries built in. extract validates the response against your schema before returning. If the first LLM pass produces a response that fails validation (say, price came back as the string "$129.00" instead of a number), Stekpad retries the extraction twice, each time with the schema error fed back to the model. You pay for the call once — the retries are on us.

Before and after, on a real page

Let's be concrete. I pointed both pipelines at a Hacker News story page. The selector script used td.title > span.titleline > a for the headline and span.age > a for the timestamp. It worked.

Two weeks later, Hacker News — which famously does not redesign — ships one of its very rare tweaks. titleline becomes titleLine (hypothetical, but realistic). The selector script returns None for every title on every page it scrapes, for fourteen hours, until the on-call notices the dashboard.

The extract call, with a schema of { title: string, points: number, author: string, age: string }, keeps working. The model still sees a headline near the top of each row, a score next to it, an author below. The JSON keeps coming. The dashboard keeps drawing. Your on-call keeps sleeping.

Multiply that across a scraper that covers ten sites. Or a hundred. Or the long tail of sites your growth team wants to pull data from once and never again. The selector tax grows linearly with coverage. The schema-first cost is flat.

When you still want raw selectors

Schema-first is not the answer to everything. There are three cases where writing a selector yourself is still the right call.

You have a genuine structured API. If the site exposes a JSON endpoint — a Shopify product API, a GitHub repo API, an RSS feed — hit it directly. You do not need an LLM to parse JSON that is already JSON.
The page is stable and high-volume. If you are hitting the same page a million times a day and it has not changed in three years, an extract call at 5 credits apiece is more expensive than a scrape call at 1 credit plus your own parser. Use POST /v1/scrape with formats: ["html"] and run your selectors locally.
The value is a pixel-exact coordinate. If you want to know whether a banner is visually above the fold, or whether a specific DOM node has a specific inline style, a language model is the wrong tool. Use scrape with formats: ["screenshot"] and a visual check.

For everything else — product pages, company pages, article pages, job listings, real estate listings, event pages, restaurant menus — schema-first is the cheaper lifetime cost. The per-call price is 5 credits. The avoided-patch cost is dozens of engineering hours per year.

What the schema actually buys you

A JSON Schema is not just a prompt wrapper. It is a contract between the page and your pipeline. Every field you declare has three properties the language model has to respect: a type, an optional description, and a required/optional flag. The extractor validates the model's response against the schema before it hands anything back to you.

That validation is where the shape gets useful. A selector-based scraper can return name: null and you would not know until your dashboard chart goes flat. A schema-based extractor either returns a valid object, or it returns a typed error saying which field failed and what the model produced. You can branch on that error. You can retry with a better description. You can flag the URL for review. None of that is possible with a quiet None from a BeautifulSoup call.

The descriptions are load-bearing too. {"type": "number"} on a price field is fine, but {"type": "number", "description": "Price in USD, numeric only, no currency symbol"} is better — the model reads the description, strips the dollar sign, and gives you 129.00 instead of "$129.00". Good schema descriptions replace five lines of regex post-processing per field.

An under-appreciated property: enum fields. If the product page can be in one of three stock states, declare them as an enum. The model will either return one of your enum values or raise a validation error. You never get a rogue stock: "In stock — 3 left" string in the middle of a column that should only have three possible values.

json

{
  "stock": {
    "type": "string",
    "enum": ["in_stock", "out_of_stock", "backorder"],
    "description": "Availability state of the product"
  }
}

That is three extra lines of config in exchange for a column that validates itself forever. There is no equivalent in the CSS-selector world. The closest you get is a post-processing step that does if "in stock" in text.lower() and prays the site does not change the wording.

A useful pattern: schema in version control

Because the schema is a JSON object, you can put it in your repo next to your code. One file per dataset you care about. Our own team keeps schemas under schemas/*.json, imports them into the extract call, and versions them with git. When the site changes meaningfully and you want to capture a new field, you edit the schema, bump the version, and the old dataset still validates.

This is the thing CSS selectors never gave you: a typed, reviewable, diff-friendly contract for what the page means, separated from how the page looks.

Next steps

Read the REST API reference for extract for every schema option, the retry policy, and the exact error shapes.
See the Firecrawl comparison for how schema extraction differs between the two tools.
If you want the same thing from an agent, install the MCP server and ask Claude to call extract with the schema inline.

Stekpad Team

We build Stekpad. We scrape the web, store it, and enrich it — from an API, from an app, or from Claude.

Technique

Scrape without XPath — what a schema-first extractor actually looks like

XPath selectors break every time a site redesigns. Schema-first extraction with an AI selector doesn't. Here is why.

Stekpad TeamApril 4, 20266 min read

On this page

The problem, written out in code
What `extract` does instead
Why it survives a redesign
Before and after, on a real page
When you still want raw selectors
What the schema actually buys you
A useful pattern: schema in version control
Next steps

This is the tax nobody prices into the build estimate. And it is what POST /v1/extract is designed to remove.

The problem, written out in code

Here is what a classic BeautifulSoup + XPath pipeline looks like for a single product page. It does exactly what you think it does, and it is going to break the next time the site ships a PR.

python

import requests
from bs4 import BeautifulSoup
 
def scrape_product(url: str) -> dict:
    html = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).text
    soup = BeautifulSoup(html, "lxml")
 
    name_el  = soup.select_one("h1.product-title")
    price_el = soup.select_one("div.price-wrapper span.price")
    stock_el = soup.select_one("div.availability > span")
    sku_el   = soup.select_one("span[data-testid='product-sku']")
 
    return {
        "name":  name_el.text.strip()  if name_el  else None,
        "price": float(price_el.text.replace("$", "").replace(",", "")) if price_el else None,
        "stock": stock_el.text.strip() if stock_el else None,
        "sku":   sku_el.text.strip()   if sku_el   else None,
    }

What `extract` does instead

Here is the same product scrape against POST /v1/extract:

bash

curl -X POST https://api.stekpad.com/v1/extract \
  -H "Authorization: Bearer stkpd_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/products/widget-42",
    "prompt": "Extract the product details from this page.",
    "schema": {
      "type": "object",
      "properties": {
        "name":  { "type": "string" },
        "price": { "type": "number", "description": "Price in USD, numeric only" },
        "stock": { "type": "string", "enum": ["in_stock", "out_of_stock", "backorder"] },
        "sku":   { "type": "string" }
      },
      "required": ["name", "price"]
    }
  }'

The response is a typed JSON object matching the schema, or a typed error if the schema cannot be filled. No selectors. No CSS paths. No "data-testid hunt" on a redesign.

json

{
  "run_id": "run_01HZ8K...",
  "url": "https://example.com/products/widget-42",
  "status": "completed",
  "json": {
    "name": "Widget 42",
    "price": 129.00,
    "stock": "in_stock",
    "sku": "WDG-042"
  },
  "credits_charged": 5
}

Five credits. One call. Zero selectors.

Why it survives a redesign

Retries built in. extract validates the response against your schema before returning. If the first LLM pass produces a response that fails validation (say, price came back as the string "$129.00" instead of a number), Stekpad retries the extraction twice, each time with the schema error fed back to the model. You pay for the call once — the retries are on us.

Before and after, on a real page

Let's be concrete. I pointed both pipelines at a Hacker News story page. The selector script used td.title > span.titleline > a for the headline and span.age > a for the timestamp. It worked.

When you still want raw selectors

Schema-first is not the answer to everything. There are three cases where writing a selector yourself is still the right call.

You have a genuine structured API. If the site exposes a JSON endpoint — a Shopify product API, a GitHub repo API, an RSS feed — hit it directly. You do not need an LLM to parse JSON that is already JSON.
The page is stable and high-volume. If you are hitting the same page a million times a day and it has not changed in three years, an extract call at 5 credits apiece is more expensive than a scrape call at 1 credit plus your own parser. Use POST /v1/scrape with formats: ["html"] and run your selectors locally.
The value is a pixel-exact coordinate. If you want to know whether a banner is visually above the fold, or whether a specific DOM node has a specific inline style, a language model is the wrong tool. Use scrape with formats: ["screenshot"] and a visual check.

What the schema actually buys you

json

{
  "stock": {
    "type": "string",
    "enum": ["in_stock", "out_of_stock", "backorder"],
    "description": "Availability state of the product"
  }
}

A useful pattern: schema in version control

This is the thing CSS selectors never gave you: a typed, reviewable, diff-friendly contract for what the page means, separated from how the page looks.

Next steps

Read the REST API reference for extract for every schema option, the retry policy, and the exact error shapes.
See the Firecrawl comparison for how schema extraction differs between the two tools.
If you want the same thing from an agent, install the MCP server and ask Claude to call extract with the schema inline.

Stekpad Team

We build Stekpad. We scrape the web, store it, and enrich it — from an API, from an app, or from Claude.

Scrape without XPath — what a schema-first extractor actually looks like

The problem, written out in code

What `extract` does instead

Why it survives a redesign

Before and after, on a real page

When you still want raw selectors

What the schema actually buys you

A useful pattern: schema in version control

Next steps

The 5 API verbs that replaced my scraping stack

Build a lead enrichment pipeline with Claude and Stekpad

Why every existing scraper is broken for AI agents

Try the API. Free to start.

Scrape without XPath — what a schema-first extractor actually looks like

The problem, written out in code

What `extract` does instead

Why it survives a redesign

Before and after, on a real page

When you still want raw selectors

What the schema actually buys you

A useful pattern: schema in version control

Next steps

The 5 API verbs that replaced my scraping stack

Build a lead enrichment pipeline with Claude and Stekpad

Why every existing scraper is broken for AI agents

Try the API. Free to start.

The problem, written out in code

What extract does instead

Why it survives a redesign

Before and after, on a real page

When you still want raw selectors

What the schema actually buys you

A useful pattern: schema in version control

Next steps

The 5 API verbs that replaced my scraping stack

Build a lead enrichment pipeline with Claude and Stekpad

Why every existing scraper is broken for AI agents

Try the API. Free to start.

The problem, written out in code

What extract does instead

Why it survives a redesign

Before and after, on a real page

When you still want raw selectors

What the schema actually buys you

A useful pattern: schema in version control

Next steps

The 5 API verbs that replaced my scraping stack

Build a lead enrichment pipeline with Claude and Stekpad

Why every existing scraper is broken for AI agents

Try the API. Free to start.

What `extract` does instead

What `extract` does instead