Skip to article
Technique

Scrape without XPath — what a schema-first extractor actually looks like

XPath selectors break every time a site redesigns. Schema-first extraction with an AI selector doesn't. Here is why.

Stekpad Team6 min read
On this page

Every scraper built before 2023 has the same hidden tax. You write a selector. The selector is fragile. The site redesigns. The selector breaks. You patch it. Next month, the site A/B tests a new layout, and your CSS path matches the wrong element. You patch it again. A senior engineer once told me he spent forty percent of his time on a four-year-old scraping project maintaining selectors. Four out of ten working days, every week, patching strings like div.product > div:nth-child(3) > span.price.

This is the tax nobody prices into the build estimate. And it is what POST /v1/extract is designed to remove.

The problem, written out in code

Here is what a classic BeautifulSoup + XPath pipeline looks like for a single product page. It does exactly what you think it does, and it is going to break the next time the site ships a PR.

python
import requests
from bs4 import BeautifulSoup
 
def scrape_product(url: str) -> dict:
html = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}).text
soup = BeautifulSoup(html, "lxml")
 
name_el = soup.select_one("h1.product-title")
price_el = soup.select_one("div.price-wrapper span.price")
stock_el = soup.select_one("div.availability > span")
sku_el = soup.select_one("span[data-testid='product-sku']")
 
return {
"name": name_el.text.strip() if name_el else None,
"price": float(price_el.text.replace("$", "").replace(",", "")) if price_el else None,
"stock": stock_el.text.strip() if stock_el else None,
"sku": sku_el.text.strip() if sku_el else None,
}

Count the assumptions. There are at least nine. h1.product-title assumes the title is in an H1 with that exact class. div.price-wrapper span.price assumes two nested elements, with two exact class names. The price-parsing line assumes dollars and commas and no currency prefix. The stock block assumes a specific data-testid on the SKU element. Every one of those assumptions is a string the site owner can change without telling you.

The worst part is the silent failure. When the site renames product-title to ProductTitle, name_el becomes None, your .text call blows up, and if you defended against it you get a row with name: null that quietly pollutes your dataset for a week before someone notices the dashboard chart went flat.

What extract does instead

Stekpad's extract verb takes a URL, a JSON Schema, and a natural-language prompt. It renders the page in a real browser, turns the DOM into a clean text representation, and asks a language model to fill in the schema. The model sees the page the way a human reader does — semantic text, headings, labels — not the way a CSS selector does.

Here is the same product scrape against POST /v1/extract:

bash
curl -X POST https://api.stekpad.com/v1/extract \
-H "Authorization: Bearer stkpd_live_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/products/widget-42",
"prompt": "Extract the product details from this page.",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number", "description": "Price in USD, numeric only" },
"stock": { "type": "string", "enum": ["in_stock", "out_of_stock", "backorder"] },
"sku": { "type": "string" }
},
"required": ["name", "price"]
}
}'

The response is a typed JSON object matching the schema, or a typed error if the schema cannot be filled. No selectors. No CSS paths. No "data-testid hunt" on a redesign.

json
{
"run_id": "run_01HZ8K...",
"url": "https://example.com/products/widget-42",
"status": "completed",
"json": {
"name": "Widget 42",
"price": 129.00,
"stock": "in_stock",
"sku": "WDG-042"
},
"credits_charged": 5
}

Five credits. One call. Zero selectors.

Why it survives a redesign

A CSS selector is coupled to the structure of the page. A schema is coupled to the meaning of the page. When the site renames a class from product-title to ProductTitle, the structure changes, but the meaning — "there is a product name, and it is the biggest heading at the top" — does not. The language model reads the page and finds the title the same way a human does: by looking at the text, not the class.

This is why a schema-based extractor survives a redesign that would break a BeautifulSoup script. The only thing that breaks an extract call is the site genuinely removing the data — if the page no longer lists a price, no selector on earth will find it.

Retries built in. extract validates the response against your schema before returning. If the first LLM pass produces a response that fails validation (say, price came back as the string "$129.00" instead of a number), Stekpad retries the extraction twice, each time with the schema error fed back to the model. You pay for the call once — the retries are on us.

Before and after, on a real page

Let's be concrete. I pointed both pipelines at a Hacker News story page. The selector script used td.title > span.titleline > a for the headline and span.age > a for the timestamp. It worked.

Two weeks later, Hacker News — which famously does not redesign — ships one of its very rare tweaks. titleline becomes titleLine (hypothetical, but realistic). The selector script returns None for every title on every page it scrapes, for fourteen hours, until the on-call notices the dashboard.

The extract call, with a schema of { title: string, points: number, author: string, age: string }, keeps working. The model still sees a headline near the top of each row, a score next to it, an author below. The JSON keeps coming. The dashboard keeps drawing. Your on-call keeps sleeping.

Multiply that across a scraper that covers ten sites. Or a hundred. Or the long tail of sites your growth team wants to pull data from once and never again. The selector tax grows linearly with coverage. The schema-first cost is flat.

When you still want raw selectors

Schema-first is not the answer to everything. There are three cases where writing a selector yourself is still the right call.

  1. You have a genuine structured API. If the site exposes a JSON endpoint — a Shopify product API, a GitHub repo API, an RSS feed — hit it directly. You do not need an LLM to parse JSON that is already JSON.
  2. The page is stable and high-volume. If you are hitting the same page a million times a day and it has not changed in three years, an extract call at 5 credits apiece is more expensive than a scrape call at 1 credit plus your own parser. Use POST /v1/scrape with formats: ["html"] and run your selectors locally.
  3. The value is a pixel-exact coordinate. If you want to know whether a banner is visually above the fold, or whether a specific DOM node has a specific inline style, a language model is the wrong tool. Use scrape with formats: ["screenshot"] and a visual check.

For everything else — product pages, company pages, article pages, job listings, real estate listings, event pages, restaurant menus — schema-first is the cheaper lifetime cost. The per-call price is 5 credits. The avoided-patch cost is dozens of engineering hours per year.

What the schema actually buys you

A JSON Schema is not just a prompt wrapper. It is a contract between the page and your pipeline. Every field you declare has three properties the language model has to respect: a type, an optional description, and a required/optional flag. The extractor validates the model's response against the schema before it hands anything back to you.

That validation is where the shape gets useful. A selector-based scraper can return name: null and you would not know until your dashboard chart goes flat. A schema-based extractor either returns a valid object, or it returns a typed error saying which field failed and what the model produced. You can branch on that error. You can retry with a better description. You can flag the URL for review. None of that is possible with a quiet None from a BeautifulSoup call.

The descriptions are load-bearing too. {"type": "number"} on a price field is fine, but {"type": "number", "description": "Price in USD, numeric only, no currency symbol"} is better — the model reads the description, strips the dollar sign, and gives you 129.00 instead of "$129.00". Good schema descriptions replace five lines of regex post-processing per field.

An under-appreciated property: enum fields. If the product page can be in one of three stock states, declare them as an enum. The model will either return one of your enum values or raise a validation error. You never get a rogue stock: "In stock — 3 left" string in the middle of a column that should only have three possible values.

json
{
"stock": {
"type": "string",
"enum": ["in_stock", "out_of_stock", "backorder"],
"description": "Availability state of the product"
}
}

That is three extra lines of config in exchange for a column that validates itself forever. There is no equivalent in the CSS-selector world. The closest you get is a post-processing step that does if "in stock" in text.lower() and prays the site does not change the wording.

A useful pattern: schema in version control

Because the schema is a JSON object, you can put it in your repo next to your code. One file per dataset you care about. Our own team keeps schemas under schemas/*.json, imports them into the extract call, and versions them with git. When the site changes meaningfully and you want to capture a new field, you edit the schema, bump the version, and the old dataset still validates.

This is the thing CSS selectors never gave you: a typed, reviewable, diff-friendly contract for what the page means, separated from how the page looks.

Next steps

Stekpad Team
We build Stekpad. We scrape the web, store it, and enrich it — from an API, from an app, or from Claude.

Try the API. Free to start.

3 free runs a day on the playground. No credit card. Install MCP for Claude in 60 seconds.

Scrape without XPath — what a schema-first extractor actually looks like — Stekpad — Stekpad