Skip to article
Manifesto

Why every existing scraper is broken for AI agents

Legacy scrapers were built for analysts with spreadsheets. Agents need a different contract — stateless, typed, priced per call.

Stekpad Team7 min read
On this page

Every scraping tool shipped before 2024 was built for a human with a spreadsheet. The human ran the job. The human downloaded the CSV. The human opened the file, squinted at a row, and decided what to do next. The job was a cron. The output was a file. The decision loop was a chair and a monitor.

Then agents showed up. And every single one of those tools turned out to be the wrong shape.

This is a short essay about why the gap exists, what it costs, and the specific contract an agent actually needs from a scraper. It is opinionated. That is the point.

The old contract, written out

The contract a traditional scraper signs with its user looks like this:

  1. You write a selector, or you click in a point-and-click UI.
  2. A job runs on a schedule.
  3. The job writes a CSV to storage or emails it to a human.
  4. The human opens the CSV and figures out what to do.

Look at what is missing. There is no typed response. There is no session. There is no memory across runs. There is no way for the consumer to say "give me the new rows since last Tuesday". There is no error protocol the consumer can handle automatically. There is a file, and there is a human, and the human handles everything the file does not.

This works when the consumer is a person. It falls apart the instant the consumer is an agent.

The new consumer is stateless, impatient, and typed

An agent running inside Claude Desktop, Cursor, or a Claude Code session has three properties a CSV was never designed for.

It is stateless between calls. A new chat starts with an empty context. If the agent scraped a site yesterday, it has no memory of that unless something outside the model stores the result. A CSV email does not count. The agent has no inbox.

It is time-bounded. The agent is waiting inside a user session. The model cannot sit for an hour watching a cron job. If the tool does not return in 60 seconds, the user walks away and the whole run is wasted. Async-with-polling is fine. Silent batch jobs are not.

It is typed. The model is wiring tool outputs into its next decision. If the return shape is "a CSV, somewhere, eventually", the model cannot reason about it. The return shape has to be a typed JSON payload with stable field names, predictable errors, and a credit cost the agent can budget against.

Legacy scrapers hit zero of three. You write a selector, you wait an hour, and you get a CSV. No state, no typed contract, no agent-shaped error handling. Which is why every team we talk to has tried to shove a cron scraper behind an agent and given up.

Three failure modes we watched happen in 2025

The stale CSV. A growth team hooks up an existing scraping service that emails a nightly CSV to a shared inbox. They wire the CSV into a RAG pipeline. The agent happily answers questions using data that is between one and 24 hours old, and the user cannot tell. This is not a scraping bug. This is the wrong contract, pretending to be the right one.

The payload-by-value trap. A developer wires a newer, API-first scraper into an agent. The API returns scraped markdown in the response body. The agent uses the markdown, the conversation ends, the markdown is gone. Tomorrow the user asks the agent about the same page. The agent re-scrapes. Two credits. Two minutes. For a page that has not changed. Firecrawl and most first-generation scraping APIs work this way. It is not wrong — it is just expensive for agents that come back.

The dashboard-only tool. Some of the prettiest scraping tools — Browse AI, Octoparse, ParseHub — have beautiful dashboards and no MCP server. An agent cannot use them at all. The user ends up copy-pasting rows from a dashboard into a chat window. At that point, you may as well not have the tool.

Three failure modes, one shared diagnosis: the tool was built for a human control loop, and the control loop is now inside a model.

What a tool built for agents actually looks like

We built Stekpad the other way. The agent is the first user. The human-facing web app is a surface on top of the same engine, not the other way round. That means five things a legacy scraper does not give you.

Every call is a typed JSON response. POST /v1/scrape returns a fixed shape: run_id, status, markdown, json, html, metadata, dataset_id, row_id, credits_charged. The agent can wire any field into its next step without guessing.

Every call has a credit cost in the response. The model can budget itself. If your workflow is "scrape until the wallet hits zero", the model knows when to stop.

Every scrape is persisted by default. Stekpad stores the result in a dataset owned by your workspace. Tomorrow, in a new chat, the agent can query that dataset for free. Reads against your own data do not cost credits. The cost scales with new information, not with re-reading yesterday.

Every verb is an MCP tool on day one. scrape, crawl, map, extract, search, list_datasets, get_dataset, query_dataset. Install the server once, restart Claude, and the model can compose them.

Every async call returns a `run_id` under 60 seconds. No polling timeout surprises inside a chat. If a crawl is going to take 20 minutes, you get the run_id instantly, and the agent can check back in another message.

json
{
"mcpServers": {
"stekpad": {
"command": "npx",
"args": ["-y", "@stekpad/mcp"],
"env": {
"STEKPAD_API_KEY": "stkpd_live_..."
}
}
}
}

Three lines of config. Every Stekpad verb becomes a tool. The agent takes it from there.

A concrete example: memory across sessions

Here is the shape of a conversation that does not work on a legacy scraper and does work on Stekpad.

Monday. You ask Claude to scrape a list of 200 companies and put them in a dataset called Q2 targets. Claude calls scrape 200 times. Every row lands in the dataset. Credit cost: 200.

Tuesday. New Claude chat. You ask, "What companies in Q2 targets are based in Berlin?" Claude calls list_datasets, finds the Q2 targets dataset, calls query_dataset with a filter. Credit cost: 0. Reads are free.

Wednesday. You ask Claude to add a funding_round column to the Berlin rows by calling the right enricher. Claude calls enrich on the subset. Credit cost: one per row.

Thursday. A new person on your team joins the workspace. They open a new Claude chat, ask "what do we know about the Berlin companies", and Claude has the whole history without the user having to paste anything. The dataset is the memory.

That conversation is impossible on a scraper that emails CSVs. It is possible-but-expensive on a scraper that returns payloads by value. It is natural on Stekpad because the storage is the product.

Reads are free. Writes cost credits. This is the single design choice that makes Stekpad usable from an agent. The incentive is "scrape once, re-query forever". The incentive on a payload-by-value API is "re-scrape every time you forget". Agents forget a lot.

Typed errors are not a nice-to-have

One more thing legacy scrapers got wrong: errors. A cron job that fails sends an email to an on-call human. An agent that gets a 500 with an HTML body has no idea what to do next.

Every Stekpad error returns a typed code: session_unavailable, rate_limited, schema_mismatch, robots_blocked, timeout, dataset_full. The agent can branch on the code. The MCP server also includes a human-readable next_action field — "Open Chrome with the Stekpad extension active" — so the model can surface a useful prompt back to the user without inventing one.

That is a small thing. It is also the thing that makes the difference between an agent that works in a demo and an agent that works for a week without a human holding its hand.

The manifesto, in one paragraph

Scraping tools were built for analysts with spreadsheets. Agents are not analysts with spreadsheets. They are stateless processes that need typed responses, persistent memory, predictable error codes, and a way to pay for exactly one call at a time. Every scraper shipped before 2024 optimises for the wrong consumer. If you are putting an agent in front of one of them, you are paying a translation tax on every call. The tax compounds. The agent gets slower, worse, and more expensive per answered question.

The fix is not "add an MCP adapter to the cron scraper". The fix is to build the tool as an agent-first product and give humans a web app on top. That is what Stekpad is. That is why we are opinionated about it.

Next steps

Stekpad Team
We build Stekpad. We scrape the web, store it, and enrich it — from an API, from an app, or from Claude.

Try the API. Free to start.

3 free runs a day on the playground. No credit card. Install MCP for Claude in 60 seconds.

Why every existing scraper is broken for AI agents — Stekpad — Stekpad