Manifesto

Beyond cron jobs — why batch scraping can't feed a live product

Cron-scheduled scrapers made sense when the consumer was a CSV. They don't when the consumer is a Claude session.

Stekpad TeamApril 13, 20267 min read

On this page

What cron was built for
Why agents break the cron model
The pull-time alternative
MCP is the pull-time wiring
A worked example: pricing intelligence
When cron is still fine
Architecturally wrong, not just slow
Next steps

Cron jobs are the oldest pattern in data engineering. At 3am, something runs. By 7am, a dashboard is fresh. A human shows up at 9am and reads it. That loop powered the first twenty years of business intelligence, and it is still how most data teams think about scraping.

It does not match the shape of an AI product. When the consumer of your data is a language model answering a question right now, the 3am-to-9am pipeline is six hours of latency between reality and the user. You cannot ship a product on top of that. You can only ship a dashboard.

This is an architecture piece, not a feature pitch. We are going to explain why cron is the wrong primitive for agent workflows, what the right primitive looks like, and when cron is still fine. Honest answer up front: it is still fine more often than you think. Just not for the things you are probably trying to build.

What cron was built for

Cron was built for a world where the cost of a fetch was much higher than the cost of storing the result. Dialing into a remote SQL server, pulling a week of rows, transforming them, landing them in a local warehouse — that round trip was expensive in the 1990s and the 2000s, and it got slightly cheaper but not structurally different in the 2010s.

The rational response was to batch. Pay the network cost once a night. Store everything. Let the humans query the warehouse the next morning. The warehouse became the product. The cron was the loader. Stale-by-design was a feature, not a bug, because the alternative was per-request fetches on a 500ms link.

Three assumptions held that story together.

The consumer is a human. A human is slow, a human reads one dashboard a day, a human does not care if the data is eight hours old because the human is going to spend thirty minutes staring at it anyway.

The query shape is predictable. You know in advance what tables to populate. You know the schema. You know the join keys. The cron loads exactly the tables the BI tool needs.

The upstream is stable. The source of truth does not change structure between runs. You can write the loader once and forget it.

AI agents break all three.

Why agents break the cron model

Agents are not slow. A Claude session that takes 8 seconds to answer feels bad. A session that takes 30 seconds feels broken. Whatever your data pipeline does, it has to fit inside that budget. A warehouse loaded at 3am does not fit — it does not even start.

The query shape is unpredictable. A user asks the agent about a competitor you have never seen before. A user asks about a page that was published an hour ago. A user asks about a product variant that did not exist yesterday. You cannot pre-populate a table with the 10,000 URLs that might matter, because the URL the user cares about is not on your list.

The upstream is the public internet. Structures change. A pricing page you scraped last week has a different DOM this week. A site you fetched as HTML now ships as a client-side React app. The cron quietly keeps loading, the schema quietly rots, and your agent quietly lies to users for three weeks before someone notices.

Cron is a coping mechanism for slow I/O. When the I/O is no longer slow, the mechanism is in the way.

The pull-time alternative

The alternative is obvious once you say it out loud. The agent pulls. The source fetches on demand. The response comes back in under ten seconds or it comes back with a typed error the agent can handle. Nothing is pre-computed unless you explicitly asked for it.

This is what a pull-time fetch looks like against Stekpad.

bash

curl -X POST https://api.stekpad.com/v1/scrape \
  -H "Authorization: Bearer stkpd_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/sku-42",
    "formats": ["markdown", "json"]
  }'

Three seconds later, you get a row. That row is stored in a dataset you own, so the next time the agent asks about the same SKU, you can serve it from storage or re-fetch it — your choice, per call. The dataset is a cache by default and a source of truth when you ask it to be.

The response shape.

json

{
  "run_id": "run_01HNZ5KV40",
  "url": "https://example.com/product/sku-42",
  "status": "succeeded",
  "markdown": "# SKU 42 - Widget...",
  "json": { "sku": "sku-42", "price": 129.00, "in_stock": true },
  "credits_charged": 1,
  "fetched_at": "2026-04-14T10:44:11Z"
}

The fetched_at is the whole point. It is honest about when the data was produced. Your agent can reason about freshness, expose it to the user, or refuse to answer if the row is too old. You cannot do that with a cron-loaded warehouse, because the warehouse does not know what the question is.

MCP is the pull-time wiring

REST works, but MCP is how agents actually consume pull-time data. The model holds a tool surface. When it needs web data, it calls a tool. The tool calls Stekpad. Stekpad returns a row. The model reads the row. The user sees the answer. No intermediate batch, no intermediate table, no intermediate cron.

json

{
  "mcpServers": {
    "stekpad": {
      "command": "npx",
      "args": ["-y", "@stekpad/mcp"],
      "env": {
        "STEKPAD_API_KEY": "stkpd_live_..."
      }
    }
  }
}

Paste that into Claude Desktop's config. The model now has scrape, crawl, map, extract, and search as first-class tools. Every write call charges credits from your workspace wallet. Every read call against a dataset is free. The budget is visible to you, the latency is visible to the model, the rows are stored where you can audit them.

That is the pull-time loop. The agent asks, the verb runs, the row lands, the answer goes out. No cron, no dashboard, no 3am batch.

A worked example: pricing intelligence

Say you are building a pricing intelligence feature for a B2B SaaS app. Your users ask questions like "has competitor X changed their enterprise pricing this month" and "what's the current trial length on product Y".

The cron approach: every night, crawl 200 competitor pricing pages, parse them with a brittle DOM selector, write rows to a Postgres table. Build a UI on top. Users get one day of latency and 200 competitors of coverage. A new competitor is not on the list until someone adds them, so the first time a user asks, the answer is wrong.

The pull-time approach: users ask the agent. The agent calls extract with a schema describing the pricing fields.

bash

curl -X POST https://api.stekpad.com/v1/extract \
  -H "Authorization: Bearer stkpd_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "schema": {
      "type": "object",
      "properties": {
        "plans": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "monthly_price_usd": {"type": "number"},
              "trial_days": {"type": "integer"}
            }
          }
        }
      }
    }
  }'

5 credits, about 6 seconds. The row lands in your dataset tagged with the URL and the fetch timestamp. The agent reads the JSON and answers. Next week, the same call either re-fetches or reads from the dataset — you decide per feature.

Coverage is now "every URL the user mentions", not "200 URLs you pre-listed". Freshness is now "seconds", not "hours". The schema lives in your code, not a brittle loader script.

When cron is still fine

We do not want to oversell this. Cron is the right primitive for a lot of real work, and we ship scheduling ourselves on our Cloud plans. Here is when we would use it instead of pull-time.

The consumer is a human on a daily rhythm. If the point is a morning dashboard that a team reads at 9am, schedule a crawl, land rows in a dataset, rebuild the dashboard at 8:45am. The user is slow, the cron is appropriate. That is what POST /v1/crawl with a webhook is for.

The workload is site-wide. If you need to crawl every page of a 5,000 page documentation site to build an embedding index, do it once a week as a batch job. Pull-time makes no sense here — the consumer is an offline process, not a conversation.

The data changes less often than the user asks. If you are tracking a competitor's company registration data from OpenCorporates, that changes maybe twice a year. Caching for a week is fine. Re-fetching on every query is wasteful.

Audit and compliance. If you need to prove that you fetched a given URL at a given time for a given reason, scheduled runs with signed logs are the cleanest pattern. The cron is the evidence.

The rule of thumb: if the consumer is a human or a batch process, cron is fine. If the consumer is a live conversation with a language model, cron is in the way.

Architecturally wrong, not just slow

The easiest misreading of this argument is "cron is slow, make it faster". It is not a speed problem. A cron that runs every minute is still a cron. It still pre-computes, it still assumes you know the URL list in advance, it still stales out between runs, it still breaks the freshness contract with the user.

The right fix is a different contract. The agent pulls, the source fetches at request time, the row lands in storage as a side effect. That is what Stekpad is shaped around. Every verb is sync-first and fast enough to finish inside an agent's patience window. Every write lands in a dataset you own. Every MCP tool returns a credits_charged field so the model can budget itself.

Cron is fine. Cron is not enough. The agent has its own clock, and your data pipeline has to run on that clock or the product does not work.

Next steps

Read Your AI agent needs live data for the companion manifesto on pull-time contracts.
Start at the scrape API reference and the crawl API reference to see both sides of the sync/async split.
Install the MCP server and wire it into Claude Desktop in five minutes.
See pricing: 300 free credits per month, PAYG packs, no subscription required.

Stekpad Team

We build Stekpad. We scrape the web, store it, and enrich it — from an API, from an app, or from Claude.

Manifesto

Beyond cron jobs — why batch scraping can't feed a live product

Cron-scheduled scrapers made sense when the consumer was a CSV. They don't when the consumer is a Claude session.

Stekpad TeamApril 13, 20267 min read

On this page

What cron was built for
Why agents break the cron model
The pull-time alternative
MCP is the pull-time wiring
A worked example: pricing intelligence
When cron is still fine
Architecturally wrong, not just slow
Next steps

What cron was built for

Three assumptions held that story together.

The query shape is predictable. You know in advance what tables to populate. You know the schema. You know the join keys. The cron loads exactly the tables the BI tool needs.

The upstream is stable. The source of truth does not change structure between runs. You can write the loader once and forget it.

AI agents break all three.

Why agents break the cron model

Cron is a coping mechanism for slow I/O. When the I/O is no longer slow, the mechanism is in the way.

The pull-time alternative

This is what a pull-time fetch looks like against Stekpad.

bash

curl -X POST https://api.stekpad.com/v1/scrape \
  -H "Authorization: Bearer stkpd_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/product/sku-42",
    "formats": ["markdown", "json"]
  }'

The response shape.

json

{
  "run_id": "run_01HNZ5KV40",
  "url": "https://example.com/product/sku-42",
  "status": "succeeded",
  "markdown": "# SKU 42 - Widget...",
  "json": { "sku": "sku-42", "price": 129.00, "in_stock": true },
  "credits_charged": 1,
  "fetched_at": "2026-04-14T10:44:11Z"
}

MCP is the pull-time wiring

json

{
  "mcpServers": {
    "stekpad": {
      "command": "npx",
      "args": ["-y", "@stekpad/mcp"],
      "env": {
        "STEKPAD_API_KEY": "stkpd_live_..."
      }
    }
  }
}

That is the pull-time loop. The agent asks, the verb runs, the row lands, the answer goes out. No cron, no dashboard, no 3am batch.

A worked example: pricing intelligence

The pull-time approach: users ask the agent. The agent calls extract with a schema describing the pricing fields.

bash

curl -X POST https://api.stekpad.com/v1/extract \
  -H "Authorization: Bearer stkpd_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/pricing",
    "schema": {
      "type": "object",
      "properties": {
        "plans": {
          "type": "array",
          "items": {
            "type": "object",
            "properties": {
              "name": {"type": "string"},
              "monthly_price_usd": {"type": "number"},
              "trial_days": {"type": "integer"}
            }
          }
        }
      }
    }
  }'

Coverage is now "every URL the user mentions", not "200 URLs you pre-listed". Freshness is now "seconds", not "hours". The schema lives in your code, not a brittle loader script.

When cron is still fine

We do not want to oversell this. Cron is the right primitive for a lot of real work, and we ship scheduling ourselves on our Cloud plans. Here is when we would use it instead of pull-time.

Audit and compliance. If you need to prove that you fetched a given URL at a given time for a given reason, scheduled runs with signed logs are the cleanest pattern. The cron is the evidence.

The rule of thumb: if the consumer is a human or a batch process, cron is fine. If the consumer is a live conversation with a language model, cron is in the way.

Architecturally wrong, not just slow

Cron is fine. Cron is not enough. The agent has its own clock, and your data pipeline has to run on that clock or the product does not work.

Next steps

Read Your AI agent needs live data for the companion manifesto on pull-time contracts.
Start at the scrape API reference and the crawl API reference to see both sides of the sync/async split.
Install the MCP server and wire it into Claude Desktop in five minutes.
See pricing: 300 free credits per month, PAYG packs, no subscription required.

Stekpad Team

We build Stekpad. We scrape the web, store it, and enrich it — from an API, from an app, or from Claude.

Beyond cron jobs — why batch scraping can't feed a live product

What cron was built for

Why agents break the cron model

The pull-time alternative

MCP is the pull-time wiring

A worked example: pricing intelligence

When cron is still fine

Architecturally wrong, not just slow

Next steps

Your AI agent needs live data, not yesterday's CSVs

Why every existing scraper is broken for AI agents

MCP explained for growth teams (no PhD required)

Try the API. Free to start.

Beyond cron jobs — why batch scraping can't feed a live product

What cron was built for

Why agents break the cron model

The pull-time alternative

MCP is the pull-time wiring

A worked example: pricing intelligence

When cron is still fine

Architecturally wrong, not just slow

Next steps

Your AI agent needs live data, not yesterday's CSVs

Why every existing scraper is broken for AI agents

MCP explained for growth teams (no PhD required)

Try the API. Free to start.