Skip to article
Guide

The authenticated scraping guide for 2026

How to scrape pages that require a login without leaking cookies to a server. A practical guide using the Stekpad cookie bridge.

Stekpad Team7 min read
On this page

Half the web you actually care about lives behind a login. LinkedIn profiles, Stripe dashboards, Salesforce reports, a Salesforce sandbox you own, an internal admin tool your team built last year. None of it is reachable with a vanilla curl. All of it is what your team is asking you to scrape.

The standard industry answer is to hand over your cookies. You paste a session cookie into a form, the scraper vendor stores it in a database, and their servers fetch pages on your behalf. That pattern is everywhere. It is also wrong, for reasons that have nothing to do with the vendor being evil and everything to do with how cookies and session binding work.

This guide walks through how to do authenticated scraping without server-side cookies. We will show the Stekpad cookie bridge, a working curl with use_session, the architecture diagram, and the legal considerations you should read before pointing any scraper at a logged-in page.

Here is how most scraper vendors do it. You log into LinkedIn in your browser. You open devtools, copy the li_at cookie, and paste it into a form on the vendor's dashboard. The vendor stores the cookie in their database, possibly encrypted at rest. Their scraping workers then send that cookie with HTTP requests from the vendor's own data center IPs.

This fails in at least four ways.

One, session binding. LinkedIn, Stripe, Salesforce, and every sensible auth system bind the session to more than just the cookie. IP, user agent, TLS fingerprint, sometimes a device fingerprint. When your cookie shows up from an AWS us-east-1 IP with a different user agent, the site either returns a CAPTCHA, silently shadow-bans the account, or kills the session. You now have to log back in. Your scraper is flaky, and the reason is not a bug you can fix.

Two, credential leakage. The vendor now has a cookie that grants full access to your account. If they are breached, your account is breached. Not "your data" — your entire account, because a live session cookie is effectively a password. You did not sign up to give them that.

Three, terms of service. Many sites explicitly ban automated access from non-user IPs. The site is not banning scraping per se — they are banning the specific pattern of a cookie appearing from a non-human location. You inherit that risk by proxy.

Four, staleness. Sessions expire. The vendor has to prompt you to refresh every few days. The refresh form is itself a credential leak surface. The whole thing is a treadmill.

The problem with server-side cookie jars is not that vendors are untrustworthy. The problem is that you cannot make a cookie behave like a session from a place that is not the session's origin.

Stekpad solves authenticated scraping by never taking your cookies off your machine. The cookie bridge is a Chrome extension that you install yourself. It stays connected to our backend over a workspace-scoped WebSocket. When the API needs a fetch from a domain you have a session for, we push a job to your browser. Your browser fetches the page with your real cookies, your real IP, your real TLS fingerprint, and posts the rendered HTML back.

The cookie never appears in our backend. Not in a database, not in a log file, not in a memory dump. The architecture is not "we promise not to look" — the architecture is "there is nowhere for it to go".

Here is the flow, step by step, when you call /v1/scrape with a session domain.

  1. You call POST /v1/scrape with use_session: "linkedin.com".
  2. Our backend accepts the request and looks up whether your workspace has an active extension WebSocket.
  3. If yes, we push a fetch job over that WebSocket. The job contains the URL, the format you want, and any interaction steps. No cookies, no auth tokens, no headers.
  4. Your Chrome extension receives the job inside the tab of your choice. It performs the fetch. Your browser attaches its own cookies automatically, the way it would for any other request.
  5. The extension posts back the rendered HTML and any extracted JSON over the same WebSocket.
  6. Our backend post-processes the HTML (markdown conversion, schema extraction), stores it in your dataset, and returns the response to your API call.

If step 2 fails because your extension is not connected, the API returns a structured error you can act on, not a timeout.

json
{
"error": {
"code": "session_unavailable",
"domain": "linkedin.com",
"message": "The cookie bridge for linkedin.com is not connected.",
"guidance": "Open Chrome with the Stekpad extension active, or remove use_session from this request."
}
}

That error is a feature. You want your agent to know immediately that it is being asked to do something the current environment cannot do, rather than silently hang for 60 seconds and then return empty HTML.

A working curl, start to finish

Install the Stekpad Chrome extension from the Chrome Web Store. Sign into it with the same workspace you use for your API key. Then, in a different terminal:

bash
curl -X POST https://api.stekpad.com/v1/scrape \
-H "Authorization: Bearer stkpd_live_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://www.linkedin.com/company/stripe/",
"formats": ["markdown", "json"],
"use_session": "linkedin.com"
}'

Open the tab. You will see the Stekpad extension badge flash for a second. The request lands, the backend pushes a job, your browser fetches the profile page with your real session, and you get back a response like this.

json
{
"run_id": "run_01HP4Z7K21",
"url": "https://www.linkedin.com/company/stripe/",
"status": "succeeded",
"markdown": "# Stripe\n\nFinancial services...",
"json": { "name": "Stripe", "employees": "5,001-10,000", ... },
"source": "cookie_bridge",
"session_domain": "linkedin.com",
"credits_charged": 1,
"fetched_at": "2026-04-14T11:02:44Z"
}

The source: "cookie_bridge" field tells you this request was served by your extension, not by an anonymous fetch. Your dataset row is stamped the same way, so you can audit later which pages came from which surface.

What about Stripe, Salesforce, internal tools

The same call works for any domain where you have an active browser session. You swap use_session: "linkedin.com" for use_session: "dashboard.stripe.com" or use_session: "yourcompany.my.salesforce.com". The extension does not care what site it is. It only cares that the tab it runs the fetch in has a valid session for that origin.

This matters for the long tail. Most teams have two or three internal admin tools nobody builds integrations for. A billing console, a support dashboard, a data labeling UI. You can scrape those too, without writing a custom adapter per tool, because the cookie bridge treats them the same way as LinkedIn.

bash
curl -X POST https://api.stekpad.com/v1/scrape \
-H "Authorization: Bearer stkpd_live_..." \
-H "Content-Type: application/json" \
-d '{
"url": "https://dashboard.stripe.com/reports/revenue",
"formats": ["json"],
"use_session": "dashboard.stripe.com"
}'

Authenticated scraping lives in a grey zone. Here is what we tell every customer before they point the cookie bridge at a site.

Read the terms of service. Some sites allow automated access for the account holder's own data. Some allow read-only reporting. Some ban any automated access outright. LinkedIn in particular has been aggressive about this — the hiQ v. LinkedIn ruling is old news and the current stance is restrictive. If you are scraping a site where the terms are unclear, talk to a lawyer before you ship.

Rate-limit yourself. The cookie bridge makes you look exactly like a normal user, which is the point, but a normal user does not load 400 profile pages in three minutes. If you fetch too fast, you will trip behavioral defenses regardless of the cookie situation. Stekpad's crawl verb defaults to polite pacing and respects robots.txt. Keep those defaults unless you know what you are doing.

Scrape your own data first. If you are fetching pages that represent your own account's records — your Stripe revenue, your Salesforce pipeline, your Google Analytics dashboard — you are on the strongest footing. You are just automating a view you already have permission to see. This is the 80% use case and it is boring in a good way.

Do not resell private data. Personal data belongs to people. Even if you can technically scrape a profile, GDPR and similar regimes constrain what you can do with it afterward. A lead enrichment pipeline that stores a LinkedIn profile as a column in your CRM is a processing activity you need a legal basis for.

Sometimes you do not need authentication. If the page is public, skip use_session entirely. The regular scrape endpoint will fetch it from our edge without touching your browser, and the response comes back faster because there is no WebSocket hop.

If you are scraping at truly industrial scale — tens of thousands of authenticated pages per hour — the cookie bridge is not built for that. It is scoped to what a single human's browser can reasonably load. At that scale, you should be talking directly to the site's API under a partnership agreement, not running a scraper.

And if the site is aggressively hostile to any automation regardless of source, no tool fixes that. We do not offer CAPTCHA farms or fingerprint spoofing. That is not a differentiator, it is a line we do not cross.

Next steps

Stekpad Team
We build Stekpad. We scrape the web, store it, and enrich it — from an API, from an app, or from Claude.

Try the API. Free to start.

3 free runs a day on the playground. No credit card. Install MCP for Claude in 60 seconds.

The authenticated scraping guide for 2026 — Stekpad — Stekpad