Batch Extract PDF Tables to Google Sheets

How it works

Step 1
Pair a runner and add the Google credential — Both http-request and google-sheets are connectors marked runnerOnly — they execute only on a paired @jadapps/runner, not in the browser. In the runner, store an OAuth2 credential with the Sheets scope under a name like google-prod; the workflow references it by that name via credentialRef.
Step 2
Fork the blueprint — From the workflow page, fork the blueprint. The orchestrator copies the three-node chain (http-request → pdf-to-text → google-sheets) into a new private draft owned by you, wires consecutive ports, and snapshots it as version 1. A forked audit event records the source blueprint slug.
Step 3
Configure the source URL — Open the http-request node and set url to the PDF you want (method GET, default timeoutMs 60000). The response streams to disk on the runner and passes to the next node as a file.
Step 4
Set the Sheets target — Open the google-sheets node: action appendRows, set spreadsheetId, set range (e.g. Sheet1!A:Z), keep valueInputOption USER_ENTERED so dates and numbers parse, and set credentialRef to your stored credential name.
Step 5
Run and watch the trace — Trigger the run. The orchestrator walks the topologically sorted nodes once, firing a status callback per node. Each node shows pending → running → done/error with its duration and a one-line summary; the per-node trace is persisted on the run record.
Step 6
(Optional) Schedule it — The blueprint ships with scheduleCron: null (manual/webhook). To run it on a schedule, set a schedule_cron on your forked workflow; the Cloudflare cron tick scans scheduled workflows, decides which are due since last_fired_at, and enqueues a run.

The real node chain

Exactly what the pdf-batch-extract-to-sheets blueprint runs, in order. Source: lib/orchestrator/seo-workflow-blueprints.ts.

#	Tool slug	Category	Role in chain	Key config
1	http-request	connector (runnerOnly)	GET the source PDF by URL	method=GET, url=…/document.pdf, timeoutMs=60000
2	pdf-to-text	pdf	Extract plain text from the PDF on the runner	{} (defaults)
3	google-sheets	connector (runnerOnly)	Append the extracted rows to a tab	action=appendRows, range=Sheet1!A:Z, valueInputOption=USER_ENTERED, credentialRef

Trigger, credential & tier matrix

What each requirement actually maps to in the orchestrator runtime. Verified against tool-registry.ts, tier-precheck.ts, cron/tick, and the from-blueprint route.

Concern	How it works here	Where it is enforced
Default trigger	Manual or webhook — blueprint scheduleCron is null	from-blueprint copies schedule_cron=null into the new row
Scheduling	Opt-in: set schedule_cron after forking; cron tick enqueues due runs	Cloudflare Cron Trigger → /api/orchestrator/cron/tick
Credentials	Referenced by name (credentialRef), resolved on the paired runner; never stored on the JAD server	tool-registry connector config; runner resolves the secret
Runner requirement	Both connectors are runnerOnly; they throw 'requires a paired @jadapps/runner' if run in-browser	categoryHandlers.connector in tool-executor.ts
Pro gate	http-request and google-sheets are isPro:true connectors	UI/run affordance + runner pairing
Tier precheck	precheckWorkflowTier blocks only nodes whose minTier exceeds the user tier; these connectors carry no minTier, so they do NOT trip the pro_media precheck	tier-precheck.ts (tool.minTier ?? 'free')
Fork	Copies the graph to a private draft owned by the caller, snapshots v1, writes a 'forked' audit event	POST /api/orchestrator/workflows/from-blueprint
Run trace	Per-node status/duration/summary persisted on the run record	WorkflowRunTrace[] in types.ts; onStep callbacks in workflow-runner.ts

google-sheets node — appendRows config

The outbound config fields used by node 3. Source: tool-registry.ts google-sheets entry.

Field	Value for this workflow	Notes
action	appendRows	Outbound; the blueprint default in the registry is readRange, so set this explicitly
spreadsheetId	your sheet ID (e.g. 1AbCdE…)	Empty in the blueprint — you fill it after forking
range	Sheet1!A:Z	A1 range; appendRows appends after the last populated row in this range
valueInputOption	USER_ENTERED	Parses dates/numbers/formulas; use RAW to keep literal strings
credentialRef	google-prod (your runner credential)	OAuth2 with Sheets scope, resolved on the runner

Cookbook

These are real, copy-paste-ready configurations for the three nodes. Field names match the tool registry exactly, so what you paste into a node's config is what the runner executes. Because the blueprint ships with empty spreadsheetId/url and an inbound readRange default on the Sheets node, the recipes below show the deltas you must set after forking.

Minimal: one PDF URL → append to a tab

Example

The smallest working setup. http-request fetches the PDF, pdf-to-text extracts it, google-sheets appends. Set only what the blueprint leaves blank.

http-request:
  method: GET
  url: https://files.example.com/2026-04-invoice.pdf
  timeoutMs: 60000

pdf-to-text:
  {}   # defaults

google-sheets:
  action: appendRows
  spreadsheetId: 1AbCdEfGhIjKlMnOpQrStUvWxYz
  range: Sheet1!A:Z
  valueInputOption: USER_ENTERED
  credentialRef: google-prod

Authenticated source PDF (bearer token)

Example

When the PDF sits behind an API, store the token as a runner credential and let http-request inject auth from it. The credential name is resolved on the runner, never sent to the JAD server.

http-request:
  method: GET
  url: https://api.vendor.com/v1/invoices/8842/pdf
  credentialRef: vendor-api      # bearer/token resolved on the runner
  headers: {"Accept":"application/pdf"}
  timeoutMs: 120000

Write into a specific tab instead of Sheet1

Example

Point the range at a named tab. appendRows appends after the last populated row within the given A1 range.

google-sheets:
  action: appendRows
  spreadsheetId: 1AbCdEfGhIjKlMnOpQrStUvWxYz
  range: 'April Invoices'!A:Z
  valueInputOption: USER_ENTERED
  credentialRef: google-prod

Keep raw strings (no date/number coercion)

Example

Statement reference numbers like 0044012 lose their leading zeros under USER_ENTERED. Switch to RAW so cells stay literal.

google-sheets:
  action: appendRows
  spreadsheetId: 1AbCdEfGhIjKlMnOpQrStUvWxYz
  range: Statements!A:Z
  valueInputOption: RAW          # literal strings, no coercion
  credentialRef: google-prod

Schedule it after forking (opt-in cron)

Example

The blueprint is manual by default. To run nightly, set schedule_cron on your forked workflow; the Cloudflare cron tick enqueues a run when it is due since last_fired_at.

# On your forked workflow row:
schedule_cron: "0 6 * * *"     # every day at 06:00

# cron/tick then:
#  - reads workflows where schedule_cron is not null
#  - skips rows not due since last_fired_at
#  - inserts a workflow_run_queue row (source: cron)
#  - stamps last_fired_at and wakes the runner

Edge cases and verbatim errors

No runner paired (run from the browser)

error

Both connectors are runnerOnly. The browser fallback for the connector category throws <slug> is a connector and requires a paired @jadapps/runner. Configure one in the orchestrator status pill. Pair a runner before running this workflow.

credentialRef missing or unknown on the runner

fail

If google-sheets references a credential name the runner doesn't have, the appendRows call has no OAuth token and the node errors. The node turns red in the trace with the failure message; the run aborts under the default abort error policy.

Source URL returns 404 / non-PDF

404

http-request streams whatever the URL returns. A 404 page or HTML body still flows to pdf-to-text, which then fails to parse a PDF. The error surfaces on the pdf-to-text node. Validate the URL, or set the node's error policy to retry for transient 5xx/timeouts.

Empty or image-only PDF (no extractable text)

warning

pdf-to-text returns an empty (or near-empty) string for scanned/image-only PDFs — it does no OCR. The chain still completes and appendRows may add a blank row. For scanned documents, swap in the OCR-capable PDF path before this chain.

Wrong Sheets action (blueprint default is readRange)

invalid

The google-sheets registry default for action is readRange (inbound). If you forget to switch it to appendRows, the node reads instead of writes and nothing is appended. Always set action explicitly after forking.

spreadsheetId left blank

fail

The blueprint ships spreadsheetId: "". Running without filling it in means the Sheets API call has no target spreadsheet and the node errors. Fill spreadsheetId (and confirm the credential has access to it) before the first run.

USER_ENTERED mangles IDs and codes

warning

USER_ENTERED parses values like a human typing into a cell: 0044 becomes 44, long account numbers go to scientific notation, and date-like strings shift. Switch valueInputOption to RAW when you need exact strings.

http-request timeout on a large PDF

error

Default timeoutMs is 60000. A slow host or a large PDF can exceed it and the request errors. Raise timeoutMs (max 600000) on the http-request node, and consider setting its error policy to retry — the runner retries up to 3 times with linear backoff.

Manual workflow never fires on its own

scheduleCron is null in the blueprint, so a forked copy will only run when you trigger it (manually or by webhook). This is expected — it is not a misconfiguration. Add a schedule_cron if you want unattended runs.

Run completes; trace shows each node done

On success the trace records each node as done with its durationMs and a summary (e.g. page count for pdf-to-text, append result for google-sheets). The output_summary captures step count, success count, and total duration on the run record.

Frequently asked questions

What exactly does this workflow chain?

Three nodes in order: http-request (GET a PDF by URL), pdf-to-text (extract the text on the runner), then google-sheets with action: appendRows (append the result to a tab). That is the literal blueprint chain.

Does it ingest a whole folder of PDFs at once?

Not as shipped. The blueprint chain fetches a single PDF per run via http-request. To process many PDFs you trigger the workflow per file (e.g. by webhook), or add a for-each loop node around the chain yourself after forking.

Are my PDFs uploaded to JAD's servers?

No. PDF text extraction runs on your paired runner. The connectors are runnerOnly. The only data sent off-machine is the extracted output going to Google Sheets over OAuth.

Where are my Google credentials stored?

On the runner. The google-sheets node references a credential by name (credentialRef, e.g. google-prod); the runner resolves the OAuth2 token at run time. The token is not stored on the JAD/Cloudflare server.

Do I need a paid plan to run this?

Both http-request and google-sheets are Pro connectors (isPro: true) and require a paired runner. Note this is separate from the media tier-precheck: precheckWorkflowTier only blocks nodes whose minTier exceeds your tier, and these connectors carry no minTier, so they do not trip the pro_media gate.

Is this workflow cron-scheduled out of the box?

No — the blueprint sets scheduleCron: null, so a fork runs manually or by webhook. Scheduling is opt-in: set a schedule_cron on your forked workflow and the Cloudflare cron tick will enqueue runs when they are due.

How does forking work?

Forking calls the from-blueprint route, which copies the three-node chain into a new private draft owned by you, wires consecutive ports, snapshots it as version 1, and writes a forked audit event tagged with the blueprint slug.

Can I see a run history / trace?

Yes. Each run records a WorkflowRunTrace[] — per-node status, duration, output size, and a one-line summary — plus an output_summary with step/success counts and total duration. The run panel renders each node pending → running → done/error as the orchestrator fires its onStep callbacks.

What happens if a PDF has no tables or no text?

pdf-to-text returns an empty or near-empty string (it does not OCR scanned images). The chain still completes; appendRows may add a blank row. For scanned PDFs, run an OCR step before this chain.

Why did numbers like account IDs change in my sheet?

valueInputOption defaults to USER_ENTERED, which parses values like a human typing — leading zeros drop and long numbers go scientific. Switch the node to RAW to keep cells as literal strings.

What if the source URL is slow or flaky?

Raise timeoutMs on the http-request node (default 60000, max 600000) and set its error policy to retry. The runner retries a failing step up to 3 times with linear backoff (200ms, 400ms) before surfacing the error.

What other workflows are like this one?

See csv-to-slack-summary for a data-clean-then-connector pattern, rss-to-notion-digest for a fetch-then-create-page chain, and video-transcode-to-r2 for a local-process-then-upload chain.

Local-first by design

This workflow executes entirely on your jadapps-runner. API keys, database credentials, and OAuth tokens are stored in an AES-GCM-encrypted vault on your device — they are never uploaded to JAD Apps' servers. The server only stores the workflow graph (the recipe), not the secrets.

How it works

Step 1
Pair a runner and add the Google credential — Both http-request and google-sheets are connectors marked runnerOnly — they execute only on a paired @jadapps/runner, not in the browser. In the runner, store an OAuth2 credential with the Sheets scope under a name like google-prod; the workflow references it by that name via credentialRef.
Step 2
Fork the blueprint — From the workflow page, fork the blueprint. The orchestrator copies the three-node chain (http-request → pdf-to-text → google-sheets) into a new private draft owned by you, wires consecutive ports, and snapshots it as version 1. A forked audit event records the source blueprint slug.
Step 3
Configure the source URL — Open the http-request node and set url to the PDF you want (method GET, default timeoutMs 60000). The response streams to disk on the runner and passes to the next node as a file.
Step 4
Set the Sheets target — Open the google-sheets node: action appendRows, set spreadsheetId, set range (e.g. Sheet1!A:Z), keep valueInputOption USER_ENTERED so dates and numbers parse, and set credentialRef to your stored credential name.
Step 5
Run and watch the trace — Trigger the run. The orchestrator walks the topologically sorted nodes once, firing a status callback per node. Each node shows pending → running → done/error with its duration and a one-line summary; the per-node trace is persisted on the run record.
Step 6
(Optional) Schedule it — The blueprint ships with scheduleCron: null (manual/webhook). To run it on a schedule, set a schedule_cron on your forked workflow; the Cloudflare cron tick scans scheduled workflows, decides which are due since last_fired_at, and enqueues a run.

The real node chain

Exactly what the pdf-batch-extract-to-sheets blueprint runs, in order. Source: lib/orchestrator/seo-workflow-blueprints.ts.

#	Tool slug	Category	Role in chain	Key config
1	http-request	connector (runnerOnly)	GET the source PDF by URL	method=GET, url=…/document.pdf, timeoutMs=60000
2	pdf-to-text	pdf	Extract plain text from the PDF on the runner	{} (defaults)
3	google-sheets	connector (runnerOnly)	Append the extracted rows to a tab	action=appendRows, range=Sheet1!A:Z, valueInputOption=USER_ENTERED, credentialRef

Trigger, credential & tier matrix

What each requirement actually maps to in the orchestrator runtime. Verified against tool-registry.ts, tier-precheck.ts, cron/tick, and the from-blueprint route.

Concern	How it works here	Where it is enforced
Default trigger	Manual or webhook — blueprint scheduleCron is null	from-blueprint copies schedule_cron=null into the new row
Scheduling	Opt-in: set schedule_cron after forking; cron tick enqueues due runs	Cloudflare Cron Trigger → /api/orchestrator/cron/tick
Credentials	Referenced by name (credentialRef), resolved on the paired runner; never stored on the JAD server	tool-registry connector config; runner resolves the secret
Runner requirement	Both connectors are runnerOnly; they throw 'requires a paired @jadapps/runner' if run in-browser	categoryHandlers.connector in tool-executor.ts
Pro gate	http-request and google-sheets are isPro:true connectors	UI/run affordance + runner pairing
Tier precheck	precheckWorkflowTier blocks only nodes whose minTier exceeds the user tier; these connectors carry no minTier, so they do NOT trip the pro_media precheck	tier-precheck.ts (tool.minTier ?? 'free')
Fork	Copies the graph to a private draft owned by the caller, snapshots v1, writes a 'forked' audit event	POST /api/orchestrator/workflows/from-blueprint
Run trace	Per-node status/duration/summary persisted on the run record	WorkflowRunTrace[] in types.ts; onStep callbacks in workflow-runner.ts

google-sheets node — appendRows config

The outbound config fields used by node 3. Source: tool-registry.ts google-sheets entry.

Field	Value for this workflow	Notes
action	appendRows	Outbound; the blueprint default in the registry is readRange, so set this explicitly
spreadsheetId	your sheet ID (e.g. 1AbCdE…)	Empty in the blueprint — you fill it after forking
range	Sheet1!A:Z	A1 range; appendRows appends after the last populated row in this range
valueInputOption	USER_ENTERED	Parses dates/numbers/formulas; use RAW to keep literal strings
credentialRef	google-prod (your runner credential)	OAuth2 with Sheets scope, resolved on the runner

Cookbook

Minimal: one PDF URL → append to a tab

Example

The smallest working setup. http-request fetches the PDF, pdf-to-text extracts it, google-sheets appends. Set only what the blueprint leaves blank.

http-request:
  method: GET
  url: https://files.example.com/2026-04-invoice.pdf
  timeoutMs: 60000

pdf-to-text:
  {}   # defaults

google-sheets:
  action: appendRows
  spreadsheetId: 1AbCdEfGhIjKlMnOpQrStUvWxYz
  range: Sheet1!A:Z
  valueInputOption: USER_ENTERED
  credentialRef: google-prod

Authenticated source PDF (bearer token)

Example

When the PDF sits behind an API, store the token as a runner credential and let http-request inject auth from it. The credential name is resolved on the runner, never sent to the JAD server.

http-request:
  method: GET
  url: https://api.vendor.com/v1/invoices/8842/pdf
  credentialRef: vendor-api      # bearer/token resolved on the runner
  headers: {"Accept":"application/pdf"}
  timeoutMs: 120000

Write into a specific tab instead of Sheet1

Example

Point the range at a named tab. appendRows appends after the last populated row within the given A1 range.

google-sheets:
  action: appendRows
  spreadsheetId: 1AbCdEfGhIjKlMnOpQrStUvWxYz
  range: 'April Invoices'!A:Z
  valueInputOption: USER_ENTERED
  credentialRef: google-prod

Keep raw strings (no date/number coercion)

Example

Statement reference numbers like 0044012 lose their leading zeros under USER_ENTERED. Switch to RAW so cells stay literal.

google-sheets:
  action: appendRows
  spreadsheetId: 1AbCdEfGhIjKlMnOpQrStUvWxYz
  range: Statements!A:Z
  valueInputOption: RAW          # literal strings, no coercion
  credentialRef: google-prod

Schedule it after forking (opt-in cron)

Example

The blueprint is manual by default. To run nightly, set schedule_cron on your forked workflow; the Cloudflare cron tick enqueues a run when it is due since last_fired_at.

# On your forked workflow row:
schedule_cron: "0 6 * * *"     # every day at 06:00

# cron/tick then:
#  - reads workflows where schedule_cron is not null
#  - skips rows not due since last_fired_at
#  - inserts a workflow_run_queue row (source: cron)
#  - stamps last_fired_at and wakes the runner

Edge cases and verbatim errors

No runner paired (run from the browser)

error

credentialRef missing or unknown on the runner

fail

Source URL returns 404 / non-PDF

404

Empty or image-only PDF (no extractable text)

warning

Wrong Sheets action (blueprint default is readRange)

invalid

spreadsheetId left blank

fail

USER_ENTERED mangles IDs and codes

warning

http-request timeout on a large PDF

error

Manual workflow never fires on its own

Run completes; trace shows each node done

Frequently asked questions

What exactly does this workflow chain?

Does it ingest a whole folder of PDFs at once?

Are my PDFs uploaded to JAD's servers?

No. PDF text extraction runs on your paired runner. The connectors are runnerOnly. The only data sent off-machine is the extracted output going to Google Sheets over OAuth.

Where are my Google credentials stored?

Do I need a paid plan to run this?

Is this workflow cron-scheduled out of the box?

How does forking work?

Can I see a run history / trace?

What happens if a PDF has no tables or no text?

pdf-to-text returns an empty or near-empty string (it does not OCR scanned images). The chain still completes; appendRows may add a blank row. For scanned PDFs, run an OCR step before this chain.

Why did numbers like account IDs change in my sheet?

valueInputOption defaults to USER_ENTERED, which parses values like a human typing — leading zeros drop and long numbers go scientific. Switch the node to RAW to keep cells as literal strings.

What if the source URL is slow or flaky?

What other workflows are like this one?

See csv-to-slack-summary for a data-clean-then-connector pattern, rss-to-notion-digest for a fetch-then-create-page chain, and video-transcode-to-r2 for a local-process-then-upload chain.

Extract PDF Tables to Google Sheets in Bulk

Use cases

Vendor Invoices

Bank Statements

How it works

The real node chain

Trigger, credential & tier matrix

google-sheets node — appendRows config

Cookbook

Minimal: one PDF URL → append to a tab

Authenticated source PDF (bearer token)

Write into a specific tab instead of Sheet1

Keep raw strings (no date/number coercion)

Schedule it after forking (opt-in cron)

Edge cases and verbatim errors

No runner paired (run from the browser)

credentialRef missing or unknown on the runner

Source URL returns 404 / non-PDF

Empty or image-only PDF (no extractable text)

Wrong Sheets action (blueprint default is readRange)

spreadsheetId left blank

USER_ENTERED mangles IDs and codes

http-request timeout on a large PDF

Manual workflow never fires on its own

Run completes; trace shows each node done

Frequently asked questions

What exactly does this workflow chain?

Does it ingest a whole folder of PDFs at once?

Are my PDFs uploaded to JAD's servers?

Where are my Google credentials stored?

Do I need a paid plan to run this?

Is this workflow cron-scheduled out of the box?

How does forking work?

Can I see a run history / trace?

What happens if a PDF has no tables or no text?

Why did numbers like account IDs change in my sheet?

What if the source URL is slow or flaky?

What other workflows are like this one?

Local-first by design

Related workflows

Extract PDF Tables to Google Sheets in Bulk

Use cases

Vendor Invoices

Bank Statements

How it works

The real node chain

Trigger, credential & tier matrix

google-sheets node — appendRows config

Cookbook

Minimal: one PDF URL → append to a tab

Authenticated source PDF (bearer token)

Write into a specific tab instead of Sheet1

Keep raw strings (no date/number coercion)

Schedule it after forking (opt-in cron)

Edge cases and verbatim errors

No runner paired (run from the browser)

credentialRef missing or unknown on the runner

Source URL returns 404 / non-PDF

Empty or image-only PDF (no extractable text)

Wrong Sheets action (blueprint default is readRange)

spreadsheetId left blank

USER_ENTERED mangles IDs and codes

http-request timeout on a large PDF

Manual workflow never fires on its own

Run completes; trace shows each node done

Frequently asked questions

What exactly does this workflow chain?

Does it ingest a whole folder of PDFs at once?

Are my PDFs uploaded to JAD's servers?

Where are my Google credentials stored?

Do I need a paid plan to run this?

Is this workflow cron-scheduled out of the box?

How does forking work?

Can I see a run history / trace?

What happens if a PDF has no tables or no text?

Why did numbers like account IDs change in my sheet?

What if the source URL is slow or flaky?

What other workflows are like this one?

Local-first by design

Related workflows