How to map a pdf survey's questions and field types
- Step 1Open the field extractor — Go to the PDF Form Field Extractor. It runs in your browser; survey data is never uploaded.
- Step 2Drop one copy of the survey — Any copy works — the schema is identical across respondents. The tool runs automatically; there are no options to set.
- Step 3Read the question schema — The preview lists each field as
{ name, type, value }with the total count below. This is your variable list. - Step 4Download the JSON schema — Save the field map as your survey's data dictionary. Keep it with your analysis scripts.
- Step 5Classify each question by type — Use the
typeto decide how to code each variable: numeric for Likert radio groups, categorical for single-choice, multi-response for option lists, free text for comments. - Step 6Capture answers and load into your tool — Read each respondent's values in a separate step, build a tidy dataset keyed by the schema's field names, then analyse in Excel, R, or pandas.
Survey question type → analysis coding
How each pdf-lib field type maps to a survey variable. The schema gives you type; you decide the coding.
| Field type | Typical survey question | Suggested coding |
|---|---|---|
PDFRadioGroup | Likert / single-choice (e.g. 1–5 satisfaction) | Ordinal or numeric (map labels to scores) |
PDFDropdown | Pick-one from a list (e.g. age band) | Categorical |
PDFOptionList | Select-all-that-apply | Multi-response (one indicator per option) |
PDFCheckBox | Yes/no or single opt-in | Binary (0/1) |
PDFTextField | Open comment / short answer | Free text (qualitative coding) |
PDFButton | Submit/reset | Exclude |
PDFSignature | Consent signature | Exclude from analysis (verify separately) |
Workflow split for survey analysis
Where this tool fits and where your analysis pipeline takes over.
| Task | This tool | Your pipeline |
|---|---|---|
| Build the variable list | Field names + types (once) | Name variables, set coding |
| Get each answer | Empty strings (schema only) | Value-read each respondent file |
| Assemble the dataset | JSON schema | Tidy table: one row per respondent |
| Compute stats | — | AVERAGE in Sheets, mean()/describe() in R/pandas |
Cookbook
From a survey's schema to a tidy dataset. Field names are illustrative; your survey's actual names appear verbatim in the JSON.
A Likert survey's schema
Three rating questions plus a comment. The radio groups confirm these are single-choice — the schema models the questions, not anyone's ratings.
[
{ "name": "q1_satisfaction", "type": "PDFRadioGroup", "value": "" },
{ "name": "q2_likelihood", "type": "PDFRadioGroup", "value": "" },
{ "name": "q3_value", "type": "PDFRadioGroup", "value": "" },
{ "name": "comments", "type": "PDFTextField", "value": "" }
]Single-choice vs select-all-that-apply
The type tells you how to model the question. Get this wrong and you'll mis-aggregate the data.
Single-choice (radio): code as ONE categorical variable
{ "name": "primary_channel", "type": "PDFRadioGroup", "value": "" }
Select-all (option list): code as MULTIPLE indicators
{ "name": "channels_used", "type": "PDFOptionList", "value": "" }
-> channels_used_email, channels_used_phone, ...Schema → R/pandas variable list
Use the field names as column names so your import code matches the survey exactly.
Columns (from the schema):
q1_satisfaction, q2_likelihood, q3_value, comments
pandas:
df = pd.DataFrame(rows, columns=[
'q1_satisfaction','q2_likelihood','q3_value','comments'])
Rows come from your separate value-reading step.Computing means after capture
Once values are captured into a tidy table, the schema's type tells you which columns are numeric and safe to average.
q1_satisfaction,q2_likelihood,q3_value 5,4,5 3,3,4 4,5,4 Sheets: =AVERAGE(A2:A4) -> 4.0 R: mean(df$q1_satisfaction) The schema confirmed these are radio (single-choice) scales.
Detecting a changed survey version
Re-running the schema extraction on a new survey edition surfaces added/removed questions before they corrupt your panel data.
v1 fields: q1..q5, comments v2 fields: q1..q5, q6_nps, comments New question 'q6_nps' added in v2. Keep v1 and v2 responses in separate frames or add the column with NA for v1 respondents.
Edge cases and what actually happens
It does not extract the ratings/answers
Schema onlyEvery value is an empty string. The tool models the survey's questions (names + types); it does not read what respondents selected or wrote. Capture answers in a separate value-reading step keyed by these field names.
Output is JSON, not CSV
JSON onlyThe result is a JSON array. There's no CSV/Excel export here — assemble your analysis table downstream in a script or spreadsheet using the schema as the column definition.
Likert scale labels aren't in the output
Not includedA Likert question shows as one PDFRadioGroup, but the option labels/scores (1–5) are not part of this output. You define the score mapping yourself when coding the variable.
Comparing many participants
ManualThe tool reads one file per run and returns schema, not data. Extract the schema once for the variable list, capture each participant's values separately, then combine rows into a single tidy dataset for comparison.
Scanned paper survey
No fieldsA printed-and-scanned survey is an image with no interactive fields, so the array is empty. Use PDF OCR to recognise the marks/text, then map them to questions manually.
Multi-select option list
ExpectedA select-all question is a single PDFOptionList field. The schema flags the type; you expand it into one indicator column per option during analysis. The individual option labels are not in this output.
Pure-XFA survey form
AcroForm onlyDynamic XFA surveys store fields in an XML layer pdf-lib doesn't parse, so they may return few or no fields. AcroForm-based surveys extract reliably.
Consent signature field
ExpectedA consent signature appears as PDFSignature. Exclude it from statistical analysis. To confirm it's actually signed and valid, use PDF Signature Verify.
Free tier size/page limit
RejectedFree tier caps at 2 MB and 50 pages; Pro at 50 MB and 500 pages. Multi-page survey booklets with images can be large — upgrade or compress with PDF Compress (lossy) if a survey is rejected.
Frequently asked questions
Does this pull the survey answers out of the PDF?
No. It extracts the survey's field schema — each question field's name and type — and returns every value as an empty string. It models the questions, not the responses. That model is what you need first: it tells you exactly which variables exist and how to code each one before you capture the actual answers in a separate step.
How does this help survey analysis if it doesn't give me data?
It gives you a correct, reusable data dictionary. The hardest part of analysing PDF surveys is mapping each question to the right variable and coding (single-choice vs multi-select vs numeric). The schema removes that guesswork — field names become your column names and the type tells you how to code each one consistently across every respondent file.
Can I calculate averages from Likert questions with this?
Not directly — the tool doesn't read the selected values. What it does is confirm a Likert question is a single-choice PDFRadioGroup, so you know to treat it as one numeric/ordinal variable. After you capture the chosen values into a table, use AVERAGE() in Sheets or mean() in R on that column.
How are single-choice vs select-all questions represented?
Single-choice questions are a single PDFRadioGroup field. Select-all-that-apply questions are typically a PDFOptionList (or several PDFCheckBox fields). The type key lets you model each correctly — radio as one categorical variable, option list as multiple indicator variables.
Will the option labels (e.g. 'Strongly agree') be in the output?
No. The output is field name + type + empty value. The labels and their numeric scores aren't included — you define that mapping yourself when coding the variable. Inspect the labels in a PDF reader once and record them in your data dictionary.
Can I compare responses from multiple participants?
Yes, with the right workflow. Extract the schema once for the variable list, then capture each participant's values in a separate value-reading step, build a tidy table (one row per participant) keyed by the schema's field names, and compare in your analysis tool. The tool itself processes one file per run.
Does it export to CSV or Excel for analysis?
No — the only output is a JSON array. Use it as the column definition and assemble your CSV/Excel dataset downstream in a script or spreadsheet.
What about a scanned paper survey?
A scanned survey has no interactive fields, so the tool returns an empty array. Use PDF OCR to recognise the text and marks, then map them to questions yourself. For pure-XFA dynamic surveys, an XFA-aware desktop tool is needed.
How do I name variables in R or pandas?
Use the field names from the schema directly. Because they're the survey's real field names, your import code will match the source exactly, avoiding mislabelled columns. If a name is awkward (e.g. a dotted, fully-qualified name), rename it in your code but keep a mapping back to the original.
Are signature/consent fields a problem?
They appear as PDFSignature and should be excluded from statistical analysis. If you need to confirm consent was actually signed, verify it cryptographically with PDF Signature Verify — this tool only reports that the field exists.
Is my survey data uploaded?
No. Parsing runs entirely in your browser via pdf-lib; survey files never reach a server. The result panel shows '0 bytes uploaded'. Only an anonymous usage counter is recorded when you're signed in — important for sensitive research data.
What are the size and page limits for a survey booklet?
Free tier accepts up to 2 MB and 50 pages; Pro raises that to 50 MB and 500 pages. Image-heavy survey booklets can exceed the free limit — compress them with PDF Compress (lossy) or upgrade.
Privacy first
All PDF processing runs locally in your browser using PDF-lib and pdf.js. No file is ever uploaded — only metadata counters are saved for signed-in dashboard stats.