How to clean a survey-response csv
- Step 1Export from your survey platform — **Google Forms**: Responses tab → ⋮ →
Download responses (.csv). **Typeform**: Results → Responses →Download all responses→ CSV. **SurveyMonkey**: Analyze Results → Save As →All response data→ CSV; pick theCondensed Columnslayout for the simplest shape. **Microsoft Forms**: Responses →Open in Excel→ File → Save As → CSV UTF-8. - Step 2Drop the file onto the cleaner above — Everything runs in your browser via PapaParse. The cleaner sniffs encoding from the BOM (or its absence) and the delimiter from the first row — you don't have to tell it which platform the file came from. Survey PII (emails, free-text answers, IP addresses) never reaches a server.
- Step 3Pick the dedup mode for your data — **Exact**: only collapses byte-identical duplicate rows. **Trim**: collapses rows that differ only in leading/trailing whitespace. **Case-insensitive**: collapses on lowercase comparison —
Sue@x.commatchessue@x.com. **Normalized**: trim + lowercase + CHAR(160) fold all applied to the comparison key; output cells keep their original casing. For survey email-based dedup,case-insensitiveis the sane default. Leave itoffif your survey allowed anonymous repeat submissions intentionally. - Step 4Enable Advanced → smart-quote + hidden-whitespace if you see textareas in your form — Free-text answer columns are where curly quotes and NBSPs hide. Mobile keyboards autocorrect
"to“and”; pasted answers from email/Word/Pages bring those plus CHAR(160) NBSPs. Both options are off by default to avoid changing data unnecessarily — turn them on when your form had any textarea / paragraph question. - Step 5Set the output line-ending for your downstream tool — **CRLF** for Excel-on-Windows, QuickBooks, Power BI, most legacy importers. **LF** for Mac terminals, Python pandas (handles either), Linux awk pipelines. **Preserve** keeps whatever the input had — useful when the file is going back into the same system.
- Step 6Download and verify the row count — The output panel shows: rows in, rows out, duplicates removed, smart quotes normalized, hidden-whitespace cells folded, rows skipped from top, BOM detected (yes/no), encoding detected (
utf8/utf16le/utf16be), delimiter detected (,/\t/;). Cross-check the rows-out count against the platform's reportedtotal responsesnumber before importing to your downstream tool.
Per-platform export quirks
What each survey platform writes when you click Export, and the specific cleaner option that addresses each quirk. Information is current as of May 2026.
| Platform | Encoding | BOM | Delimiter | Quirks | Cleaner option(s) |
|---|---|---|---|---|---|
| **Google Forms** | UTF-8 | **No** — Excel on Windows mojibakes non-ASCII chars | , | Always exports as Form Responses 1.csv with the form's name as a prefix; column order matches the question order; multi-select columns hold semicolon-separated values in a single cell | outputLineEnding: crlf for Excel; cleaner output adds the missing BOM |
| **Typeform** | UTF-8 | Yes (newer exports) | , | Forms created before June 2025 have no Ending column. Multiple-choice with multi-select shows answer choices as separate columns rather than one cell. Some legacy exports prepend a single metadata row (# Exported from typeform.com on ...) before the question headers | skipRows: 1 for the legacy metadata row; inputEncoding: auto handles either BOM state |
| **SurveyMonkey** (Standard) | UTF-8 with BOM | Yes | , | Two header rows by default (question text + answer-choice subhead); Respondent ID is a 16-digit number that displays as scientific notation in Excel if opened by double-click | skipRows: 1 to drop the second header; cleaner preserves the 16-digit Respondent ID as text |
| **SurveyMonkey** (Condensed) | UTF-8 with BOM | Yes | , | Single header row; multi-select stored as comma-separated values inside one cell (use csv-column-splitter to separate after cleaning) | Same as Standard; no skipRows needed |
| **Microsoft Forms** | UTF-8 | Yes (when exported via Excel → Save As CSV UTF-8) | , | Submission timestamp uses the form owner's local timezone — different respondents reading the same Excel file may see different times based on display locale | outputLineEnding: crlf for Excel; date fields untouched (cleaner is text-only) |
| **Jotform** | UTF-8 with BOM | Yes | , | Multi-row addresses are joined with \n literals inside one cell — quote-aware parser handles these; trimWhitespace only trims cell boundaries, not internal newlines | Default options |
| **Tally** | UTF-8 with BOM | Yes | , | Includes a Submission ID column that's a 24-character ULID — text-safe in Excel because it's not all-digits | Default options |
What the cleaner does
Every cleaner option mapped to the survey-specific problem it solves. The cleaner is a single pass — all selected options apply together.
| Cleaner option | Survey-data problem it solves | Default |
|---|---|---|
trimWhitespace | Mobile keyboards add a trailing space on autocomplete; trim removes it so dedup catches john@x.com matching john@x.com | On |
removeEmptyRows | Optional skipped questions leave blank cells; rows where every cell is blank (a respondent who started but submitted nothing) get dropped | On |
repairRows | Pads short rows / truncates long rows to the header width. Triggered when an unquoted comma in a free-text answer broke a row | On |
dedupMode: case-insensitive | Email-based dedup that catches Sue@gmail.com matching sue@gmail.com. For survey responses where users submitted twice — once on each device — this is the right default | Off (must enable) |
dedupMode: normalized | Trim + lowercase + CHAR(160) fold applied to the comparison key only. Catches Sue@gmail.com (trailing space) matching sue@gmail.com. The output cells keep their original form | Off (must enable) |
skipRows: 1 | Strip the metadata row Typeform legacy exports prepend, or the second header row SurveyMonkey Standard layout writes | 0 (none) |
inputEncoding: auto | Detects UTF-8 with/without BOM, UTF-16 LE, UTF-16 BE from the file's first bytes — handles all five major platforms automatically | auto |
delimiter: auto | Sniffs comma vs semicolon (some EU-locale Typeform exports use semicolon) vs tab from the first 4 KB | auto |
normalizeSmartQuotes | Converts “ ” ‘ ’ – — from mobile autocorrect and Word/Pages paste to ASCII straight quotes / hyphens. Preserves the words, fixes the encoding | Off (must enable) |
normalizeHiddenWhitespace | Folds CHAR(160) NBSP and U+200B zero-width space (both common from paste-from-mobile and paste-from-webpage) to regular space; collapses internal runs to single space | Off (must enable) |
outputLineEnding: crlf | Windows / QuickBooks / Power BI line endings. lf for Mac terminals / pandas; preserve to keep input format | crlf |
Cookbook
Real before/after rows from real survey exports. Headers shortened for brevity.
Google Forms response with mojibake when opened in Excel-on-Windows
ExampleGoogle Forms exports UTF-8 without a BOM. Excel on Windows defaults to Windows-1252 and reads é as the two-byte sequence é. The data is fine — the encoding declaration is missing. Cleaner output adds a BOM that Excel auto-detects.
Input (Google Forms UTF-8, no BOM, opened in Excel-on-Windows): Name,Comment José Smith,Bonjour à tous ↓ Excel-on-Windows displays as: José Smith,Bonjour à tous Cleaner output (UTF-8 BOM added): Name,Comment José Smith,Bonjour à tous ← Excel reads correctly
Typeform legacy export with prepended metadata row
ExampleSome pre-2025 Typeform exports prepend a single metadata row before the actual question headers. The cleaner's skipRows: 1 drops it so row 1 of the output is the real header.
Input (Typeform legacy): # Exported from typeform.com on 2026-05-15 Submit Date,Respondent,Q1: Name,Q2: Email 2026-05-15,abc123,John Smith,john@x.com Cleaner config: skipRows: 1 Output: Submit Date,Respondent,Q1: Name,Q2: Email 2026-05-15,abc123,John Smith,john@x.com
Free-text answer with curly quotes from mobile autocorrect
ExampleMobile keyboards autocorrect straight " to curly “ ”. The data is the same; the encoding isn't. Downstream tools that do exact text matching (BI dashboards, NLP pipelines) will treat "frustrated" and “frustrated” as different strings. normalizeSmartQuotes folds them to ASCII.
Input: Email,Feedback user@x.com,"The “frustrated” user said “I don't think this works”" Cleaner config: normalizeSmartQuotes: true Output: Email,Feedback user@x.com,"The ""frustrated"" user said ""I don't think this works"""
Email duplicates with case + whitespace differences
ExampleRespondent filled the form twice — once from desktop (John.Smith@Gmail.com) and once from mobile (john.smith@gmail.com with trailing space from autocomplete). Exact dedup misses both. case-insensitive mode catches the case difference; normalized mode catches case + whitespace + NBSP.
Input: Email,Response John.Smith@Gmail.com,Yes john.smith@gmail.com ,Yes Cleaner config: dedupMode: normalized Output (first occurrence kept): Email,Response John.Smith@Gmail.com,Yes
CHAR(160) NBSP from a paste-from-email answer
ExampleRespondent copied their support ticket from email and pasted it into a textarea. Outlook/Gmail/Apple Mail wrap text with CHAR(160) NBSPs between words — invisible to the eye, visible to any string-matching pipeline. normalizeHiddenWhitespace folds them to regular space.
Input (with NBSPs shown as · for clarity — they're U+00A0 in the actual file): Email,Description user@x.com,The·error·says·"Cannot·connect·to·server" Cleaner config: normalizeHiddenWhitespace: true Output: Email,Description user@x.com,The error says "Cannot connect to server"
Empty rows from respondents who started but didn't submit
ExampleMost survey platforms record an empty row when a respondent loads the form but doesn't submit. Skip-the-empties cleanup removes these silently.
Input: Email,Response user1@x.com,Yes , , user2@x.com,No Cleaner config: removeEmptyRows: true (default) Output: Email,Response user1@x.com,Yes user2@x.com,No
Errors and edge cases
Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.
`é` displays as `é` after opening Google Forms CSV in Excel
Encoding mismatchGoogle Forms exports UTF-8 without a BOM. Excel on Windows defaults to Windows-1252 when no encoding is declared. Cause is not the data — the file is correctly UTF-8. Two fixes: (1) re-save through Google Sheets (which writes UTF-8 with smarter Excel handling); (2) open via Excel's Data → Get Data → From File → From Text/CSV and manually pick UTF-8. The cleaner adds a BOM on output so future opens auto-detect.
SurveyMonkey Respondent ID displays as scientific notation in Excel
Excel precision corruptionSurveyMonkey writes Respondent IDs as 16-digit integers. Excel coerces 12+ digit numbers to scientific notation (1.23457E+15) and 16+ digit values get digit 16 replaced with 0 — permanently after save. SurveyMonkey's docs recommend using the Text Import Wizard with the Respondent ID column set to Text. The cleaner treats every cell as a string so the ID round-trips correctly — but if Excel was used between download and cleaner, the damage is done.
Typeform multi-select questions have one column per choice instead of one cell
Schema choiceTypeform's CSV export writes multi-select (multiple-choice with multi-select enabled) as separate columns, one per answer choice, with Yes/No or 1/0 values. The question text itself doesn't appear as a column header. If your downstream tool expects a single multi-value cell, you'll need csv-column-merger after the cleaner. For most survey-analysis tools, the separate-column shape is preferred.
Comma in a free-text answer breaks the row
Quote-stripped pasteMost survey platforms quote-wrap free-text fields correctly. The exception: when a respondent's answer was copy-pasted into the form from a source that already had RFC 4180 quoting, and the form processed the literal " characters. Result: every comma in the answer shifts subsequent cells one column right. The cleaner's repairRows: true (default on) pads short rows or trims long rows to the header width. The underlying answer text may need manual review.
Survey allowed multiple submissions intentionally — dedup deletes legitimate data
Risk of data lossIf your form allowed respondents to submit multiple times by design (longitudinal surveys, weekly check-ins, multi-stage research), any email-keyed dedup will collapse legitimate distinct responses. Use dedupMode: off for these scenarios. The cleaner's other passes (trim, smart-quote, NBSP, empty-row) are still safe to apply.
Multi-row addresses with embedded newlines
Quote-aware parser handles thisJotform and Google Forms wrap multi-line address fields with \n inside the cell. RFC 4180 requires those cells to be quoted ("..."). The cleaner's parser is quote-aware so the embedded newlines stay inside the cell. If a downstream tool fails on these cells, the cleaner's outputLineEnding setting doesn't normalise newlines *inside* quoted cells — only between rows. For full sanitisation, run csv-find-replace with regex \r?\n → on the address column.
SurveyMonkey export has two header rows (question text + answer-choice subhead)
Standard layout defaultSurveyMonkey's Standard layout writes question text on row 1 and the answer-choice subhead on row 2. The cleaner's skipRows: 1 drops row 1, leaving the subhead as the canonical header. Alternative: switch to Condensed Columns in SurveyMonkey's export options for a single-header layout — generally easier downstream.
Free-text answers contain commas inside semicolon-separated lists
Locale-specific delimiter pitfallSome EU-locale Typeform exports use ; as the field delimiter (because , is the decimal separator). The cleaner's delimiter: auto sniffs from the first row. Force delimiter: ";" if your file is small enough that the sniffer guessed wrong (1–2 columns of data can confuse heuristics).
Microsoft Forms timestamp displays differently for different respondents
Display locale, not data corruptionMicrosoft Forms writes the submission timestamp in the form owner's timezone. Excel displays dates in the local timezone of whoever opens the file. The cleaner treats every cell as text, so the timestamp string is preserved exactly — the perceived difference is downstream display behaviour. If you need a fixed timezone, format the timestamp as text inside Forms before export.
Frequently asked questions
Why does my Google Forms CSV show `é` instead of `é` in Excel?
Google Forms exports as UTF-8 without a Byte Order Mark. Excel on Windows defaults to Windows-1252 (Western European Latin-1) when no encoding indicator is present, and Windows-1252 reads the UTF-8 byte sequence for é (0xC3 0xA9) as the two characters é. The data isn't corrupted — the encoding label is missing. Three fixes, easiest first: (1) open the cleaner above, drop the file, and download — the output adds a BOM Excel auto-detects; (2) Excel's Data → Get Data → From File → From Text/CSV lets you pick UTF-8 manually; (3) re-save through Google Sheets, which writes a UTF-8 CSV Excel handles better.
How do I dedupe survey responses by email when one was submitted with capital letters?
Use the cleaner's dedupMode: case-insensitive option (under Advanced). It lower-cases each cell *only for the comparison key* — the output preserves whatever the respondent typed. If you also want to catch trailing whitespace from mobile autocomplete (Sue@x.com vs Sue@x.com), use dedupMode: normalized instead — it adds trim + NBSP folding to the comparison pipeline. Neither mode modifies the visible cell values; only the dedup decision uses the normalised form.
Why does my SurveyMonkey Respondent ID look like `1.23457E+15` in Excel?
Respondent IDs are 16-digit integers. Excel coerces any 12+ digit number to scientific notation for display, and any 16+ digit number has its 16th digit replaced with 0 permanently after save. This is Excel's documented 15-digit precision limit (IEEE 754 floating-point). The cleaner treats every cell as a string, so if you drop the SurveyMonkey CSV directly onto the cleaner *before* opening it in Excel, the IDs survive. If you've already saved through Excel, the IDs are damaged — re-export from SurveyMonkey.
Should I turn on `normalizeSmartQuotes` for my survey CSV?
Turn it on if your form had **any** free-text or paragraph question. Mobile keyboards autocorrect straight " to curly “ ”; pasted answers from email or Word bring curly quotes plus en-dashes (–) plus the horizontal ellipsis (…). Downstream tools doing exact text matching — BI dashboards, NLP sentiment pipelines, customer-support search — treat the two strings as different. Folding them to ASCII straight quotes preserves the words and fixes the encoding. Leave it off if your form was all multiple-choice or numeric input.
What about CHAR(160) non-breaking space in responses?
Same advice as smart quotes: turn normalizeHiddenWhitespace on if your form had textarea / paragraph questions. CHAR(160) NBSPs come in via three paths: (1) Outlook / Gmail / Apple Mail wrap pasted text with NBSPs between words; (2) Webpages with survive copy-paste as U+00A0; (3) Some mobile keyboards on iOS Safari insert NBSP after punctuation. All three produce strings that look identical to human eyes but never match in exact-string comparison. The cleaner folds them to regular U+0020 space.
Does the cleaner work on `.xlsx` exports from Google Forms / Typeform / SurveyMonkey?
Not directly — the cleaner parses CSV. For XLSX, use the platform's CSV export option (every platform offers it). If you only have XLSX, open it in Excel and File → Save As → CSV UTF-8 (Comma delimited). The XLSX-direct dedup tool in our JAD Excel suite is the right home for native XLSX workflows; CSV is faster and lighter-weight for one-off survey-export cleanup.
My survey allows multiple submissions — will the cleaner delete legitimate duplicates?
Only if you turn dedup on. The default mode is exact (byte-for-byte row match) which still collapses exact duplicates — change to none if your form is a weekly check-in or repeat-research design where the same email submitting twice with different answers is expected. The cleaner's other passes (trim, smart-quote, NBSP, empty-row, repair) are non-destructive — they normalise formatting without changing meaning.
Why does my Typeform CSV have a metadata row before the headers?
Some Typeform exports — particularly older forms or files generated via the API rather than the web UI — prepend a single metadata row (# Exported from typeform.com on ...) before the actual question headers. The cleaner's skipRows: 1 drops it so row 1 of the output is the canonical header. SurveyMonkey's Standard layout has the same pattern — two header rows where you usually want the second.
How do I tell which encoding my export is in before dropping it onto the cleaner?
You don't need to. inputEncoding: auto (the default) sniffs the encoding from the first bytes: UTF-8 BOM (EF BB BF), UTF-16 LE BOM (FF FE), UTF-16 BE BOM (FE FF), or content heuristics for no-BOM files. Forced override options exist if you have a specific need — but for the five major survey platforms (Google Forms, Typeform, SurveyMonkey, Microsoft Forms, Jotform), auto-detection works correctly every time.
Will survey PII (emails, free-text answers, IP addresses) be uploaded to JAD Apps?
No. PapaParse runs entirely in your browser. The CSV is never uploaded — the only thing saved server-side is a single counter (file processed, no content) for signed-in dashboard stats. You can opt out in account settings. This matters for survey data because most jurisdictions treat email addresses + free-text responses as PII under GDPR (EU), CCPA (California), CPA (Colorado), VCDPA (Virginia).
How large a survey CSV can the cleaner handle?
Free tier caps at 2 MB — roughly 8,000–15,000 responses depending on answer length. Pro removes the limit, with a practical ceiling around 5–10M rows determined by browser memory. Even the busiest survey rarely exceeds that. The downstream constraint is your BI / accounting / analysis tool, not the cleaner: SurveyMonkey's max single export is 100,000 responses (paginate beyond that); Typeform paginates at 1,000 by default.
Can I run this as part of an automated pipeline?
Yes — the same csv-cleaner is available via your local @jadapps/runner. Hit GET /api/v1/tools/csv-cleaner from any MCP/SDK client to fetch the option schema (returns the same 11 options the UI exposes), then POST the payload to your runner's 127.0.0.1:9789/v1/tools/csv-cleaner/run endpoint with your file. Survey PII never reaches JAD's servers — the cleaner runs on your machine via the runner. Pair the runner once and you can automate the daily survey-export cleanup.
Privacy first
Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.