How to clean a database csv export
- Step 1Export from your database CLI — **PostgreSQL**:
\copy table_name TO 'file.csv' WITH (FORMAT csv, HEADER true)— comma-delimited UTF-8. **MySQL**:SELECT * FROM table_name INTO OUTFILE '/tmp/file.csv' FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n'— explicitly comma-delimited, otherwise defaults to tab. **SQLite**:.mode csvthen.headers onthen.output file.csvthen your SELECT. **mysqldump tab mode**:mysqldump --tab=/tmp --fields-terminated-by=',' database_name— writes one.txtper table. - Step 2Drop the file onto the cleaner above — PapaParse runs in your browser — database rows never reach a server. The cleaner sniffs encoding from the BOM (or absence) and delimiter from the first row. You don't have to tell it whether the file came from psql, mysql, sqlite3, or SQL Server
bcp. - Step 3For mysql output: confirm the delimiter detected — If you ran
SELECT INTO OUTFILEwithout explicitFIELDS TERMINATED BY ',', the file is tab-separated (TSV with a.csvextension). The cleaner'sdelimiter: autodetects this from the first row. The output panel shows the detected delimiter — confirm it's\trather than,before downloading if you intend to convert to comma-CSV. Use the Advanced → Delimiter override to force a specific value if auto-detection guessed wrong (rare on multi-column files). - Step 4For psql output destined for Excel: enable BOM on output — Excel on Windows can't auto-detect UTF-8 without a BOM. psql's
\copywrites UTF-8 without a BOM (good for programmatic use, bad for Excel). The cleaner's output adds a BOM by default — your accented column values (José,Müller) will display correctly when the cleaned CSV is double-clicked in Excel. - Step 5For Windows-saved files going to Linux tools: choose LF line endings — If psql ran on Windows and produced
\r\nline endings, downstream Linux/pandas pipelines may show\rartefacts ('value\r'instead of'value'). SetoutputLineEnding: lfto normalise to Unix. Conversely, for output going back to Windows / QuickBooks, leave the defaultcrlf. - Step 6Decide whether to keep NULL markers as `\N` or convert to empty — psql's text-format mode uses
\Nfor NULL; CSV mode uses empty string. mysql uses\Nin tab-format text files. sqlite3 always uses empty string in CSV mode. The cleaner doesn't auto-convert — it preserves whatever's in the cell. If your downstream tool needs empty-for-NULL, run csv-find-replace withfind: \N,replace: empty,caseSensitive: trueafter cleaning. If it needs\N-for-NULL (a pg target table), leave them.
Per-database export shapes
What each major database CLI writes by default, the platform-specific overrides, and the cleaner option that addresses each quirk. Verified against the official documentation for PostgreSQL 18, MySQL 8.4, and SQLite (latest).
| Database | Default delimiter | NULL representation | Encoding / BOM | Line endings | Common quirks |
|---|---|---|---|---|---|
**PostgreSQL** \copy ... CSV HEADER | , | Empty string (unquoted) in CSV mode; \N in text mode | UTF-8, no BOM | \n on Linux/Mac; \r\n on Windows (server-side COPY); \copy ... TO STDOUT always \n | CHAR(N) columns pad with trailing spaces; embedded newlines in TEXT values are quote-wrapped per RFC 4180 (multi-line cells are valid) |
**MySQL** SELECT INTO OUTFILE | **\t (tab)** by default — must specify FIELDS TERMINATED BY ',' for true CSV | \N (per MySQL docs, mirrors text-format) | Server's character set — utf8mb4 if configured, often utf8 or latin1 on older defaults | \n by default; explicitly settable via LINES TERMINATED BY '\r\n' | Fields **not enclosed** by default — strings containing the delimiter break rows. Use OPTIONALLY ENCLOSED BY '"' for quoting |
**MySQL** mysqldump --tab | \t | \N | Per --default-character-set; **older mysqldump defaults to utf8 which can't dump utf8mb4 correctly** — known bug, must specify --default-character-set=utf8mb4 | \n | Writes .txt per table (not .csv); paired with a separate .sql file containing CREATE TABLE |
**SQLite** .mode csv | , | Empty string | UTF-8, no BOM (Windows users report Excel mojibake) | OS native (\r\n on Windows, \n on Linux/Mac) | In SQLite 3.53+, **CLI flag order** matters: sqlite3 -csv -header works, sqlite3 -header -csv produces no headers. For dot commands inside the REPL, the conventional order is .mode csv then .headers on |
**SQL Server** bcp (queryout) | \t (tab) by default; -t, for comma | Empty string | UCS-2 / UTF-16 LE if -w flag, ASCII if -c, UTF-8 if -c -t, -C 65001 | \r\n | Without -c flag, output is binary-format (not CSV). With -c it's text; quoting only via -q and specific FORMAT options |
**SQL Server** SQLCMD -E -Q ... -o | Configurable via -s (column separator) | NULL (literal text) — confusing if downstream tools expect empty | UTF-16 LE by default | \r\n | Adds a separator-line of dashes between header and data rows — skipRows: 1 after the header to remove it |
What the cleaner does
Each cleaner option mapped to the database-export problem it solves. Single-pass — all selected options apply together.
| Cleaner option | Database-export problem it solves | Default |
|---|---|---|
delimiter: auto | MySQL default-tab vs PostgreSQL default-comma vs SQLite default-comma — sniffs from the first row so you don't have to specify per-file | auto |
inputEncoding: auto | Detects UTF-8 with BOM (Excel-saved), UTF-8 no-BOM (psql, sqlite3, mysql defaults), UTF-16 LE (bcp -w from SQL Server). No flag needed for the common cases | auto |
outputLineEnding: crlf | Convert psql-on-Linux \n output to \r\n for QuickBooks / Excel-on-Windows / Power BI ingestion | crlf |
outputLineEnding: lf | Convert mysqldump-on-Windows \r\n to \n for awk / Python pandas / DuckDB / Linux pipeline ingestion | — |
trimWhitespace | Strip trailing spaces from CHAR(N) column values (PostgreSQL pads CHAR(10) to exactly 10 chars even if the data is shorter) | On |
removeEmptyRows | Drop rows that were empty in the source — common when a UNION ALL query left placeholder rows or a partial export was concatenated | On |
repairRows | Pad short rows / trim long rows caused by SELECT INTO OUTFILE writing unquoted strings containing the delimiter character | On |
dedupMode: exact | Collapse exact-row duplicates from a UNION ALL you should have made UNION DISTINCT — or accidentally re-running the same export and concatenating | exact (legacy default) |
dedupMode: trim | Collapse duplicates that differ only by trailing-space padding from CHAR(N) columns | — |
normalizeSmartQuotes | Rare for database exports — only useful if your TEXT columns contain user-entered content with curly quotes (e.g. a comments table populated by a Word-paste workflow) | Off |
normalizeHiddenWhitespace | Rare — only useful if your TEXT columns received user input from web forms that injected CHAR(160) NBSPs | Off |
skipRows: 1 | SQL Server SQLCMD writes a separator-line of dashes between header and data — skipRows: 1 removes it. Also useful for any custom export script that prepends a -- generated by X comment row | 0 |
Cookbook
Real before/after rows from common database CLI outputs. Headers and values illustrative.
MySQL default-tab output mistaken for comma-CSV
ExampleRunning SELECT * FROM users INTO OUTFILE '/tmp/users.csv' writes a TSV (tab-separated values) — the .csv extension is just a filename choice; the delimiter is \t because MySQL defaults to tab. Excel double-click merges every column into row 1 because Excel uses comma as its default delimiter.
Source MySQL command (no FIELDS TERMINATED BY): SELECT id, name, email FROM users INTO OUTFILE '/tmp/users.csv'; Actual file contents (TSV — note the tabs): 1\tJohn\tjohn@x.com 2\tJane\tjane@x.com Excel double-click: every row appears as one merged cell. Cleaner output (auto-detects tab delimiter, outputs comma-CSV): id,name,email 1,John,john@x.com 2,Jane,jane@x.com
psql `\copy` on Windows: CRLF line endings break Linux pandas
Examplepsql on Windows writes server-side files with CRLF line endings. pandas.read_csv() accepts either, but some text-only utilities (awk, simple split calls) see \r as part of the last column value. The cleaner's outputLineEnding: lf normalises.
Source (psql on Windows, written to a file):
id,name,email\r\n
1,John,john@x.com\r\n
Python pandas reads correctly (handles either ending), but:
awk -F, '{print $3}' users.csv
→ 'john@x.com\r' ← trailing carriage return is part of the value
Cleaner config: outputLineEnding: lf
Output:
id,name,email\n
1,John,john@x.com\nPostgreSQL `CHAR(N)` column padded with trailing spaces
ExampleCHAR(10) columns in PostgreSQL pad every value to exactly 10 characters with trailing spaces. \copy writes the padded form. Downstream string comparisons (WHERE code = 'ABC' vs the padded 'ABC ') fail until trimmed.
Source PostgreSQL table:
CREATE TABLE products (code CHAR(10), name VARCHAR(100));
INSERT INTO products VALUES ('ABC', 'Widget');
psql \copy output:
code,name
"ABC ",Widget
Cleaner config: trimWhitespace: true (default)
Output:
code,name
ABC,WidgetSQLite UTF-8-no-BOM CSV opened in Excel-on-Windows
ExampleSQLite's .mode csv writes UTF-8 without a BOM. Excel on Windows defaults to Windows-1252 and mojibakes any accented character. Same issue as Google Forms and Mailchimp — the data isn't corrupted, the encoding declaration is missing.
Source SQLite query result: id,name 1,José Müller Excel-on-Windows display (Windows-1252 fallback): id,name 1,José Müller Cleaner adds BOM on output → Excel auto-detects UTF-8 → displays correctly.
psql `\N` NULL marker survives cleaner unchanged
ExamplePostgreSQL text-format COPY writes \N for NULL. CSV-format COPY writes unquoted empty string. The cleaner is text-only — it preserves whatever's in the cell. If your downstream tool needs empty-for-NULL, chain with csv-find-replace.
Source (psql in text-format COPY mode): id\tname\temail 1\tJohn\t\N 2\tJane\tjane@x.com Cleaner output (preserves `\N` verbatim): id,name,email 1,John,\N 2,Jane,jane@x.com Follow-up: csv-find-replace with find: \\N, replace: empty, caseSensitive: true → converts to empty-string NULL.
Embedded newline in `TEXT` column from unquoted MySQL output
ExampleMySQL's SELECT INTO OUTFILE without OPTIONALLY ENCLOSED BY '"' doesn't quote strings. A TEXT value containing a newline becomes a row break — every subsequent column shifts up one row. The cleaner's repairRows: true pads/trims to header width but the underlying multi-line value needs manual review.
Source MySQL row: id=42, notes='Line 1\nLine 2', email='x@y.com' Unquoted SELECT INTO OUTFILE writes: 42\tLine 1\nLine 2\tx@y.com ↓ Parser sees this as two rows: 42\tLine 1 Line 2\tx@y.com Fix at source: use OPTIONALLY ENCLOSED BY '"' or switch to mysqldump. Fix at cleaner: repairRows: true (default) pads short rows to header width — the data still needs manual review.
Errors and edge cases
Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.
MySQL `SELECT INTO OUTFILE` default-tab delimiter mistaken for CSV
MySQL default behaviourMySQL's FIELDS TERMINATED BY defaults to \t (tab). A SELECT INTO OUTFILE 'file.csv' without explicit FIELDS TERMINATED BY ',' produces a TSV with a .csv extension. Excel double-click merges all columns. The cleaner's delimiter: auto detects this; force delimiter: ',' if your downstream needs comma-CSV.
MySQL `SELECT INTO OUTFILE` without `OPTIONALLY ENCLOSED BY '"'`
Unquoted outputBy default, MySQL writes string values **without enclosure**. Any value containing the field delimiter or a newline breaks the row. Always use OPTIONALLY ENCLOSED BY '"' (or just ENCLOSED BY '"' if you don't mind quoting numeric fields too) for safe CSV output. The cleaner can repair some damage via repairRows but multi-line embedded values need manual review.
`mysqldump --default-character-set=utf8` corrupts `utf8mb4` data
Known MySQL bug (#71746)Older mysqldump versions default to utf8 (which is actually MySQL's 3-byte alias for UTF-8 BMP only — emoji and many CJK chars need 4-byte utf8mb4). If your tables are utf8mb4 but mysqldump uses utf8, 4-byte characters get replaced with ? permanently. Always specify --default-character-set=utf8mb4 for utf8mb4 tables. The cleaner can't recover characters that were truncated at the dump step.
PostgreSQL `\copy` on Windows produces `\r\n`, breaks Linux awk
Cross-platform line endingsServer-side COPY writes \n on Linux and \r\n on Windows. Client-side \copy ... TO STDOUT always writes \n. If a Windows analyst exported via \copy directly to a file (not stdout), downstream awk -F, shells see \r as part of the last column. The cleaner's outputLineEnding: lf normalises to \n.
SQLite CLI flag order: `sqlite3 -header -csv` produces no headers
SQLite 3.53+ behaviour changeConfirmed in the SQLite forum: starting in 3.53.0, the order of CLI flags now matters. sqlite3 -csv -header db.db < query.sql outputs headers correctly; sqlite3 -header -csv db.db < query.sql does not. The workaround is to always put -csv before -header in scripts. For dot commands inside the REPL, the conventional order is .mode csv then .headers on. The cleaner can't recover a missing header — re-run the export with the corrected flag order.
PostgreSQL `\N` NULL marker in CSV-mode output
Configurationpsql's \copy ... CSV writes empty string for NULL. psql's \copy ... CSV NULL '\N' writes the literal \N. Different configs produce different output for the same data — be explicit about which one you used. The cleaner preserves the cell content verbatim; conversion between the two needs csv-find-replace.
PostgreSQL `COPY FROM` rejects mixed line endings
Re-import strictnessPer PostgreSQL docs: 'COPY FROM will complain if the line endings in the input are not all alike.' If a file has some rows ending in \n and others in \r\n (common after manual edits in different editors), re-import fails. The cleaner's outputLineEnding: crlf (or lf) normalises all rows to one ending.
SQL Server `SQLCMD` writes a separator line of dashes between header and data
Pretty-printing artefactSQLCMD -E -Q '...' -o output.csv writes a line of - characters between the header row and the data rows. This breaks any standard CSV parser. skipRows: 1 after the header drops the separator line. Better: use the -s, flag with -W (no trailing whitespace) and -h -1 (no header repeat) — or use bcp queryout for cleaner output.
SQL Server `bcp -w` writes UTF-16 LE
Encoding non-defaultbcp queryout file.csv -w produces a UTF-16 Little-Endian file with a BOM. The .csv extension is misleading — Excel and pandas both default to assuming UTF-8 and may misread. The cleaner's inputEncoding: auto detects the UTF-16 BOM and decodes correctly; the output is UTF-8 (with BOM if outputLineEnding: crlf).
Embedded carriage returns in TEXT values escape as visible `\r`
psql text-mode escapingPostgreSQL's text-format COPY escapes embedded \r and \n as the two-character sequences \r and \n. CSV-format COPY quote-wraps the entire field and embeds the raw newline. If you see literal \r in your data after a text-mode COPY, the data isn't corrupted — psql preserved the value verbatim. Re-export with WITH (FORMAT csv) to get standard CSV quoting.
Frequently asked questions
Why does my MySQL `SELECT INTO OUTFILE` produce a tab-delimited file?
Because MySQL's default for FIELDS TERMINATED BY is \t (tab), not comma. Even when you name the output file .csv, the contents are TSV. Two fixes: (1) add FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '"' LINES TERMINATED BY '\n' to the export to produce true CSV; (2) drop the existing file onto the cleaner above — delimiter: auto detects the tab and the output is comma-delimited.
Why does my `psql \copy` output show `\r` characters in Linux tools?
Server-side COPY on Windows writes CRLF (\r\n) line endings. When you grep, awk, or column-slice that file on Linux, the \r becomes part of the last column's value. Three fixes: (1) use \copy ... TO STDOUT > file.csv instead of \copy ... TO 'file.csv' — STDOUT always writes \n regardless of platform; (2) run dos2unix file.csv after the export; (3) drop the file onto the cleaner with outputLineEnding: lf.
How do I handle `\N` NULL markers from psql?
Depends on what your downstream tool expects. PostgreSQL's text-format COPY writes \N for NULL; CSV-format COPY writes empty string. Downstream pandas accepts either via na_values=['\\N']. Downstream re-import to PostgreSQL with \copy ... CSV NULL '\N' accepts \N. For BI tools and Excel, convert to empty: chain the cleaner with csv-find-replace (find: \\N, replace: empty, caseSensitive: true). For schema-aware downstreams (DuckDB, BigQuery), keep \N if their loader recognises it.
Will the cleaner damage data types like dates, booleans, JSON?
No. The cleaner treats every cell as a string from input to output. Dates stay as the string the database emitted (2026-05-15 or 2026-05-15 14:32:00+00:00). Booleans stay as t/f (psql), 1/0 (mysql), or whatever your CAST chose. JSON column values stay as the literal JSON string. There's no type coercion, no precision loss, no locale-based date reformatting.
Why is my emoji column showing `?` after a mysqldump round-trip?
MySQL bug #71746: older mysqldump defaults to --default-character-set=utf8 which is MySQL's 3-byte alias (BMP only). Emoji are 4-byte characters that need utf8mb4. The emoji are replaced with ? at the dump step — the data on disk is already corrupted before the cleaner sees it. Re-run mysqldump with --default-character-set=utf8mb4. The cleaner can't recover lost characters.
Can I clean a SQL Server `bcp` output with this?
Yes, but mind the flags. bcp ... queryout file.csv -c -t, writes ASCII text with comma delimiter — cleans straightforwardly. bcp ... queryout file.csv -w -t, writes UTF-16 LE with BOM — the cleaner's inputEncoding: auto detects this and decodes to UTF-8. Native-format bcp (without -c or -w) is binary and can't be cleaned as CSV — re-run with text mode.
How does the cleaner handle multi-line values inside a TEXT/VARCHAR column?
Correctly, **if** the source CSV quote-wrapped them per RFC 4180. PostgreSQL's CSV-format COPY and mysqldump (with proper enclosure flags) both do this — multi-line values appear as "Line 1\nLine 2" and the cleaner's parser keeps the embedded newline inside the cell. The breakage case is MySQL SELECT INTO OUTFILE without OPTIONALLY ENCLOSED BY '"' — unquoted multi-line values produce row-misalignment that no parser can fully recover from. repairRows: true pads/trims to header width but the underlying value still needs manual review.
Why does my SQLite CSV export have no header row?
Most likely a CLI flag-order issue in SQLite 3.53.0+. sqlite3 -csv -header db.db < query.sql outputs headers correctly; sqlite3 -header -csv db.db < query.sql does not. Documented in the SQLite forum as a behaviour change introduced in 3.53. Fix: put -csv before -header in your script. For dot commands inside the REPL, run .mode csv first, then .headers on, then .output file.csv, then your SELECT. The cleaner can't recover a missing header — re-run the export.
Can I dedupe rows by a single column (e.g. primary key) instead of whole-row match?
Not with csv-cleaner — its dedup is whole-row. For column-based dedup use csv-deduplicator which supports a columns config and strategy: first | last. Alternatively, dedup at the source: re-run your query with SELECT DISTINCT (psql/mysql/sqlite) or GROUP BY primary_key with appropriate aggregates.
Will database PII be uploaded to JAD Apps?
No. PapaParse runs entirely in your browser. Database rows — customer names, addresses, financial values — never reach a server. The only thing saved server-side is a single counter (file processed, no content) for signed-in dashboard stats. You can opt out in account settings. For corporate compliance, every modern database export tool offers an in-VPC option (RDS Data API, psql via SSH tunnel, etc.) so the file can go directly from database to your machine without intermediate exposure.
How large a database export can the cleaner handle?
Free tier caps at 2 MB — roughly 8k–30k rows depending on column count and value length. Pro removes the limit, with a practical ceiling around 5–10M rows determined by browser memory. For multi-million-row tables, the database-side fix is faster: re-run with a smaller LIMIT for testing, then run the production export through the cleaner via the runner API (POST 127.0.0.1:9789/v1/tools/csv-cleaner/run) to stream through without the browser-memory ceiling.
Can I run this in a CI pipeline before loading data into a warehouse?
Yes — GET /api/v1/tools/csv-cleaner returns the 11-option schema; pair the @jadapps/runner once on your CI runner and POST the payload to 127.0.0.1:9789/v1/tools/csv-cleaner/run. Common pipeline: nightly pg_dump --table → cleaner (outputLineEnding: lf, delimiter: auto, trimWhitespace: true) → load into Snowflake / BigQuery / Redshift via their CSV ingest. Database rows never reach JAD's servers — the cleaner runs locally on your CI runner.
Privacy first
Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.