How to clean a ga4 / search console csv export
- Step 1Export from your analytics surface — **GA4 standard report**: Reports → pick report → top-right ⋮ → Download file → CSV. **GA4 Explore**: Exploration → Export data → CSV or TSV. **GSC**: Performance → Top right Export → CSV (downloads a ZIP). **Looker Studio**: ⋮ on any chart → Export → CSV → choose 'Standard CSV' or 'Excel-compatible'.
- Step 2Unzip GSC's export if you took the CSV option — GSC's CSV export is a ZIP containing typically two files: the main performance data CSV (Queries / Pages / Countries / Devices / depending on what was visible) and a
Filters.csvlisting which filters were active. For most analysis you only need the data CSV. To combine both into one file for audit purposes, use csv-merger on the two files first, then drop the merged file onto the cleaner. - Step 3For GA4 standard reports: set `skipRows` to match the preamble — GA4 standard reports prepend metadata rows (property name, date range, sampling status, and at least one blank row) before the actual column headers. The pandas community recipe uses
skiprows=5— verify against your specific export by opening the first 6 lines in a text editor and counting rows above the real header. Drop the file onto the cleaner with Advanced →skipRowsset to that count. GA4 Explore exports don't have this preamble — leaveskipRows: 0for those. - Step 4Drop the file onto the cleaner — PapaParse runs in your browser — GA4 / GSC data (which can include PII like user-id custom dimensions) never reaches a server. The cleaner sniffs encoding from the BOM (Looker Studio's Excel-compatible mode includes one; GA4 / GSC default exports do not) and delimiter from the first non-skipped row.
- Step 5For URL-based dedup: enable `dedupMode: case-insensitive` — If you exported the same data across two date ranges or two report views and concatenated them, the same Page URL may appear with different casing depending on the source.
Example.com/pageandexample.com/pageare the same page for SEO purposes.case-insensitivemode collapses them. For URL + path matching that also handles trailing-slash differences (example.com/pagevsexample.com/page/), follow up with csv-find-replace with regex/$→ empty before the dedup. - Step 6For Excel-on-Windows output: leave default BOM + CRLF — The cleaner's default output is UTF-8 with BOM and CRLF line endings — exactly what Excel-on-Windows needs to read accented page titles correctly. For Python pandas / DuckDB / Linux pipeline downstream, switch to
outputLineEnding: lf(BOM is harmless either way; pandas auto-detects it).
Per-surface export shapes
What each Google analytics surface writes when you export, and the cleaner option that addresses each quirk. Row limits verified against current Google docs as of May 2026.
| Surface | Export shape | Encoding / BOM | Row limit | Common quirks | Cleaner option(s) |
|---|---|---|---|---|---|
| **GA4 Standard Report** | Single CSV, comma-delimited | UTF-8, no BOM | **100,000 rows** (verified per Google docs) | Prepends a metadata block (property name, date range, sampling status, blank, then headers); pandas community recipe is skiprows=5 but verify the count for your specific report. (not set) placeholders for un-tracked dimensions | skipRows: 5 (typical) for the preamble |
| **GA4 Explore (TSV/CSV)** | Single TSV or CSV — chosen at export time | UTF-8, no BOM | **10,000,000 cells** for unsampled explorations (= ~50,000 rows × 20 columns) | No metadata preamble (unlike standard reports). TSV is tab-delimited; the cleaner auto-detects via delimiter: auto | skipRows: 0; auto delimiter |
| **GA4 Direct-to-Google-Sheets** | Live Google Sheet (re-exportable as CSV from Sheets) | UTF-8 (no BOM until Excel-saved) | Same 100k row cap as standard reports | Sheets export adds its own BOM; double round-trip through Sheets occasionally adds CHAR(160) NBSPs around merged dimension values | normalizeHiddenWhitespace if NBSPs appear |
| **Search Console UI export** | ZIP with 2 CSVs: Queries.csv (or similar) + Filters.csv | UTF-8 with BOM | **1,000 rows per report** through the UI — applies to every dimension (Queries / Pages / Countries / Devices) | The 1,000-row cap is shared between display and export — you can't export more by changing date range. For more, use the GSC API or BigQuery Bulk Export | csv-merger combines the two ZIP files into one for analysis — it does **not** bypass the 1,000-row source cap, only the GSC API or BigQuery Bulk Export does. dedupMode: case-insensitive for URL casing across concatenated exports |
| **Search Console Bulk Export → BigQuery** | Daily streamed dataset, not a CSV | — | No row limit | Not a CSV workflow — the cleaner doesn't apply. Export from BigQuery as CSV first if you need a flat file | — |
| **Looker Studio chart download** | Single CSV per chart | UTF-8, no BOM by default; **'Excel-compatible' option adds the BOM** | **750,000 rows per chart download** | If 'Excel-compatible' is selected, BOM is present; if not, Excel mojibakes accented values. Multi-page reports require one download per page | inputEncoding: auto handles either |
What the cleaner does
Each cleaner option mapped to the GA4 / GSC / Looker Studio problem it solves. Single-pass — all selected options apply together.
| Cleaner option | Analytics-export problem it solves | Default |
|---|---|---|
skipRows: 5 | **The canonical GA4 standard-report fix**: strips the 5-row metadata preamble (property + date range + sampling notice + blank + headers). Equivalent to pandas' pd.read_csv('GA-export.csv', skiprows=5) recipe | 0 |
inputEncoding: auto | Detects BOM (Looker Studio Excel-compatible, GSC) or no-BOM (GA4 standard, Looker Studio default, GA4 Explore). Handles all surfaces without manual config | auto |
outputLineEnding: crlf (default) | Excel-on-Windows compatibility — pairs with the BOM the cleaner adds on output. For pandas / DuckDB / Linux downstream, switch to lf | crlf |
normalizeHiddenWhitespace | Folds CHAR(160) NBSPs that GA4's (not set) and Looker Studio's merged-dimension rendering can introduce. Especially common in page-title columns where the source page used for visual alignment | Off (must enable) |
normalizeSmartQuotes | Useful for page-title and meta-description columns that include CMS-styled curly quotes. Common with news sites and blogs | Off (must enable) |
dedupMode: case-insensitive | URL casing varies across GA4 report views — example.com/Page from one view, example.com/page from another. Case-insensitive dedup catches both as the same URL | Off (must enable) |
dedupMode: normalized | Trim + lowercase + NBSP fold for the comparison key. Catches URL casing + trailing-NBSP edge cases together | Off (must enable) |
trimWhitespace | Removes trailing whitespace from dimension values — GA4 occasionally pads on export, GSC trims them automatically | On |
removeEmptyRows | Drops rows that GA4 includes for (not set) placeholder values when the column is otherwise empty. Use sparingly — sometimes (not set) rows are meaningful traffic-source signals | On |
repairRows | Pads short rows from a partial GA4 Explore export where some dimension columns were truncated | On |
delimiter: auto | Detects tab (GA4 Explore TSV) vs comma (everything else) automatically. Force \t if you renamed an Explore TSV to .csv and the sniffer guessed wrong | auto |
Cookbook
Real before/after rows from real GA4 / Search Console exports.
GA4 standard report metadata preamble (typically 5 rows)
ExampleThe canonical GA4 export problem. Pandas community recipe uses skiprows=5. The cleaner's skipRows does the same thing in a drop-on-file workflow. Verify the exact row count for your specific report by opening the first 6 lines in a text editor — the preamble structure can vary slightly by report type.
Input (GA4 standard report CSV, typical structure): # Property: example.com - GA4 # Date range: Apr 1, 2026 - Apr 30, 2026 # Sampling: 100% of sessions Page path,Active users,Sessions /blog/seo-tips,12453,15892 /products,8745,11203 Cleaner config: skipRows: 5 (Advanced — verify for your report) Output: Page path,Active users,Sessions /blog/seo-tips,12453,15892 /products,8745,11203
GSC ZIP with two CSVs — concatenate then dedupe
ExampleGSC's CSV export is a ZIP containing a main data CSV and a Filters.csv. For unified analysis, merge them, then dedupe on URL (case-insensitive) to catch the same page appearing with different casing in different report views.
Inputs (from GSC ZIP):
Queries.csv (Top queries — 1,000 rows)
Filters.csv (Active filters — 3 rows)
Workflow:
1. csv-merger combines into one file
2. csv-cleaner with dedupMode: case-insensitive on URL column
→ collapses example.com/Page ↔ example.com/page duplicatesExcel mojibake from GA4's no-BOM UTF-8
ExampleGA4 writes UTF-8 without a BOM. Excel-on-Windows defaults to Windows-1252 and mojibakes accented page titles (José's Blog Post → José's Blog Post). Cleaner adds the BOM.
Source (GA4 CSV, opened in Excel-on-Windows): Page title,Sessions José's Blog Post,1245 Müller's Travel Notes,892 Cleaner adds UTF-8 BOM → Excel auto-detects → displays: Page title,Sessions José's Blog Post,1245 Müller's Travel Notes,892
URL casing variance across GA4 report views
ExampleGA4's Pages and Landing Pages reports occasionally produce the same URL with different casing depending on which view it came from. case-insensitive dedup catches the duplicates.
Input (concatenated Pages + Landing Pages exports): URL,Sessions example.com/Blog,1234 example.com/blog,5678 example.com/PRODUCTS,432 example.com/products,891 Cleaner config: dedupMode: case-insensitive on URL Output (first occurrence wins — keeps original casing in output): URL,Sessions example.com/Blog,1234 example.com/PRODUCTS,432
Looker Studio multi-page chart concatenation
ExampleLooker Studio caps at 750,000 rows per chart download. For larger reports, you download multiple times with date-range filters. Concatenate and dedupe to produce the full unified dataset.
Inputs (3 chart downloads, 500k rows each from Looker Studio):
chart1-jan-mar.csv
chart2-apr-jun.csv
chart3-jul-sep.csv
Workflow:
1. csv-merger combines (1.5M rows total)
2. csv-cleaner with dedupMode: exact (catches any
overlap between adjacent date-range exports)
3. Output: unified UTF-8 BOM CSV for Excel /
UTF-8 no-BOM for pandas (set outputLineEnding: lf)GSC query column with NBSP from special-character keyword
ExampleSearch queries that users typed using certain mobile keyboards (notably iOS Safari around punctuation) can include CHAR(160) NBSPs. The cleaner folds them so duplicate queries from different mobile inputs collapse on dedup.
Input (GSC Queries.csv, NBSP shown as ·): Query,Clicks,Impressions "best·practices",42,310 "best practices",28,245 "how to·clean csv",15,98 Cleaner config: normalizeHiddenWhitespace: true dedupMode: case-insensitive Output: Query,Clicks,Impressions best practices,42,310 how to clean csv,15,98 (Clicks/impressions need to be SUMMED, not first-wins — use csv-deduplicator with strategy: aggregate after the cleaner, OR sum manually in Excel.)
Errors and edge cases
Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.
GA4 metadata preamble corrupts pandas `read_csv` without `skiprows`
GA4 standard-report behaviourGA4 standard reports include metadata rows before the actual data: typically property name, date range, sampling status, blank row, then headers. The pandas community recipe — pd.read_csv('GA-export.csv', skiprows=5) — handles the common case but the exact count can vary slightly; verify by opening the first 6 lines in a text editor. The cleaner's skipRows option does this in a drop-on-file pass. GA4 Explore exports do NOT have this preamble — use skipRows: 0 for those.
Search Console UI caps every export at 1,000 rows
GSC hard limitVerified per Google docs: the GSC UI displays and exports a maximum of 1,000 rows per dimension (Queries, Pages, Countries, Devices). No filter or date-range trick exceeds this. For more rows, use the GSC API (no documented row cap) or the Bulk Export to BigQuery (streamed daily). The cleaner can't recover rows that never made it into the CSV — re-export with a different tool if you need >1,000 rows.
GA4 standard reports cap at 100,000 rows
GA4 documented limitStandard report downloads cap at 100,000 rows (raised from earlier 5,000 cap). Explore reports cap at 10,000,000 cells. For deeper analysis, switch from Standard reports to Explore — fewer dimensions per row but far more rows. Or use the GA4 Data API (no documented row cap). The cleaner is text-only — it processes whatever rows make it into the file.
Excel mojibake from GA4 UTF-8-no-BOM
Encoding mismatchGA4 writes UTF-8 without a BOM. Excel-on-Windows defaults to Windows-1252 → accented page titles mojibake (José → José). Same root cause as the Google Forms case in the survey-response spoke. Cleaner adds BOM on output (default). For pandas downstream, no BOM needed — switch to outputLineEnding: lf and the BOM is still harmless.
Looker Studio default CSV lacks BOM; 'Excel-compatible' option includes it
User configurationLooker Studio's chart download has two CSV options: 'Standard CSV' (UTF-8 no BOM, optimised for programmatic use) and 'Excel-compatible' (UTF-8 with BOM). Many users select 'Standard CSV' by default and then complain about Excel mojibake. The cleaner detects either and outputs UTF-8 with BOM by default.
GA4 `(not set)` placeholder in dimension columns
GA4 default behaviourGA4 writes the literal string (not set) for any dimension value it couldn't track (e.g. Source/Medium for direct traffic, page-title for some events). The cleaner preserves these as cell content — they're meaningful (high (not set) rates indicate tracking issues). If you want to filter these rows out, chain with csv-column-filter (column: Source, operator: not_equals, value: (not set)).
GA4 sampling notice in the metadata preamble
Data-quality flagIf GA4 reduces row 3 to # Sampling: X% of sessions (where X < 100), your export is based on a sample, not full data. The cleaner's skipRows: 5 discards this notice — but you should NOT analyse the result as if it were complete. Switch the source GA4 report to a property/date-range that returns 100% sampling, or accept the sampled result and document the X% in your downstream report.
GSC ZIP export contains two CSVs with different schemas
Multi-file exportGSC's CSV export ZIP contains Queries.csv (or Pages.csv etc.) plus a separate Filters.csv listing the active filters. These have completely different schemas — concatenating them produces a mis-aligned file. For audit purposes, treat them as two separate datasets. The cleaner handles each individually.
Looker Studio multi-page reports require per-page exports
Manual concatenationLooker Studio's chart download is per-chart. A 5-page report with 3 charts per page = 15 downloads. Concatenate with csv-merger, then cleaner-dedupe on whatever the natural key is (URL, Date, Query). For automated multi-chart export, use the Looker Studio API or Google Apps Script.
Tab-delimited GA4 Explore renamed to `.csv`
Extension vs content mismatchGA4 Explore lets you choose TSV (tab-separated) or CSV (comma-separated) at export time. Users sometimes pick TSV and then rename the file to .csv thinking the extension determines the format. The actual content is tab-delimited. The cleaner's delimiter: auto detects tab from the first row and parses correctly regardless of extension.
Frequently asked questions
Why does pandas misread my GA4 CSV export?
GA4 standard reports prepend a metadata block before the actual data: property name, date range, sampling status, a blank row, then your column headers. pd.read_csv('GA-export.csv') without skiprows treats the first metadata row as the header — every subsequent data row is misaligned. The pandas community recipe is pd.read_csv('GA-export.csv', skiprows=5) — works for most reports, but the exact count can vary by report type, so verify by opening the first 6 lines in a text editor before relying on it. The cleaner's skipRows option does this in a drop-on-file pass. GA4 Explore exports don't have this preamble — use skipRows: 0 there.
Can I export more than 1,000 rows from Google Search Console?
Not through the UI — the 1,000-row cap applies to every dimension (Queries, Pages, Countries, Devices) regardless of date range or filters. For more rows, use the GSC API (no documented row cap; programmatic), the Bulk Export to BigQuery (daily streamed dataset), or third-party tools like Search Analytics for Sheets (~25,000 rows per query). The cleaner can't recover rows that never made it into the CSV — pick a tool that exports more, then drop on the cleaner.
What's the difference between GA4 Standard Reports and Explore exports?
**Standard Reports** are pre-built (Acquisition, Engagement, Demographics, etc.) with a 100,000-row CSV cap and the 5-row metadata preamble. **Explore** is the custom-report builder with up to 10,000,000 cells per export (≈ 50,000 rows × 20 columns for unsampled data) and no metadata preamble. For deeper analysis or larger row counts, use Explore. For quick stakeholder shareable views, use Standard. The cleaner handles both — set skipRows: 5 for Standard, 0 for Explore.
Why does Excel mojibake my GA4 / GSC export?
GA4 and GSC default exports write UTF-8 without a BOM. Excel-on-Windows defaults to Windows-1252 when no encoding marker is present, mojibaking accented characters (é → é, ü → ü). The cleaner adds the UTF-8 BOM on output by default — equivalent to Looker Studio's 'Excel-compatible' export option or pandas' encoding='utf-8-sig'. After cleaning, Excel auto-detects UTF-8 and displays correctly.
Should I use `dedupMode: case-insensitive` on URL columns?
Yes if you're concatenating multiple GA4 reports or GA4 + GSC outputs. URL casing varies across report views and platforms — GA4's Landing Pages report may show example.com/Blog while the Pages report shows example.com/blog for the same canonical URL. SEO analysis treats these as one page; the cleaner's case-insensitive mode collapses them. Skip if your data is from a single export and casing isn't varying.
How do I clean GSC's two-CSV ZIP export?
Three steps: (1) extract the ZIP — you'll get Queries.csv (or Pages.csv, etc.) and Filters.csv; (2) use csv-merger if you want a unified audit file, or process them separately; (3) drop on the cleaner with default options (GSC's CSVs already include the UTF-8 BOM, so encoding is fine; the value-add is dedup if you concatenated, plus trim/empty-row removal). The Filters.csv is usually tiny (a handful of rows showing active filters at export time).
Will the cleaner handle GA4 Explore's TSV format?
Yes. The cleaner's delimiter: auto detects tab from the first row and parses correctly regardless of file extension. If your Explore export has a .csv extension but tab-delimited content (a common user mistake), the cleaner still reads it correctly. The output is comma-delimited UTF-8 by default — convert to TSV by specifying delimiter: \t in Advanced if you want to round-trip back to a TSV-only consumer.
Why does my GA4 export show `(not set)` everywhere?
(not set) is GA4's placeholder for dimension values it couldn't capture. Common causes: direct traffic for Source/Medium; page_view events fired before page-title was set; geographic data when the IP can't be resolved to a region. The cleaner preserves (not set) as cell content — it's a meaningful data-quality signal. To filter these rows out, chain with csv-column-filter (operator: not_equals, value: (not set)).
Can I dedupe SEO data by URL+date instead of whole row?
Not with csv-cleaner — its dedup is whole-row comparison. For multi-column dedup keys (URL + Date is common for SEO time-series), use csv-deduplicator with columns: URL,Date and strategy: first (or last). Combine with the cleaner: cleaner first (trim, encoding, normalize), then csv-deduplicator with the composite key.
Will analytics data be uploaded to JAD Apps?
No. PapaParse runs entirely in your browser. Analytics data — page URLs, search queries, user counts, custom dimensions (which may include user IDs) — never reaches a server. The only thing saved server-side is a single counter (file processed, no content) for signed-in dashboard stats. You can opt out in account settings. This matters because GA4 custom dimensions can carry PII (logged-in user IDs, email hashes, customer segments).
How large a GA4 / GSC export can the cleaner handle?
Free tier caps at 2 MB — comfortable for any single GSC export (1,000 rows × ~10 columns) and most GA4 standard reports. Pro removes the limit (5–10M rows browser-memory bound). For multi-million-row GA4 Explore exports or BigQuery-sourced datasets, run via the runner API (POST 127.0.0.1:9789/v1/tools/csv-cleaner/run) for streaming throughput. GA4 standard reports cap at 100k rows and GSC UI at 1,000 rows, so file size isn't usually the constraint at the source — it's at the cleaner's free tier.
Can I run this as part of a scheduled SEO reporting pipeline?
Yes — GET /api/v1/tools/csv-cleaner returns the 11-option schema; pair the @jadapps/runner once and POST the payload to 127.0.0.1:9789/v1/tools/csv-cleaner/run. Common pipeline: scheduled GA4 Data API + GSC API pulls → cleaner (skipRows: 5 for GA4 standard reports, outputLineEnding: lf, normalizeHiddenWhitespace: true) → Snowflake / BigQuery / pandas → Looker dashboard. Analytics data never reaches JAD's servers — the cleaner runs locally on your runner.
Privacy first
Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.