Clean a Google Analytics 4 / Search Console CSV Export

How to clean a ga4 / search console csv export

Step 1
Export from your analytics surface — GA4 standard report: Reports → pick report → top-right ⋮ → Download file → CSV. GA4 Explore: Exploration → Export data → CSV or TSV. GSC: Performance → Top right Export → CSV (downloads a ZIP). Looker Studio: ⋮ on any chart → Export → CSV → choose 'Standard CSV' or 'Excel-compatible'.
Step 2
Unzip GSC's export if you took the CSV option — GSC's CSV export is a ZIP containing typically two files: the main performance data CSV (Queries / Pages / Countries / Devices / depending on what was visible) and a Filters.csv listing which filters were active. For most analysis you only need the data CSV. To combine both into one file for audit purposes, use csv-merger on the two files first, then drop the merged file onto the cleaner.
Step 3
For GA4 standard reports: set `skipRows` to match the preamble — GA4 standard reports prepend metadata rows (property name, date range, sampling status, and at least one blank row) before the actual column headers. The pandas community recipe uses skiprows=5 — verify against your specific export by opening the first 6 lines in a text editor and counting rows above the real header. Drop the file onto the cleaner with Advanced → skipRows set to that count. GA4 Explore exports don't have this preamble — leave skipRows: 0 for those.
Step 4
Drop the file onto the cleaner — PapaParse runs in your browser — GA4 / GSC data (which can include PII like user-id custom dimensions) never reaches a server. The cleaner sniffs encoding from the BOM (Looker Studio's Excel-compatible mode includes one; GA4 / GSC default exports do not) and delimiter from the first non-skipped row.
Step 5
For URL-based dedup: enable `dedupMode: case-insensitive` — If you exported the same data across two date ranges or two report views and concatenated them, the same Page URL may appear with different casing depending on the source. Example.com/page and example.com/page are the same page for SEO purposes. case-insensitive mode collapses them. For URL + path matching that also handles trailing-slash differences (example.com/page vs example.com/page/), follow up with csv-find-replace with regex /$ → empty before the dedup.
Step 6
For Excel-on-Windows output: leave default BOM + CRLF — The cleaner's default output is UTF-8 with BOM and CRLF line endings — exactly what Excel-on-Windows needs to read accented page titles correctly. For Python pandas / DuckDB / Linux pipeline downstream, switch to outputLineEnding: lf (BOM is harmless either way; pandas auto-detects it).

Per-surface export shapes

What each Google analytics surface writes when you export, and the cleaner option that addresses each quirk. Row limits verified against current Google docs as of May 2026.

Surface	Export shape	Encoding / BOM	Row limit	Common quirks	Cleaner option(s)
GA4 Standard Report	Single CSV, comma-delimited	UTF-8, no BOM	100,000 rows (verified per Google docs)	Prepends a metadata block (property name, date range, sampling status, blank, then headers); pandas community recipe is `skiprows=5` but verify the count for your specific report. `(not set)` placeholders for un-tracked dimensions	`skipRows: 5` (typical) for the preamble
GA4 Explore (TSV/CSV)	Single TSV or CSV — chosen at export time	UTF-8, no BOM	10,000,000 cells for unsampled explorations (= ~50,000 rows × 20 columns)	No metadata preamble (unlike standard reports). TSV is tab-delimited; the cleaner auto-detects via `delimiter: auto`	`skipRows: 0`; auto delimiter
GA4 Direct-to-Google-Sheets	Live Google Sheet (re-exportable as CSV from Sheets)	UTF-8 (no BOM until Excel-saved)	Same 100k row cap as standard reports	Sheets export adds its own BOM; double round-trip through Sheets occasionally adds CHAR(160) NBSPs around merged dimension values	`normalizeHiddenWhitespace` if NBSPs appear
Search Console UI export	ZIP with 2 CSVs: `Queries.csv` (or similar) + `Filters.csv`	UTF-8 with BOM	1,000 rows per report through the UI — applies to every dimension (Queries / Pages / Countries / Devices)	The 1,000-row cap is shared between display and export — you can't export more by changing date range. For more, use the GSC API or BigQuery Bulk Export	csv-merger combines the two ZIP files into one for analysis — it does not bypass the 1,000-row source cap, only the GSC API or BigQuery Bulk Export does. `dedupMode: case-insensitive` for URL casing across concatenated exports
Search Console Bulk Export → BigQuery	Daily streamed dataset, not a CSV	—	No row limit	Not a CSV workflow — the cleaner doesn't apply. Export from BigQuery as CSV first if you need a flat file	—
Looker Studio chart download	Single CSV per chart	UTF-8, no BOM by default; 'Excel-compatible' option adds the BOM	750,000 rows per chart download	If 'Excel-compatible' is selected, BOM is present; if not, Excel mojibakes accented values. Multi-page reports require one download per page	`inputEncoding: auto` handles either

What the cleaner does

Each cleaner option mapped to the GA4 / GSC / Looker Studio problem it solves. Single-pass — all selected options apply together.

Cleaner option	Analytics-export problem it solves	Default
`skipRows: 5`	The canonical GA4 standard-report fix: strips the 5-row metadata preamble (property + date range + sampling notice + blank + headers). Equivalent to pandas' `pd.read_csv('GA-export.csv', skiprows=5)` recipe	0
`inputEncoding: auto`	Detects BOM (Looker Studio Excel-compatible, GSC) or no-BOM (GA4 standard, Looker Studio default, GA4 Explore). Handles all surfaces without manual config	auto
`outputLineEnding: crlf` (default)	Excel-on-Windows compatibility — pairs with the BOM the cleaner adds on output. For pandas / DuckDB / Linux downstream, switch to `lf`	crlf
`normalizeHiddenWhitespace`	Folds CHAR(160) NBSPs that GA4's `(not set)` and Looker Studio's merged-dimension rendering can introduce. Especially common in page-title columns where the source page used ` ` for visual alignment	Off (must enable)
`normalizeSmartQuotes`	Useful for page-title and meta-description columns that include CMS-styled curly quotes. Common with news sites and blogs	Off (must enable)
`dedupMode: case-insensitive`	URL casing varies across GA4 report views — `example.com/Page` from one view, `example.com/page` from another. Case-insensitive dedup catches both as the same URL	Off (must enable)
`dedupMode: normalized`	Trim + lowercase + NBSP fold for the comparison key. Catches URL casing + trailing-NBSP edge cases together	Off (must enable)
`trimWhitespace`	Removes trailing whitespace from dimension values — GA4 occasionally pads on export, GSC trims them automatically	On
`removeEmptyRows`	Drops rows that GA4 includes for `(not set)` placeholder values when the column is otherwise empty. Use sparingly — sometimes `(not set)` rows are meaningful traffic-source signals	On
`repairRows`	Pads short rows from a partial GA4 Explore export where some dimension columns were truncated	On
`delimiter: auto`	Detects tab (GA4 Explore TSV) vs comma (everything else) automatically. Force `\t` if you renamed an Explore TSV to `.csv` and the sniffer guessed wrong	auto

Cookbook

Real before/after rows from real GA4 / Search Console exports.

GA4 standard report metadata preamble (typically 5 rows)

Example

The canonical GA4 export problem. Pandas community recipe uses skiprows=5. The cleaner's skipRows does the same thing in a drop-on-file workflow. Verify the exact row count for your specific report by opening the first 6 lines in a text editor — the preamble structure can vary slightly by report type.

Input (GA4 standard report CSV, typical structure):
# Property: example.com - GA4
# Date range: Apr 1, 2026 - Apr 30, 2026
# Sampling: 100% of sessions

Page path,Active users,Sessions
/blog/seo-tips,12453,15892
/products,8745,11203

Cleaner config: skipRows: 5 (Advanced — verify for your report)

Output:
Page path,Active users,Sessions
/blog/seo-tips,12453,15892
/products,8745,11203

GSC ZIP with two CSVs — concatenate then dedupe

Example

GSC's CSV export is a ZIP containing a main data CSV and a Filters.csv. For unified analysis, merge them, then dedupe on URL (case-insensitive) to catch the same page appearing with different casing in different report views.

Inputs (from GSC ZIP):
  Queries.csv (Top queries — 1,000 rows)
  Filters.csv (Active filters — 3 rows)

Workflow:
  1. csv-merger combines into one file
  2. csv-cleaner with dedupMode: case-insensitive on URL column
     → collapses example.com/Page ↔ example.com/page duplicates

Excel mojibake from GA4's no-BOM UTF-8

Example

GA4 writes UTF-8 without a BOM. Excel-on-Windows defaults to Windows-1252 and mojibakes accented page titles (José's Blog Post → JosÃ©'s Blog Post). Cleaner adds the BOM.

Source (GA4 CSV, opened in Excel-on-Windows):
Page title,Sessions
JosÃ©'s Blog Post,1245
MÃ¼ller's Travel Notes,892

Cleaner adds UTF-8 BOM → Excel auto-detects → displays:
Page title,Sessions
José's Blog Post,1245
Müller's Travel Notes,892

URL casing variance across GA4 report views

Example

GA4's Pages and Landing Pages reports occasionally produce the same URL with different casing depending on which view it came from. case-insensitive dedup catches the duplicates.

Input (concatenated Pages + Landing Pages exports):
URL,Sessions
example.com/Blog,1234
example.com/blog,5678
example.com/PRODUCTS,432
example.com/products,891

Cleaner config: dedupMode: case-insensitive on URL

Output (first occurrence wins — keeps original casing in output):
URL,Sessions
example.com/Blog,1234
example.com/PRODUCTS,432

Looker Studio multi-page chart concatenation

Example

Looker Studio caps at 750,000 rows per chart download. For larger reports, you download multiple times with date-range filters. Concatenate and dedupe to produce the full unified dataset.

Inputs (3 chart downloads, 500k rows each from Looker Studio):
  chart1-jan-mar.csv
  chart2-apr-jun.csv
  chart3-jul-sep.csv

Workflow:
  1. csv-merger combines (1.5M rows total)
  2. csv-cleaner with dedupMode: exact (catches any
     overlap between adjacent date-range exports)
  3. Output: unified UTF-8 BOM CSV for Excel /
     UTF-8 no-BOM for pandas (set outputLineEnding: lf)

GSC query column with NBSP from special-character keyword

Example

Search queries that users typed using certain mobile keyboards (notably iOS Safari around punctuation) can include CHAR(160) NBSPs. The cleaner folds them so duplicate queries from different mobile inputs collapse on dedup.

Input (GSC Queries.csv, NBSP shown as ·):
Query,Clicks,Impressions
"best·practices",42,310
"best practices",28,245
"how to·clean csv",15,98

Cleaner config:
  normalizeHiddenWhitespace: true
  dedupMode: case-insensitive

Output:
Query,Clicks,Impressions
best practices,42,310
how to clean csv,15,98

(Clicks/impressions need to be SUMMED, not first-wins —
use csv-deduplicator with strategy: aggregate after
the cleaner, OR sum manually in Excel.)

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

GA4 metadata preamble corrupts pandas `read_csv` without `skiprows`

GA4 standard-report behaviour

GA4 standard reports include metadata rows before the actual data: typically property name, date range, sampling status, blank row, then headers. The pandas community recipe — pd.read_csv('GA-export.csv', skiprows=5) — handles the common case but the exact count can vary slightly; verify by opening the first 6 lines in a text editor. The cleaner's skipRows option does this in a drop-on-file pass. GA4 Explore exports do NOT have this preamble — use skipRows: 0 for those.

Search Console UI caps every export at 1,000 rows

GSC hard limit

Verified per Google docs: the GSC UI displays and exports a maximum of 1,000 rows per dimension (Queries, Pages, Countries, Devices). No filter or date-range trick exceeds this. For more rows, use the GSC API (no documented row cap) or the Bulk Export to BigQuery (streamed daily). The cleaner can't recover rows that never made it into the CSV — re-export with a different tool if you need >1,000 rows.

GA4 standard reports cap at 100,000 rows

GA4 documented limit

Standard report downloads cap at 100,000 rows (raised from earlier 5,000 cap). Explore reports cap at 10,000,000 cells. For deeper analysis, switch from Standard reports to Explore — fewer dimensions per row but far more rows. Or use the GA4 Data API (no documented row cap). The cleaner is text-only — it processes whatever rows make it into the file.

Excel mojibake from GA4 UTF-8-no-BOM

Encoding mismatch

GA4 writes UTF-8 without a BOM. Excel-on-Windows defaults to Windows-1252 → accented page titles mojibake (José → JosÃ©). Same root cause as the Google Forms case in the survey-response spoke. Cleaner adds BOM on output (default). For pandas downstream, no BOM needed — switch to outputLineEnding: lf and the BOM is still harmless.

Looker Studio default CSV lacks BOM; 'Excel-compatible' option includes it

User configuration

Looker Studio's chart download has two CSV options: 'Standard CSV' (UTF-8 no BOM, optimised for programmatic use) and 'Excel-compatible' (UTF-8 with BOM). Many users select 'Standard CSV' by default and then complain about Excel mojibake. The cleaner detects either and outputs UTF-8 with BOM by default.

GA4 `(not set)` placeholder in dimension columns

GA4 default behaviour

GA4 writes the literal string (not set) for any dimension value it couldn't track (e.g. Source/Medium for direct traffic, page-title for some events). The cleaner preserves these as cell content — they're meaningful (high (not set) rates indicate tracking issues). If you want to filter these rows out, chain with csv-column-filter (column: Source, operator: not_equals, value: (not set)).

GA4 sampling notice in the metadata preamble

Data-quality flag

If GA4 reduces row 3 to # Sampling: X% of sessions (where X < 100), your export is based on a sample, not full data. The cleaner's skipRows: 5 discards this notice — but you should NOT analyse the result as if it were complete. Switch the source GA4 report to a property/date-range that returns 100% sampling, or accept the sampled result and document the X% in your downstream report.

GSC ZIP export contains two CSVs with different schemas

Multi-file export

GSC's CSV export ZIP contains Queries.csv (or Pages.csv etc.) plus a separate Filters.csv listing the active filters. These have completely different schemas — concatenating them produces a mis-aligned file. For audit purposes, treat them as two separate datasets. The cleaner handles each individually.

Looker Studio multi-page reports require per-page exports

Manual concatenation

Looker Studio's chart download is per-chart. A 5-page report with 3 charts per page = 15 downloads. Concatenate with csv-merger, then cleaner-dedupe on whatever the natural key is (URL, Date, Query). For automated multi-chart export, use the Looker Studio API or Google Apps Script.

Tab-delimited GA4 Explore renamed to `.csv`

Extension vs content mismatch

GA4 Explore lets you choose TSV (tab-separated) or CSV (comma-separated) at export time. Users sometimes pick TSV and then rename the file to .csv thinking the extension determines the format. The actual content is tab-delimited. The cleaner's delimiter: auto detects tab from the first row and parses correctly regardless of extension.

Frequently asked questions

Why does pandas misread my GA4 CSV export?

GA4 standard reports prepend a metadata block before the actual data: property name, date range, sampling status, a blank row, then your column headers. pd.read_csv('GA-export.csv') without skiprows treats the first metadata row as the header — every subsequent data row is misaligned. The pandas community recipe is pd.read_csv('GA-export.csv', skiprows=5) — works for most reports, but the exact count can vary by report type, so verify by opening the first 6 lines in a text editor before relying on it. The cleaner's skipRows option does this in a drop-on-file pass. GA4 Explore exports don't have this preamble — use skipRows: 0 there.

Can I export more than 1,000 rows from Google Search Console?

Not through the UI — the 1,000-row cap applies to every dimension (Queries, Pages, Countries, Devices) regardless of date range or filters. For more rows, use the GSC API (no documented row cap; programmatic), the Bulk Export to BigQuery (daily streamed dataset), or third-party tools like Search Analytics for Sheets (~25,000 rows per query). The cleaner can't recover rows that never made it into the CSV — pick a tool that exports more, then drop on the cleaner.

What's the difference between GA4 Standard Reports and Explore exports?

Standard Reports are pre-built (Acquisition, Engagement, Demographics, etc.) with a 100,000-row CSV cap and the 5-row metadata preamble. Explore is the custom-report builder with up to 10,000,000 cells per export (≈ 50,000 rows × 20 columns for unsampled data) and no metadata preamble. For deeper analysis or larger row counts, use Explore. For quick stakeholder shareable views, use Standard. The cleaner handles both — set skipRows: 5 for Standard, 0 for Explore.

Why does Excel mojibake my GA4 / GSC export?

GA4 and GSC default exports write UTF-8 without a BOM. Excel-on-Windows defaults to Windows-1252 when no encoding marker is present, mojibaking accented characters (é → Ã©, ü → Ã¼). The cleaner adds the UTF-8 BOM on output by default — equivalent to Looker Studio's 'Excel-compatible' export option or pandas' encoding='utf-8-sig'. After cleaning, Excel auto-detects UTF-8 and displays correctly.

Should I use `dedupMode: case-insensitive` on URL columns?

Yes if you're concatenating multiple GA4 reports or GA4 + GSC outputs. URL casing varies across report views and platforms — GA4's Landing Pages report may show example.com/Blog while the Pages report shows example.com/blog for the same canonical URL. SEO analysis treats these as one page; the cleaner's case-insensitive mode collapses them. Skip if your data is from a single export and casing isn't varying.

How do I clean GSC's two-CSV ZIP export?

Three steps: (1) extract the ZIP — you'll get Queries.csv (or Pages.csv, etc.) and Filters.csv; (2) use csv-merger if you want a unified audit file, or process them separately; (3) drop on the cleaner with default options (GSC's CSVs already include the UTF-8 BOM, so encoding is fine; the value-add is dedup if you concatenated, plus trim/empty-row removal). The Filters.csv is usually tiny (a handful of rows showing active filters at export time).

Will the cleaner handle GA4 Explore's TSV format?

Yes. The cleaner's delimiter: auto detects tab from the first row and parses correctly regardless of file extension. If your Explore export has a .csv extension but tab-delimited content (a common user mistake), the cleaner still reads it correctly. The output is comma-delimited UTF-8 by default — convert to TSV by specifying delimiter: \t in Advanced if you want to round-trip back to a TSV-only consumer.

Why does my GA4 export show `(not set)` everywhere?

(not set) is GA4's placeholder for dimension values it couldn't capture. Common causes: direct traffic for Source/Medium; page_view events fired before page-title was set; geographic data when the IP can't be resolved to a region. The cleaner preserves (not set) as cell content — it's a meaningful data-quality signal. To filter these rows out, chain with csv-column-filter (operator: not_equals, value: (not set)).

Can I dedupe SEO data by URL+date instead of whole row?

Not with csv-cleaner — its dedup is whole-row comparison. For multi-column dedup keys (URL + Date is common for SEO time-series), use csv-deduplicator with columns: URL,Date and strategy: first (or last). Combine with the cleaner: cleaner first (trim, encoding, normalize), then csv-deduplicator with the composite key.

Will analytics data be uploaded to JAD Apps?

No. PapaParse runs entirely in your browser. Analytics data — page URLs, search queries, user counts, custom dimensions (which may include user IDs) — never reaches a server. The only thing saved server-side is a single counter (file processed, no content) for signed-in dashboard stats. You can opt out in account settings. This matters because GA4 custom dimensions can carry PII (logged-in user IDs, email hashes, customer segments).

How large a GA4 / GSC export can the cleaner handle?

Free tier caps at 2 MB — comfortable for any single GSC export (1,000 rows × ~10 columns) and most GA4 standard reports. Pro removes the limit (5–10M rows browser-memory bound). For multi-million-row GA4 Explore exports or BigQuery-sourced datasets, run via the runner API (POST 127.0.0.1:9789/v1/tools/csv-cleaner/run) for streaming throughput. GA4 standard reports cap at 100k rows and GSC UI at 1,000 rows, so file size isn't usually the constraint at the source — it's at the cleaner's free tier.

Can I run this as part of a scheduled SEO reporting pipeline?

Yes — GET /api/v1/tools/csv-cleaner returns the 11-option schema; pair the @jadapps/runner once and POST the payload to 127.0.0.1:9789/v1/tools/csv-cleaner/run. Common pipeline: scheduled GA4 Data API + GSC API pulls → cleaner (skipRows: 5 for GA4 standard reports, outputLineEnding: lf, normalizeHiddenWhitespace: true) → Snowflake / BigQuery / pandas → Looker dashboard. Analytics data never reaches JAD's servers — the cleaner runs locally on your runner.

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

How to clean a ga4 / search console csv export

Step 1
Export from your analytics surface — GA4 standard report: Reports → pick report → top-right ⋮ → Download file → CSV. GA4 Explore: Exploration → Export data → CSV or TSV. GSC: Performance → Top right Export → CSV (downloads a ZIP). Looker Studio: ⋮ on any chart → Export → CSV → choose 'Standard CSV' or 'Excel-compatible'.
Step 2
Unzip GSC's export if you took the CSV option — GSC's CSV export is a ZIP containing typically two files: the main performance data CSV (Queries / Pages / Countries / Devices / depending on what was visible) and a Filters.csv listing which filters were active. For most analysis you only need the data CSV. To combine both into one file for audit purposes, use csv-merger on the two files first, then drop the merged file onto the cleaner.
Step 3
For GA4 standard reports: set `skipRows` to match the preamble — GA4 standard reports prepend metadata rows (property name, date range, sampling status, and at least one blank row) before the actual column headers. The pandas community recipe uses skiprows=5 — verify against your specific export by opening the first 6 lines in a text editor and counting rows above the real header. Drop the file onto the cleaner with Advanced → skipRows set to that count. GA4 Explore exports don't have this preamble — leave skipRows: 0 for those.
Step 4
Drop the file onto the cleaner — PapaParse runs in your browser — GA4 / GSC data (which can include PII like user-id custom dimensions) never reaches a server. The cleaner sniffs encoding from the BOM (Looker Studio's Excel-compatible mode includes one; GA4 / GSC default exports do not) and delimiter from the first non-skipped row.
Step 5
For URL-based dedup: enable `dedupMode: case-insensitive` — If you exported the same data across two date ranges or two report views and concatenated them, the same Page URL may appear with different casing depending on the source. Example.com/page and example.com/page are the same page for SEO purposes. case-insensitive mode collapses them. For URL + path matching that also handles trailing-slash differences (example.com/page vs example.com/page/), follow up with csv-find-replace with regex /$ → empty before the dedup.
Step 6
For Excel-on-Windows output: leave default BOM + CRLF — The cleaner's default output is UTF-8 with BOM and CRLF line endings — exactly what Excel-on-Windows needs to read accented page titles correctly. For Python pandas / DuckDB / Linux pipeline downstream, switch to outputLineEnding: lf (BOM is harmless either way; pandas auto-detects it).

Per-surface export shapes

What each Google analytics surface writes when you export, and the cleaner option that addresses each quirk. Row limits verified against current Google docs as of May 2026.

Surface	Export shape	Encoding / BOM	Row limit	Common quirks	Cleaner option(s)
GA4 Standard Report	Single CSV, comma-delimited	UTF-8, no BOM	100,000 rows (verified per Google docs)	Prepends a metadata block (property name, date range, sampling status, blank, then headers); pandas community recipe is `skiprows=5` but verify the count for your specific report. `(not set)` placeholders for un-tracked dimensions	`skipRows: 5` (typical) for the preamble
GA4 Explore (TSV/CSV)	Single TSV or CSV — chosen at export time	UTF-8, no BOM	10,000,000 cells for unsampled explorations (= ~50,000 rows × 20 columns)	No metadata preamble (unlike standard reports). TSV is tab-delimited; the cleaner auto-detects via `delimiter: auto`	`skipRows: 0`; auto delimiter
GA4 Direct-to-Google-Sheets	Live Google Sheet (re-exportable as CSV from Sheets)	UTF-8 (no BOM until Excel-saved)	Same 100k row cap as standard reports	Sheets export adds its own BOM; double round-trip through Sheets occasionally adds CHAR(160) NBSPs around merged dimension values	`normalizeHiddenWhitespace` if NBSPs appear
Search Console UI export	ZIP with 2 CSVs: `Queries.csv` (or similar) + `Filters.csv`	UTF-8 with BOM	1,000 rows per report through the UI — applies to every dimension (Queries / Pages / Countries / Devices)	The 1,000-row cap is shared between display and export — you can't export more by changing date range. For more, use the GSC API or BigQuery Bulk Export	csv-merger combines the two ZIP files into one for analysis — it does not bypass the 1,000-row source cap, only the GSC API or BigQuery Bulk Export does. `dedupMode: case-insensitive` for URL casing across concatenated exports
Search Console Bulk Export → BigQuery	Daily streamed dataset, not a CSV	—	No row limit	Not a CSV workflow — the cleaner doesn't apply. Export from BigQuery as CSV first if you need a flat file	—
Looker Studio chart download	Single CSV per chart	UTF-8, no BOM by default; 'Excel-compatible' option adds the BOM	750,000 rows per chart download	If 'Excel-compatible' is selected, BOM is present; if not, Excel mojibakes accented values. Multi-page reports require one download per page	`inputEncoding: auto` handles either

What the cleaner does

Each cleaner option mapped to the GA4 / GSC / Looker Studio problem it solves. Single-pass — all selected options apply together.

Cleaner option	Analytics-export problem it solves	Default
`skipRows: 5`	The canonical GA4 standard-report fix: strips the 5-row metadata preamble (property + date range + sampling notice + blank + headers). Equivalent to pandas' `pd.read_csv('GA-export.csv', skiprows=5)` recipe	0
`inputEncoding: auto`	Detects BOM (Looker Studio Excel-compatible, GSC) or no-BOM (GA4 standard, Looker Studio default, GA4 Explore). Handles all surfaces without manual config	auto
`outputLineEnding: crlf` (default)	Excel-on-Windows compatibility — pairs with the BOM the cleaner adds on output. For pandas / DuckDB / Linux downstream, switch to `lf`	crlf
`normalizeHiddenWhitespace`	Folds CHAR(160) NBSPs that GA4's `(not set)` and Looker Studio's merged-dimension rendering can introduce. Especially common in page-title columns where the source page used ` ` for visual alignment	Off (must enable)
`normalizeSmartQuotes`	Useful for page-title and meta-description columns that include CMS-styled curly quotes. Common with news sites and blogs	Off (must enable)
`dedupMode: case-insensitive`	URL casing varies across GA4 report views — `example.com/Page` from one view, `example.com/page` from another. Case-insensitive dedup catches both as the same URL	Off (must enable)
`dedupMode: normalized`	Trim + lowercase + NBSP fold for the comparison key. Catches URL casing + trailing-NBSP edge cases together	Off (must enable)
`trimWhitespace`	Removes trailing whitespace from dimension values — GA4 occasionally pads on export, GSC trims them automatically	On
`removeEmptyRows`	Drops rows that GA4 includes for `(not set)` placeholder values when the column is otherwise empty. Use sparingly — sometimes `(not set)` rows are meaningful traffic-source signals	On
`repairRows`	Pads short rows from a partial GA4 Explore export where some dimension columns were truncated	On
`delimiter: auto`	Detects tab (GA4 Explore TSV) vs comma (everything else) automatically. Force `\t` if you renamed an Explore TSV to `.csv` and the sniffer guessed wrong	auto

Cookbook

Real before/after rows from real GA4 / Search Console exports.

GA4 standard report metadata preamble (typically 5 rows)

Example

Input (GA4 standard report CSV, typical structure):
# Property: example.com - GA4
# Date range: Apr 1, 2026 - Apr 30, 2026
# Sampling: 100% of sessions

Page path,Active users,Sessions
/blog/seo-tips,12453,15892
/products,8745,11203

Cleaner config: skipRows: 5 (Advanced — verify for your report)

Output:
Page path,Active users,Sessions
/blog/seo-tips,12453,15892
/products,8745,11203

GSC ZIP with two CSVs — concatenate then dedupe

Example

Inputs (from GSC ZIP):
  Queries.csv (Top queries — 1,000 rows)
  Filters.csv (Active filters — 3 rows)

Workflow:
  1. csv-merger combines into one file
  2. csv-cleaner with dedupMode: case-insensitive on URL column
     → collapses example.com/Page ↔ example.com/page duplicates

Excel mojibake from GA4's no-BOM UTF-8

Example

GA4 writes UTF-8 without a BOM. Excel-on-Windows defaults to Windows-1252 and mojibakes accented page titles (José's Blog Post → JosÃ©'s Blog Post). Cleaner adds the BOM.

Source (GA4 CSV, opened in Excel-on-Windows):
Page title,Sessions
JosÃ©'s Blog Post,1245
MÃ¼ller's Travel Notes,892

Cleaner adds UTF-8 BOM → Excel auto-detects → displays:
Page title,Sessions
José's Blog Post,1245
Müller's Travel Notes,892

URL casing variance across GA4 report views

Example

GA4's Pages and Landing Pages reports occasionally produce the same URL with different casing depending on which view it came from. case-insensitive dedup catches the duplicates.

Input (concatenated Pages + Landing Pages exports):
URL,Sessions
example.com/Blog,1234
example.com/blog,5678
example.com/PRODUCTS,432
example.com/products,891

Cleaner config: dedupMode: case-insensitive on URL

Output (first occurrence wins — keeps original casing in output):
URL,Sessions
example.com/Blog,1234
example.com/PRODUCTS,432

Looker Studio multi-page chart concatenation

Example

Looker Studio caps at 750,000 rows per chart download. For larger reports, you download multiple times with date-range filters. Concatenate and dedupe to produce the full unified dataset.

Inputs (3 chart downloads, 500k rows each from Looker Studio):
  chart1-jan-mar.csv
  chart2-apr-jun.csv
  chart3-jul-sep.csv

Workflow:
  1. csv-merger combines (1.5M rows total)
  2. csv-cleaner with dedupMode: exact (catches any
     overlap between adjacent date-range exports)
  3. Output: unified UTF-8 BOM CSV for Excel /
     UTF-8 no-BOM for pandas (set outputLineEnding: lf)

GSC query column with NBSP from special-character keyword

Example

Input (GSC Queries.csv, NBSP shown as ·):
Query,Clicks,Impressions
"best·practices",42,310
"best practices",28,245
"how to·clean csv",15,98

Cleaner config:
  normalizeHiddenWhitespace: true
  dedupMode: case-insensitive

Output:
Query,Clicks,Impressions
best practices,42,310
how to clean csv,15,98

(Clicks/impressions need to be SUMMED, not first-wins —
use csv-deduplicator with strategy: aggregate after
the cleaner, OR sum manually in Excel.)

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

GA4 metadata preamble corrupts pandas `read_csv` without `skiprows`

GA4 standard-report behaviour

Search Console UI caps every export at 1,000 rows

GSC hard limit

GA4 standard reports cap at 100,000 rows

GA4 documented limit

Excel mojibake from GA4 UTF-8-no-BOM

Encoding mismatch

Looker Studio default CSV lacks BOM; 'Excel-compatible' option includes it

User configuration

GA4 `(not set)` placeholder in dimension columns

GA4 default behaviour

GA4 sampling notice in the metadata preamble

Data-quality flag

GSC ZIP export contains two CSVs with different schemas

Multi-file export

Looker Studio multi-page reports require per-page exports

Manual concatenation

Tab-delimited GA4 Explore renamed to `.csv`

Extension vs content mismatch

Frequently asked questions

Why does pandas misread my GA4 CSV export?

Can I export more than 1,000 rows from Google Search Console?

What's the difference between GA4 Standard Reports and Explore exports?

Why does Excel mojibake my GA4 / GSC export?

Should I use `dedupMode: case-insensitive` on URL columns?

How do I clean GSC's two-CSV ZIP export?

Will the cleaner handle GA4 Explore's TSV format?

Why does my GA4 export show `(not set)` everywhere?

Can I dedupe SEO data by URL+date instead of whole row?

Will analytics data be uploaded to JAD Apps?

How large a GA4 / GSC export can the cleaner handle?

Can I run this as part of a scheduled SEO reporting pipeline?

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

Clean a GA4 / Search Console CSV Export

How to clean a ga4 / search console csv export

Per-surface export shapes

What the cleaner does

Cookbook

GA4 standard report metadata preamble (typically 5 rows)

GSC ZIP with two CSVs — concatenate then dedupe

Excel mojibake from GA4's no-BOM UTF-8

URL casing variance across GA4 report views

Looker Studio multi-page chart concatenation

GSC query column with NBSP from special-character keyword

Errors and edge cases

GA4 metadata preamble corrupts pandas `read_csv` without `skiprows`

Search Console UI caps every export at 1,000 rows

GA4 standard reports cap at 100,000 rows

Excel mojibake from GA4 UTF-8-no-BOM

Looker Studio default CSV lacks BOM; 'Excel-compatible' option includes it

GA4 `(not set)` placeholder in dimension columns

GA4 sampling notice in the metadata preamble

GSC ZIP export contains two CSVs with different schemas

Looker Studio multi-page reports require per-page exports

Tab-delimited GA4 Explore renamed to `.csv`

Frequently asked questions

Why does pandas misread my GA4 CSV export?

Can I export more than 1,000 rows from Google Search Console?

What's the difference between GA4 Standard Reports and Explore exports?

Why does Excel mojibake my GA4 / GSC export?

Should I use `dedupMode: case-insensitive` on URL columns?

How do I clean GSC's two-CSV ZIP export?

Will the cleaner handle GA4 Explore's TSV format?

Why does my GA4 export show `(not set)` everywhere?

Can I dedupe SEO data by URL+date instead of whole row?

Will analytics data be uploaded to JAD Apps?

How large a GA4 / GSC export can the cleaner handle?

Can I run this as part of a scheduled SEO reporting pipeline?

Privacy first

Related guides

Clean a GA4 / Search Console CSV Export

How to clean a ga4 / search console csv export

Per-surface export shapes

What the cleaner does

Cookbook

GA4 standard report metadata preamble (typically 5 rows)

GSC ZIP with two CSVs — concatenate then dedupe

Excel mojibake from GA4's no-BOM UTF-8

URL casing variance across GA4 report views

Looker Studio multi-page chart concatenation

GSC query column with NBSP from special-character keyword

Errors and edge cases

GA4 metadata preamble corrupts pandas `read_csv` without `skiprows`

Search Console UI caps every export at 1,000 rows

GA4 standard reports cap at 100,000 rows

Excel mojibake from GA4 UTF-8-no-BOM

Looker Studio default CSV lacks BOM; 'Excel-compatible' option includes it

GA4 `(not set)` placeholder in dimension columns

GA4 sampling notice in the metadata preamble

GSC ZIP export contains two CSVs with different schemas

Looker Studio multi-page reports require per-page exports

Tab-delimited GA4 Explore renamed to `.csv`

Frequently asked questions

Why does pandas misread my GA4 CSV export?

Can I export more than 1,000 rows from Google Search Console?

What's the difference between GA4 Standard Reports and Explore exports?

Why does Excel mojibake my GA4 / GSC export?

Should I use `dedupMode: case-insensitive` on URL columns?

How do I clean GSC's two-CSV ZIP export?

Will the cleaner handle GA4 Explore's TSV format?

Why does my GA4 export show `(not set)` everywhere?

Can I dedupe SEO data by URL+date instead of whole row?

Will analytics data be uploaded to JAD Apps?

How large a GA4 / GSC export can the cleaner handle?

Can I run this as part of a scheduled SEO reporting pipeline?

Privacy first

Related guides