Clean a Web-Scraped HTML Table CSV (pandas read_html, BeautifulSoup)

How to clean a web-scraped html table csv

Step 1
Scrape the table and save to CSV — pandas: tables = pd.read_html(url); tables[0].to_csv('out.csv', index=False, encoding='utf-8-sig'). BeautifulSoup: soup = BeautifulSoup(html, 'html.parser'); rows = [[c.get_text(strip=True) for c in tr.find_all(['td','th'])] for tr in soup.find_all('tr')] then write with csv.writer. Browser extension (Table Capture, Scraper, etc.): right-click → Export table → CSV.
Step 2
Drop the file onto the cleaner above — PapaParse runs in your browser. The cleaner sniffs encoding from the BOM (or absence) and delimiter from the first row — handles the output of any scraper without configuration. Scraped data — including any source-page PII — never reaches a server.
Step 3
Enable `normalizeHiddenWhitespace` if your source HTML had ` ` — This is the single most useful option for scraped tables. HTML tables commonly use   for visual cell padding; the entity decodes to CHAR(160) (U+00A0) in the resulting cell value. Visually identical to space; never matches in exact-string comparison. The cleaner folds U+00A0 + zero-width space (U+200B) + a few other invisible whitespace codepoints to a regular U+0020 space, then collapses internal runs to a single space.
Step 4
Enable `normalizeSmartQuotes` if the source page had article content embedded — Wikipedia, news sites, and most CMS-driven content auto-format " as “ ”, ' as ‘ ’, and - as –. Tables embedded in such pages inherit these styled characters. Downstream string matching, NLP, and BI tools treat the curly and straight forms as different strings. normalizeSmartQuotes folds them to ASCII.
Step 5
For Excel-on-Windows output: leave default `outputLineEnding: crlf` and let the BOM ride — The cleaner output is UTF-8 with a BOM by default — equivalent to pandas' encoding='utf-8-sig' flag. This is what Excel-on-Windows needs to auto-detect encoding. Without it, accented column values mojibake (José becomes JosÃ©).
Step 6
Download and verify against the source page — The output panel shows: rows in, rows out, hidden-whitespace cells folded, smart-quote characters converted, BOM detected (yes/no), encoding detected. Spot-check 3–5 rows in the cleaner's preview against the original HTML table to confirm no data loss — particularly for tables with merged cells where the scraper may have produced misaligned rows the cleaner can pad but not structurally repair.

Per-scraper output patterns

The CSV shape each major scraper produces, and the cleaner option that addresses each quirk. Verified against the official pandas docs and BeautifulSoup community guidance.

Scraper	Output shape	Encoding default	Common quirks	Cleaner option(s)
pandas `read_html` + `to_csv`	Comma-CSV, UTF-8 (without BOM unless `encoding='utf-8-sig'`)	UTF-8, no BOM (default `to_csv`)	Pandas' docs: 'expect to do cleanup after calling read_html()'. ` ` decodes to U+00A0; references like `[1]` survive as text; multi-row headers join with newline within the cell	`normalizeHiddenWhitespace` for `\xa0`; output BOM for Excel
pandas `read_html` with multi-index headers	Comma-CSV with two header rows or joined `("L1, L2")` tuples	Same as above	Multi-row HTML headers produce multi-row CSV headers; pandas joins them as Python tuples by default. The cleaner doesn't auto-flatten — use csv-header-rename to consolidate	`skipRows` if exporting flattened with a separator-row
BeautifulSoup + `csv.writer`	Comma-CSV, encoding specified at file-open time	Whatever you specified — `utf-8`, `utf-8-sig`, or system default	Cell text from `.get_text(strip=True)` strips outer whitespace but not internal `\xa0`; `\n` from `<br>` survives in cells (RFC 4180 quote-wrapped if you used `csv.writer` correctly)	`normalizeHiddenWhitespace` for residual NBSPs
Scrapy Item Pipeline → CSV	Comma-CSV, UTF-8 by default	UTF-8, no BOM	Items often retain HTML entities (`&`, `<`) if extracted with `xpath` without `string()` wrapping; CSS selectors may include trailing whitespace from indented HTML	`trimWhitespace` (default); pair with csv-find-replace for entity decoding
Browser 'Save table as CSV' extension	Varies by extension — typically comma-CSV with BOM (Excel-friendly)	Usually UTF-8 with BOM	Cells may have nested element text concatenated (`<span>Year</span><sup>1</sup>` becomes `Year1`); curly quotes from page styling survive	`normalizeSmartQuotes` + `normalizeHiddenWhitespace`
Wikipedia table (any scraper)	Same as scraper output	Same as scraper	Source-specific artefacts: `[edit]` link text, `[citation needed]`, footnote references (`[1]`, `[2]`), interlanguage links, `[change]` for Simple English. Most scrapers leave these as cell content	Cleaner handles whitespace + quotes; footnotes need [csv-find-replace](/tool/csv-find-replace) with regex `\[\d+\]` → empty

What the cleaner does

Each cleaner option mapped to the scraped-table problem it solves. Single-pass — all selected options apply together.

Cleaner option	Scraped-table problem it solves	Default
`normalizeHiddenWhitespace`	The primary value-add for scraped tables: folds `\xa0` (from ` `) + zero-width space + a few invisible codepoints to regular space. Equivalent to pandas' `unicodedata.normalize('NFKC', x).strip()` recipe	Off (must enable)
`normalizeSmartQuotes`	Folds `“ ” ‘ ’ – — …` from CMS-styled article content embedded in the source HTML. Common on news sites, Wikipedia article body text, blog tables	Off (must enable)
`inputEncoding: auto`	Detects UTF-8 (with/without BOM), UTF-16 LE/BE, Windows-1252-ish from no-BOM heuristics. Handles older Latin-1 source pages and the UTF-8 default from modern scrapers	auto
`outputLineEnding: crlf` + UTF-8 BOM (default)	Equivalent to `pandas.to_csv(encoding='utf-8-sig')`. Excel-on-Windows auto-detects encoding so `José` displays correctly instead of `JosÃ©`	crlf
`trimWhitespace`	Strips leading/trailing whitespace from cells. Handles indented HTML where the scraper kept the inner-text padding	On
`removeEmptyRows`	Drops fully-empty rows — common after `colspan`/`rowspan` mis-handling produces some rows with all cells consumed by the spanning cell	On
`repairRows`	Pads short rows / trims long rows to header width. Triggered by `colspan`/`rowspan` misalignment or commas inside CMS-styled text	On
`skipRows: 1` (or more)	Strip 'Generated by `scraper-name` on `YYYY-MM-DD`' preamble rows that some scrapers / browser extensions prepend	0
`dedupMode: exact`	Collapse duplicate rows from a multi-page scrape that concatenated outputs without deduping; rare for single-table scrapes	exact
`dedupMode: case-insensitive`	Useful if your scraped data includes URLs or domain names where casing varies (`Example.com` vs `example.com`)	Off (must enable)
`delimiter: auto`	Auto-detects if your scraper wrote TSV instead of CSV (rare but happens with some BeautifulSoup recipes using `\t.join`)	auto

Cookbook

Real before/after rows showing what survives a typical pandas.read_html() round-trip and how the cleaner addresses each.

` ` decodes to U+00A0 — the canonical scraped-table cleanup

Example

An HTML table cell <td>Population 1,234,567</td> becomes the string 'Population\xa01,234,567' in the resulting CSV. The character between Population and 1 is U+00A0, not U+0020. Downstream string comparisons (== 'Population 1,234,567') fail. The cleaner's normalizeHiddenWhitespace folds it. Pandas' own documented recipe (unicodedata.normalize('NFKC', x).strip()) does the same thing.

Source HTML:
  <td>Country</td><td>Population&nbsp;(2024)</td>

After pandas.read_html() + to_csv() (showing \xa0 as ·):
Country,Population·(2024)
France,67,500,000·(est.)

Cleaner config: normalizeHiddenWhitespace: true

Output:
Country,Population (2024)
France,67,500,000 (est.)

Curly quotes from a Wikipedia article-body table

Example

Wikipedia auto-formats quotes in article-body content. Tables embedded in articles inherit the styled characters. "Lord of the Rings" becomes “Lord of the Rings”. Downstream NLP / search-indexing pipelines treat the two forms as different strings.

Source (Wikipedia table cell, after read_html):
Title,Year
“The Lord of the Rings”,1954
“Foundation”,1951

Cleaner config: normalizeSmartQuotes: true

Output:
Title,Year
"The Lord of the Rings",1954
"Foundation",1951

Latin-1 source page mojibakes through pandas

Example

An older non-English source page declared as <meta charset="latin-1"> (or with no charset declaration, sniffing as Latin-1) passes through pandas/BeautifulSoup as Latin-1 strings. Saving with to_csv(encoding='utf-8') then re-reading produces mojibake on accented characters. The cleaner detects the actual encoding from the BOM (or content heuristics) and re-decodes correctly.

Source HTML (declared Latin-1 charset):
<td>José</td>

After scrape → save as utf-8 (mojibake):
Name
JosÃ©

Cleaner with inputEncoding: auto re-decodes correctly:
Name
José

Wikipedia footnotes survive as `[1]`, `[2]`, `[citation needed]`

Example

Wikipedia tables include inline footnote references. The cleaner doesn't strip these — that's outside its scope (text content vs whitespace/encoding). For Wikipedia-specific cleanup, chain with csv-find-replace.

Source (typical Wikipedia data table):
Population,Source
"67,500,000[1]",United Nations[2]
"60,200,000[citation needed]",

Cleaner output (whitespace + quotes normalised, footnotes preserved):
Population,Source
"67,500,000[1]",United Nations[2]
"60,200,000[citation needed]",

Follow-up: csv-find-replace with regex (?i)\[\d+\]|\[citation needed\]
  → empty for footnote removal.

Excel mojibake from pandas `to_csv` default (no BOM)

Example

pandas.to_csv('out.csv') writes UTF-8 without a BOM. Excel-on-Windows falls back to Windows-1252 → mojibake. The cleaner adds the BOM, equivalent to pandas' encoding='utf-8-sig'.

Source (after pandas.to_csv defaults):
Name
José Müller

Excel-on-Windows display (Windows-1252 fallback):
Name
JosÃ© MÃ¼ller

Cleaner adds UTF-8 BOM on output → Excel auto-detects → displays correctly.

Colspan / rowspan misalignment the cleaner can pad but not structurally fix

Example

A table with merged cells (colspan/rowspan) that pandas read_html misparsed produces rows of unequal length. The cleaner's repairRows: true pads short rows to the header width — but the structural relationships are lost. For correct extraction, parse with a colspan/rowspan-aware library (the tabulate-html Python package, or a custom BeautifulSoup loop).

Source table with colspan=2 in row 2 misparsed by read_html:
Country,Region,Population
France,Europe,67500000
Germany,83200000
  ↑ row 3 only has 2 cells because pandas didn't expand the rowspan

Cleaner with repairRows: true pads to 3 columns:
Country,Region,Population
France,Europe,67500000
Germany,,83200000
  ↑ now 3 columns but the Region value is empty — needs source-side fix

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Pandas docs: 'expect to do cleanup after calling read_html()'

Documented limitation

Quoted verbatim from the pandas reference: 'read_html attempts to push the idiosyncrasies of the HTML contained in tables to the user'. The cleaner is the cleanup pandas docs expect you to do. The canonical recipe — unicodedata.normalize('NFKC', x).strip() applied to every cell — does exactly what normalizeHiddenWhitespace: true + trimWhitespace: true (default) does, just in one drop-on-file pass instead of an applymap() call.

` ` survives every byte-for-byte CSV pipeline

HTML entity decoding

  is HTML 4 / HTML 5 syntax for U+00A0 (non-breaking space). Every modern HTML parser decodes it to the actual U+00A0 codepoint — which is then preserved as a real character through CSV, JSON, every database column. Visually identical to U+0020 regular space; never matches in exact-string comparison. The cleaner's normalizeHiddenWhitespace is built specifically for this.

Latin-1 / Windows-1252 source pages produce mojibake after UTF-8 save

Encoding misdetection

Some older / non-English source pages declare Latin-1 (<meta charset="latin-1">) or no charset at all (legal but ambiguous). pandas defaults to Latin-1 detection in some paths; the resulting strings are Latin-1 codepoints. Saving as encoding='utf-8' (without re-encoding) produces a file where each Latin-1 byte was reinterpreted as the same Unicode codepoint — mojibake. Fix at source: pass encoding='utf-8' and flavor='lxml' or flavor='html5lib' to read_html for better detection. Fix at output: the cleaner's inputEncoding: auto re-decodes if the file genuinely was saved as Latin-1; mojibake from a misdeclared save needs scraper-side fix.

Colspan / rowspan misalignment from pandas read_html

Structural failure — cleaner can pad but not fix

Per scraper-community consensus, pandas read_html handles ~85% of merged-cell tables correctly but breaks on complex multi-level headers, tables split across multiple <tbody> sections, or rowspan values that extend beyond the visible table. Result: CSV rows of unequal length. The cleaner's repairRows: true pads short rows to header width but the underlying relationships are lost (cells appear in the wrong columns). For complex tables, use a colspan/rowspan-aware library (tabulate-html, custom BeautifulSoup) at scrape time, then run the cleaner for whitespace/encoding.

Wikipedia `[edit]` / `[citation needed]` / footnote references in cells

Cleaner doesn't strip — chain csv-find-replace

Wikipedia and similar wiki sources embed [edit] links, [citation needed] superscripts, and footnote references ([1], [2]) as inline text. These survive the scrape as cell content. The cleaner doesn't strip them — that's a text-replacement task, not a whitespace/encoding task. Chain with csv-find-replace using regex \[\d+\]|\[edit\]|\[citation needed\] → empty, with caseSensitive: false.

Multi-row HTML headers join as tuple strings

pandas-specific behaviour

An HTML table with two <thead> rows (group header + subheader) produces a pandas MultiIndex column header. Saving to CSV writes the headers as Python tuple strings: "('Country', 'Name')", "('Population', '2024')". The cleaner doesn't auto-flatten these — use csv-header-rename with a renames JSON to consolidate into single-level headers, OR re-export with pd.read_html(url)[0].columns = pd.read_html(url)[0].columns.map('_'.join) before to_csv.

BeautifulSoup `.get_text(strip=True)` doesn't strip internal `\xa0`

Documented BeautifulSoup behaviour

.get_text(strip=True) strips leading and trailing whitespace from the concatenated text. It does NOT collapse internal whitespace and does NOT fold U+00A0 to U+0020. The string 'Population\xa02024' keeps the U+00A0. The cleaner's normalizeHiddenWhitespace handles it post-scrape.

`<br>` tags inside cells are NOT handled properly by pandas read_html

pandas issue #29528 (open)

Open in pandas' issue tracker since 2019: <br> tags inside table cells are not converted to whitespace or a newline — pandas concatenates the surrounding text together, often losing the visual separation. A cell with Apples<br>Oranges<br>Pears may parse as ApplesOrangesPears. The cleaner can't recover the lost separation — fix at scrape time by pre-processing the HTML (e.g. html.replace('<br>', '\n') before passing to read_html) or by using a custom BeautifulSoup loop that calls .get_text(separator='\n') which DOES respect <br>.

Browser extension exports HTML markup inside cells

Extension-specific

Some browser 'Save table as CSV' extensions copy the cell's innerHTML rather than innerText — producing <a href="...">Link</a> as literal cell content. The cleaner doesn't strip HTML — that's outside its scope. Either use a different extension that exports innerText, or chain with csv-find-replace using regex <[^>]+> → empty for HTML tag stripping.

Concatenated multi-page scrape produces exact-row duplicates

Source artefact

Scraping a paginated table (page 1, page 2, ...) and concatenating the outputs without explicit dedup produces duplicate rows when source pages overlap. The cleaner's dedupMode: exact (default) collapses byte-identical rows. For near-duplicates (case differences, whitespace), use dedupMode: trim or normalized.

Frequently asked questions

Why does pandas say 'expect to do cleanup after calling read_html()'?

Because HTML tables carry a set of formatting artefacts that don't have a clean structured equivalent:   (which decodes to U+00A0 in cell values), HTML entities (&, <), nested element text that gets concatenated, multi-row headers, colspan/rowspan misalignment, and inline footnotes. Pandas extracts the table structure; the cell values still carry HTML idiosyncrasies. The cleaner's normalizeHiddenWhitespace + normalizeSmartQuotes + trimWhitespace (default) cover the most common 80% of post-read_html cleanup. The remaining 20% (HTML entity decoding, footnote stripping) needs chained find-replace.

What's the difference between U+00A0 and U+0020?

U+0020 is the regular space — what you get when you press the spacebar. U+00A0 is the non-breaking space — what HTML's   decodes to. Visually identical in every font. Different bytes in every encoding (U+0020 is 0x20; U+00A0 is 0xC2 0xA0 in UTF-8). Every string-comparison function in every language treats them as different. The cleaner's normalizeHiddenWhitespace folds U+00A0 (plus zero-width space U+200B, word joiner U+2060, and the byte-order mark U+FEFF when not at the start of file) to regular U+0020 space.

Why does `José` show as `JosÃ©` in Excel after a pandas `to_csv`?

Pandas' default to_csv writes UTF-8 without a BOM. Excel-on-Windows defaults to Windows-1252 when no encoding indicator is present. The UTF-8 byte sequence for é (0xC3 0xA9) reads as the two Windows-1252 characters Ã©. Two fixes: (1) at source, use df.to_csv('out.csv', encoding='utf-8-sig') which adds the BOM; (2) at output, drop the file onto the cleaner — the default output is UTF-8 with BOM (Excel-on-Windows then auto-detects).

How do I strip Wikipedia footnotes like `[1]` and `[citation needed]`?

Outside the cleaner's scope — that's a text-replacement task, not a whitespace/encoding task. Chain with csv-find-replace using regex mode: find \[\d+\]|\[edit\]|\[citation needed\]|\[change\], replace empty, caseSensitive: false. Run this after the cleaner has handled whitespace and encoding, so your regex doesn't need to worry about NBSPs hiding inside the brackets.

What if my source page is Latin-1 / Windows-1252?

Two scenarios. (1) Source page is correctly declared as Latin-1, pandas / BeautifulSoup decode it correctly to Python strings, then to_csv re-encodes as UTF-8 — file is clean UTF-8. The cleaner handles it normally. (2) Source page misdeclared or undeclared, scraper sniffed wrong and produced mojibake before save — file is corrupt at the byte level. The cleaner can't recover mojibake created upstream; fix at scrape time with explicit encoding='latin-1' in pd.read_html() or BeautifulSoup(html, from_encoding='latin-1').

Can the cleaner handle a table that pandas got the colspan/rowspan wrong on?

Partially. repairRows: true (default) pads short rows to the header width, so the CSV is structurally valid. But the spanning cells' content is lost or misaligned — the cleaner can't reconstruct what <td rowspan=3>USA</td> should have filled across the next 3 rows. For tables with non-trivial merged-cell structures, fix at scrape time using a colspan/rowspan-aware library (the tabulate-html package, or a custom BeautifulSoup loop tracking grid occupancy) before running the cleaner for whitespace/encoding.

Does the cleaner work on tables saved via 'Save Page As' from a browser?

Not for the raw HTML (it's an HTML file, not CSV). For CSV output from a browser extension (Table Capture, Scraper, etc.): yes, treat it like any other scraped CSV. Encoding is usually UTF-8 with BOM (most extensions are Excel-friendly by default); curly quotes from page styling survive; NBSPs from   decode to U+00A0. Enable normalizeSmartQuotes and normalizeHiddenWhitespace.

What about HTML entities like `&` and `<` in scraped cells?

Most modern scrapers (pd.read_html, BeautifulSoup with 'html.parser' or 'lxml') decode standard HTML entities automatically — & becomes &, < becomes <. If you're seeing literal & in your CSV, the scraper didn't decode (e.g. you used regex against raw HTML instead of an HTML parser). The cleaner doesn't HTML-decode — chain with csv-find-replace (find: &, replace: &, then find: <, replace: <, etc.) or re-run the scrape with a proper HTML parser.

Can I dedupe rows from a paginated scrape where pages overlap?

Yes. Concatenate the page outputs (e.g. pd.concat([df1, df2, df3]) or csv-merger on the saved files), then drop onto the cleaner. dedupMode: exact (default) catches byte-identical duplicates from page-overlap. For near-duplicates that differ in whitespace or case (https://Example.com vs https://example.com), use dedupMode: trim or case-insensitive.

Will scraped PII be uploaded to JAD Apps?

No. PapaParse runs entirely in your browser. Scraped table data — including any PII the source page exposed — never reaches a server. The only thing saved server-side is a single counter (file processed, no content) for signed-in dashboard stats. You can opt out in account settings. Compliance reminder: scraping PII may trigger GDPR (EU), CCPA (California), or copyright concerns regardless of how clean the resulting data is — the cleaner handles formatting, not legal-basis questions.

How large a scraped CSV can the cleaner handle?

Free tier caps at 2 MB — roughly 8k–30k rows depending on column count and value length. Pro removes the limit (5–10M rows browser-memory bound). For multi-million-row scrapes (full Wikipedia category dumps, historical price data), run via the @jadapps/runner API (POST 127.0.0.1:9789/v1/tools/csv-cleaner/run) for streaming throughput. The Wikipedia table-of-tables (every infobox per article) is feasible at runner scale.

Can I run this as part of an automated scraping pipeline?

Yes — GET /api/v1/tools/csv-cleaner returns the 11-option schema; pair the @jadapps/runner once and POST the payload to 127.0.0.1:9789/v1/tools/csv-cleaner/run. Common pipeline: scheduled Scrapy spider → CSV pipeline output → cleaner (normalizeHiddenWhitespace: true, normalizeSmartQuotes: true, default trim + dedupe) → DuckDB / Snowflake / pandas analysis. Scraped data never reaches JAD's servers — the cleaner runs locally on your runner.

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

How to clean a web-scraped html table csv

Step 1
Scrape the table and save to CSV — pandas: tables = pd.read_html(url); tables[0].to_csv('out.csv', index=False, encoding='utf-8-sig'). BeautifulSoup: soup = BeautifulSoup(html, 'html.parser'); rows = [[c.get_text(strip=True) for c in tr.find_all(['td','th'])] for tr in soup.find_all('tr')] then write with csv.writer. Browser extension (Table Capture, Scraper, etc.): right-click → Export table → CSV.
Step 2
Drop the file onto the cleaner above — PapaParse runs in your browser. The cleaner sniffs encoding from the BOM (or absence) and delimiter from the first row — handles the output of any scraper without configuration. Scraped data — including any source-page PII — never reaches a server.
Step 3
Enable `normalizeHiddenWhitespace` if your source HTML had ` ` — This is the single most useful option for scraped tables. HTML tables commonly use   for visual cell padding; the entity decodes to CHAR(160) (U+00A0) in the resulting cell value. Visually identical to space; never matches in exact-string comparison. The cleaner folds U+00A0 + zero-width space (U+200B) + a few other invisible whitespace codepoints to a regular U+0020 space, then collapses internal runs to a single space.
Step 4
Enable `normalizeSmartQuotes` if the source page had article content embedded — Wikipedia, news sites, and most CMS-driven content auto-format " as “ ”, ' as ‘ ’, and - as –. Tables embedded in such pages inherit these styled characters. Downstream string matching, NLP, and BI tools treat the curly and straight forms as different strings. normalizeSmartQuotes folds them to ASCII.
Step 5
For Excel-on-Windows output: leave default `outputLineEnding: crlf` and let the BOM ride — The cleaner output is UTF-8 with a BOM by default — equivalent to pandas' encoding='utf-8-sig' flag. This is what Excel-on-Windows needs to auto-detect encoding. Without it, accented column values mojibake (José becomes JosÃ©).
Step 6
Download and verify against the source page — The output panel shows: rows in, rows out, hidden-whitespace cells folded, smart-quote characters converted, BOM detected (yes/no), encoding detected. Spot-check 3–5 rows in the cleaner's preview against the original HTML table to confirm no data loss — particularly for tables with merged cells where the scraper may have produced misaligned rows the cleaner can pad but not structurally repair.

Per-scraper output patterns

The CSV shape each major scraper produces, and the cleaner option that addresses each quirk. Verified against the official pandas docs and BeautifulSoup community guidance.

Scraper	Output shape	Encoding default	Common quirks	Cleaner option(s)
pandas `read_html` + `to_csv`	Comma-CSV, UTF-8 (without BOM unless `encoding='utf-8-sig'`)	UTF-8, no BOM (default `to_csv`)	Pandas' docs: 'expect to do cleanup after calling read_html()'. ` ` decodes to U+00A0; references like `[1]` survive as text; multi-row headers join with newline within the cell	`normalizeHiddenWhitespace` for `\xa0`; output BOM for Excel
pandas `read_html` with multi-index headers	Comma-CSV with two header rows or joined `("L1, L2")` tuples	Same as above	Multi-row HTML headers produce multi-row CSV headers; pandas joins them as Python tuples by default. The cleaner doesn't auto-flatten — use csv-header-rename to consolidate	`skipRows` if exporting flattened with a separator-row
BeautifulSoup + `csv.writer`	Comma-CSV, encoding specified at file-open time	Whatever you specified — `utf-8`, `utf-8-sig`, or system default	Cell text from `.get_text(strip=True)` strips outer whitespace but not internal `\xa0`; `\n` from `<br>` survives in cells (RFC 4180 quote-wrapped if you used `csv.writer` correctly)	`normalizeHiddenWhitespace` for residual NBSPs
Scrapy Item Pipeline → CSV	Comma-CSV, UTF-8 by default	UTF-8, no BOM	Items often retain HTML entities (`&`, `<`) if extracted with `xpath` without `string()` wrapping; CSS selectors may include trailing whitespace from indented HTML	`trimWhitespace` (default); pair with csv-find-replace for entity decoding
Browser 'Save table as CSV' extension	Varies by extension — typically comma-CSV with BOM (Excel-friendly)	Usually UTF-8 with BOM	Cells may have nested element text concatenated (`<span>Year</span><sup>1</sup>` becomes `Year1`); curly quotes from page styling survive	`normalizeSmartQuotes` + `normalizeHiddenWhitespace`
Wikipedia table (any scraper)	Same as scraper output	Same as scraper	Source-specific artefacts: `[edit]` link text, `[citation needed]`, footnote references (`[1]`, `[2]`), interlanguage links, `[change]` for Simple English. Most scrapers leave these as cell content	Cleaner handles whitespace + quotes; footnotes need [csv-find-replace](/tool/csv-find-replace) with regex `\[\d+\]` → empty

What the cleaner does

Each cleaner option mapped to the scraped-table problem it solves. Single-pass — all selected options apply together.

Cleaner option	Scraped-table problem it solves	Default
`normalizeHiddenWhitespace`	The primary value-add for scraped tables: folds `\xa0` (from ` `) + zero-width space + a few invisible codepoints to regular space. Equivalent to pandas' `unicodedata.normalize('NFKC', x).strip()` recipe	Off (must enable)
`normalizeSmartQuotes`	Folds `“ ” ‘ ’ – — …` from CMS-styled article content embedded in the source HTML. Common on news sites, Wikipedia article body text, blog tables	Off (must enable)
`inputEncoding: auto`	Detects UTF-8 (with/without BOM), UTF-16 LE/BE, Windows-1252-ish from no-BOM heuristics. Handles older Latin-1 source pages and the UTF-8 default from modern scrapers	auto
`outputLineEnding: crlf` + UTF-8 BOM (default)	Equivalent to `pandas.to_csv(encoding='utf-8-sig')`. Excel-on-Windows auto-detects encoding so `José` displays correctly instead of `JosÃ©`	crlf
`trimWhitespace`	Strips leading/trailing whitespace from cells. Handles indented HTML where the scraper kept the inner-text padding	On
`removeEmptyRows`	Drops fully-empty rows — common after `colspan`/`rowspan` mis-handling produces some rows with all cells consumed by the spanning cell	On
`repairRows`	Pads short rows / trims long rows to header width. Triggered by `colspan`/`rowspan` misalignment or commas inside CMS-styled text	On
`skipRows: 1` (or more)	Strip 'Generated by `scraper-name` on `YYYY-MM-DD`' preamble rows that some scrapers / browser extensions prepend	0
`dedupMode: exact`	Collapse duplicate rows from a multi-page scrape that concatenated outputs without deduping; rare for single-table scrapes	exact
`dedupMode: case-insensitive`	Useful if your scraped data includes URLs or domain names where casing varies (`Example.com` vs `example.com`)	Off (must enable)
`delimiter: auto`	Auto-detects if your scraper wrote TSV instead of CSV (rare but happens with some BeautifulSoup recipes using `\t.join`)	auto

Cookbook

Real before/after rows showing what survives a typical pandas.read_html() round-trip and how the cleaner addresses each.

` ` decodes to U+00A0 — the canonical scraped-table cleanup

Example

Source HTML:
  <td>Country</td><td>Population&nbsp;(2024)</td>

After pandas.read_html() + to_csv() (showing \xa0 as ·):
Country,Population·(2024)
France,67,500,000·(est.)

Cleaner config: normalizeHiddenWhitespace: true

Output:
Country,Population (2024)
France,67,500,000 (est.)

Curly quotes from a Wikipedia article-body table

Example

Source (Wikipedia table cell, after read_html):
Title,Year
“The Lord of the Rings”,1954
“Foundation”,1951

Cleaner config: normalizeSmartQuotes: true

Output:
Title,Year
"The Lord of the Rings",1954
"Foundation",1951

Latin-1 source page mojibakes through pandas

Example

Source HTML (declared Latin-1 charset):
<td>José</td>

After scrape → save as utf-8 (mojibake):
Name
JosÃ©

Cleaner with inputEncoding: auto re-decodes correctly:
Name
José

Wikipedia footnotes survive as `[1]`, `[2]`, `[citation needed]`

Example

Source (typical Wikipedia data table):
Population,Source
"67,500,000[1]",United Nations[2]
"60,200,000[citation needed]",

Cleaner output (whitespace + quotes normalised, footnotes preserved):
Population,Source
"67,500,000[1]",United Nations[2]
"60,200,000[citation needed]",

Follow-up: csv-find-replace with regex (?i)\[\d+\]|\[citation needed\]
  → empty for footnote removal.

Excel mojibake from pandas `to_csv` default (no BOM)

Example

pandas.to_csv('out.csv') writes UTF-8 without a BOM. Excel-on-Windows falls back to Windows-1252 → mojibake. The cleaner adds the BOM, equivalent to pandas' encoding='utf-8-sig'.

Source (after pandas.to_csv defaults):
Name
José Müller

Excel-on-Windows display (Windows-1252 fallback):
Name
JosÃ© MÃ¼ller

Cleaner adds UTF-8 BOM on output → Excel auto-detects → displays correctly.

Colspan / rowspan misalignment the cleaner can pad but not structurally fix

Example

Source table with colspan=2 in row 2 misparsed by read_html:
Country,Region,Population
France,Europe,67500000
Germany,83200000
  ↑ row 3 only has 2 cells because pandas didn't expand the rowspan

Cleaner with repairRows: true pads to 3 columns:
Country,Region,Population
France,Europe,67500000
Germany,,83200000
  ↑ now 3 columns but the Region value is empty — needs source-side fix

Errors and edge cases

Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.

Pandas docs: 'expect to do cleanup after calling read_html()'

Documented limitation

` ` survives every byte-for-byte CSV pipeline

HTML entity decoding

Latin-1 / Windows-1252 source pages produce mojibake after UTF-8 save

Encoding misdetection

Colspan / rowspan misalignment from pandas read_html

Structural failure — cleaner can pad but not fix

Wikipedia `[edit]` / `[citation needed]` / footnote references in cells

Cleaner doesn't strip — chain csv-find-replace

Multi-row HTML headers join as tuple strings

pandas-specific behaviour

BeautifulSoup `.get_text(strip=True)` doesn't strip internal `\xa0`

Documented BeautifulSoup behaviour

`<br>` tags inside cells are NOT handled properly by pandas read_html

pandas issue #29528 (open)

Browser extension exports HTML markup inside cells

Extension-specific

Concatenated multi-page scrape produces exact-row duplicates

Source artefact

Frequently asked questions

Why does pandas say 'expect to do cleanup after calling read_html()'?

What's the difference between U+00A0 and U+0020?

How do I strip Wikipedia footnotes like `[1]` and `[citation needed]`?

What if my source page is Latin-1 / Windows-1252?

Can the cleaner handle a table that pandas got the colspan/rowspan wrong on?

Does the cleaner work on tables saved via 'Save Page As' from a browser?

What about HTML entities like `&` and `<` in scraped cells?

Can I dedupe rows from a paginated scrape where pages overlap?

Will scraped PII be uploaded to JAD Apps?

How large a scraped CSV can the cleaner handle?

Can I run this as part of an automated scraping pipeline?

Privacy first

Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.

Clean a Web-Scraped HTML Table CSV

How to clean a web-scraped html table csv

Per-scraper output patterns

What the cleaner does

Cookbook

`&nbsp;` decodes to U+00A0 — the canonical scraped-table cleanup

Curly quotes from a Wikipedia article-body table

Latin-1 source page mojibakes through pandas

Wikipedia footnotes survive as `[1]`, `[2]`, `[citation needed]`

Excel mojibake from pandas `to_csv` default (no BOM)

Colspan / rowspan misalignment the cleaner can pad but not structurally fix

Errors and edge cases

Pandas docs: 'expect to do cleanup after calling read_html()'

`&nbsp;` survives every byte-for-byte CSV pipeline

Latin-1 / Windows-1252 source pages produce mojibake after UTF-8 save

Colspan / rowspan misalignment from pandas read_html

Wikipedia `[edit]` / `[citation needed]` / footnote references in cells

Multi-row HTML headers join as tuple strings

BeautifulSoup `.get_text(strip=True)` doesn't strip internal `\xa0`

`<br>` tags inside cells are NOT handled properly by pandas read_html

Browser extension exports HTML markup inside cells

Concatenated multi-page scrape produces exact-row duplicates

Frequently asked questions

Why does pandas say 'expect to do cleanup after calling read_html()'?

What's the difference between U+00A0 and U+0020?

Why does `José` show as `JosÃ©` in Excel after a pandas `to_csv`?

How do I strip Wikipedia footnotes like `[1]` and `[citation needed]`?

What if my source page is Latin-1 / Windows-1252?

Can the cleaner handle a table that pandas got the colspan/rowspan wrong on?

Does the cleaner work on tables saved via 'Save Page As' from a browser?

What about HTML entities like `&amp;` and `&lt;` in scraped cells?

Can I dedupe rows from a paginated scrape where pages overlap?

Will scraped PII be uploaded to JAD Apps?

How large a scraped CSV can the cleaner handle?

Can I run this as part of an automated scraping pipeline?

Privacy first

Related guides

Clean a Web-Scraped HTML Table CSV

How to clean a web-scraped html table csv

Per-scraper output patterns

What the cleaner does

Cookbook

`&nbsp;` decodes to U+00A0 — the canonical scraped-table cleanup

Curly quotes from a Wikipedia article-body table

Latin-1 source page mojibakes through pandas

Wikipedia footnotes survive as `[1]`, `[2]`, `[citation needed]`

Excel mojibake from pandas `to_csv` default (no BOM)

Colspan / rowspan misalignment the cleaner can pad but not structurally fix

Errors and edge cases

Pandas docs: 'expect to do cleanup after calling read_html()'

`&nbsp;` survives every byte-for-byte CSV pipeline

Latin-1 / Windows-1252 source pages produce mojibake after UTF-8 save

Colspan / rowspan misalignment from pandas read_html

Wikipedia `[edit]` / `[citation needed]` / footnote references in cells

Multi-row HTML headers join as tuple strings

BeautifulSoup `.get_text(strip=True)` doesn't strip internal `\xa0`

`<br>` tags inside cells are NOT handled properly by pandas read_html

Browser extension exports HTML markup inside cells

Concatenated multi-page scrape produces exact-row duplicates

Frequently asked questions

Why does pandas say 'expect to do cleanup after calling read_html()'?

What's the difference between U+00A0 and U+0020?

Why does `José` show as `JosÃ©` in Excel after a pandas `to_csv`?

How do I strip Wikipedia footnotes like `[1]` and `[citation needed]`?

What if my source page is Latin-1 / Windows-1252?

Can the cleaner handle a table that pandas got the colspan/rowspan wrong on?

Does the cleaner work on tables saved via 'Save Page As' from a browser?

What about HTML entities like `&amp;` and `&lt;` in scraped cells?

Can I dedupe rows from a paginated scrape where pages overlap?

Will scraped PII be uploaded to JAD Apps?

How large a scraped CSV can the cleaner handle?

Can I run this as part of an automated scraping pipeline?

Privacy first

Related guides

` ` decodes to U+00A0 — the canonical scraped-table cleanup

` ` survives every byte-for-byte CSV pipeline

What about HTML entities like `&` and `<` in scraped cells?

` ` decodes to U+00A0 — the canonical scraped-table cleanup

` ` survives every byte-for-byte CSV pipeline

What about HTML entities like `&` and `<` in scraped cells?