How to create a sample dataset from a large csv
- Step 1Decide your sample size — Most pipeline and dashboard work needs 1,000–5,000 data rows to exercise the column shapes. On free tier the output is capped at 500 rows — pick a Row limit at or below that, or upgrade to Pro for up to 100,000 output rows.
- Step 2Drop the large CSV onto the tool — PapaParse reads the file locally; nothing uploads. Free tier accepts files up to 2 MB; Pro accepts up to 100 MB. A file larger than your tier's byte cap is blocked at drop with an upgrade prompt before any parsing happens.
- Step 3Set the Row limit — Type the number of data rows you want to keep (default
1000, minimum1). The header is added on top automatically, so a limit of1000produces a 1,001-line file. - Step 4Leave Row offset at 0 for a head sample — Offset
0takes the rows directly after the header. Raise it only if you want to skip a leading block — for exampleoffset 5000to sample rows that come after the first 5,000. - Step 5Click Limit rows and check the stats — The result panel shows
Total rows in,Rows out, andRows skipped. ConfirmRows outmatches your target before downloading. - Step 6Download and build against the sample — The file downloads as
<name>.rows-<start>-<end>.csv(e.g.export.rows-1-1000.csv). Develop your pipeline on it, then re-run the logic on the full export once it's proven.
The two controls and what they actually do
The Row Limiter exposes exactly two numeric inputs. Everything else (header handling, output) is automatic. Numbers operate on data rows only — the header is row 0 and is always kept.
| Control | Default | Minimum | What it does |
|---|---|---|---|
| Row limit | 1000 | 1 | Number of data rows to keep, counted after the offset. The output is header + data.slice(offset, offset + limit). If the file has fewer rows than requested, you simply get all of them |
| Row offset (skip) | 0 | 0 | Number of leading data rows to skip before counting the limit. 0 = start right after the header. Lets you take a middle block without any sort |
| Header row | always kept | n/a | Row 0 of the file is treated as the header and is always emitted first. It is not counted by limit or offset — there is no control to drop or change it here |
Slice recipes: limit + offset combinations
Worked examples for a source CSV with a header plus 50,000 data rows. Output line count = kept data rows + 1 header line.
| Goal | Row limit | Row offset | Data rows in output | Download filename |
|---|---|---|---|---|
| First 1,000 rows (head sample) | 1000 | 0 | rows 1–1000 | <name>.rows-1-1000.csv |
| Second block of 1,000 | 1000 | 1000 | rows 1001–2000 | <name>.rows-1001-2000.csv |
| 500-row free-tier sample | 500 | 0 | rows 1–500 | <name>.rows-1-500.csv |
| Skip first 49,900, keep rest | 1000 | 49900 | rows 49901–50000 (100 rows) | <name>.rows-49901-50000.csv |
Tier limits that apply to sampling
The Row Limiter is a Pro tool with a free allowance. The free row cap is checked against the OUTPUT (rows kept), not the input file. Values from lib/tier-limits.ts.
| Tier | Max input file size | Max output rows | What happens past the cap |
|---|---|---|---|
| Free | 2 MB | 500 rows | A >2 MB file is blocked at drop. A run that would keep >500 rows is blocked after processing with a Pro upgrade prompt |
| Pro | 100 MB | 100,000 rows | Handles realistic production exports; keep Row limit at or under 100,000 |
Cookbook
Concrete before/after slices using the two real controls. Each code block shows the source shape, the limit/offset you set, and the resulting download. Data is illustrative.
Head sample of 3 data rows
ExampleThe default workflow: keep the header and the first N data rows. With offset 0 the slice starts immediately after the header. Shown with a small limit for clarity; in practice you'd use 1,000–5,000.
Source (header + 6 data rows): order_id,sku,qty,total 1001,WID-A,2,19.98 1002,WID-B,1,7.50 1003,WID-A,5,49.95 1004,WID-C,3,29.97 1005,WID-B,2,15.00 1006,WID-D,1,4.25 Row limit: 3 Row offset: 0 Output (export.rows-1-3.csv): order_id,sku,qty,total 1001,WID-A,2,19.98 1002,WID-B,1,7.50 1003,WID-A,5,49.95 Stats: Total rows in 6 · Rows out 3 · Rows skipped 3
Middle block via offset (no sort needed)
ExampleOffset skips a leading block of data rows, then the limit counts from there. This is how you sample a slice from the middle of a file without sorting it first.
Source: header + 6 data rows (order_id 1001..1006) Row limit: 2 Row offset: 2 Output (export.rows-3-4.csv): order_id,sku,qty,total 1003,WID-A,5,49.95 1004,WID-C,3,29.97 Stats: Total rows in 6 · Rows out 2 · Rows skipped 2 (2 skipped by offset; the trailing 2 are simply not reached)
Limit larger than the file
ExampleAsking for more rows than exist is safe — you get every available data row. There is no padding and no error.
Source: header + 6 data rows Row limit: 1000 Row offset: 0 Output (export.rows-1-6.csv): All 6 data rows + header Stats: Total rows in 6 · Rows out 6 · Rows skipped 0
Reproducible notebook fixture
ExampleBecause the slice is deterministic, the same input plus the same limit/offset always produces byte-identical rows — ideal for a sample committed alongside an analysis notebook.
# documented in the notebook README source: customers_2026q2.csv (4.1M rows, Pro tier) slice: Row limit 5000, Row offset 0 output: customers_2026q2.rows-1-5000.csv Replaying the same slice on the same export reproduces the exact 5,000 rows — no random seed to track.
Sort first for a non-head sample
ExampleThe Row Limiter only takes a contiguous block from the current order. To sample the highest-value or most-recent rows, sort the file first, then slice the top.
Step 1 — csv-sorter: sort by total DESC Step 2 — csv-row-limiter: Row limit 100, Row offset 0 Result: the 100 highest-total orders, header preserved. (Without the sort, you'd get the first 100 in file order.)
Errors and edge cases
Real errors and silent failures sourced from each platform's own documentation. Match the wording to the row, fix what the row says to fix.
Free-tier run keeps more than 500 rows
Pro requiredThe free row cap is enforced against the OUTPUT. If your Row limit (clamped to the rows actually available) would keep more than 500 rows, the run completes the slice in memory but is then blocked with a Pro upgrade prompt instead of producing a download. Set Row limit to 500 or less to stay within free, or upgrade for up to 100,000 output rows.
Source file larger than your tier's byte cap
Blocked at dropFile size is checked the moment you drop the file, before any parsing. Free blocks files over 2 MB; Pro blocks over 100 MB. The block message names the file and its size and offers an upgrade. This is a size gate, not a row gate — a 1.9 MB file with 40,000 rows still loads on free (though the 500-row output cap then applies).
Offset is larger than the number of data rows
Empty resultIf offset skips past the end of the data, the slice is empty: you get a file containing only the header, Rows out 0. No error is thrown. Lower the offset or check Total rows in against the offset you set.
Blank lines inside the file
Counted as rowsThe parser runs with skipEmptyLines: false, so blank lines are preserved and counted as data rows. A blank line at position 50 is row 50 in the slice math. If you want blanks gone first, run csv-empty-row-remover before sampling.
File has no header
First row treated as headerThe tool always treats row 0 as the header and keeps it. If your CSV is headerless, the first data row is consumed as the header and won't appear in the data slice. Add a header row first, or account for the off-by-one in your downstream code.
Semicolon- or tab-delimited file
Auto-detectedPapaParse auto-detects the delimiter from the file, so EU-locale semicolon CSVs and tab-separated exports slice correctly. There is no delimiter dropdown in this tool — detection is automatic and the output uses the same field structure.
Quoted fields containing commas or newlines
PreservedRFC 4180 quoted fields (a comma or newline inside double quotes) are parsed as a single cell, so a quoted multi-line note counts as one row, not several. Output is re-serialized with PapaParse using minimal quoting (quotes only where needed), so the data round-trips intact.
Row limit set to 0 or blank
Run disabledThe Limit rows button stays disabled unless Row limit is at least 1 (the input enforces a minimum of 1 and the run requires limit >= 1). There is no 'keep everything' shortcut — to keep all rows, enter a number larger than the row count.
Non-UTF-8 source encoding
May mojibakeThe file is read with the browser's File.text() (UTF-8 decode) and downloaded as UTF-8. A Latin-1 or UTF-16 source can show garbled accents. This tool has no encoding selector — normalize encoding upstream (e.g. with csv-cleaner) if you see mojibake.
Frequently asked questions
Does it take rows from the top of the file or at random?
From the top. With Row offset 0 it keeps the first Row limit data rows in their original order. There is no random-sample option. For a value-biased sample, sort with csv-sorter first and then take the top N.
Can I take rows from the middle of the file?
Yes — that's what Row offset is for. Set the offset to the number of leading data rows to skip, then the limit counts from there. For example offset 10000, limit 1000 keeps rows 10001–11000.
Can I take rows from the end of the file?
Not directly — there's no negative offset or tail option. If you know the row count, set the offset to (total − N). Otherwise sort descending with csv-sorter so the rows you want are at the top, then take the first N.
Is the header row included in the sample?
Always. Row 0 is treated as the header and is emitted first regardless of your limit or offset. It does not count toward either number, so Row limit 1000 produces a 1,001-line file.
How many rows can I keep on the free tier?
Up to 500 output rows, and the source file must be 2 MB or smaller. A run that would keep more than 500 rows is blocked with a Pro prompt after processing. Pro raises this to 100,000 output rows and a 100 MB file size.
What does the download filename look like?
It's derived from your source name plus the slice range: <name>.rows-<start>-<end>.csv. A head sample of 1,000 from export.csv downloads as export.rows-1-1000.csv; an offset slice downloads with the offset reflected in the range.
Does it stream the file so a 500 MB CSV won't crash my browser?
No — it reads the whole file into memory via File.text() before parsing, so it is not a streaming reader. The practical ceiling is your tier's file-size cap (2 MB free, 100 MB Pro) plus available browser memory. For very large files, split first with csv-row-splitter.
Will it change my data, quoting, or column order?
No. It only selects a contiguous block of rows. Columns, order, and cell content are untouched. On output, fields are re-serialized with minimal quoting (quotes added only where a comma, quote, or newline requires them).
What if I ask for more rows than the file has?
You get every available data row and no error. The stats show Rows out equal to the actual count and Rows skipped 0 (when offset is 0).
Is my data uploaded anywhere?
No. Parsing and slicing happen entirely in your browser with PapaParse. The file content never reaches a JAD Apps server; only an anonymous run counter is recorded for signed-in dashboard stats.
How do I make a sample weighted toward specific values?
Chain tools. To sample by category, filter with csv-column-filter first; to sample top performers, sort with csv-sorter; then run the Row Limiter on the result to cap the count.
Does the offset value appear in the stats?
Indirectly. The panel shows Total rows in, Rows out, and Rows skipped, where Rows skipped is the rows not in your output (offset skips plus any trailing rows beyond the limit). Use Rows out to confirm the sample size.
Privacy first
Processing runs locally in your browser with PapaParse. No file is uploaded — only metadata counters are saved for signed-in dashboard stats.