How to generate pandas dataframe code from an excel file for jupyter notebooks
- Step 1Put the table on sheet 1 — Only the first sheet is read. If your analysis data is on a
Datatab behind aCovertab, move it to the front or export it alone — the tool will not reach into later sheets. - Step 2Upload and choose the pandas format — Drop the
.xlsxor.csvand setformattopandas. The generator wraps each column as a list insidepd.DataFrame({...})and assigns it todf. - Step 3Copy into a notebook cell — Use Copy (or Download
data.py) and paste into a Jupyter/Colab cell. Run it —dfexists immediately, noread_excel, no upload widget, no path. - Step 4Cast the typed-as-text columns — Numeric and boolean columns arrive quoted. Add
df = df.astype({"score": float, "count": int})ordf["score"] = pd.to_numeric(df["score"])right after the constructor. - Step 5Map the booleans — Boolean cells come out as
"TRUE"/"FALSE"strings. Convert withdf["active"] = df["active"].map({"TRUE": True, "FALSE": False})before any boolean indexing. - Step 6Treat it as a snapshot — Inline data is a frozen copy of up to 500 rows. If the source workbook changes, re-run the generator and replace the cell — for live or large data, keep
pandas.read_excel()instead. For a quick visual instead of a DataFrame, the same sheet can become an SVG chart.
Inline constructor vs read_excel for notebooks
When the generated pd.DataFrame constructor beats reading the file, and when it doesn't.
| Situation | Inline pd.DataFrame({...}) | pandas.read_excel() |
|---|---|---|
| Notebook shared on GitHub without the .xlsx | Works — data is in the cell | Fails — file not found |
| Colab/Kaggle kernel with no openpyxl | Works — no Excel reader needed | Needs openpyxl/xlrd installed |
| Reviewer wants to see the data in the diff | Visible inline | Opaque binary attachment |
| Dataset over ~500 rows | Truncated to 500 | Reads the whole file |
| Data refreshes weekly | Stale until you re-generate | Always current on re-read |
| Numeric/date dtypes matter | Arrive as text — cast manually | Inferred on read |
What the pandas format emits per value
The constructor is one Python list per column. Spreadsheet cells reach the generator as formatted text, so plan a cast for numeric/boolean columns.
| Cell | Emitted in the list | Cast you'll likely add |
|---|---|---|
Ann (text) | "Ann" | none — already str |
9.5 (number) | "9.5" | pd.to_numeric(df["col"]) |
42 (number) | "42" | df.astype({"col": int}) |
TRUE (bool) | "TRUE" | .map({"TRUE": True, "FALSE": False}) |
2024-01-15 (date) | "1/15/24" (formatted text) | pd.to_datetime(df["col"]) |
| (empty) | None | reads as NaN automatically |
Tier upload limits (DataFrame size still caps at 500 rows)
Tier limits govern the file you can upload; the inlined DataFrame is always at most 500 rows. The tool needs Pro or above.
| Tier | Max file | Max rows uploaded | Rows inlined in df |
|---|---|---|---|
| Free | n/a — tool gated | n/a | n/a (Pro required) |
| Pro | 50 MB | 100,000 | 500 |
| Pro-media | 200 MB | 500,000 | 500 |
| Developer | 500 MB | unlimited | 500 |
Cookbook
Constructor fragments and the dtype fixes you'll paste right after them. Values are quoted because cells arrive as formatted text.
Basic DataFrame constructor
The pandas format prepends the import and builds the column lists. df is ready after running the cell.
Excel (Sheet1):
region | revenue | growth
EMEA | 120000 | 0.12
APAC | 98000 | 0.31
format: pandas → data.py
import pandas as pd
df = pd.DataFrame({
"region": ["EMEA", "APAC"],
"revenue": ["120000", "98000"],
"growth": ["0.12", "0.31"]
})Fixing dtypes after the constructor
Numbers are strings. One astype call (or to_numeric per column) restores numeric dtypes for analysis.
import pandas as pd
df = pd.DataFrame({
"region": ["EMEA", "APAC"],
"revenue": ["120000", "98000"],
"growth": ["0.12", "0.31"]
})
df = df.astype({"revenue": int, "growth": float})
# now df["revenue"].sum() works as a number, not string concatEmpty cells become None → NaN
A blank cell is None in the constructor, which pandas treats as NaN — so missing-value methods work without extra cleaning.
Excel:
user | score
ann | 88
bob | <- blank
import pandas as pd
df = pd.DataFrame({
"user": ["ann", "bob"],
"score": ["88", None]
})
df["score"].isna() # bob -> True
df["score"] = pd.to_numeric(df["score"]) # NaN stays NaNDates come through as formatted text
Date cells are formatted to a display string by SheetJS, so they arrive quoted. Parse with pd.to_datetime after construction.
import pandas as pd
df = pd.DataFrame({
"event": ["launch", "review"],
"on": ["1/15/24", "3/1/24"]
})
df["on"] = pd.to_datetime(df["on"])
# now df.set_index("on").resample("M") etc. worksKeyword-safe columns for attribute access
df.class would be a syntax error; the sanitiser renames the column so dotted access and queries stay legal.
Excel header: class -> sanitised to class_col
import pandas as pd
df = pd.DataFrame({
"class_col": ["A", "B"],
"count": ["10", "5"]
})
df.class_col # legal — 'df.class' would not beEdge cases and what actually happens
No dtypes comment block is produced
By designThe pandas format emits only the import and the pd.DataFrame({...}) constructor — there is no auto-generated dtypes annotation. Because every value arrives as text, add your own astype/to_numeric line for numeric and date columns after pasting.
Numeric columns are strings
Expecteddf["revenue"].sum() on a freshly pasted DataFrame concatenates strings instead of adding numbers, because the cells came through as text. Cast first: df = df.astype({"revenue": int}) or pd.to_numeric. This is the single most common surprise with the pandas format.
Only sheet 1 becomes a DataFrame
By designA multi-sheet model yields a DataFrame for the first tab only. To analyse several tabs, run the tool once per tab (moving each to the front, or saving each as its own file) and concatenate the resulting DataFrames in your notebook. If the tabs share a key and you'd rather merge them into one sheet upstream, the sheet joiner does that before you generate.
DataFrame capped at 500 rows
TruncatedThe pandas format slices to the first 500 rows and — unlike list_of_dicts — adds no truncation comment, so a 5,000-row file silently produces a 500-row DataFrame. For the full dataset switch to pandas.read_excel() against the source file.
Line break inside a cell
Invalid PythonNewlines in cells are not escaped, so a multi-line cell yields a broken string literal and the cell won't run. Collapse line breaks before generating, or fix the offending literal in the output.
Boolean columns are 'TRUE'/'FALSE' strings
ExpectedBoolean cells arrive as text, so df["active"] is an object column of "TRUE"/"FALSE". Boolean indexing (df[df.active]) won't behave until you map them: df["active"].map({"TRUE": True, "FALSE": False}).
Duplicate / colliding column names
CollisionIf two headers sanitise to the same identifier, the second column's list overwrites the first in the constructor dict, so you lose a column. Rename headers first with csv-header-rename to keep them distinct.
Empty sheet
SupportedA sheet with no data rows produces an effectively empty structure rather than an error. With the pandas format you'll get a constructor with empty lists — a valid, empty DataFrame.
File too large for the tier
RejectedUploads above the tier ceiling (Pro 50 MB / 100,000 rows; Pro-media 200 MB / 500,000; Developer 500 MB / unlimited) are rejected before generation. Within the limit, still only 500 rows are inlined.
Run on Free tier
RejectedThe generator is Pro-gated and throws Python Generator requires Pro tier. for Free users. Upgrade to Pro to generate DataFrame code.
Frequently asked questions
Why does df["col"].sum() concatenate instead of add?
Because numeric cells are read as formatted text and emitted as quoted strings, the column has object dtype. Cast it first: df["col"] = pd.to_numeric(df["col"]) or df = df.astype({"col": int}).
Is there a dtypes block in the output?
No. The pandas format outputs only the import and the pd.DataFrame({...}) constructor. Add any dtype casting yourself after pasting — it's typically one astype line.
Why not just use pandas.read_excel()?
Use it when the file ships with the notebook and the kernel has openpyxl, and especially for data over 500 rows or data that changes. The inline constructor wins when the .xlsx isn't bundled (shared GitHub/Colab notebooks), when you want the data visible in the diff, or when you want zero Excel-reader dependency.
How many rows make it into the DataFrame?
At most 500 — a fixed code constant, not a tier setting. The pandas format truncates silently (no comment), so a larger file quietly yields a 500-row DataFrame.
Does it read multiple sheets into one DataFrame?
No. Only the first sheet is read. Generate per tab and pd.concat([...]) the results in your notebook.
What variable name is used?
df, and import pandas as pd is included. Rename to taste after pasting.
How are dates handled?
Date cells are formatted to display strings by SheetJS, so they arrive quoted (e.g. "1/15/24"). Parse them with pd.to_datetime(df["col"]) after construction.
What about empty cells?
They become None, which pandas reads as NaN. So df.isna(), df.dropna(), and df.fillna() all work without an extra cleaning step.
Are columns with spaces or keywords safe?
Yes. Headers are sanitised to valid identifiers (First Name → First_Name, class → class_col), so df.column_name attribute access and df.query(...) won't break. If you also want the source workbook's headers cleaned permanently, run csv-header-rename on the file first.
Does the constructor need pandas installed?
Yes — pd.DataFrame requires pandas where you run the cell. If you want zero dependency, use the list_of_dicts format and call pd.DataFrame(data) only where pandas is available.
Is my data sent to a server?
No. Everything runs in your browser via SheetJS. Nothing is uploaded, which is why unpublished experiment data is safe to convert.
What does the download look like?
A plain-text data.py containing the import and constructor. You can also Copy it directly into a notebook cell.
Privacy first
Every JAD Excel tool runs entirely in your browser using SheetJS and ExcelJS. Your spreadsheets, formulas, and data never leave your device — verified by zero outbound network requests during processing.