How to shannon entropy for file analysis: a technical deep dive
- Step 1Slice the file into 256-byte windows — Iterate the buffer in steps of 256. Each window spans
[i, min(i+256, length)). The window length isend - i, which equals 256 for all windows except possibly the last. This length, not a hardcoded 256, is the denominator for probabilities. - Step 2Build the 256-bucket frequency table — Allocate
Uint32Array(256)and incrementcounts[byte]for every byte in the window. After the pass,counts[b]is the occurrence count of byte valueb(0-255) in this window. - Step 3Convert counts to probabilities — For each byte value
bwithcounts[b] > 0, computep = counts[b] / windowLength. Byte values that never appear contribute nothing (the algorithm skips zero counts to avoidlog2(0)). - Step 4Accumulate the Shannon sum — Maintain
h = 0; for each non-zerop, doh -= p * Math.log2(p). Becausep <= 1,log2(p) <= 0, so each term subtracts a non-positive number —haccumulates upward toward at most 8. - Step 5Round and record the chunk value — Store
Math.round(h * 1000) / 1000— entropy to 3 decimal places. Push it onto thechunksarray. Repeat for every window until the buffer is exhausted. - Step 6Summarize and plot — Count windows with value
>= 7.5intohighEntropyChunks;totalischunks.length. The recharts LineChart plots each chunk against its index on a fixed[0,8]axis with the amber 7.5 reference line; the footer reports the max chunk value.
The per-chunk algorithm, step by step
Exact computation the analyzer performs for each 256-byte window. Reproduce this and your numbers will match to 3 decimals.
| Step | Operation | Detail |
|---|---|---|
| 1. Window | [i, min(i+256, len)) | Non-overlapping; last window may be < 256 bytes |
| 2. Count | Uint32Array(256), counts[byte]++ | One bucket per possible byte value 0-255 |
| 3. Length | total = end - i | Actual window length, used as the denominator |
| 4. Probability | p = counts[b] / total | Skipped when counts[b] == 0 |
| 5. Accumulate | h -= p * log2(p) | Sum over non-zero buckets only |
| 6. Round | round(h * 1000) / 1000 | 3 decimal places, range 0.000 - 8.000 |
Entropy signatures of common file types
Approximate Shannon entropy (bits/byte) by content. The analyzer flags chunks >=7.5; whole-file flag fires above 50%.
| File / content type | Typical entropy | Why |
|---|---|---|
| Plain ASCII text | 4.0 - 4.8 | ~70-90 distinct values, skewed toward letters/space |
| Source code | 4.5 - 5.5 | More symbols and structure than prose |
| UTF-8 with multibyte | 4.5 - 5.5 | Continuation bytes broaden the distribution |
| Bitmap (BMP, uncompressed) | varies, often 6-7 | Depends on image content; gradients lower, noise higher |
| x86/x64 machine code | 5.5 - 6.5 | Opcode distribution is non-uniform |
| JPEG | 6.8 - 7.3 | Lossy DCT + entropy coding |
| DEFLATE / gzip / zlib / PNG IDAT | 7.5 - 7.9 | Huffman coding leaves minor regularities |
| ZIP / 7z body | 7.6 - 8.0 | Compressed payloads |
| MP3 / AAC / FLAC | 7.7 - 7.9 | Entropy-coded audio |
| AES-GCM / strong encryption | 7.95 - 8.0 | Computationally indistinguishable from random |
| Null / 0xFF padding | 0.0 | Single byte value — zero uncertainty |
| Random key material | ~8.0 | Uniform distribution |
Output schema (no options)
Fields the analyzer returns. There is no configurable option — the option schema is empty and the window size is hardcoded to 256.
| Field | Type | Definition |
|---|---|---|
chunks | number[] | Per-window entropy, each rounded to 3 decimals |
highEntropyChunks | number | Count of windows with value >= 7.5 |
total | number | Window count = ceil(byteLength / 256) |
threatDetected | boolean | highEntropyChunks > total * 0.5 (browser findings) |
| Chart Y-axis | fixed [0, 8] | recharts domain; amber reference line at 7.5 |
| Footer | Max chunk value | Single highest entropy chunk — not an average |
Cookbook
Worked numerical examples. Entropy is computed exactly as the analyzer does: probabilities over a 256-byte window, summed, rounded to 3 decimals.
A window of all identical bytes -> entropy 0
A 256-byte window that is entirely 0x00. One byte value with p=1; log2(1)=0, so the sum is zero. This is the deep-valley case you see over PE padding and sparse regions.
counts[0x00] = 256, all others 0 p(0x00) = 256/256 = 1.0 H = -(1.0 * log2(1.0)) = -(1.0 * 0) = 0 chunk value: 0.000
A window with two equally frequent bytes -> entropy 1
128 bytes of 0xAA and 128 of 0xBB. Two symbols at p=0.5 each. Each contributes 0.5 bits; total exactly 1.0 — a one-bit-per-byte source.
p(0xAA) = 0.5, p(0xBB) = 0.5 H = -(0.5*log2 0.5 + 0.5*log2 0.5) = -(0.5*(-1) + 0.5*(-1)) = 1.0 chunk value: 1.000
A perfectly uniform window -> entropy 8 (the ceiling)
All 256 byte values appear exactly once in a 256-byte window. Every p=1/256; the sum is log2(256)=8. This is the theoretical maximum the algorithm can return.
counts[b] = 1 for every b in 0..255 p(b) = 1/256 for all 256 values H = -256 * (1/256) * log2(1/256) = -log2(1/256) = log2(256) = 8.0 chunk value: 8.000
English prose -> ~4.2 bits/byte
A 256-byte window of typical English. Letters, space, and punctuation dominate a subset of the 256 values with a skewed distribution, yielding entropy around 4.0-4.8 — well below the amber line.
Distribution skewed to a-z, space, common punctuation (~50-70 distinct byte values, non-uniform) H ~ 4.2 bits/byte chunk value: ~4.200 (far below 7.5 -> not high-entropy)
AES ciphertext -> ~7.99, header chunk lower
Encrypted output is near-uniform, so chunks read ~7.99. But the container's first window holds format markers (salt, IV, magic) with lower entropy, so the very first chunk can read noticeably below the ciphertext body.
Body chunks: ~7.990 (near-uniform ciphertext) First chunk: ~6.x (salt + IV + markers are structured) highEntropyChunks ~= total - 1 threatDetected: true (almost all chunks >= 7.5)
Edge cases and what actually happens
Maximum entropy is exactly 8.0, never higher
By designBecause the alphabet is 256 byte values, log2(256) = 8 is the hard ceiling. The recharts Y-axis domain is fixed at [0, 8] to reflect this. A value above 8 would indicate a bug; the algorithm cannot produce one.
Last window shorter than 256 bytes
PreservedThe denominator is end - i (the real window length), not a hardcoded 256, so a final 100-byte window is normalized correctly. A short window has at most 100 distinct values, capping its entropy at log2(100) ~= 6.64 even if every byte differs — expect the last chunk to read a touch lower for small tails.
Values rounded to 3 decimals
PreservedEach chunk is round(h * 1000) / 1000. So 7.9994 records as 7.999 and a true 8.0 records as 8.000. The rounding is cosmetic for triage but means you should recompute from raw bytes if you need full floating-point precision for research.
Zero-count byte values are skipped, not treated as p=0
By designByte values that never appear contribute nothing and are skipped before the log — avoiding log2(0) = -Infinity. This is the standard convention (0 log 0 = 0) and is why a sparse window does not blow up the sum.
Entropy does not equal compressibility
Common misconceptionHigh Shannon entropy over 256-byte windows means the byte distribution within each window is flat — it does NOT prove the data is incompressible globally. Repeated high-entropy blocks (e.g. a duplicated encrypted region) still compress via deduplication even though each window reads ~8.0.
Window boundaries can blur transitions
InvestigateA transition from text to ciphertext that falls mid-window produces a single chunk averaging both, smoothing the step on the curve. The fixed 256-byte grid is not aligned to file structure, so feature edges land within +/- one chunk of their true offset.
Per-chunk, not whole-file, entropy
By designThe analyzer reports a profile of per-window values, not one global entropy number. Whole-file entropy would mask local structure (the very thing you want to see). If you need a single figure, average the chunks array yourself — the tool does not display an average.
Identical bytes give 0, not undefined
PreservedA constant window has one symbol at p=1, and 1 * log2(1) = 0, so entropy is exactly 0.000 — well-defined, not NaN. This is why padding regions render as clean valleys rather than gaps.
File exceeds tier limit before computation
RejectedThe buffer reader enforces the size cap and throws before any window is processed: Free 10 MB, Pro 100 MB, Pro-media 500 MB, Developer 2 GB. No partial profile is returned for oversized input.
Empty file
EdgeA zero-byte buffer produces an empty chunks array and total = 0. With no file at all the tool throws No file provided. A truly empty file yields nothing to plot — there are no windows to compute.
Frequently asked questions
Why is 8 bits/byte the maximum?
Because the symbol alphabet is the 256 possible byte values and log2(256) = 8. Maximum entropy occurs when the distribution is uniform — every byte value equally likely within the window. Real data only approaches 8 asymptotically; the analyzer's Y-axis is fixed at [0, 8] to reflect this hard ceiling.
What is the entropy of AES-encrypted data?
AES-GCM output is computationally indistinguishable from random, so the ciphertext body reads ~7.99 bits/byte and nearly every chunk counts as high-entropy. The container's first window is usually lower because it carries structured format markers (salt, IV, magic) rather than ciphertext.
How does DEFLATE compression affect entropy?
DEFLATE output reads ~7.5-7.9 — high but not maximal, because Huffman coding leaves minor statistical regularities that keep it just below ciphertext. This is why gzip/zlib/PNG-IDAT regions land on or just above the amber 7.5 line, in the same band as (but slightly under) encryption.
What exact window size and rounding does the tool use?
A fixed 256-byte non-overlapping window, with each chunk's entropy rounded to 3 decimal places via round(h * 1000) / 1000. The window size is not configurable — the option schema is empty. The denominator is the real window length, so the final short chunk is normalized correctly.
Does the analyzer show whole-file entropy or a per-chunk profile?
A per-chunk profile — an array of one entropy value per 256-byte window, plotted as a curve. It does not display a single whole-file average (the footer shows the max chunk, not the mean). To get one global number, average the chunks array yourself.
Why are byte values that never appear ignored in the sum?
Because log2(0) is undefined (-Infinity). The standard convention is 0 * log 0 = 0, so zero-count buckets contribute nothing and are skipped. This keeps the sum finite for sparse windows and is mathematically correct, not an approximation.
Can I reproduce the analyzer's numbers in my own code?
Yes — follow the six steps: 256-byte windows, a 256-bucket count table, p = count / windowLength, h -= p * log2(p) over non-zero buckets, then round to 3 decimals. The same algorithm runs server-side, so a runner/API call returns identical chunks, highEntropyChunks, and total.
Does high entropy mean the file is incompressible?
No. Per-window entropy near 8.0 means each window's byte distribution is flat, but it says nothing about global redundancy. Two identical high-entropy blocks still compress via deduplication. Shannon entropy here is a local randomness measure, not a global compressibility bound.
How is the chart rendered, and what does the footer mean?
It is a recharts LineChart: entropy on a fixed [0,8] Y-axis, chunk index on a hidden X-axis, with an amber dashed reference line at 7.5. The header readout shows the high-entropy chunk count and percentage; the footer shows Max: X.XX bits/byte — the single highest chunk, not an average.
What happens at the boundary between two content types?
A transition that falls mid-window produces one chunk that averages both content types, smoothing the step. Because the 256-byte grid is not aligned to file structure, a feature's edge on the curve lands within about one chunk of its true byte offset.
How large a file can the analyzer process?
Up to the tier cap: Free 10 MB, Pro 100 MB, Pro-media 500 MB, Developer 2 GB. The whole file is read into a buffer first, then chunked; oversized files are rejected with a plan-limit error before any computation. There is no streaming partial profile.
Can I integrate this entropy computation into a pipeline?
Yes. GET /api/v1/tools/entropy-analyzer returns the schema (no options), and the paired @jadapps/runner executes the identical algorithm locally, returning the chunks array plus highEntropyChunks and total. Cross-reference results with magic-byte-validator and hex-header-inspector for typed context.
Privacy first
Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.