Shannon Entropy in File Analysis — Technical Guide to the Per-Chunk Computation

How to shannon entropy for file analysis: a technical deep dive

Step 1
Slice the file into 256-byte windows — Iterate the buffer in steps of 256. Each window spans [i, min(i+256, length)). The window length is end - i, which equals 256 for all windows except possibly the last. This length, not a hardcoded 256, is the denominator for probabilities.
Step 2
Build the 256-bucket frequency table — Allocate Uint32Array(256) and increment counts[byte] for every byte in the window. After the pass, counts[b] is the occurrence count of byte value b (0-255) in this window.
Step 3
Convert counts to probabilities — For each byte value b with counts[b] > 0, compute p = counts[b] / windowLength. Byte values that never appear contribute nothing (the algorithm skips zero counts to avoid log2(0)).
Step 4
Accumulate the Shannon sum — Maintain h = 0; for each non-zero p, do h -= p * Math.log2(p). Because p <= 1, log2(p) <= 0, so each term subtracts a non-positive number — h accumulates upward toward at most 8.
Step 5
Round and record the chunk value — Store Math.round(h * 1000) / 1000 — entropy to 3 decimal places. Push it onto the chunks array. Repeat for every window until the buffer is exhausted.
Step 6
Summarize and plot — Count windows with value >= 7.5 into highEntropyChunks; total is chunks.length. The recharts LineChart plots each chunk against its index on a fixed [0,8] axis with the amber 7.5 reference line; the footer reports the max chunk value.

The per-chunk algorithm, step by step

Exact computation the analyzer performs for each 256-byte window. Reproduce this and your numbers will match to 3 decimals.

Step	Operation	Detail
1. Window	`[i, min(i+256, len))`	Non-overlapping; last window may be < 256 bytes
2. Count	`Uint32Array(256)`, `counts[byte]++`	One bucket per possible byte value 0-255
3. Length	`total = end - i`	Actual window length, used as the denominator
4. Probability	`p = counts[b] / total`	Skipped when `counts[b] == 0`
5. Accumulate	`h -= p * log2(p)`	Sum over non-zero buckets only
6. Round	`round(h * 1000) / 1000`	3 decimal places, range 0.000 - 8.000

Entropy signatures of common file types

Approximate Shannon entropy (bits/byte) by content. The analyzer flags chunks >=7.5; whole-file flag fires above 50%.

File / content type	Typical entropy	Why
Plain ASCII text	4.0 - 4.8	~70-90 distinct values, skewed toward letters/space
Source code	4.5 - 5.5	More symbols and structure than prose
UTF-8 with multibyte	4.5 - 5.5	Continuation bytes broaden the distribution
Bitmap (BMP, uncompressed)	varies, often 6-7	Depends on image content; gradients lower, noise higher
x86/x64 machine code	5.5 - 6.5	Opcode distribution is non-uniform
JPEG	6.8 - 7.3	Lossy DCT + entropy coding
DEFLATE / gzip / zlib / PNG IDAT	7.5 - 7.9	Huffman coding leaves minor regularities
ZIP / 7z body	7.6 - 8.0	Compressed payloads
MP3 / AAC / FLAC	7.7 - 7.9	Entropy-coded audio
AES-GCM / strong encryption	7.95 - 8.0	Computationally indistinguishable from random
Null / 0xFF padding	0.0	Single byte value — zero uncertainty
Random key material	~8.0	Uniform distribution

Output schema (no options)

Fields the analyzer returns. There is no configurable option — the option schema is empty and the window size is hardcoded to 256.

Field	Type	Definition
`chunks`	number[]	Per-window entropy, each rounded to 3 decimals
`highEntropyChunks`	number	Count of windows with value >= 7.5
`total`	number	Window count = ceil(byteLength / 256)
`threatDetected`	boolean	`highEntropyChunks > total * 0.5` (browser findings)
Chart Y-axis	fixed [0, 8]	recharts domain; amber reference line at 7.5
Footer	Max chunk value	Single highest entropy chunk — not an average

Cookbook

Worked numerical examples. Entropy is computed exactly as the analyzer does: probabilities over a 256-byte window, summed, rounded to 3 decimals.

A window of all identical bytes -> entropy 0

A 256-byte window that is entirely 0x00. One byte value with p=1; log2(1)=0, so the sum is zero. This is the deep-valley case you see over PE padding and sparse regions.

counts[0x00] = 256, all others 0
p(0x00) = 256/256 = 1.0
H = -(1.0 * log2(1.0)) = -(1.0 * 0) = 0
chunk value: 0.000

A window with two equally frequent bytes -> entropy 1

128 bytes of 0xAA and 128 of 0xBB. Two symbols at p=0.5 each. Each contributes 0.5 bits; total exactly 1.0 — a one-bit-per-byte source.

p(0xAA) = 0.5, p(0xBB) = 0.5
H = -(0.5*log2 0.5 + 0.5*log2 0.5)
  = -(0.5*(-1) + 0.5*(-1)) = 1.0
chunk value: 1.000

A perfectly uniform window -> entropy 8 (the ceiling)

All 256 byte values appear exactly once in a 256-byte window. Every p=1/256; the sum is log2(256)=8. This is the theoretical maximum the algorithm can return.

counts[b] = 1 for every b in 0..255
p(b) = 1/256 for all 256 values
H = -256 * (1/256) * log2(1/256)
  = -log2(1/256) = log2(256) = 8.0
chunk value: 8.000

English prose -> ~4.2 bits/byte

A 256-byte window of typical English. Letters, space, and punctuation dominate a subset of the 256 values with a skewed distribution, yielding entropy around 4.0-4.8 — well below the amber line.

Distribution skewed to a-z, space, common punctuation
(~50-70 distinct byte values, non-uniform)
H ~ 4.2 bits/byte
chunk value: ~4.200  (far below 7.5 -> not high-entropy)

AES ciphertext -> ~7.99, header chunk lower

Encrypted output is near-uniform, so chunks read ~7.99. But the container's first window holds format markers (salt, IV, magic) with lower entropy, so the very first chunk can read noticeably below the ciphertext body.

Body chunks:   ~7.990 (near-uniform ciphertext)
First chunk:   ~6.x   (salt + IV + markers are structured)

highEntropyChunks ~= total - 1
threatDetected: true (almost all chunks >= 7.5)

Edge cases and what actually happens

Maximum entropy is exactly 8.0, never higher

By design

Because the alphabet is 256 byte values, log2(256) = 8 is the hard ceiling. The recharts Y-axis domain is fixed at [0, 8] to reflect this. A value above 8 would indicate a bug; the algorithm cannot produce one.

Last window shorter than 256 bytes

Preserved

The denominator is end - i (the real window length), not a hardcoded 256, so a final 100-byte window is normalized correctly. A short window has at most 100 distinct values, capping its entropy at log2(100) ~= 6.64 even if every byte differs — expect the last chunk to read a touch lower for small tails.

Values rounded to 3 decimals

Preserved

Each chunk is round(h * 1000) / 1000. So 7.9994 records as 7.999 and a true 8.0 records as 8.000. The rounding is cosmetic for triage but means you should recompute from raw bytes if you need full floating-point precision for research.

Zero-count byte values are skipped, not treated as p=0

By design

Byte values that never appear contribute nothing and are skipped before the log — avoiding log2(0) = -Infinity. This is the standard convention (0 log 0 = 0) and is why a sparse window does not blow up the sum.

Entropy does not equal compressibility

Common misconception

High Shannon entropy over 256-byte windows means the byte distribution within each window is flat — it does NOT prove the data is incompressible globally. Repeated high-entropy blocks (e.g. a duplicated encrypted region) still compress via deduplication even though each window reads ~8.0.

Window boundaries can blur transitions

Investigate

A transition from text to ciphertext that falls mid-window produces a single chunk averaging both, smoothing the step on the curve. The fixed 256-byte grid is not aligned to file structure, so feature edges land within +/- one chunk of their true offset.

Per-chunk, not whole-file, entropy

By design

The analyzer reports a profile of per-window values, not one global entropy number. Whole-file entropy would mask local structure (the very thing you want to see). If you need a single figure, average the chunks array yourself — the tool does not display an average.

Identical bytes give 0, not undefined

Preserved

A constant window has one symbol at p=1, and 1 * log2(1) = 0, so entropy is exactly 0.000 — well-defined, not NaN. This is why padding regions render as clean valleys rather than gaps.

File exceeds tier limit before computation

Rejected

The buffer reader enforces the size cap and throws before any window is processed: Free 10 MB, Pro 100 MB, Pro-media 500 MB, Developer 2 GB. No partial profile is returned for oversized input.

Empty file

Edge

A zero-byte buffer produces an empty chunks array and total = 0. With no file at all the tool throws No file provided. A truly empty file yields nothing to plot — there are no windows to compute.

Frequently asked questions

Why is 8 bits/byte the maximum?

Because the symbol alphabet is the 256 possible byte values and log2(256) = 8. Maximum entropy occurs when the distribution is uniform — every byte value equally likely within the window. Real data only approaches 8 asymptotically; the analyzer's Y-axis is fixed at [0, 8] to reflect this hard ceiling.

What is the entropy of AES-encrypted data?

AES-GCM output is computationally indistinguishable from random, so the ciphertext body reads ~7.99 bits/byte and nearly every chunk counts as high-entropy. The container's first window is usually lower because it carries structured format markers (salt, IV, magic) rather than ciphertext.

How does DEFLATE compression affect entropy?

DEFLATE output reads ~7.5-7.9 — high but not maximal, because Huffman coding leaves minor statistical regularities that keep it just below ciphertext. This is why gzip/zlib/PNG-IDAT regions land on or just above the amber 7.5 line, in the same band as (but slightly under) encryption.

What exact window size and rounding does the tool use?

A fixed 256-byte non-overlapping window, with each chunk's entropy rounded to 3 decimal places via round(h * 1000) / 1000. The window size is not configurable — the option schema is empty. The denominator is the real window length, so the final short chunk is normalized correctly.

Does the analyzer show whole-file entropy or a per-chunk profile?

A per-chunk profile — an array of one entropy value per 256-byte window, plotted as a curve. It does not display a single whole-file average (the footer shows the max chunk, not the mean). To get one global number, average the chunks array yourself.

Why are byte values that never appear ignored in the sum?

Because log2(0) is undefined (-Infinity). The standard convention is 0 * log 0 = 0, so zero-count buckets contribute nothing and are skipped. This keeps the sum finite for sparse windows and is mathematically correct, not an approximation.

Can I reproduce the analyzer's numbers in my own code?

Yes — follow the six steps: 256-byte windows, a 256-bucket count table, p = count / windowLength, h -= p * log2(p) over non-zero buckets, then round to 3 decimals. The same algorithm runs server-side, so a runner/API call returns identical chunks, highEntropyChunks, and total.

Does high entropy mean the file is incompressible?

No. Per-window entropy near 8.0 means each window's byte distribution is flat, but it says nothing about global redundancy. Two identical high-entropy blocks still compress via deduplication. Shannon entropy here is a local randomness measure, not a global compressibility bound.

How is the chart rendered, and what does the footer mean?

It is a recharts LineChart: entropy on a fixed [0,8] Y-axis, chunk index on a hidden X-axis, with an amber dashed reference line at 7.5. The header readout shows the high-entropy chunk count and percentage; the footer shows Max: X.XX bits/byte — the single highest chunk, not an average.

What happens at the boundary between two content types?

A transition that falls mid-window produces one chunk that averages both content types, smoothing the step. Because the 256-byte grid is not aligned to file structure, a feature's edge on the curve lands within about one chunk of its true byte offset.

How large a file can the analyzer process?

Up to the tier cap: Free 10 MB, Pro 100 MB, Pro-media 500 MB, Developer 2 GB. The whole file is read into a buffer first, then chunked; oversized files are rejected with a plan-limit error before any computation. There is no streaming partial profile.

Can I integrate this entropy computation into a pipeline?

Yes. GET /api/v1/tools/entropy-analyzer returns the schema (no options), and the paired @jadapps/runner executes the identical algorithm locally, returning the chunks array plus highEntropyChunks and total. Cross-reference results with magic-byte-validator and hex-header-inspector for typed context.

Privacy first

Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.

How to shannon entropy for file analysis: a technical deep dive

Step 1
Slice the file into 256-byte windows — Iterate the buffer in steps of 256. Each window spans [i, min(i+256, length)). The window length is end - i, which equals 256 for all windows except possibly the last. This length, not a hardcoded 256, is the denominator for probabilities.
Step 2
Build the 256-bucket frequency table — Allocate Uint32Array(256) and increment counts[byte] for every byte in the window. After the pass, counts[b] is the occurrence count of byte value b (0-255) in this window.
Step 3
Convert counts to probabilities — For each byte value b with counts[b] > 0, compute p = counts[b] / windowLength. Byte values that never appear contribute nothing (the algorithm skips zero counts to avoid log2(0)).
Step 4
Accumulate the Shannon sum — Maintain h = 0; for each non-zero p, do h -= p * Math.log2(p). Because p <= 1, log2(p) <= 0, so each term subtracts a non-positive number — h accumulates upward toward at most 8.
Step 5
Round and record the chunk value — Store Math.round(h * 1000) / 1000 — entropy to 3 decimal places. Push it onto the chunks array. Repeat for every window until the buffer is exhausted.
Step 6
Summarize and plot — Count windows with value >= 7.5 into highEntropyChunks; total is chunks.length. The recharts LineChart plots each chunk against its index on a fixed [0,8] axis with the amber 7.5 reference line; the footer reports the max chunk value.

The per-chunk algorithm, step by step

Exact computation the analyzer performs for each 256-byte window. Reproduce this and your numbers will match to 3 decimals.

Step	Operation	Detail
1. Window	`[i, min(i+256, len))`	Non-overlapping; last window may be < 256 bytes
2. Count	`Uint32Array(256)`, `counts[byte]++`	One bucket per possible byte value 0-255
3. Length	`total = end - i`	Actual window length, used as the denominator
4. Probability	`p = counts[b] / total`	Skipped when `counts[b] == 0`
5. Accumulate	`h -= p * log2(p)`	Sum over non-zero buckets only
6. Round	`round(h * 1000) / 1000`	3 decimal places, range 0.000 - 8.000

Entropy signatures of common file types

Approximate Shannon entropy (bits/byte) by content. The analyzer flags chunks >=7.5; whole-file flag fires above 50%.

File / content type	Typical entropy	Why
Plain ASCII text	4.0 - 4.8	~70-90 distinct values, skewed toward letters/space
Source code	4.5 - 5.5	More symbols and structure than prose
UTF-8 with multibyte	4.5 - 5.5	Continuation bytes broaden the distribution
Bitmap (BMP, uncompressed)	varies, often 6-7	Depends on image content; gradients lower, noise higher
x86/x64 machine code	5.5 - 6.5	Opcode distribution is non-uniform
JPEG	6.8 - 7.3	Lossy DCT + entropy coding
DEFLATE / gzip / zlib / PNG IDAT	7.5 - 7.9	Huffman coding leaves minor regularities
ZIP / 7z body	7.6 - 8.0	Compressed payloads
MP3 / AAC / FLAC	7.7 - 7.9	Entropy-coded audio
AES-GCM / strong encryption	7.95 - 8.0	Computationally indistinguishable from random
Null / 0xFF padding	0.0	Single byte value — zero uncertainty
Random key material	~8.0	Uniform distribution

Output schema (no options)

Fields the analyzer returns. There is no configurable option — the option schema is empty and the window size is hardcoded to 256.

Field	Type	Definition
`chunks`	number[]	Per-window entropy, each rounded to 3 decimals
`highEntropyChunks`	number	Count of windows with value >= 7.5
`total`	number	Window count = ceil(byteLength / 256)
`threatDetected`	boolean	`highEntropyChunks > total * 0.5` (browser findings)
Chart Y-axis	fixed [0, 8]	recharts domain; amber reference line at 7.5
Footer	Max chunk value	Single highest entropy chunk — not an average

Cookbook

Worked numerical examples. Entropy is computed exactly as the analyzer does: probabilities over a 256-byte window, summed, rounded to 3 decimals.

A window of all identical bytes -> entropy 0

A 256-byte window that is entirely 0x00. One byte value with p=1; log2(1)=0, so the sum is zero. This is the deep-valley case you see over PE padding and sparse regions.

counts[0x00] = 256, all others 0
p(0x00) = 256/256 = 1.0
H = -(1.0 * log2(1.0)) = -(1.0 * 0) = 0
chunk value: 0.000

A window with two equally frequent bytes -> entropy 1

128 bytes of 0xAA and 128 of 0xBB. Two symbols at p=0.5 each. Each contributes 0.5 bits; total exactly 1.0 — a one-bit-per-byte source.

p(0xAA) = 0.5, p(0xBB) = 0.5
H = -(0.5*log2 0.5 + 0.5*log2 0.5)
  = -(0.5*(-1) + 0.5*(-1)) = 1.0
chunk value: 1.000

A perfectly uniform window -> entropy 8 (the ceiling)

All 256 byte values appear exactly once in a 256-byte window. Every p=1/256; the sum is log2(256)=8. This is the theoretical maximum the algorithm can return.

counts[b] = 1 for every b in 0..255
p(b) = 1/256 for all 256 values
H = -256 * (1/256) * log2(1/256)
  = -log2(1/256) = log2(256) = 8.0
chunk value: 8.000

English prose -> ~4.2 bits/byte

A 256-byte window of typical English. Letters, space, and punctuation dominate a subset of the 256 values with a skewed distribution, yielding entropy around 4.0-4.8 — well below the amber line.

Distribution skewed to a-z, space, common punctuation
(~50-70 distinct byte values, non-uniform)
H ~ 4.2 bits/byte
chunk value: ~4.200  (far below 7.5 -> not high-entropy)

AES ciphertext -> ~7.99, header chunk lower

Body chunks:   ~7.990 (near-uniform ciphertext)
First chunk:   ~6.x   (salt + IV + markers are structured)

highEntropyChunks ~= total - 1
threatDetected: true (almost all chunks >= 7.5)

Edge cases and what actually happens

Maximum entropy is exactly 8.0, never higher

By design

Last window shorter than 256 bytes

Preserved

Values rounded to 3 decimals

Preserved

Zero-count byte values are skipped, not treated as p=0

By design

Entropy does not equal compressibility

Common misconception

Window boundaries can blur transitions

Investigate

Per-chunk, not whole-file, entropy

By design

Identical bytes give 0, not undefined

Preserved

A constant window has one symbol at p=1, and 1 * log2(1) = 0, so entropy is exactly 0.000 — well-defined, not NaN. This is why padding regions render as clean valleys rather than gaps.

File exceeds tier limit before computation

Rejected

The buffer reader enforces the size cap and throws before any window is processed: Free 10 MB, Pro 100 MB, Pro-media 500 MB, Developer 2 GB. No partial profile is returned for oversized input.

Empty file

Edge

Frequently asked questions

Why is 8 bits/byte the maximum?

What is the entropy of AES-encrypted data?

How does DEFLATE compression affect entropy?

What exact window size and rounding does the tool use?

Does the analyzer show whole-file entropy or a per-chunk profile?

Why are byte values that never appear ignored in the sum?

Can I reproduce the analyzer's numbers in my own code?

Does high entropy mean the file is incompressible?

How is the chart rendered, and what does the footer mean?

What happens at the boundary between two content types?

How large a file can the analyzer process?

Can I integrate this entropy computation into a pipeline?

Privacy first

Every JAD Security operation runs entirely in your browser. Files, passwords, and PGP private keys never leave your device — verified by zero outbound network requests during processing.

Shannon Entropy for File Analysis: A Technical Deep Dive

How to shannon entropy for file analysis: a technical deep dive

The per-chunk algorithm, step by step

Entropy signatures of common file types

Output schema (no options)

Cookbook

A window of all identical bytes -> entropy 0

A window with two equally frequent bytes -> entropy 1

A perfectly uniform window -> entropy 8 (the ceiling)

English prose -> ~4.2 bits/byte

AES ciphertext -> ~7.99, header chunk lower

Edge cases and what actually happens

Maximum entropy is exactly 8.0, never higher

Last window shorter than 256 bytes

Values rounded to 3 decimals

Zero-count byte values are skipped, not treated as p=0

Entropy does not equal compressibility

Window boundaries can blur transitions

Per-chunk, not whole-file, entropy

Identical bytes give 0, not undefined

File exceeds tier limit before computation

Empty file

Frequently asked questions

Why is 8 bits/byte the maximum?

What is the entropy of AES-encrypted data?

How does DEFLATE compression affect entropy?

What exact window size and rounding does the tool use?

Does the analyzer show whole-file entropy or a per-chunk profile?

Why are byte values that never appear ignored in the sum?

Can I reproduce the analyzer's numbers in my own code?

Does high entropy mean the file is incompressible?

How is the chart rendered, and what does the footer mean?

What happens at the boundary between two content types?

How large a file can the analyzer process?

Can I integrate this entropy computation into a pipeline?

Privacy first

Related guides

Shannon Entropy for File Analysis: A Technical Deep Dive

How to shannon entropy for file analysis: a technical deep dive

The per-chunk algorithm, step by step

Entropy signatures of common file types

Output schema (no options)

Cookbook

A window of all identical bytes -> entropy 0

A window with two equally frequent bytes -> entropy 1

A perfectly uniform window -> entropy 8 (the ceiling)

English prose -> ~4.2 bits/byte

AES ciphertext -> ~7.99, header chunk lower

Edge cases and what actually happens

Maximum entropy is exactly 8.0, never higher

Last window shorter than 256 bytes

Values rounded to 3 decimals

Zero-count byte values are skipped, not treated as p=0

Entropy does not equal compressibility

Window boundaries can blur transitions

Per-chunk, not whole-file, entropy

Identical bytes give 0, not undefined

File exceeds tier limit before computation

Empty file

Frequently asked questions

Why is 8 bits/byte the maximum?

What is the entropy of AES-encrypted data?

How does DEFLATE compression affect entropy?

What exact window size and rounding does the tool use?

Does the analyzer show whole-file entropy or a per-chunk profile?

Why are byte values that never appear ignored in the sum?

Can I reproduce the analyzer's numbers in my own code?

Does high entropy mean the file is incompressible?

How is the chart rendered, and what does the footer mean?

What happens at the boundary between two content types?

How large a file can the analyzer process?

Can I integrate this entropy computation into a pipeline?

Privacy first

Related guides