How to zip metadata extractor for security & compliance
- Step 1Stage the evidence locally — Pull the archive from your evidence store, artifact registry, or quarantine into a local folder. Nothing leaves the machine; the tool reads from disk via the File API.
- Step 2Confirm it is actually a ZIP — Run /archive-tools/auto-format-detector first. A renamed
.7z/.rarwill throw 'Not a valid ZIP archive…' here, and the format mismatch itself can be a finding worth recording. - Step 3Extract the metadata — Open /archive-tools/archive-metadata-extractor, drop the ZIP, and click Process. For evidence above 50 MB use a Pro tier (up to 500 MB / 50,000 entries) so you do not have to split the artifact.
- Step 4Scan for the red-flag fields — In the JSON, search for
"encrypted": true,"utf8": false, unexpectedhostOSvalues, andcrc32of00000000on non-encrypted entries. These are the fields that most often distinguish a benign archive from a crafted one. - Step 5Anchor chain-of-custody with a hash — Run /archive-tools/checksum-generator to capture a SHA-256 of the whole archive before and after analysis, proving the bytes were not altered during inspection. Store the hash alongside the metadata JSON.
- Step 6Hand off to your tooling — Download the
<name>-metadata.jsonand attach it to the case, or pipe it into your SIEM. The schema is stable, so successive reports across an investigation diff cleanly.
Header fields and what they reveal in an investigation
How each reported field maps to a security/compliance question. All read from plaintext central-directory metadata — no decryption required.
| Field | Security signal | Red flag to watch for |
|---|---|---|
flags.encrypted | Entry payload is encrypted (bit 0) | Encrypted entry in an artifact that should be plaintext — scanners skip it |
flags.utf8 | Filename encoding (bit 11) | false on a name with non-ASCII or path characters — possible code-page spoofing |
hostOS | Creating operating system | Unix hostOS: 3 on an artifact claimed to be Windows-built (or vice versa) |
compressionMethod | Per-entry method | AES (method 99) confirms encryption; an exotic method may evade naive parsers |
crc32 | Stored checksum | 00000000 on a non-encrypted entry, or identical CRCs across distinct files |
lastModified | DOS timestamp | Future dates, epoch (1980-01-01), or timestamps that contradict the build log |
hasExtraField | Extra-field present | Unexpected extra fields can carry Unix mode bits, signatures, or padding used to hide data |
versionNeeded | Min reader version | Surprisingly high values hint at ZIP64 or methods a downstream scanner cannot read |
Tier limits for evidence sizing
Per-job limits from the archive family in tier-limits.ts. Choose the tier that holds the artifact without splitting.
| Tier | Max archive size | Max entries |
|---|---|---|
| Free | 50 MB | 500 |
| Pro | 500 MB | 50,000 |
| Pro-media | 2 GB | 500,000 |
| Developer | 2 GB | 500,000 |
Cookbook
Forensic recipes built from the report fields. Each shows what to grep for in the JSON and which sibling tool to chain next.
Find every encrypted entry without a password
General-purpose bit 0 lives in the plaintext central directory, so encrypted entries are visible even though their payloads are not. This is how you confirm that an upload contains an encrypted blob a content scanner would have skipped.
$ jq '.entries[] | select(.flags.encrypted) | {name, compressionMethod}' \
suspicious-metadata.json
{
"name": "payload.bin",
"compressionMethod": "AES"
}
→ One encrypted entry. To test a suspected password:
/archive-tools/archive-password-tester
→ To classify ZipCrypto vs AES across the archive:
/archive-tools/encrypted-archive-detectorCatch non-UTF-8 filenames hiding traversal characters
flags.utf8: false means the name was stored in a legacy code page. The tool decodes leniently, so suspicious bytes may surface as U+FFFD. Combined with path characters, this is a classic obfuscation for Zip Slip-style entry names.
$ jq '.entries[] | select(.flags.utf8 == false) | .name' meta.json "..\\..\\windows\\system32\\evil.dll" "\uFFFD\uFFFD config.json" → The first is a path-traversal attempt; the second has non-decodable bytes. Sanitise names on extraction with /archive-tools/filename-sanitizer .
Cross-check provenance via hostOS
An artifact a vendor claims was built on Windows but whose entries all report hostOS 3 (Unix) is worth a question. The host-OS byte is written by the creating tool and is a cheap provenance signal.
$ jq '[.entries[].hostOS] | unique' build-from-vendor-meta.json [ 3 ] → Every entry hostOS 3 = Unix-built. If the SBOM claims a Windows build farm, that contradiction is a finding. (0 = DOS/FAT, 3 = Unix, 11 = NTFS/Windows.)
Tamper smell test on CRC-32 values
A zero CRC on a non-encrypted entry, or the same CRC across files that should differ, suggests the directory was edited without recomputing checksums. The report gives you the stored CRCs to compare; verification of the payload is a separate step.
$ jq '.entries[] | {name, crc32, encrypted: .flags.encrypted}' meta.json
{ "name": "a.txt", "crc32": "00000000", "encrypted": false }
{ "name": "b.txt", "crc32": "00000000", "encrypted": false }
→ Two plaintext entries with zero CRC is suspicious.
Recompute and verify with
/archive-tools/archive-integrity-tester .Lock chain-of-custody with before/after hashes
Because everything is read-only and in-tab, the archive bytes are never modified. Prove it by hashing before and after analysis and storing both hashes with the metadata JSON.
Workflow:
1. /archive-tools/checksum-generator → SHA-256 of evidence.zip
e3b0c44298fc1c149afbf4c8996fb924...
2. /archive-tools/archive-metadata-extractor → metadata.json
3. /archive-tools/checksum-generator again → same SHA-256
Identical hashes prove the inspection did not alter the file.
Archive the two hashes + metadata.json in the case record.Edge cases and what actually happens
Encrypted entries in the archive
SupportedEncryption never blocks this tool — the central directory is plaintext, so names, sizes, methods, timestamps and flags.encrypted all read normally. Only the payloads are encrypted, and the tool never reads payloads. This is precisely why it is useful for spotting encrypted blobs a scanner would skip.
Artifact is actually a renamed 7z/RAR
Unsupported formatThe tool reads ZIP only and throws 'Not a valid ZIP archive (or unsupported format for metadata extraction)' on non-ZIP input. In a forensic context the mismatch (a .zip extension on a 7z payload) is itself a finding. Confirm with /archive-tools/auto-format-detector and record it.
Filename carries invalid UTF-8 bytes
Decoded lenientlyNames are decoded with non-fatal UTF-8, so undecodable bytes become U+FFFD rather than throwing. With flags.utf8: false, a U+FFFD name signals a legacy code page — a common obfuscation vector. The original bytes are not preserved in the report; sanitise on extraction with /archive-tools/filename-sanitizer.
Zero CRC on an AES entry
ExpectedAES-encrypted entries (compressionMethod: 'AES', flags.encrypted: true) commonly store crc32: '00000000' because the real CRC is protected with the encrypted data. A zero CRC here is normal — only a zero CRC on a plaintext entry is suspicious.
Comment or extra field used to smuggle data
Presence onlyThe report flags hasComment/hasExtraField: true but does not dump their bytes — so it tells you data is present without showing what. To inspect comment contents use /archive-tools/comment-extractor; for signing-related extra fields use /archive-tools/archive-signing-info.
Truncated or tampered central directory
Partial / rejectedParsing stops at the first record missing the 0x02014b50 signature, yielding a partial report; damage before the first entry yields zero entries and the standard error. A directory that parses fewer entries than totalEntries (the EOCD count) is itself a tamper indicator. For recovery, use /archive-tools/corrupted-zip-repair.
Multiple analysts inspecting concurrently
SupportedEach browser tab is an independent, stateless instance — there is no shared server session to contend over. Free-tier limits apply per session; a Pro seat removes them. Nothing about one analyst's run affects another's.
Regulated environment (HIPAA / PCI / FedRAMP)
Boundary preservedBecause the file is read in-tab via the File API and nothing transits the network for archive tools (browserOnly: true), the regulated boundary does not move — the same posture as a local CLI. Confirm with your compliance team, but most treat in-browser local processing as equivalent to on-machine tooling.
Over 65,535 entries (ZIP64)
LimitationThe entry count is read from the 16-bit EOCD field and wraps past 65,535; the loop may stop early on a true ZIP64 directory. For very large evidence archives, corroborate the entry count with a ZIP64-aware tool. Tier entry caps keep typical artifacts well under this.
Frequently asked questions
Is evidence uploaded anywhere?
No. Archive tools are browser-only (browserOnly: true) — the ZIP is read in-tab with the File API and never sent to a server. For chain-of-custody this matters: using the tool does not move the file outside your machine, equivalent to running a local CLI.
Can I detect encrypted entries without the password?
Yes. flags.encrypted comes from general-purpose bit 0, which is plaintext in the central directory. You see exactly which entries are encrypted (and AES entries report compressionMethod: 'AES') without decrypting anything. To test a candidate password, chain /archive-tools/archive-password-tester.
How do I spot a spoofed or path-traversal filename?
Filter for flags.utf8: false and inspect those names — legacy code-page storage is where traversal payloads (..\..\) and look-alike names hide. The tool decodes names leniently, so undecodable bytes appear as U+FFFD. Sanitise on extraction with /archive-tools/filename-sanitizer.
Does it verify CRCs or just read them?
It reports the stored crc32 from the directory; it does not recompute against the payload. A zero CRC on a plaintext entry, or duplicated CRCs across distinct files, is a tamper smell test — confirm by recomputing with /archive-tools/archive-integrity-tester.
What does hostOS tell me about provenance?
It is the high byte of versionMadeBy: 0 = DOS/FAT, 3 = Unix, 11 = NTFS/Windows. If the host OS contradicts the artifact's claimed build environment, that is a provenance discrepancy worth recording. It reflects what the creating tool wrote.
How large an evidence archive can I process?
Free: 50 MB / 500 entries. Pro: 500 MB / 50,000 entries. Pro-media and Developer: 2 GB / 500,000 entries. The size cap is checked before processing; the entry cap is enforced by the archive tier schema.
Can I attach the output to a SIEM or ticket?
Yes — the report is plain JSON (<name>-metadata.json) with no JAD wrapper, ready for SIEM ingestion, ticket attachment, or jq filtering. The stable schema means reports across an investigation timeline diff cleanly.
Does it read the comment or extra-field contents?
No — only hasComment/hasExtraField booleans (presence). To read comment text use /archive-tools/comment-extractor; for signature extra fields use /archive-tools/archive-signing-info. Reporting presence is itself useful: it flags entries carrying hidden metadata for deeper inspection.
Is this suitable for supply-chain artifact audits?
Yes for the ZIP case — it surfaces method, encryption, host OS, timestamps and CRCs per entry, which are exactly the fields a supply-chain review checks against an SBOM or build log. Pair it with /archive-tools/checksum-generator to bind the report to a whole-archive hash.
What if the artifact is a .7z or .tar.gz?
This tool is ZIP-only and will throw 'Not a valid ZIP archive…'. Identify the true format with /archive-tools/auto-format-detector; for listing non-ZIP formats use /archive-tools/archive-previewer. Record any extension/format mismatch as a finding.
Can it prove the archive wasn't altered during analysis?
Indirectly. The tool is read-only and never writes the input, so hashing the file with /archive-tools/checksum-generator before and after analysis yields identical SHA-256 values — store both with the metadata JSON as chain-of-custody evidence.
Is there an API for automated scanning?
No public REST API — archive tools are browser-only (apiAvailable: false). On Pro+ tiers the @jadapps/runner can drive the tool via a headless browser. For unattended scanning at scale, a Node ZIP-directory parser mirroring this tool's JSON schema is the practical path.
Privacy first
Every JAD Archive tool runs entirely in your browser using fflate, @zip.js/zip.js, and the libarchive WASM bridge. Your archives never leave your device — verified by zero outbound network requests during processing.