Skip to main content
Skip to content
Methodology

Corpus Inventory & Evidence Chain

30 EFTA citations1,715 words2 persons referenced

This is the foundational accounting of every piece of source material underlying this

Corpus Inventory & Evidence Chain

What This Document Is

This is the foundational accounting of every piece of source material underlying this

investigation. Every report in this repository traces back to specific documents within

the corpus described here. If you want to verify a claim, this is where you start.

Bottom line: 1,380,937 PDF documents containing 2,731,785 pages and 3.18 billion

characters of text, plus 3,234 media files, 1,530 audio/video transcripts, 2,587,102

redaction records, and a 1,536-person entity registry. All derived from 194.5 GB of

publicly released U.S. Department of Justice files.


Source Material

All source data was released by the U.S. Department of Justice at

justice.gov/epstein between December 2025 and

February 2026. Bulk downloads were obtained from archive.org

mirrors of the DOJ release. No data was obtained through leaks, hacking, FOIA requests,

or any non-public channel.

DOJ Release Timeline

DatasetRelease DateSizeFormat
------------------------------------
DS1-7December 19, 2025~3.0 GBIndividual PDFs on justice.gov
DS8January 17, 2026~1.8 GBZIP archive
DS9January 30, 2026~103.6 GBtar.bz2 archive
DS10February 3, 2026~68.3 GBZIP archive
DS11February 6, 2026~26.8 GBZIP archive
DS12February 10, 2026~0.1 GBIndividual PDFs

Complete Document Inventory

PDF Documents by Dataset

DSDocumentsPagesCharactersSize (GB)EFTA RangeBlank Pages
--------------:------:-----------:----------:------------------------:
13,1563,156171,0721.24EFTA00000001EFTA000031580
257469946,5930.62EFTA00003159EFTA000038570
3671,847275,9430.58EFTA00003858EFTA000055860
41522,7043,364,9090.35EFTA00005705EFTA000083200
512012046,8620.06EFTA00008409EFTA000085280
613487491,3550.05EFTA00008529EFTA000089980
717660720,7560.10EFTA00009016EFTA000096640
810,59329,34338,733,3801.78EFTA00009676EFTA000390230
9531,2841,223,7611,557,581,45694.51EFTA00039025EFTA0126278132
10503,154950,1011,060,544,61968.32EFTA01262782EFTA022128820
11331,655517,382513,671,71526.75EFTA02212883EFTA027302620
121521,5251,658,3270.12EFTA02730265EFTA027317830
Total1,380,9372,731,7853,177,306,987194.5EFTA00000001EFTA0273178332
Disk-to-database reconciliation: Every PDF file on disk has a corresponding entry

in the database. Zero discrepancies across 11 of 12 datasets. DS8 has one duplicate

file (EFTA00022173 exists in two subdirectories at identical size — a packaging artifact,

not a data issue).

EFTA Number Gaps

EFTA numbers are not contiguous. The DOJ did not release every number — there are gaps

within each dataset's range. For example, DS9 spans EFTA00039025 to EFTA01262781

(a range of 1,223,757 possible numbers) but contains only 531,284 documents. These gaps

are part of the DOJ's release structure, not missing data on our end. The same pattern

holds across all datasets.

Media Files (Non-PDF)

TypeCountPrimary LocationDescription
------------:------------------------------
.avi1,529DS9MCC/surveillance video clips (no audio)
.mp4255DS8, DS9Surveillance video, longer-form recordings
.m4a78DS9Audio recordings (phone calls, interviews)
.vob10DS9DVD video objects
.m4v10DS9Video files
.wav9DS9Audio recordings
.mov8DS9QuickTime video
.wmv5DS9Windows Media video
.mp32DS9Audio files
.xlsx/.xls11DS8Spreadsheets (victim pseudonym lists, device inventories)
.csv4DS8Fully redacted tabular data (every cell blacked out)
Other~322DS8, DS10, DS11Miscellaneous native files
Total~3,234

Zero-Page Documents (Corrupted/Non-Standard PDFs)

Five documents across the entire corpus returned zero extractable pages from standard

PDF tools. Byte-level forensic analysis revealed these are not simply corrupted — they

are forensic disk image fragments and truncated scans. All five have been recovered.

See CORRUPTED_PDF_FORENSICS for full details.

EFTASizeWhat It Actually IsRecovered Content
----------------------------------------------------
EFTA0059387046 KBNull-padded PDF (64% zeroed)Page 1 of CVRA motion (Jane Doe #1 & #2 v. US)
EFTA00597207883 KBDisk image with Apple Address Book8 contacts + iPhone 5s photo (Aug 2014, LSJ)
EFTA0064562435 KBTruncated Sharp scanner faxEpstein fee dispute legal memo (Apr 2015)
EFTA01175426827 KBTruncated linearized PDF10-page court trust order (Zaffaroni/Packard)
EFTA012209341.1 MBRaw Windows disk image fragmentCached web images, application files (not case-relevant)

Derived Databases

All analysis in this repository is built on four databases derived from the source PDFs.

full_text_corpus.db (6.08 GB)

The primary analytical database. Contains the full text of every page of every document.

TableRecordsDescription
---------------:-------------
documents1,380,937One row per PDF: EFTA number, dataset, file path, page count, file size
pages2,731,785One row per page: EFTA number, page number, full text content, character count
pages_fts2,731,785FTS5 full-text search index over all page text
How it was built: PyMuPDF (fitz) text extraction on every PDF. For scanned documents,

this captures the invisible OCR text layer (rendering mode Tr=3) that the DOJ's scanning

vendor applied. Documents where PyMuPDF returned zero text were flagged for manual review

(this is how the 5 corrupted PDFs were identified).

redaction_analysis_v2.db (0.95 GB)

Spatial analysis of every redaction rectangle in the corpus, with the text found at

each redaction's coordinates.

TableRecordsDescription
---------------:-------------
redactions2,587,102Every redaction: position, type, hidden text, confidence
document_summary638,416Per-document redaction counts and flags
reconstructed_pages39,588Pages rebuilt from spatially-ordered redaction fragments
extracted_entities107,422Named entities extracted from reconstructed text
Important caveat: The vast majority of bad_overlay redaction records (~98%) are

OCR noise — the scanner's OCR engine attempted to read black redaction bars and produced

garbage text. Only 12 documents contain genuinely failed redactions (Apple Mail PLIST

metadata exposed behind incompletely flattened overlays). See

DATA_QUALITY_AUDIT and EVIDENCE_RELIABILITY_AUDIT

for the full audit. The redaction database remains useful as a searchable index of text

found near redaction zones, but its hidden_text field should not be interpreted as

"recovered secret content."

transcripts.db (2.5 MB)

GPU-transcribed audio/video content using faster-whisper large-v3 on NVIDIA A100.

MetricValue
--------------:
Total entries1,530
With speech content375
Total words transcribed92,153
Silent/surveillance skipped1,155

Pre-screening classified 2,581 unique media files: 903 were processed, 1,633 were skipped

(silent surveillance footage — 77+ hours of MCC/facility video with no audio stream).

Notable content: BOP Warden OIG interview, 3 MCC prison phone calls, 20+ Grand Jury

testimony recordings, Deepak Chopra voicemails.

persons_registry.json

Unified entity registry merging three sources: our pipeline extraction, la-rana-chicana

community research CSV, and the knowledge_graph.db entity table.

MetricValue
--------------:
Total persons1,536
With aliases203
With descriptions237
With 100+ document hits693
With 10-99 document hits409

What's NOT in This Corpus

For transparency, here is what we do NOT have access to:

  • Sealed court filings — Multiple cases have sealed dockets (SDNY, SDFL, USVI).
  • We only analyze publicly released material.

  • Grand jury transcripts (full text) — We have audio recordings of some testimony
  • sessions (transcribed in transcripts.db) but not official transcripts.

  • Classified intelligence material — References to intelligence connections are
  • derived from what appears in the public DOJ files, not from classified sources.

  • Victim statements beyond what DOJ released — The DOJ's release is selective.
  • Many victim interviews, depositions, and statements referenced in the documents are

    not included in the EFTA corpus.

  • The "missing" EFTA numbers — 692,473 EFTA numbers in DS9's range are not present.
  • Whether these represent withheld documents, unimaged evidence, or simply unused

    numbers in the task force's tracking system is unknown.

  • Post-February 10, 2026 releases — This inventory reflects data through DS12.
  • Additional datasets may be released.


    Verification

    Anyone can independently verify any finding in this repository:

  • Obtain the source PDFs from justice.gov/epstein
  • or the archive.org mirrors linked in each report

  • Check any EFTA citation by constructing the URL:
  • https://www.justice.gov/epstein/files/DataSet%20{N}/EFTA{########}.pdf

  • Reproduce the text extraction using PyMuPDF: fitz.open(path)[page].get_text()
  • Reproduce the redaction analysis using the methodology in
  • REDACTION_TEXT_LAYER_ANALYSIS

    The EFTA-to-dataset mapping table is in the main [README](../README.md#efta-number-to-dataset-mapping).


    Processing Pipeline

    ``

    DOJ PDFs (194.5 GB, 1.38M files)

    ├─→ PyMuPDF text extraction ──→ full_text_corpus.db (6.08 GB)

    │ └─→ FTS5 full-text search index

    ├─→ Redaction rectangle analysis ──→ redaction_analysis_v2.db (0.95 GB)

    │ ├─→ Reconstructed pages (39,588)

    │ └─→ Entity extraction (107,422 entities)

    ├─→ Media file pre-screening + GPU transcription ──→ transcripts.db

    ├─→ Person registry unification ──→ persons_registry.json (1,536 persons)

    └─→ Byte-level forensic recovery of 5 zero-page PDFs

    └─→ recovered_corrupted_pdfs/ (Apple Address Book, LSJ photo, etc.)

    ``

    All processing was performed locally. No documents were uploaded to cloud services

    or third-party APIs for analysis. Text extraction, OCR, transcription, and entity

    extraction were all run on local hardware.