Skip to content
OPEN SOURCE

Epstein Pipeline

Open-source Python toolkit for downloading, processing, and analyzing Epstein case documents. Powers the data behind epsteinexposed.com.

Terminal
# Install from PyPI
pip install epstein-pipeline
# Or clone and install
git clone https://github.com/stonesalltheway1/Epstein-Pipeline.git
cd Epstein-Pipeline
pip install -e ".[dev]"

What It Does

A complete pipeline for turning raw DOJ releases into structured, searchable data

Downloaders

Fetch documents from DOJ EFTA, Kaggle, HuggingFace, and Archive.org with built-in rate limiting and resume support.

OCR Processing

Extract text from scanned PDFs using IBM Docling. Handles multi-page documents, redacted sections, and poor scan quality.

Entity Extraction

Identify persons, organizations, and locations in documents using spaCy NLP. Automatic person linking with fuzzy matching.

Deduplication

Find and merge duplicate documents using rapidfuzz similarity scoring. Configurable thresholds for title and content matching.

Validation

JSON schema validation and cross-reference integrity checks. Verify person IDs, document references, and data consistency.

Export

Export to JSON (compatible with epsteinexposed.com), CSV for spreadsheets, or SQLite with FTS5 full-text search.

CLI Reference

Full command-line interface for every step of the pipeline

Commands
epstein download dojFetch latest DOJ EFTA releases
epstein ocr ./data/pdfs/OCR all PDFs in a directory
epstein extract-entitiesRun NLP entity extraction
epstein dedupFind and merge duplicates
epstein validateCheck data integrity
epstein export --format sqliteExport to SQLite with FTS5
epstein statsShow dataset statistics

Built With

Python 3.11+Pydantic v2Click CLIspaCy NLPrapidfuzzIBM DoclingSQLite FTS5pytestruffmypyDockerGitHub Actions

Contribute to the Pipeline

The pipeline is open source and welcomes contributions. Add new data sources, improve OCR accuracy, or build new exporters.