OPEN SOURCE

Epstein Pipeline

Open-source Python toolkit for downloading, processing, and analyzing Epstein case documents. Powers the data behind epsteinexposed.com.

Terminal

# Install from PyPI

pip install epstein-pipeline

# Or clone and install

git clone https://github.com/stonesalltheway1/Epstein-Pipeline.git

cd Epstein-Pipeline

pip install -e ".[dev]"

What It Does

A complete pipeline for turning raw DOJ releases into structured, searchable data

Fetch documents from DOJ EFTA, Kaggle, HuggingFace, and Archive.org with built-in rate limiting and resume support.

Extract text from scanned PDFs using IBM Docling. Handles multi-page documents, redacted sections, and poor scan quality.

Identify persons, organizations, and locations in documents using spaCy NLP. Automatic person linking with fuzzy matching.

Find and merge duplicate documents using rapidfuzz similarity scoring. Configurable thresholds for title and content matching.

JSON schema validation and cross-reference integrity checks. Verify person IDs, document references, and data consistency.

Export to JSON (compatible with epsteinexposed.com), CSV for spreadsheets, or SQLite with FTS5 full-text search.

Full command-line interface for every step of the pipeline

Commands

epstein download dojFetch latest DOJ EFTA releases

epstein ocr ./data/pdfs/OCR all PDFs in a directory

epstein extract-entitiesRun NLP entity extraction

epstein dedupFind and merge duplicates

epstein validateCheck data integrity

epstein export --format sqliteExport to SQLite with FTS5

epstein statsShow dataset statistics

Python 3.11+Pydantic v2Click CLIspaCy NLPrapidfuzzIBM DoclingSQLite FTS5pytestruffmypyDockerGitHub Actions

The pipeline is open source and welcomes contributions. Add new data sources, improve OCR accuracy, or build new exporters.