Skip to main content
Skip to content
OPEN SOURCE

Epstein Pipeline

The data engine behind epsteinexposed.com. Downloads, OCRs, extracts entities, cross-references OSINT databases, and exports 2.1M+ documents to Neon Postgres with vector search. MIT licensed. pip installable.

2.1M+
Documents Processed
2.0M+
OCR Texts Extracted
1,700+
Persons Identified
2.4M+
Person-Doc Links
31
CLI Commands
95+
GitHub Stars
Terminal
# Install from PyPI
pip install epstein-pipeline
# Or clone and install with all extras
git clone https://github.com/stonesalltheway1/Epstein-Pipeline.git
cd Epstein-Pipeline
pip install -e ".[all]"
# Optional install groups: ocr, nlp, embeddings, neon, ai, audit, dev
pip install epstein-pipeline[ocr,nlp,neon]

10-Stage Pipeline

Raw DOJ releases in, structured searchable database out

01
Download
Fetch raw PDFs from 9 sources
02
OCR
Multi-backend text extraction with fallback
03
Entity Extraction
spaCy + GLiNER NER on full text
04
Person Linking
Fuzzy match names to canonical IDs
05
Classification
Zero-shot BART into 12 categories
06
Deduplication
Hash, MinHash/LSH, and semantic passes
07
Chunking
Paragraph-aware, 450 tokens per chunk
08
Embedding
nomic-embed-text-v2-moe vectors
09
Validation
Schema checks + cross-reference integrity
10
Export
JSON, CSV, SQLite, Neon, or direct site sync

What It Does

31 CLI commands covering ingestion, processing, cross-referencing, and export

9 Data Sources

Pull documents from DOJ EFTA (DS1-DS12), Kaggle, HuggingFace, Archive.org, FBI Vault, CourtListener, House Oversight, DocumentCloud, and Sea_Doughnut research databases.

4 OCR Backends

PyMuPDF for text-layer extraction, Surya for 90+ languages, olmOCR 2 (Allen AI) for VLM-based accuracy, and IBM Docling for table/layout understanding. Automatic fallback chain.

NLP Entity Extraction

spaCy transformer models and GLiNER zero-shot NER identify persons, organizations, locations, dates, and financial amounts. Fuzzy name matching links entities to canonical person IDs.

Document Classification

Zero-shot BART classifier sorts documents into 12 legal categories: court filings, depositions, financial records, flight logs, correspondence, law enforcement, and more.

3-Pass Deduplication

First pass: SHA-256 exact hashing. Second pass: MinHash/LSH approximate matching. Third pass: semantic cosine similarity on embeddings. Configurable thresholds at each stage.

Vector Embeddings + Search

nomic-embed-text-v2-moe generates 768-dim (or 256-dim Matryoshka) vectors. Paragraph-aware chunking at 450 tokens. Stored in pgvector with cosine ANN indexes for semantic search.

OSINT Cross-Reference

Match persons against OpenSanctions (OFAC, EU, UN, Interpol), ICIJ Offshore Leaks (Panama/Paradise/Pandora Papers), FEC political donations, and IRS Form 990 nonprofit filings.

Person Integrity Audit

5-phase audit using Claude AI: deduplication check, Wikidata verification, fact-checking against source documents, internal coherence scoring, and confidence grading per person.

Neon Postgres Export

Direct push to Neon with pgvector, pg_trgm, and tsvector/GIN full-text search. Also exports to JSON, CSV, and SQLite with FTS5. Pydantic models use camelCase aliases to match the site's TypeScript types.

CLI Reference

31 commands organized by stage

Data Ingestion
epstein download dojFetch latest DOJ EFTA releases (DS1-DS12)
epstein download kagglePull Kaggle Epstein Ranker dataset
epstein download huggingfacePull HuggingFace structured data
epstein download archivePull Archive.org media collections
epstein import sea-doughnutImport 1.38M Sea_Doughnut docs
Processing
epstein ocr ./data/pdfs/Multi-backend OCR (auto/pymupdf/surya/olmocr/docling)
epstein extract-entitiesRun spaCy + GLiNER entity extraction
epstein classifyZero-shot BART document classification
epstein dedup3-pass deduplication (hash/minhash/semantic)
epstein embedGenerate vector embeddings
epstein analyze-redactionsDetect redacted sections + recover text
epstein extract-imagesExtract images from PDFs (optional AI description)
epstein transcribeAudio/video transcription via faster-whisper
OSINT Cross-Reference
epstein check-sanctionsMatch against OFAC, EU, UN, Interpol, PEP lists
epstein check-icijCross-reference Panama/Paradise/Pandora Papers
epstein check-fecSearch FEC political donation records
epstein check-nonprofitsSearch IRS Form 990 nonprofit filings
Export + Database
epstein export jsonExport to JSON (site-compatible)
epstein export csvExport to CSV for spreadsheets
epstein export sqliteExport to SQLite with FTS5 full-text search
epstein export-neonPush to Neon Postgres with pgvector
epstein search 'query'Semantic search against pgvector
epstein migrateRun idempotent Neon schema migration
Quality + Audit
epstein validateJSON schema validation + integrity checks
epstein audit-persons5-phase AI person integrity audit
epstein statsShow processing statistics
epstein build-graphBuild knowledge graph (JSON + GEXF)

Built With

Core
Python 3.10+Click CLIPydantic v2httpxRich
OCR
PyMuPDFSuryaolmOCR 2IBM Docling
NLP
spaCyGLiNERBART-large-mnlirapidfuzz
Embeddings
nomic-embed-text-v2-moesentence-transformersPyTorch
Database
Neon Postgrespgvectorpg_trgmSQLite FTS5
AI
OpenAIAnthropic ClaudeVoyage AICohere
Infra
DockerGitHub Actionspytestruffmypy

Data Sources

DOJ EFTA (DS1-DS12)2.73M documents~218 GB
Sea_Doughnut Research DBs1.38M documents849K redaction analyses
Kaggle (Epstein Ranker)~23,700 documentsAI-analyzed
HuggingFaceStructured emails + filings
Archive.orgMedia collectionsPhotos, videos, audio
FBI VaultFBI records
CourtListenerCourt filings
House OversightCongressional releases
DocumentCloudSearchable court docs

Contribute

MIT licensed. Add new data sources, improve OCR accuracy, write new exporters, or run the pipeline on your own infrastructure.