Epstein Pipeline
The data engine behind epsteinexposed.com. Downloads, OCRs, extracts entities, cross-references OSINT databases, and exports 2.1M+ documents to Neon Postgres with vector search. MIT licensed. pip installable.
10-Stage Pipeline
Raw DOJ releases in, structured searchable database out
What It Does
31 CLI commands covering ingestion, processing, cross-referencing, and export
9 Data Sources
Pull documents from DOJ EFTA (DS1-DS12), Kaggle, HuggingFace, Archive.org, FBI Vault, CourtListener, House Oversight, DocumentCloud, and Sea_Doughnut research databases.
4 OCR Backends
PyMuPDF for text-layer extraction, Surya for 90+ languages, olmOCR 2 (Allen AI) for VLM-based accuracy, and IBM Docling for table/layout understanding. Automatic fallback chain.
NLP Entity Extraction
spaCy transformer models and GLiNER zero-shot NER identify persons, organizations, locations, dates, and financial amounts. Fuzzy name matching links entities to canonical person IDs.
Document Classification
Zero-shot BART classifier sorts documents into 12 legal categories: court filings, depositions, financial records, flight logs, correspondence, law enforcement, and more.
3-Pass Deduplication
First pass: SHA-256 exact hashing. Second pass: MinHash/LSH approximate matching. Third pass: semantic cosine similarity on embeddings. Configurable thresholds at each stage.
Vector Embeddings + Search
nomic-embed-text-v2-moe generates 768-dim (or 256-dim Matryoshka) vectors. Paragraph-aware chunking at 450 tokens. Stored in pgvector with cosine ANN indexes for semantic search.
OSINT Cross-Reference
Match persons against OpenSanctions (OFAC, EU, UN, Interpol), ICIJ Offshore Leaks (Panama/Paradise/Pandora Papers), FEC political donations, and IRS Form 990 nonprofit filings.
Person Integrity Audit
5-phase audit using Claude AI: deduplication check, Wikidata verification, fact-checking against source documents, internal coherence scoring, and confidence grading per person.
Neon Postgres Export
Direct push to Neon with pgvector, pg_trgm, and tsvector/GIN full-text search. Also exports to JSON, CSV, and SQLite with FTS5. Pydantic models use camelCase aliases to match the site's TypeScript types.
CLI Reference
31 commands organized by stage
epstein download dojFetch latest DOJ EFTA releases (DS1-DS12)epstein download kagglePull Kaggle Epstein Ranker datasetepstein download huggingfacePull HuggingFace structured dataepstein download archivePull Archive.org media collectionsepstein import sea-doughnutImport 1.38M Sea_Doughnut docsepstein ocr ./data/pdfs/Multi-backend OCR (auto/pymupdf/surya/olmocr/docling)epstein extract-entitiesRun spaCy + GLiNER entity extractionepstein classifyZero-shot BART document classificationepstein dedup3-pass deduplication (hash/minhash/semantic)epstein embedGenerate vector embeddingsepstein analyze-redactionsDetect redacted sections + recover textepstein extract-imagesExtract images from PDFs (optional AI description)epstein transcribeAudio/video transcription via faster-whisperepstein check-sanctionsMatch against OFAC, EU, UN, Interpol, PEP listsepstein check-icijCross-reference Panama/Paradise/Pandora Papersepstein check-fecSearch FEC political donation recordsepstein check-nonprofitsSearch IRS Form 990 nonprofit filingsepstein export jsonExport to JSON (site-compatible)epstein export csvExport to CSV for spreadsheetsepstein export sqliteExport to SQLite with FTS5 full-text searchepstein export-neonPush to Neon Postgres with pgvectorepstein search 'query'Semantic search against pgvectorepstein migrateRun idempotent Neon schema migrationepstein validateJSON schema validation + integrity checksepstein audit-persons5-phase AI person integrity auditepstein statsShow processing statisticsepstein build-graphBuild knowledge graph (JSON + GEXF)Built With
Data Sources
Contribute
MIT licensed. Add new data sources, improve OCR accuracy, write new exporters, or run the pipeline on your own infrastructure.