Skip to main content
Skip to content
Case File
d-19915House OversightOther

Methodology for Resolving Homonymous Names in Large Text Corpora

The passage describes a technical process for name disambiguation and fame scoring in a database. It contains no specific allegations, actors, transactions, or actionable leads linking powerful indivi Outlines steps to generate all first‑name + last‑name combinations for query matching. Describes calculating time‑resolved word‑match frequencies as a 'fame signal'. Identifies and categorizes homony

Date
November 11, 2025
Source
House Oversight
Reference
House Oversight #017031
Pages
1
Persons
0
Integrity
No Hash Available

Summary

The passage describes a technical process for name disambiguation and fame scoring in a database. It contains no specific allegations, actors, transactions, or actionable leads linking powerful indivi Outlines steps to generate all first‑name + last‑name combinations for query matching. Describes calculating time‑resolved word‑match frequencies as a 'fame signal'. Identifies and categorizes homony

Tags

data-analysismethodologytext-miningname-disambiguationhouse-oversight

Ask AI About This Document

0Share
PostReddit

Extracted Text (OCR)

EFTA Disclosure
Text extracted via OCR from the original document. May contain errors from the scanning process.
h. Add to the set of query names all pairs of “first names + last names” produced by combining the sets of first and last names. i. This procedure is carried for every raw name variant. l11.7.A.6 — Find the word match frequencies of all names. Given the set of names which may refer to an individual, we wish to find the time resolved words frequencies of these names. The frequency of the name, which corresponds to a measure of how often an individual is mentioned, provides a metric for the fame of that person. We append the word frequencies of all the names which can potentially refer to an individual. This enables us, in a later step, to identify which name is the relevant. 6) Append the fame signal for each query name of each record. The fame signal is the timeseries of normalized word matches in the complete English database. l1.7.A.7 — Find ambiguous names which can refer to multiple individuals. Certain names are particularly popular and are shared by multiple people. This results in ambiguity, as the same query name may refer to a plurality of individuals. Homonimity conflicts occur between a group of individuals when they share some part of, or all, their name. When these homonimity conflicts arise, the word frequency of a specific name may not reflect the number of references to a unique person, but to that of an entire group. As such, the word frequency does not constitute a clear means of tracking the fame of the concerned individuals. We identify homonimity conflicts by finding instances of individuals whose names contain complete or partial matches. These conflicts are, when possible, resolved on the basis of the importance of the conflicted individuals in the following step. Typical homonimity conflicts are shown in Table $11. 7) Identify homonimity conflicts. Homonimity conflicts arise when the query names of two or more individuals contain a substring match. These conflicts are distinguished as such : a. For every query name of every record, find the set of substrings of query names. b. For every query name of every record, search for matches in the set of query name substrings of all other records. c. Bidirectional homonimity conflicts occur when a query name fully matches another query name. The name conflicted name could be used to refer to both individuals. Unidirectional conflicts occur when a query name has a substring match within another query name. Thus, the conflicted name can refer to one of the individuals, but also be part of a name referring to another. l11.7.A.8 — Resolve, when possible, the most likely origin of ambiguous names. The problem of homonymous individuals is limiting because the word frequencies data do not allow us to resolve the true identity behind a homonymous name. Nonetheless, in some cases, it is possible to distinguish conflicted individuals on the basis of their importance. For the database of people extracted from Encyclopedia Britannica, we argue that the quantity of information available about an individual provides a proxy for their relevance. Likewise, for people obtained from Wikipedia, we can judge their importance by the size of the article written about the person and the quantity of traffic the article generates. As such, we approach the problem of ambiguous names by comparing the notability of individuals, as evaluated by the amount of information available about them in the respective encyclopedic source. Examples of conflict resolution are shown in Table $12 and $13. 8) Resolve homonimity conflicts. 23

Technical Artifacts (2)

View in Artifacts Browser

Email addresses, URLs, phone numbers, and other technical indicators extracted from this document.

Wire Refreferences
Wire Refreferring

Forum Discussions

This document was digitized, indexed, and cross-referenced with 1,400+ persons in the Epstein files. 100% free, ad-free, and independent.

Annotations powered by Hypothesis. Select any text on this page to annotate or highlight it.