Text extracted via OCR from the original document. May contain errors from the scanning process.
h. Add to the set of query names all pairs of “first names + last names” produced by
combining the sets of first and last names.
i. This procedure is carried for every raw name variant.
l11.7.A.6 — Find the word match frequencies of all names.
Given the set of names which may refer to an individual, we wish to find the time resolved words
frequencies of these names. The frequency of the name, which corresponds to a measure of how often
an individual is mentioned, provides a metric for the fame of that person. We append the word
frequencies of all the names which can potentially refer to an individual. This enables us, in a later step,
to identify which name is the relevant.
6) Append the fame signal for each query name of each record. The fame signal is the
timeseries of normalized word matches in the complete English database.
l1.7.A.7 — Find ambiguous names which can refer to multiple individuals.
Certain names are particularly popular and are shared by multiple people. This results in ambiguity, as
the same query name may refer to a plurality of individuals. Homonimity conflicts occur between a group
of individuals when they share some part of, or all, their name. When these homonimity conflicts arise,
the word frequency of a specific name may not reflect the number of references to a unique person, but to
that of an entire group. As such, the word frequency does not constitute a clear means of tracking the
fame of the concerned individuals. We identify homonimity conflicts by finding instances of individuals
whose names contain complete or partial matches. These conflicts are, when possible, resolved on the
basis of the importance of the conflicted individuals in the following step. Typical homonimity conflicts are
shown in Table $11.
7) Identify homonimity conflicts. Homonimity conflicts arise when the query names of two or more
individuals contain a substring match. These conflicts are distinguished as such :
a. For every query name of every record, find the set of substrings of query names.
b. For every query name of every record, search for matches in the set of query name
substrings of all other records.
c. Bidirectional homonimity conflicts occur when a query name fully matches another query
name. The name conflicted name could be used to refer to both individuals.
Unidirectional conflicts occur when a query name has a substring match within another
query name. Thus, the conflicted name can refer to one of the individuals, but also be
part of a name referring to another.
l11.7.A.8 — Resolve, when possible, the most likely origin of ambiguous names.
The problem of homonymous individuals is limiting because the word frequencies data do not allow us to
resolve the true identity behind a homonymous name. Nonetheless, in some cases, it is possible to
distinguish conflicted individuals on the basis of their importance. For the database of people extracted
from Encyclopedia Britannica, we argue that the quantity of information available about an individual
provides a proxy for their relevance. Likewise, for people obtained from Wikipedia, we can judge their
importance by the size of the article written about the person and the quantity of traffic the article
generates. As such, we approach the problem of ambiguous names by comparing the notability of
individuals, as evaluated by the amount of information available about them in the respective
encyclopedic source. Examples of conflict resolution are shown in Table $12 and $13.
8) Resolve homonimity conflicts.
23
HOUSE_OVERSIGHT_017031