Lee S. Jensen
James G. Shanahan
360 West 4800 North
Provo, UT 84604, USA
Church and Duncan Group Inc.
541 Duncan Street
San Francisco, CA 94131, USA
documents in multiple languages, spanning multiple centuries, from sources such as census, newspapers, vital records, and
family histories. While current algorithms have made these
collections accessible to the dedicated researcher, new advances are required in order to make family history research truly usable. This includes advances in fields such as information retrieval, information extraction, image processing, distributed systems, and record linkage.
The pervasive nature of the internet has caused a significant transformation in the field of genealogical research. This has impacted not only how research is conducted, but has also
dramatically increased the number of people discovering their family history. Recent market research (Maritz Marketing 2000, Harris Interactive 2009) indicates that general interest in the United States has increased from 45% in 1996, to 60% in 2000, and 87% in 2009. Increased popularity has caused a dramatic need for improvements in algorithms related to extracting, accessing, and processing genealogical data for use in building family trees.
This paper focuses on the task of labeling familial relationships in United States federal census data. The census data is a sequential listing, by household, of the population of the United States. In this task the classification of one person in the census is strongly correlated to the correct classification of the preceding people. In fact, because the familial relationships are relative to the head of the household, the classification label itself is relative to a single previous instance. Additionally, the discriminating features of the task are primarily derived by computing the difference, or
similarity, of these two related instances.
This paper presents one approach to algorithmic improvement in the family history domain, where we infer the familial
relationships of households found in human transcribed United States census data. By applying advances made in natural
language processing, exploiting the sequential nature of the census, and using state of the art machine learning algorithms, we were able to decrease the error by 35% over a hand coded baseline system. The resulting system is immediately applicable to
hundreds of millions of other genealogical records where families are represented, but the familial relationships are missing.
In many ways, labeling familial relationships in sequential data is similar to many tasks found in processing sequences of natural language text. Both tasks are benefited by a segmentation of the data and a representation of the conditional dependency between sequential instances. Accurately representing the underlying relationships found in sequential data requires additional
computational and representational challenges that are not always present in non-sequential data tasks.
Categories and Subject Descriptors
I.2.6 [Learning]: Knowledge acquisition
Many algorithmic simplifications made for representational
complexity or availability reasons, have a greater impact when applied to sequential data. For example, the independence
assumption made with the naive Bayes algorithm trades off
representational power for a reduction in computational
complexity at the cost of properly representing the conditional probability of the classification sequence. Likewise, the choice to only generate features that are dependent upon the current
instance’s features reduces the ability of the model to accurately represent the underlying distribution.
Keywords Classification, CRF, genealogy, family history
The advent of the inexpensive...