While this document continues to be useful, it is out of date.  You can track current progress on my research blog.

Genealogy Provenance Tutorial

RDF Resource Description Framework Icon Creative Commons License

Working Draft


The Genealogy Provenance (GP) is an OWL ontology used for tracking the provenance of genealogical information (Note: The actual ontology has not yet been defined.  It's use in this document is only suggestive.).  More importantly, it is a method for documenting provenance data in RDF.  The RDF standard provides for reification, which is the ability to take a statement (subject-predicate-object) and treat it as the subject or object of another statement.  Ideally this would work well for annotating genealogical information with provenance data.  However, reification has a number of well-known problems.  Instead, the GP uses named graphs to track provenance.  An RDF dataset consists of a primary graph and zero or more named graphs.  These named graphs are themselves resources and can therefore be described by statements in the same or other graphs.  All of the information from a given source is captured in a single graph. Provenance chains with well-defined semantics can easily be created with this mechanism.

This document introduces and demonstrates the use of GP.  It builds on the Genealogy Core Tutorial which you should read first if you haven't yet.


In the Genealogy Core Tutorial we built a graph of information on Rachel Anderson, her birth mother, and her adoption parents.  We avoided citing any sources to keep it simple, but as any good genealogist knows, you don't know anything until you know where it came from.

It turns out that we can find the information about Rachel's birth from her birth certificate.  Let's store everything we gleaned from her birth certificate in a graph called <#birthCertificateOfRachel>:

As in the previous tutorial, we refer to this graph as <#birthCertificateOfRachel> to make the discussion clear, even though any arbitrary URI (such as <http://www.example.com/genealogy#birthCertificate-218> or <uri:uuid:138905bf-52c3-4498-84d3-070ce7e261e2>) would work just as well.  Also, notice that Rachel's surname was Matthews on the birth certificate, even though it later changed to Anderson with her adoption.

We would like to give more information about the birth certificate, which we can do by describing the <#birthCertificateOfRachel> graph in the primary graph:

This primary graph holds any information that we are asserting ourselves.  In this example, we are asserting that there is a graph called <#birthCertificateOfRachel> which contains information from a birth certificate (specifically birth certificate number 5260, issued by the Kalamazoo County Clerk on October 26, 1919).

We remember hearing that Sarah was divorced, but we don't have a source for it, so we just throw it in the primary graph as well:

The rest of the information we received in a GEDCOM file from our good friend Mike Davis.  Most of the information isn't cited, but the adoption and death events each have a source.  First let's put the information on Rachel's adoption in the <#adoptionCertificateOfRachel> graph:

We then record information on the adoption certificate in the <#gedcomFromMike> graph:

Information on Naomi's death will be placed in the <#deathCertificateOfNaomi> graph:

Information about the death certificate itself will also be placed in the <#gedcomFromMike> graph:

The rest of the information without sources is thrown into the <#gedcomFromMike> graph as well:

We then record Mike's submitter information in the primary graph:

You will probably have noticed that the URI of each individual is different in each of the graphs.  This is because we cannot blindly assume that the individual referred to in one source is the same individual in another.  These assertions must be made explicitly.  In Mike's GEDCOM, both <#travis-3> and <#travis-5> refer to the same individual, as do <#naomi-3>, <#naomi-4>, and <#naomi-5>.  We assert these facts in the <#gedcomFromMike> graph like so:

We also happen to know that <#sarah-1> and <#sarah-2> are the same person, as are <#rachel-1> and <#rachel-3>.  As we have no source for this knowledge, we throw it in the primary graph:

Our set of named graphs ends up looking something like this:

The complete graph is available in both TriX and TriG serialization.

Next is the Genealogy Trust Tutorial, which introduces a trust model for genealogical information.

Related