Citing Your Sources on the Web

I was involved in the World Wide Web Consortium (W3C) Provenance Working Group, which was an amazing experience, even though I couldn’t put as much time into it as I would have liked. My friend and collaborator, Tim Lebo, edited the Provenance Ontology (PROV-O). PROV-O is, in my narrow perspective of the world, a fantastic foundation for talking about how stuff happens and, most importantly to this post, how to cite people and resources on the web.

To use PROV-O that way requires a little tutorial in a micro-format of HTML called RDFa. RDFa stands for Resource Description Framework in Attributes, and is a way to embed a knowledge graph inside HTML documents. Think of a series of web pages as a bunch of nodes in a graph, where the hyperlinks in each page are links between the nodes. Now, by themselves, it’s really powerful. It’s what’s made the web run so well so far. However, imagine if one could add labels to those links, so that, for instance, I could say that the quote

RDFa is a thin layer of markup you can add to your web pages that makes them understandable for machines as well as people.

came from the page http://www.w3.org/MarkUp/2009/rdfa-for-html-authors, and that Steven Pemberton wrote it. By themselves (and without human interpretation), there’s no way to know how I’m using those two links, I just link out to Steven’s home page and to a tutorial. But if we add labels to those links, suddenly things get much more interesting. For instance, authors can get credit for any citation or reuse of their work, whether it be a conventional quote or a (hopefully) authorized syndication of the entire content. Together, RDFa and PROV-O can make this happen.

To do this, all we have to do is use a couple of new attributes to our HTML. This is something that can be done in many HTML editors, and is trivial for us techies who know how to write HTML by hand. I’m going to show the actual HTML, because the changes are that simple. Let’s start by making a nice blockquote and citation in normal HTML for the above quote:

RDFa is a thin layer of markup you can add to your web pages that makes them understandable for machines as well as people.

- Steven Pemberton, in “RDFa for HTML Authors

The HTML for this looks like this:

<blockquote>
 <p>
  RDFa is a thin layer of markup you can add to your
  web pages that makes them understandable for
  machines as well as people.
 </p>
 <p style="text-align: right;">-
  <a href="http://homepages.cwi.nl/~steven/">Steven Pemberton</a>, in
  "<a href="http://www.w3.org/MarkUp/2009/rdfa-for-html-authors">
   RDFa for HTML Authors
  </a>"
 </p>
</blockquote>

There’s some CSS styling of the citation part to be right-aligned, etc., but it’s pretty basic HTML. The first thing we need to do is say what language (or vocabulary) we are using. We can do that by saying that whenever we say “prov:” we are speaking using the PROV-O vocabulary. The attribute prefix will let us say that, and we can add it to the blockquote:

<blockquote prefix="prov: http://www.w3.org/ns/prov#">
 <p>
  RDFa is a thin layer of markup you can add to your
  web pages that makes them understandable for
  machines as well as people.
 </p>
 <p style="text-align: right;">-
  <a href="http://homepages.cwi.nl/~steven/">Steven Pemberton</a>, in
  "<a href="http://www.w3.org/MarkUp/2009/rdfa-for-html-authors">
   RDFa for HTML Authors
  </a>"
 </p>
</blockquote>

Prefixes are parts of URIs that are re-used inside of vocabularies. There are tools like Prefix.cc and Linked Open Vocabularies that will let you look for particular vocabularies and figure out what prefixes to use with them. By itself, it doesn’t add any knowledge to the web page, do do that we need to add some RDF statements.

RDF statements are very similar to English statements: they are composed of subjects, predicates, and objects. Thinking back to our web graph, the subject is the source node of the link, the predicate is the link’s label, and the object is the target node. We can set the subject of an HTML element using the attribute resource, which carries down to any child elements that haven’t set their own resource. If we don’t set a resource, RDFa assumes that the subject of any statements will be the URL for the web page itself. Let’s add a subject to the blockquote, so that we can identify the quote itself:

<blockquote prefix="prov: http://www.w3.org/ns/prov#"
            resource="#rdfaQuote">
 <p>
  RDFa is a thin layer of markup you can add to your
  web pages that makes them understandable for
  machines as well as people.
 </p>
 <p style="text-align: right;">-
  <a href="http://homepages.cwi.nl/~steven/">Steven Pemberton</a>, in
  "<a href="http://www.w3.org/MarkUp/2009/rdfa-for-html-authors">
   RDFa for HTML Authors
  </a>"
 </p>
</blockquote>

Note that I used a local URI using # to make this particular RDFa quote different from others that might appear on different web pages. Further, we don’t have much information about the quote itself, except that it is. Let’s give it a prov:value, which is PROV’s way of marking some literal value (text, numbers, dates, etc.) as a direct representation of something. We should add it to the p element of the quote text itself:

<blockquote prefix="prov: http://www.w3.org/ns/prov#"
            resource="#rdfaQuote">
 <p property="prov:value">
  RDFa is a thin layer of markup you can add to your
  web pages that makes them understandable for
  machines as well as people.
 </p>
 <p style="text-align: right;">-
  <a href="http://homepages.cwi.nl/~steven/">Steven Pemberton</a>, in
  "<a href="http://www.w3.org/MarkUp/2009/rdfa-for-html-authors">
   RDFa for HTML Authors
  </a>"
 </p>
</blockquote>

There, we’ve added our first RDF statement. This would render into Turtle (a more direct way of writing RDF) as:

@prefix prov: <http://www.w3.org/ns/prov#> .
<#rdfaQuote"> prov:value
  """RDFa is a thin layer of markup you can add to your
  web pages that makes them understandable for
  machines as well as people.""" .

Note that we didn’t label any of the links we already had yet. Let’s do that now, by saying that we can attribute the quote to Steven Pemberton, and that it was quoted from the RDFa for HTML Authors web page. Attribution of something to what PROV calls Agents is done using prov:wasAttributedTo. We don’t need to say that Steven is an agent, since PROV uses semantics to imply that he is. This makes it possible to infer that Steven is an Agent simply because something was attributed to him. Quotations can be linked to their source material using prov:wasQuotedFrom. PROV has a number of other citation-related properties, such a prov:hadPrimarySource, prov:wasRevisionOf, and prov:wasDerivedFrom. We can add these labels to our links again using the property attribute:

<blockquote prefix="prov: http://www.w3.org/ns/prov#"
            resource="#rdfaQuote">
 <p property="prov:value">
  RDFa is a thin layer of markup you can add to your
  web pages that makes them understandable for
  machines as well as people.
 </p>
 <p style="text-align: right;">-
  <a property="prov:wasAttributedTo"
     href="http://homepages.cwi.nl/~steven/">
   Steven Pemberton
  </a>
  , in
  "<a property="prov:wasQuotedFrom"
      href="http://www.w3.org/MarkUp/2009/rdfa-for-html-authors">
   RDFa for HTML Authors
  </a>"
 </p>
</blockquote>

We can also say that Steven is a prov:Person using the attribute typeof and that we can use “Steven Pemberton” as a label for him using the rdfs:label property (RDFS is a default prefix in RDFa). We need to set a new resource though:

<blockquote prefix="prov: http://www.w3.org/ns/prov#"
            resource="#rdfaQuote">
 <p property="prov:value">
  RDFa is a thin layer of markup you can add to your
  web pages that makes them understandable for
  machines as well as people.
 </p>
 <p style="text-align: right;">-
  <a property="prov:wasAttributedTo"
     href="http://homepages.cwi.nl/~steven/">
   <span resource="http://homepages.cwi.nl/~steven/"
         typeof="prov:Person"
         property="rdfs:label">Steven Pemberton</span>
  </a>
  , in
  "<a property="prov:wasQuotedFrom"
      href="http://www.w3.org/MarkUp/2009/rdfa-for-html-authors">
   RDFa for HTML Authors
  </a>"
 </p>
</blockquote>

We’ve now exercised all of RDFa but there is one more statement that needs to be added to the graph, and it is that this document as a whole was derived from the quote. We should do this outside of the blockquote, as we have set a resource inside it. We can do this by adding an a element just before the blockquote with no text inside, and giving it a property:

<a property="http://www.w3.org/ns/prov#wasDerivedFrom"
   href="#rdfaQuote"></a>
<blockquote prefix="prov: http://www.w3.org/ns/prov#"
 resource="#rdfaQuote">
 <p property="prov:value">
 RDFa is a thin layer of markup you can add to your
 web pages that makes them understandable for
 machines as well as people.
 </p>
 <p style="text-align: right;">-
 <a property="prov:wasAttributedTo"
 href="http://homepages.cwi.nl/~steven/">
 <span resource="http://homepages.cwi.nl/~steven/"
 typeof="prov:Person"
 property="rdfs:label">Steven Pemberton</span>
 </a>
 , in
 "<a property="prov:wasQuotedFrom"
 href="http://www.w3.org/MarkUp/2009/rdfa-for-html-authors">
 RDFa for HTML Authors
 </a>"
 </p>
</blockquote>

Note that I didn’t use a prefix here, partially to demonstrate using URIs directly, but also because we have not set the prefix at this level. Planning ahead, we could have wrapped both elements in a div that would have had the prefix attribute, but this will work just fine without it. The HTML looks exactly the same to humans, But it contains structured information that can be used by computers to do interesting things:

@prefix prov: <http://www.w3.org/ns/prov#>

<#> prov:wasDerivedFrom <#rdfaQuote>.
<#rdfaQuote"> prov:value"
    """RDFa is a thin layer of markup you can add to your
   web pages that makes them understandable for
   machines as well as people.""";
  prov:wasAttributedTo" <http://homepages.cwi.nl/~steven/>;
  prov:wasQuotedFrom <http://www.w3.org/MarkUp/2009/rdfa-for-html-authors">.
<http://homepages.cwi.nl/~steven/ a prov:Person;
  rdfs:label "Steven Pemberton".

Because WordPress.com is stripping out the RDFa from this post, I’ve made a separate page at RPI’s Tetherless World. You can view the embedded RDF using my RDF graph visualizer.

Since it’s so unobtrusive and other Javascript tools can take advantage of the embedded information, this should be the go-to way to quote sources in journalism, whether they are people, other web pages, or, in the case of data-driven journalism, the source data used to create the story.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s