4 Reasons Why Semantics Help Make Biobanks Better

My first blog post at 5AM is up:

The Semantic Web provides a means to link information on the web to each other and to things in real life in an interoperable way. Internationalized Resource Identifiers, of which URLs are a type, are used to identify nearly everything, and linked data makes it possible to visit those URLs to get more information about the things they represent. This has some very useful applications, especially in biobanking. Semantics was literally made for biomedical research, and here are 4 ways in which that relationship can help make biobanks better information resources…

Read more at http://info.5amsolutions.com/blog/bid/152921/4-Reasons-Why-Semantics-Help-Make-Biobanks-Better.

Validating RDF

I spent the day listening in on the RDF Validation Workshop, which kind of spilled over into the Cambridge Semantic Web Meetup. Here are my general musings and notes from the day. They may be completely wrong for many reasons, including possible misunderstandings of what the speaker said and information that is now out of date.

Google Says They’re Triplifying the Web

The Google Knowledge Graph is including a need to support users in how they can provide rich snippets in Google search results. They are building a validator for these formats against the RDF representations of their microdata. Most of their constraints are property paths, and they use SPARQL for the rest. They are also mostly concerned with suitability to their purpose, which is based on rich snippets and knowledge graph. They are using SPARQL-based constraints and are using RDFlib for prototyping, but will be moving to their own parser, which is used by the Structured Data Testing Tool. Here is an example path constraint, with the resulting SPARQL queries that are generated from it:

schema:reservationFor/schema:flightNumber
SELECT ?context WHERE {?context schema:flightNumber ?constraint.}
ASK WHERE {?context schema:flightNumber ?constraint.}

Currently, they are only validating things that are necessary, they won't check for things that are optional.

Semantic Web Meetup

The general idea seems to be that the RDF community needs to provide a means to say the following things about RDF graphs:

  • The graph must at least contain X.
  • The graph must contain at most Y.
  • The graph can never contain Z.

The general idea seems to be to provisionally close any given RDF graph before validation in order to produce the report. That closure can include some fixed set of other graphs (such as vocabularies used), but ultimately, for the purposes of validation, the Unique Name Assumption and the Closed World Assumption need to be used to validate the graph as given. Eric Prud'hommeaux presented an interesting framework based on YACC-style grammars by providing "shapes" of objects to validate. This is similar to OSLC's (Open Services for Lifecycle Collaboration) Resource Shape vocabulary, but with additional capabilities around disjunction and non-declarative validation processes.

Citing Your Sources on the Web

I was involved in the World Wide Web Consortium (W3C) Provenance Working Group, which was an amazing experience, even though I couldn’t put as much time into it as I would have liked. My friend and collaborator, Tim Lebo, edited the Provenance Ontology (PROV-O). PROV-O is, in my narrow perspective of the world, a fantastic foundation for talking about how stuff happens and, most importantly to this post, how to cite people and resources on the web.

Continue reading

Getting Up Early in the Morning

Well, sort of. I have some exciting news: in August I will be starting at 5AM Solutions as a data scientist. I’ll be finishing my time at Yale University with Michael Krauthammer, and will soon be wrapping up my computer science Ph.D. at Rensselaer Polytechnic Institute in the Tetherless World Constellation. Continue reading

Thanksgiving Science!

I’ve got a little formula that predicts how long it will take for our Thanksgiving turkey to cook. It works really well for our temperatures and preparation, but I’d like to make it a little more general so everyone else can use it, regardless of temperature. As a wise man once said, if it’s worth doing, it’s worth overdoing, unless you’re overcooking turkey.

Towards that end, and because this is a science blog, I would like to perform a hypothesis-generating experiment. If you’re willing to further science, please share some details about how you prepared your turkey, and how it turned out. Humanity will thank you. Turkeys will not. I will post the results when I can, and maybe we can try again next year for a full prediction.

Click here to share your turkey data.

Firetruck: In which I write and record my first song…

Ian asked me for a firetruck song tonight for bedtime songs. I thought maybe there might be some kind of melody hiding in the fire truck siren sound, so I started there and half-assed my way through a melody. After he went down, I thought I might have enough to make a “real” song. It only has one chord progression and melody, due to it being a kid’s song and the first song I’ve ever written, but I found a surprisingly good loop to match against, and it kind of came out as a ballad for firefighters from a kid’s perspective.

Like I said, be gentle, it’s supposed to be a little lame. And I’ve never recorded a song before either.

https://dl.dropbox.com/u/9752413/songs/Firetruck.m4a

Here are the lyrics:
https://dl.dropbox.com/u/9752413/songs/firetruck.txtI’m releasing the recording and the lyrics as creative commons share-alike (http://creativecommons.org/licenses/by-sa/3.0/).

Data Processing with Python: Part 2

As I’ve said, I’ve been doing tons and tons of tabular data manipulation using Python in the past few years, and I’m sharing some of the patterns I’ve developed. Please look at Part 1 to see some of the more basic stuff, and review the rules of the road. Below the fold, we will be talking about filtering data by column and row and doing processing without loading the whole file into memory.

Continue reading