A few weeks ago, as mentioned in an earlier post, I attended the Linked Data course at the Oxford Digital Humanities Summer School. As part of the course we wrote (or tried to write) some SPARQL queries. SPARQL is the query language used to interrogate RDF and was previously quite opaque to me. Now that I’ve done the course it makes a bit more sense but I really need to try out SPARQL on my own to see if I can get it to work.

You may know that there is a linked data version of Wikipedia, called DBpedia. It uses the information boxes found on most Wikipedia pages (because these are structured data) and publishes them as RDF. If you know what you’re doing with SPARQL you can derive all kinds of data from DBpedia; I don’t really know what I’m doing but I thought I’d write up my first query attempts here in case anyone finds them useful.

There are several SPARQL interfaces for DBpedia. The one I’ve been using is at http://dbpedia.org/snorql/. I like this interface because it declares prefixes for you (prefixes, or prefix namespaces, are really just aliases: they save you some typing). This was my first query:

SELECT * WHERE {
:Skipton ?b ?c
}
LIMIT 100

I’m asking for all the triples that have Skipton as a subject, or, to put it more bluntly, give me everything you’ve got on Skipton. Everything in RDF consists of three terms (a ‘triple’), so at least you have the advantage of knowing in advance of your query that there will be three possible fields to look at. I haven’t specified anything about the other two terms in the triple: ?b and ?c are just placeholders – they could be anything preceded by a question mark.

If you’re familiar with SQL then the SELECT statement will be an old friend. If you’re not then you just need to know that SELECT * means show me all fields, and the WHERE clause adds, as long as the first field value is “Skipton”.

The : is the prefix that I mentioned earlier. The full form for Skipton in DBpedia is

<http://dbpedia.org/resource/skipton>

So I could have written the query above as:

SELECT * WHERE {
<http://dbpedia.org/resource/skipton> ?b ?c
}

The results for this query aren’t very interesting, but this broad approach is a useful tactic when first querying a dataset you don’t know: it gives you a sense of the ontologies and terms being used. You need to know these terms to be able to perform precise queries.

For example, looking through the triples about Skipton I can see that the term for being born in Skipton (or anywhere else) isdbpedia2:birthPlace. If I add that to my query I now get a list of everyone born in Skipton (providing they have an entry in Wikipedia):

SELECT * WHERE {
?b dbpedia2:birthPlace :Skipton
}

Since I only get six results for people born in Skipton, I’ll now change the query to return people born in London. You can combine searches in SPARQL by simply listing them, separated with punctuation. Full triples are separated with a full stop. So I can get a list of women writers born in London with this query:

SELECT ?name WHERE {
?name ?c <http://dbpedia.org/class/yago/EnglishWomenWriters> .
?name dbpedia2:birthPlace :London .
}

The two queries are pulled together by the shared variable?name. You can see that I can easily add more lines to the query to refine it further. But at the moment I’m just pleased that I can get a simple SPARQL query to work.