I spent most of yesterday attending a workday on Cultural Heritage and the Semantic Web at the British Museum. Unfortunately I had to miss the final two papers and the closing panel, so I’m not in a position to offer an overall summary. But certain themes and common notes kept arising in each of the six talks I did manage to catch — and these are worth commenting on in themselves, because they mark, to my mind, a new (and, I think, messier and more interesting) direction for the Semantic Web than that most frequently outlined in the past.
If there was one point hit very heavily by every speaker, it’s that Semantic Web content has at long last passed its tipping point: which is to say, there is by now enough Linked Data out there and available for it to be worth someone’s time to go looking for it. For the better part of a decade the Semantic Web community has been bemoaning the lack of useable and useful semantic data online; but now, thanks in large part to the various governmental open data initiatives as well as groups like Freebase and DBPedia, that problem has largely been solved. Indeed, John Sheridan (of legislation.gov.uk) even outlined a workflow his organisation already has in place for automatically publishing government data in Linked form.
There is, however, a rub, and it lies in the way that this critical mass has been achieved — which is to say, by favouring volume over structure. As Professor Wendy Hall, the event’s keynote speaker, put it, the Semantic Web for the first four years of its existence, disappeared down an ‘AI rathole’: the SW’s chief enthusiasts were all more interested in solving abstract problems of computation and artificial intelligence than in building something workable and useful. This line of thinking, however, was finally abandoned, in favour of a “scruffy” approach inspired by the growth of the original WWW, prioritising availability of data over neat ontological form. Unimaginable millions of RDF statements later, it’s an approach that can be said to have worked.
The problem with this approach, given the nature of the Semantic Web, is that once you drop AI as an approach, where does the I come from? The whole idea of the thing, in my understanding of it, was that inherent semantics would per se and by their very nature enable intelligent retrieval, integration, and reasoning over SW data — and that, as a result, such operations could readily be undertaken by machines. But lowering the semantic weight and coherence of the data would seem to undermine this vision of the Semantic Web significantly — and so it proved to be. As far as I could tell from the presentations I saw, much of the burden in making sense of the semantics of the Semantic Web has been shifted back from the machine to human agents.
At one end of the process we had speakers like Hugh Glaser (Seme4) and Atanas Kiryakov (Ontotext), both of whom develop core semantic technology — and both of whom undertake some form of data curation in relation to the Semantic Web in order to do so. Indeed, the whole point of Ontotext’s FactForge is to provide what Kiryakov terms a ‘reason-able‘ view of the SW’s data — in essence, a managed subset of the Semantic Web, with significant processing performed upon it in order to guarantee its coherence and computability.
This approach has an obvious couple of problems with it. First, it’s clearly not going to scale in any straightforward way. Second, even despite the amount of processing done, it’s still capable of returning nonsensical results. To illustrate this point, Kiryakov queried the FactForge integrated dataset to discover who the most popular German entertainer might be. The result, as it turned out, was Friedrich Nietzche — by virtue of the fact, apparently, that MusicBrainz (included in the FactForge ‘reason-able’ view) records the fact that he was, in addition to being a philosopher, a pianist.
This kind of absurdity in itself is hardly fatal to the whole enterprise. For what it’s worth, no matter how I formulate the search terms for ‘most popular German entertainer’, Google at this moment is determined to tell me about the “Wetten, dass…?” television show — and the form of the SPARQL query at least gives me a chain of logic I can track back to work out where I’ve gone wrong. But the difficulty remains that the responsibility for working out the semantics of all this lies with the end (human) user. And there’s thus plenty of room for the SW sceptic (after ten years of the SW sputtering along, there’s no shortage of them) to ask whether this is much of an advance on what we get from the web as it stands. After all, the cognitive heavy lifting is still largely being done by the user. Does it matter that much if the process it supports is truly semantic, or simply string-searching?
The weak response to this would (I guess) be that very often we have no choice. As Jonathan Whitson Cloud of the British Museum pointed out in passing, even relatively common heritage sector problems such as a multiplicity of object identifiers utterly confound string-based search. The alternative to Semantic Web solutions would be some kind of bespoke data-integration application; and, given that the work of people like Kiryakov and Glaser has shown the former to be perfectly viable, it would generally seem to be preferable to simply tunnelling between silos. Similarly, the domain John Sheridan is working in seems to favour semantic over string solutions — while the biomedical community around FactForge’s sister Linked Life Data site must have perceived a need for very large-scale data integration beyond that provided by the WWW over the internet.
Semantic solutions, then, seem to be often desirable, and sometimes perhaps necessary, for specialised communities with distinctive and well-defined domain needs; and the British Museum will discover whether it is part of one of these communities when its own Linked Data set goes live next month. But where does this leave Tim Berners-Lee’s (and others’) more over-arching vision for the Semantic Web?
In broad terms, it suggests to me that the Semantic Web has crossed one tipping point, only to find another. In her keynote address, Wendy Hall noted that there had in fact been three tipping points in the take-off of the World Wide Web. The first, as for the Semantic Web, was attaining some kind of size. The second was the emergence of search engines, allowing users to navigate the WWW beyond the links provided among sites. And the third was the widespread availability of broadband.
If the analogy holds true (and if I’m right that the evolution of the Semantic Web is for the moment tending to place a heavy informational burden on users) this would suggest that the next milestone is the development of some kind of tool that allows users easy “navigation” of the Semantic Web — which is to say, that exposes to users in some kind of clear, intuitive, and precise way, the semantics of the datasets they are querying and manipulating.
At the moment AFAIK no such application exists. Such Semantic Web browsers as are available tend to allow ready point-to-point navigation, but don’t assist the user much in terms of semantics. Some of the postprocessing (generation of a preferred label and image, for example) performed by FactForge is intended to assist user comprehension; but clearly much more needs to be done. For the Semantic Web to become useful to the general public, interfaces will need to be developed that outline the semantics of the datasets of interest with all the accuracy and precision of a SPARQL query, but none of the cognitive overhead. Presumably Fresnel is intended to meet this need — but I haven’t played with it enough, or seen enough examples of it in action, to judge whether it actually does the job or not.
In this context, cultural heritage — and to an even greater extent, the humanities — form an interesting use case. The two sectors’ data typically exhibit many of the problematic features for which RDF and other SW technologies are often touted as a solution: holes, sparseness, messiness, an apparently limitless potential for remodelling, a long history of claims and counter-claims, etc. Both would stand to benefit considerably from Semantic Web technologies as these mature — and, as a result, might even be in a position to contribute something back to these technologies themselves.
This might seem a slight possibility — but it is, I think, a possibility nonetheless. Speaking in very general terms, the cultural domain typifies many of the problems the Semantic Web can be expected to encounter as it scales upwards. Leaving aside the question of speed (the humanities generally do not react at the speed of Twitter, but perhaps, given the current state of the evolution of the Semantic Web, this is no bad thing), then, design solutions that work for the humanities can be expected to apply a fortiori and mutatis mutandis more broadly.
This isn’t to say that there’s a one-size-fits-all-solution to the Semantic Web usability problem, or to the humanities Semantic Web usability problems, or even that there’s a solution at all. But it is to say that, if these Semantic Web technologies do, as promised, come to open up new perspectives on the cultural domain, we need also to explore the extent to which these perspectives might in turn serve to frame the Semantic Web itself.