Reintegrating the Human(ities): Reflections on Cultural Heritage and the Semantic Web at the British Museum

I spent most of yesterday attending a workday on Cultural Heritage and the Semantic Web at the British Museum. Unfortunately I had to miss the final two papers and the closing panel, so I’m not in a position to offer an overall summary. But certain themes and common notes kept arising in each of the six talks I did manage to catch — and these are worth commenting on in themselves, because they mark, to my mind, a new (and, I think, messier and more interesting) direction for the Semantic Web than that most frequently outlined in the past.

If there was one point hit very heavily by every speaker, it’s that Semantic Web content has at long last passed its tipping point: which is to say, there is by now enough Linked Data out there and available for it to be worth someone’s time to go looking for it. For the better part of a decade the Semantic Web community has been bemoaning the lack of useable and useful semantic data online; but now, thanks in large part to the various governmental open data initiatives as well as groups like Freebase and DBPedia, that problem has largely been solved.  Indeed, John Sheridan (of legislation.gov.uk) even outlined a workflow his organisation already has in place for automatically publishing government data in Linked form.

There is, however, a rub, and it lies in the way that this critical mass has been achieved — which is to say, by favouring volume over structure. As Professor Wendy Hall, the event’s keynote speaker, put it, the Semantic Web for the first four years of its existence, disappeared down an ‘AI rathole’: the SW’s chief enthusiasts were all more interested in solving abstract problems of computation and artificial intelligence than in building something workable and useful. This line of thinking, however, was finally abandoned, in favour of a “scruffy” approach inspired by the growth of the original WWW, prioritising availability of data over neat ontological form. Unimaginable millions of RDF statements later, it’s an approach that can be said to have worked.

The problem with this approach, given the nature of the Semantic Web, is that once you drop AI as an approach, where does the I come from? The whole idea of the thing, in my understanding of it, was that inherent semantics would per se and by their very nature enable intelligent retrieval, integration, and reasoning over SW data — and that, as a result, such operations could readily be undertaken by machines. But lowering the semantic weight and coherence of the data would seem to undermine this vision of the Semantic Web significantly — and so it proved to be. As far as I could tell from the presentations I saw, much of the burden in making sense of the semantics of the Semantic Web has been shifted back from the machine to human agents.

At one end of the process we had speakers like Hugh Glaser (Seme4) and Atanas Kiryakov (Ontotext), both of whom develop core semantic technology — and both of whom undertake some form of data curation in relation to the Semantic Web in order to do so. Indeed, the whole point of Ontotext’s FactForge is to provide what Kiryakov terms a ‘reason-able‘ view of the SW’s data — in essence, a managed subset of the Semantic Web, with significant processing performed upon it in order to guarantee its coherence and computability.

This approach has an obvious couple of problems with it. First, it’s clearly not going to scale in any straightforward way. Second, even despite the amount of processing done, it’s still capable of returning nonsensical results. To illustrate this point, Kiryakov queried the FactForge integrated dataset to discover who the most popular German entertainer might be. The result, as it turned out, was Friedrich Nietzche — by virtue of the fact, apparently, that MusicBrainz (included in the FactForge ‘reason-able’ view) records the fact that he was, in addition to being a philosopher, a pianist.

This kind of absurdity in itself is hardly fatal to the whole enterprise. For what it’s worth, no matter how I formulate the search terms for ‘most popular German entertainer’, Google at this moment is determined to tell me about the “Wetten, dass…?” television show — and the form of the SPARQL query at least gives me a chain of logic I can track back to work out where I’ve gone wrong. But the difficulty remains that the responsibility for working out the semantics of all this lies with the end (human) user. And there’s thus plenty of room for the SW sceptic (after ten years of the SW sputtering along, there’s no shortage of them) to ask whether this is much of an advance on what we get from the web as it stands. After all, the cognitive heavy lifting is still largely being done by the user. Does it matter that much if the process it supports is truly semantic, or simply string-searching?

The weak response to this would (I guess) be that very often we have no choice. As Jonathan Whitson Cloud of the British Museum pointed out in passing, even relatively common heritage sector problems such as a multiplicity of object identifiers utterly confound string-based search. The alternative to Semantic Web solutions would be some kind of bespoke data-integration application; and, given that the work of people like Kiryakov and Glaser has shown the former to be perfectly viable, it would generally seem to be preferable to simply tunnelling between silos. Similarly, the domain John Sheridan is working in seems to favour semantic over string solutions — while the biomedical community around FactForge’s sister Linked Life Data site must have perceived a need for very large-scale data integration beyond that provided by the WWW over the internet.

Semantic solutions, then, seem to be often desirable, and sometimes perhaps necessary, for specialised communities with distinctive and well-defined domain needs; and the British Museum will discover whether it is part of one of these communities when its own Linked Data set goes live next month. But where does this leave Tim Berners-Lee’s (and others’) more over-arching vision for the Semantic Web?

In broad terms, it suggests to me that the Semantic Web has crossed one tipping point, only to find another. In her keynote address, Wendy Hall noted that there had in fact been three tipping points in the take-off of the World Wide Web. The first, as for the Semantic Web, was attaining some kind of size. The second was the emergence of search engines, allowing users to navigate the WWW beyond the links provided among sites. And the third was the widespread availability of broadband.

If the analogy holds true (and if I’m right that the evolution of the Semantic Web is for the moment tending to place a heavy informational burden on users) this would suggest that the next milestone is the development of some kind of tool that allows users easy “navigation” of the Semantic Web — which is to say, that exposes to users in some kind of  clear, intuitive, and precise way, the semantics of the datasets they are querying and manipulating.At the moment AFAIK no such application exists. Such Semantic Web browsers as are available tend to allow ready point-to-point navigation, but don’t assist the user much in terms of semantics. Some of the postprocessing (generation of a preferred label and image, for example) performed by FactForge is intended to assist user comprehension; but clearly much more needs to be done. For the Semantic Web to become useful to the general public, interfaces will need to be developed that outline the semantics of the datasets of interest with all the accuracy and precision of a SPARQL query, but none of the cognitive overhead. Presumably Fresnel is intended to meet this need — but I haven’t played with it enough, or seen enough examples of it in action, to judge whether it actually does the job or not.

In this context, cultural heritage — and to an even greater extent, the humanities — form an interesting use case. The two sectors’ data typically exhibit many of the problematic features for which RDF and other SW technologies are often touted as a solution: holes, sparseness, messiness, an apparently limitless potential for remodelling, a long history of claims and counter-claims, etc. Both would stand to benefit considerably from Semantic Web technologies as these mature — and, as a result, might even be in a position to contribute something back to these technologies themselves.

This might seem a slight possibility — but it is, I think, a possibility nonetheless. Speaking in very general terms, the cultural domain typifies many of the problems the Semantic Web can be expected to encounter as it scales upwards. Leaving aside the question of speed (the humanities generally do not react at the speed of Twitter, but perhaps, given the current state of the evolution of the Semantic Web, this is no bad thing), then, design solutions that work for the humanities can be expected to apply a fortiori and mutatis mutandis more broadly.

This isn’t to say that there’s a one-size-fits-all-solution to the Semantic Web usability problem, or to the humanities Semantic Web usability problems, or even that there’s a solution at all. But it is to say that, if these Semantic Web technologies do, as promised, come to open up new perspectives on the cultural domain, we need also to explore the extent to which these perspectives might in turn serve to frame the Semantic Web itself.

19 thoughts on “Reintegrating the Human(ities): Reflections on Cultural Heritage and the Semantic Web at the British Museum

  1. I’ve been a Semantic Web sceptic since I first learned about it – largely because the vision just screams AI-complete. I’m not at all convinced that a change in approach to getting lots of RDF triples out in the wild is going to help with the twin issues of poor structure/ontologies and differing structures/ontologies. This is why I like to use the term Semantic Ghettoes, to indicate that the benefits of the approach are mostly going to accrue to small communities that use largely agreed upon ontologies and create rich data. From there, it might be possible to link multiple such ghettoes with carefully crafted mappings.

    Otherwise, it looks entirely possible that we’ll continue to be treated to rapturous proclamations of all the marvellous things that come out of using FOAF and DC.

  2. Well, there was plenty of fuel for the Semantic Web skeptic there (although thankfully neither FOAF nor DC raised their heads) – not to mention the fact that even a lot of the more technically-mundane but significant issues were kind of glossed over. There’s a sense in which the remote join problem has to be solved before we start worrying about the AI questions attendant upon such an operation …

    That being said, I suppose the extent of Semantic Ghetto-ization is going to depend upon how people actually use ontologies. The original vision for the SW (and I think how most CompSci people think about ontologies) was highly computational: what’s desired is a highly formalised schema specified in detail that allows all the work to be done by the machine.

    But insofar as I’m a humanities scholar, this isn’t what I want an ontology to do. First of all, if the question I’m working on really is a research question, then I’m not in a position to make any very strong ontological commitments. Second, I’m probably going to be a bit wary of strong ontological formulations even in relatively familiar domains, because a lot of interesting research in the humanities takes as its starting point a questioning of attributed ontological status. For instance, I remember we once had a discussion about the individual person being the “natural” unit of prosopographical databases. This is true, but the very idea of a prosopography stems from the notion that individual persons constitute a natural unit of historiographical enquiry generally – a notion that is in part a surprisingly recent one.

    This isn’t to say that I don’t want an ontology at all; I’m probably going to have to perform some kind of data mediation if I’m doing a really large-scale enquiry, and questioning the reality-status of the kinds of relationships that underpin OWL and description logics generally falls outside the remit of historians. But if I’m using an ontology, I want it to keep its commitments low, and above all I want it to show its work: I need to know what those commitments are and how they fit together. Ideally, I want to be able to manipulate it in some kind of way that’s transparent to me, and to add to it experimentally to see what follows.

    It is, as they say, a big ask. But it’s not, I think, an AI big ask. It’s an anthropological, cognitive, and visualisation ask.

    This, by the way, is all just me freestyling on things said at the workday. I didn’t hear any of the speakers actually say that the Semantic Web was rowing back quite strongly from its AI goals, and I think that’s a problem. On the one hand, a lot of the rhetoric was the usual SW stuff we’ve been hearing for a decade; on the other, it was pretty much assumed that the desired end result was something that you’d output to a browser, supporting an absolutely standard navigate-and-read style of interaction.

  3. I think we need to be more specific. If your purpose is the gathering of “raw” data (a problematic notion; for the moment, let’s think of it as data not mediated by a strong ontology), then it might well seem more effort than it is worth to navigate via someone else’s complex ontology.

    However, I think your desire that an ontology show its work points to a broader issue, which is that more scholarship should show its work. The “questioning of attributed ontological status” is far better done when that being questioned is there in front of you in explicit, manipulatable form. Indeed, I should imagine that the easiest way to start such questioning is to literally question the data via that
    ontology. The results, the working, can also be out there and available for others to use and question in turn. This is in accordance with your stated ideal; I just don’t see how having the ontology be simplistic is at all helpful for this.

  4. The reasons I want the ontology to keep its commitments low (not, ahem, simplistic) are practical ones.

    First, there’s the reason that Tom Gruber gave for formulating “minimal ontological commitment” as a basic desideratum of ontology design: all you need is enough shared vocabulary to enable communication. More than that and you’re creating technical and conceptual hassles for yourself.

    My second reason is that a lot of the most interesting and problematic research questions in the humanities are themselves questions about semantics. And to do this kind of work you don’t need much more (in fact, probably can’t use much more) than some very basic bootstrapping ontological formulations to get you going.

    For instance, to rever to the ethnonym example (not the best example, but the shortest): the HESTIA project was, and for all I know might still be, producing a geographical visualisation of all the places mentioned in Herodotus. The tricky bit here, however, is that the classical languages tend to refer to geographical areas less than you might think, and to names of peoples rather more frequently. And so the HESTIA people had decided to take words like Skythoi (“the Scythians”) as having a geographical reference, to the area of Scythia.

    Now this was, as you’ll appreciate, a hugely problematic decision. It’s just too coarse-grained: sometimes it makes sense to take ethnonyms as referring to a place; at other times to a people, without a geographic dimensions; sometimes to a mix of the two; and, for all I know, in other senses as well (e.g., characterising somebody as “very Italian” or somesuch). Furthermore, these things have a historical drift: Skythoi becomes more and more geographical in sense through the course of the Byzantine Empire, for example, while Romaioi becomes absurdly less so.

    So anyway, there’s probably enough meat here for some valid humanities research to take place: what ethnonyms are predominantly geographic in reference?; which are predominantly ethnic?; does this change?; with time?; with genre? etc …

    Now, to do this research some kinds of ontological tools would be very handy. If somebody’s been kind enough to tag all the ethnonyms in the Histories I want to be able to experiment with assertions like “take Ethnonym X as a geographical referent within this timespan” and “take Ethnonym Y as an ethnic affiliation in the genre of oratory” and so on.

    The end result of all this experimentation would tend, I suppose, towards becoming an ontology-of-ethnonyms. But what I don’t think would be handy would be for this structure to then be used as some kind of organising principle for data access and classification. It will of course be useful as a reference, or adoption as a starting point for further experiments – and maybe, eventually, you would end up with some kind of Grand-Unified-Theory-of-Places-and-Peoples. But even if this were the case it seems to me you’d want the GUTPP to act more as a knowledge base than an ontology. Insofar as your data users are researchers, they’re either going to be seeking to reformulate the GUTPP, or are going to have concerns that cross-cut it. In either case, the first thing they’re going to want to do is peel it back to some more ontologically primitive layer.

  5. Sorry for equating “low” with “simplistic”, although I think I might be somewhat forgiven given your last sentence, and the idea that more ontology equals more hassle.

    I think that the divide between accumulation of data on the one hand and use and presentation of data on the other is crucial here, as I said before. I’m not thinking that any GUT is possible or desirable, but rather that any ontology (indeed, any work) that a person develops should exist *out there* and be open to others to use, adopt, tinker with, etc. This is naturally something that comes out of communities, and is most useful within that community.

    I would go so far as to say that I think it’s a defining characteristic of what I consider a digital humanities project that it create something that can be poked and prodded. Show your working, indeed — if you’ve developed an interpretation of a text based on the collocations with a particular word, I want your site to have functions for showing all collocations with that word, and collocations with all of those words. I want to be able to feed your system texts by the same author, by other authors, to see what might show up in them.

    Applying ontologies to the actual data, in a way that makes it usable and testable by all, is just one example of this.

    I’m not convinced, to return to your specific example of ethnonyms, that having data classified by some wonderful structure would be an impediment to anything. It doesn’t prevent other structures being applied to the same data, and given that, what are the issues? My original point about the folly of the Semantic Web is that you need mappings between various ontologies. This need is not, however, a problem within a community, because it is an active part of the continuing enquiry of the field.

    I fear I am not making a great deal of sense.

  6. I think if we’re worried we’re not making sense (it’s a nagging thought that had occurred to me too) it’s because we’re actually agreeing on most common points, but envisioning slightly different purposes for Semantic Web/ontological constructs.

    The purpose of every ontology, to my mind, is to simplify: you’ve got a huge mass of data and you need to make it comprehensible and useful in some way. And the question of when the simple (clear, useable, good) tips over into the simplistic (inadequately representative, obstructive, bad) has to be answered by making some kind of cost/benefit analysis regarding what you want the ontology to do and what you want out of it.

    If what you want is a knowledge base supporting automated reasoning over items within the collection and rapid access to them (the original vision as far as I can tell for the SW) then even the simplest possible workable design will still be quite complex. The price of this desirable functionality is thus (a) that the risk of botching the ontology has been raised; and (b) you’ve probably limited the number of people to whom the ontology is transparent (to whom it can legibly “show its work”) to a community of experts – the “Semantic Ghetto” you’ve described.

    On the other hand, if what you want is data mediation across domains then you want to keep your commitments low. They are thus broadly “legible” (or can be made to be so) to a wide group of people. But the cost of this is speed and clarity: the user has to do most of the brain-work. And this seemed to be the revised vision of the SW that was presented last week.

    My contention, as it has evolved, is that the humanities use case more often resembles the latter scenario than the former. In part this is just because the humanities have been very fertile in multiplying incompatible constructs, meaning that in practice formalised semantics often do get in the way of researchers. More fundamentally, this diversity reflects the fact that the humanities are often ultimately about semantics and how one structures or makes sense of one’s data. The researcher is interested both in the characteristics of the data, and in the various ways these data might be held to interrelate. And this means it’s more important to present researchers with a comprehensible means of manipulating and experimenting with these interrelations than it is to formulate them in a way that’s amenable to treatment by hard AI.

    In other words, I’m interested in semantic technology more for its expressivity (“showing your work”) and as a means of experimentation than as an underpinning for automated operations, whether of calculation or retrieval.

    Of course, there’s no reason for the hard-AI/loose’n’sloppy visions of the Semantic Web should be mutually incompatible. But I do think it’s important to be clear about the existence of the spectrum, and the implications of its poles.

    Which I think is what we’ve been talking about?

  7. That’s a good, clear summary, Tim. And I agree with you right up until you link presenting researchers with “a comprehensible means of manipulating and experimenting” to not using something suitable for an automated approach. I think perhaps we’re getting stuck on what I think is an enormous gap between AI and automation. To my mind, if you need any sophisticated AI to work with your data, you’ve likely already lost. However, automation does not require AI and I’m rather confused that you’re contrasting expressivity and experimentability with automation — the two must surely be very close, as per the example in my previous comment.

  8. Well, I was really just using ‘hard-AI’ as a sort of short-hand for one extreme end of the automation spectrum, so maybe let’s throw that out of the conversation for the moment.

    I think I didn’t comment your example because – while I agree that the functionality you describe would be desirable – it seems to me more like an example of NLP than ontology kit. And I wonder if to some extent the reason we’re not quite seeing eye-to-eye on this is that you’re framing things in an XML-y kind of context, and I’m framing them in a relational DB-y kind of context. If what you’re talking about is a text to which the world has general access, then you can happily create your own little application and an ontology to power your interpretation of the text – and you might as well make the latter as complex as you like, because there’s always a version that comes prestructured as natural language out there. On the other hand, if we’re talking about a database holding unique information and we’re wondering about how to structure it, store it, and expose it to the world – well, then there’s considerable room for your formulations to make access and use more difficult, not easier.

  9. Well, I’ve more been looking at it from a Linked Data perspective than from an XML one. I think that whether the data is stored in an XML document or in a database, it should be exposed in ways that admit of as much repurposing as possible. Obviously this will vary a great deal — I would not expect most projects to allow individual words or even sentences to be addressed individually, though this would be appropriate for some (Peter’s paleography project comes to mind).

    I recall that one of the Semantic Web books floating around CCH talks about the advantages of the SW approach as opposed to relational database model, and the rigidity of the latter was a major point. By using a generic database structure (such as can store a Topic Map or set of RDF triples), it is possible to layer as many ways of seeing the data on top as you want.

    Coincidentally I’m currently working on a Django-based implementation of the Topic Maps API, which does precisely this. It’s good to have the storage mechanism be a separate layer from the data model, which is itself a distinct layer from the ontological model.

  10. I couldn’t agree with you more on the architecture question – precisely because, as I mentioned above, I’d imagine most humanities researchers want to work pretty far down the ontological stack. When I talk about “peeling back” ontological constructs and you talk about “layer[ing] … many ways of seeing the data” I think we’re talking the same language. The point were we differ, I suppose, is in gauging where the dividing line between a useful and a simplistic ontology lies.

    The crux of the difficulty lies in the frequency with which the focus of humanities enquiry is itself semantic. Studying a topic of the form ‘x in the context of y’ (Suicide in Ancient Rome/Religious Ritual in Japan/The Theme of Poverty in the Dickens, etc.) is a bog-standard approach in the humanities. And an equally bog-standard difficulty/opportunity/conclusion in this kind of enquiry is that ‘x’ in the context ‘y’ doesn’t map straightforwardly onto ‘x’ in the reference (presumably Western, modern) context: suicide in ancient Rome includes some acts no psychological or legal authority would consider such in the modern world, and excludes others which they would; the applicability of the word “religion” to Japanese culture is problematic; poverty is such an arresting theme in Dickens precisely because it differs radically from prevailing conceptions of it; and so on.

    This means that in a lot of humanities enquiries there is a sort of reflexivity between whatever it is you’re studying, and its semantics: you start off deciding you’re looking at a certain class of things, with a few exceptions; consideration of the exceptions makes you redefine the class; broadening the class redefines the membership set, but creates another set of exclusions, and so on and so forth. Now, having a formalised ontology could help immensely in this explorational process. But the point of complexity at which it will start to seem like an over-reification of the material is probably going to arise very early on. And past that point, as they say, no matter how fine you slice it, it will still be baloney.

    This doesn’t mean, of course, that highly structured ontologies don’t have a role to play. But they’re more like lenses with which one might experiment with viewing the data than something that has to be implemented before semantic technologies might be of use to researchers.

  11. I think my difficulty is in seeing how the semantic technologies are going to be useful to the specified subset of researchers, who are uncomfortable with any statement about the source material that they have not made themselves. (Yes, this isn’t what you said, but consider that any given simple ontology is going to be problematic for some researcher.) Aren’t such researchers simply going to work with the source texts directly, and use usual search techniques to find parts of those that might be relevant? Having found the appropriate source material for their study, they do their cogitating, perhaps build a model or three, maybe at some point apply it rigorously to the source material and test it. After a suitable number of repetitions, I have to hope that this work is published in a way that links in with whatever else is there.

    So, for example, if there already existed identifications for, and statements involving, a chap called Pliny and a mountain called Vesuvius, then I’d hope that if our researcher ended up saying something about how the latter was a bit of deadly attraction to the latter, it was done so with reference to those identifiers – even if the existing model claimed Pliny was a person and Vesuvius a geographical feature, while our researcher saw them as an identity and a volcano respectively.

    I wonder if anyone else is reading this thread?

  12. Jamie said:
    Aren’t such researchers simply going to work with the source texts directly, and use usual search techniques to find parts of those that might be relevant?

    The answer to that is most likely “yes”, but that the usual search techniques available in the majority of DH websites are usualin the sense of what is usually provided by the website creators, and not in the sense of what is usual to humanities researchers. I would like to wave a small flag on behalf of an old-fashioned tool from the print medium which is a kind of half-way house between full-blown ontology-driven data classification and the catch-as-catch-can of free text searching: I speak, of course, of the index. I mean a proper index, compiled by a professional who, iteratively, by engaging with a work, comes up with a rough, flexible, kind-of-ontology; it’s not formal, but it’s useful because it’s fit to context while at the same time using categories (or labels, or classifications – whatever you want to call them) that are commonly used. The result is a finding aid that, to use Tim’s words from his initial post:
    exposes to users in some kind of clear, intuitive, and precise way, the semantics of the datasets they are querying. So when I pick up a book about the painter Joe Blogs, say, I can look in the back and see the entry for “Joe Blogs” broken down into clusters of references – perhaps even sub-clusters within those – that reflect (in the hands of a good indexer) that accurately reflect the contents of the work. Of course to some extent it’s an interpretation, but that doesn’t stop it being extraordinarily useful to scholars. In the world of scholarly books, this IS the usual way of searching, and i think we in DH have done a huge disservice to humanities scholars by regularly failing to give them anything near as useful. Now we boast “you can find all occurrences of ‘Joe Blogs’ within six miliseconds”. Yeah, great, but that’s not the same as being able to look up Joe Blogs with respect to ” – adultery with Jane Smith” “- and Slade School of Fine Art”, “- military service”, etc. How have we let this go by the board? Perhaps before we try to work out how to make all our data universally discoverable, we could at least give our users something as useful as a good index, and it surely wouldn’t take the amount of effort that we know even relatively simple ontological markup takes.

  13. Well, I’m glad Paul’s comment indicated that at least someone else is reading this thread; I’ve been taking a breather to put in lots and lots and lots of time on a project launch …

    As an aside, I’d like to second both Paul’s and Jamie’s comments above on the tendency of most scholars simply to go to the source text, and to use traditional scholarly aids to navigate around them. The other day my wife, a lecturer in Slavonic and Byzantine history, was doing a little research for a lecture on the web, and at one point exclaimed “This is such a great site!”. Eager, of course, to keep up with examples of best practice in the field, I peered over her shoulder and discovered that what she was looking at was the LacusCurtius Ammianus Marcellinus text. And what makes the site great in her opinion is that it has a lightly annotated table of contents that makes it easy for her to find her way around a text she’s not terribly familiar with but has to make frequent glancing reference to for background material.

    So I’m not doing down traditional scholarly aids, or thinking that semantic technologies are somehow going to replace the reading of texts. I’m really just envisaging situations in which these tools aren’t available, or don’t quite do the job, or where ontological enquiry is a necessary step to clarify what needs to be read – e.g., datastore mediation. For example, I alluded to the ethnonym problem above, whereby it’s desirable to correlate some but not all ethnonyms in, say, Herodotus, to geographical features. In this case, it would certainly be desirable to have an index broken down in the fashion “Scythians: geographical range of; chief settlements of; marriage customs; relationship to snow;”, etc., in which case your job is pretty much done for you. But assuming such an index doesn’t exist – how could you determine which ethnonyms should be taken as geographical, and which otherwise?

    One answer would be to do a big regex string search that yielded you all the ethnonyms in the corpus, and then go and read the passages in which these were found – which would be fine if you had about twenty years to do it in. More helpful, I think, would be to apply a little semantic technology, in combination with some NLP techniques, and do a bit of experimenting. Though I’m hardly a specialist in any of these areas and this is all very crudely conceived, I can imagine a sequence something like:

    (1) Consider all ethnonyms found within n words of a reference to a feature identified by Pleiades (or Google Ancient Places, or whatever) to be geographical
    (2) Consider all ethnonyms found within n words of words belonging to (some ancient-language equivalent to) Wordnet’s juridical, political, or artistic categories as non-geographical
    (3) Plot all geographical results on a map. Do a sanity check on outliers, unexpected results, and ethnonymic instances that fail to fall under either of the two rules.
    (4) Tweak the rules created in (1) and (2) and reapply (3)
    (5) Repeat until satisfied

    The results of this would be two-fold. First, a set of points on a map associated with various ethnoi; and second, a great deal of information about how (various) ethnoi are conceived of and represented in the Histories. The latter could easily be remediated as an index into the text – or possibly be expressed ontologically, as a basis for exploration of other corpora in other contexts. Such an ontological formulation, however, would be useful in other contexts only insofar as it could be “rolled back” and its underpinnings inspected.

  14. Not that I disagree with the desirability of being clear on what basis any sort of categorisation has been made, but I’m sensing a slight mismatch between what is expected of an index and what is expected of an ontological formulation. And I’m curious as to what sort of underpinnings you’d be satisfied with, Tim: presumably, in the case of your example, the full code that was used to generate the final results, with perhaps a statement of the reasoning behind the choices? But what of something that was entirely or largely manual — would a statement of the criteria used be sufficient?

    As for not providing traditional-style indices to digital works, I think there’s certainly room for a tool that does this. It would need to be customisable to a fairly large degree however – essentially a system that can wrap multiple versions of Tim’s example NLP process. Or, as an alternative system, include lots of markup in the text that identifies whatever entities you care about, and leave the way those should be combined to the user.

    So, for example, there might be a page for Joe Blogs, and on that page a form or set of links to Joe Blogs in relationship with various other entities. If a form, then it might present options such as, “is mentioned in the same section/paragraph/sentence with”, “is cited”, “is quoted”, etc. And of course your entities can include “adultery with Jane Smith”.

    I can think of two types of subentry in an index:

    * The main entity (ie, the subject of the index term) within a specific context.
    * An identifiable entity that is related to the main entity but which does not have a convenient name (eg, the adultery).

    While the second might be seen as a subtype of the first, I think it’s useful to treat it distinctly — particularly if the secondary entity has relationships to other entities.

    Are there other types of index entry that an automated indexer would need to be able to handle? As it stands, if the entities are marked up wherever they meaningfully occur, there are two things that need to be done to generate a decent index:

    * Express ‘universal’ (extra-textual) relationships between entities in the entity management system. Eg, relate “adultery between Joe Bloggs and Jane Smith” to both Joe Bloggs and Jane Smith.
    * Derive contextual relationships between instances of entities within the text.

    The first is straightforward; the second potentially extremely difficult. But, worth a shot. Presumably there has been a fair amount of work in this area already, though I’ve never heard anything good about automatic index generators.

  15. With regard to exposing the underpinnings – I suppose this is why I mentioned the need for some kind of visualisation tool for this kind of thing to work. It would be both trivial and desirable to expose the chain of reasoning that led to the final formulation in the form of, say, SVN commits. But I’d find that less than helpful – and it would leave almost all humanities scholars in the dust. For this kind of thing to get any sort of traction, humanities scholars are going to have to feel as confident manipulating the toolset as they do working with natural language – or at least, that such a degree of confidence is reachable by them.

    As for indices … I’m sceptical of automation here. Indices are at their most useful (particularly in electronic texts) when they capture relationships that aren’t clearly recoverable by NLP techniques. I can easily imagine automating the kind of entry that reads ‘Caesar, Julius, 2, 5, 7, 8, 22-24, and passim‘. But – to hark back once again to my own thesis — I’d say one of the most distinctive and important passages to any understanding of Roman attitudes towards suicide is this snippet from the Younger Seneca:

    Do you ask where the road to freedom lies? Fool! It lies over every cliff, and runs through every vein in your body!

    Now, to a human reader the language and rhetoric is almost embarrassingly bald. But I would suggest that NLP techniques would have a very difficult time working out that this passage is about suicide.

    Or am I being naive?

  16. I am also sceptical of automation in the generation of indices, as in many humanities-related areas (if you’ve never heard me ranting about the horror of trusting named entity recognition, you’re lucky). However, I think it is absolutely essential that that scepticism not prevent us from truly attempting to make that automation possible and worthwhile. Otherwise we lose the D from DH, and given that I think it’s hardly there to start with, that would be unfortunate.

    In the list possible answers to the question ‘what is digital humanities’, the only one that I really hold to is, ‘using computation as a key part of performing humanities research’. Marking up electronic texts with TEI markup is not digital humanities work, except in as much as that markup is used by a process that performs computation (and most transformations of TEI to HTML are such boring computations as to be not count). The same goes for stuffing material into a database and taking it back out again in one or two different ways. Automatically generating an index and improving the automation code over time so that it requires less and less manual intervention (without needing necessarily to ever reach none) to be suitable is digital humanities work.

    If people working in digital humanities are not the ones to push the idea that computers can do/help with a lot of things that are currently conceived as being the exclusive domain of human thinkers, then who will? I am likely displaying my ignorance of what fantastic work is being down around the world, but I’ve felt for years that the DH community is a timid one, more interested in trying to make pre-existing work methods and research practices easier and more convenient, than to try to redefine what parts of that work no longer require a human to do.

  17. Well, a hearty amen to all that.

    It seems to me that to be done well, DH has to thread a very difficult needle. On the one hand, there’s a tendency on the part of most humanities scholars to simply claim that it’s impossible to automate what they/we do: if a question can’t be answered using the by-now traditional methods of humanities scholarship, then it simply can’t or shouldn’t be answered, or at the very least isn’t “really” a humanities question, whatever that may be. On the other hand there’s a counter-vailing tendency to over-simplify and over-reify – to simply claim that something that can’t be readily automated using existing techniques doesn’t really count as knowledge. In other words, we’re stuck between devotees of ultra-conservative humanities methods and ideologies and devotees of ultra-conservative computational methods and ideologies.

    The difficulty, as you point out, is that walking a tightrope doesn’t get any easier if you do it timidly – which is exactly what’s happening now. It seems to me that ideologues of the romantic and algorithmic approaches shout the loudest, while those seeking either some middle ground or a reframing of the issue are rather hesitant. The issue’s compounded by existing admin and political arrangements – with funding bodies and current trends tending to favour the algorithm fundamentalists, and every other possible kind of pressure (tenure committees, peer review, etc.) pulling in the opposite direction …

  18. Hi chaps

    Fascinating discussion 🙂 I won’t be daft enough to wade in with any particularly new angles as you’ve covered so much interesting ground already, but as I’ve just got round to posting the slides from my talk at that BM workshop (which was one of the ones you missed I think, Tim?) I thought it might be worthing passing on the links:
    text
    slides

    The key points are (I believe) very much in line with what you’ve been saying and the other speakers as well. With my HESTIA/Google Ancient Places/PELAGIOS hat on, I’d certainly also concur that digital semantics are by no means a replacement for the human kind. Where I believe we _may_ see the kind of major transformations you discuss in the final posts is that triggered by accessibility and discoverability on previously unknown scale. What actually happens with that material will need to be ineluctably human but the processes may nonetheless be quite different to what we are used to. Nevertheless the key point IMHO is that it will be the transformative philosophy of openness and decentralization that Web technology facilitates (including the Semantic Web) that will have a disruptive impact on the Humanities, not the technology itself. Twitter doesn’t bring down dictators – it’s people using twitter in ways that dictators don’t grok that gets ’em 🙂

    I’ll shut up now, but thanks for such a stimulating conversation.

  19. Tim said:

    But I would suggest that NLP techniques would have a very difficult time working out that this passage is about suicide.
    Or am I being naive?

    If you are being naive over NLP’s ability to work out that the Younger Seneca passage is about suicide, then I am going to be even more naive. My elder son is in English ninth grade (he’s thirteen). If you set him a reading comprehension question about that passage in the form “What is being discussed here?” I would certainly hope that his class come up with the answer “suicide” – but I wouldn’t call it *obvious*. I would say that when presented with that passage out of any context the average teenager will not immediately grasp that the “freedom” being talked about is the individual’s freedom to take their own life. I really hope that an NLP specialist will be irritated enough by what I say next to jump in and slap me down, but I seriously doubt that computers are anywhere near that level of ability at answering simple questions put to them about texts.
    Here is a simple passage (invented):
    “It was August, and the temperature in the street was one hundred and ten degrees. In the air-conditioned hotel room Jane sweated; she was about to come face-to-face with the killer of her boyfriend.”
    Comprehension question: Why was Jane sweating?
    A very young child might first answer “because the weather was very hot”, not realizing that the information in the first part of the passage is negated by the phrase “air-conditioned”. An older child (including my son – I hope!) would read the whole thing and realize that Jane’s sweating was from nervous anticipation, not from being hot.
    Think of the combination of syntactic and semantic reasoning a computer would have to be able to go through to come to the correct answer. Could a computer today actually answer that question? (It sounds rhetorical, but it’s a genuine question – has AI and NLP progressed to the point where a machine would have no problem in saying why Jane was sweating? Literally, what is the reading level of the world’s most advanced computer? Third grade? Fourth grade?)
    The comprehension question above is a classic humanities question – albeit fairly low level. If we are ever to have a truly “digital” humanities, where (for example) a computer could compile an index that has “- and adultery with Jane Smith” with a reference pointing to a passage in a biography of Joe Bloggs where the words “adultery”, “Jane” and “Smith” don’t occur, then we need to concentrate on getting the computer through grade school.

Leave a Reply

Your email address will not be published. Required fields are marked *