All posts by Michele Pasin

People of Medieval Scotland database launch

A public launch of the AHCR-funded Peoples of Medieval Scotland (PoMS) database by a Scottish Cabinet Secretary was held at the Sir Charles Wilson Lecture Theatre at the University of Glasgow on Wednesday, 5 September. The official launch by the Education and Lifelong Learning Secretary Michael Russell last week capped work carried out over 5 years by John Bradley and Michele Pasin from KCL’s Department of Digital Humanities (DDH) with historians from 3 institutions. Dauvit Broun, the lead historian at the University of Glasgow on the PoMS project said at the public launch that the database “demonstrated [a] potential to transform History as a public discipline” through “the new techniques of digital humanity”, noting that it has been a “privilege and a pleasure” to work with the team’s “exceptional people”.

screenshot

One of the highlights of the launch was the brand new PoMS Labs section. This is an innovative and thought-provoking area of the site that features tools and visualizations aimed at letting users gain new perspectives on the database materials. Fore example, such tools allow to you browse incrementally the network of relationships linking persons/institutions to other persons/institutions; to compare the different roles played by two agents played in the context of their common events; or to browse iteratively transactions and the witnesses associated to them.

screenshot

In general, PoMS Labs aims at addressing the needs of both non-expert users (e.g., learners) – who could simultaneously access the data and get a feeling for the meaningful relations among them – and experts alike (e.g., academic scholars) – who could be facilitated in the process of analysing data within predefined dimensions, so to highlight patterns of interest that would be otherwise hard to spot. For this reasons, the Labs have been welcomed warmly by both the academics present at the launch, and the minister, who felt that this kind of tools could revolutionise the teaching of history in schools.

More info:

Seminar: the Role of Digital Humanities in a Natural Disaster

Screen Shot 2012 05 24 at 15 11 40

As part of the New Directions in the Digital Humanities series this week we had a very inspiring presentation from Dr Paul Millar, Associate Professor and Head of the Department of English, Cinema and Digital Humanities, the University of Canterbury (NZ).

The talk focused on the CEISMIC project, with which Millar and his team intended to ‘crowdsource’ a digital resource to preserve the record of the earthquakes’ impacts, document the long-term process of recovery, and discover virtual solutions to issues of profound heritage loss. Continue reading Seminar: the Role of Digital Humanities in a Natural Disaster

Seminar: the National Archives Online

Screen Shot 2012 03 12 at 15 21 23

Last Thursday (8th March), as part of the New Direction in the DH seminar we hosted a very interesting talk by Emma Bayne and Ruth Roberts about the most recent digital developments of the National Archives online presence.

Discovery and the Digital Challenge

Emma Bayne and Ruth Roberts talked about the changes to the National Archives online services. This includes the development of a new service – the Discovery Service. This is based on a new architecture and allows improved access to the National Archives Catalogue and digitised material. Features include a new taxonomy-based search and an API to allow bulk downloads of data.
They also discussed about some of the challenges facing the National Archives in delivering large quantities of digital images of records online – moving from a gigabyte scale to a petabyte scale in a short period of time.

Recording of the seminar:

If you’re interested you can listen again to the seminar (~1hr) by clicking here.

Relevant Links:

National archives main site
National Archives Labs
Discovery Service

Tagore digital editions and Bengali textual computing

Professor Sukanta Chaudhuri yesterday gave a very interesting talk on the scope, methods and aims of ‘Bichitra’ (literally, ‘the various’), the ongoing project for an online variorum edition of the complete works of Rabindranath Tagore in English and Bengali. The talk (part of this year’s DDH research seminar) highlighted a number of issues I personally wasn’t much familiar with, so in this post I’m summarising them a bit and then highlighting a couple of possible suggestions.

Sukanta Chaudhuri is Professor Emeritus at Jadavpur University, Kolkata (Calcutta), where he was formerly Professor of English and Director of the School of Cultural Texts and Records. His core specializations are in Renaissance literature and in textual studies: he published The Metaphysics of Text from Cambridge University Press in 2010. He has also translated widely from Bengali into English, and is General Editor of the Oxford Tagore Translations.

Rabindranath Tagore (1861 – 1941), the first nobel laureate of Asia, was arguably the most important icon of modern Indian Renaissance. This recent project on the electronic collation of Tagore texts, called ‘the Bichitra project’, is being developed as part of the national commemoration of the 150th birth anniversary of the poet (here’s the official page). This is how the School of Cultural Texts and Records summarizes the project’s scope:

The School is carrying out pioneer work in computer collation of Tagore texts and creation of electronic hypertexts incorporating all variant readings. The first software for this purpose in any Indian language, named “Pathantar” (based on the earlier version “Tafat”), has been developed by the School. Two pilot projects have been carried out using this software, for the play Bisarjan (Sacrifice) and the poetical collection Sonar Tari (The Golden Boat). The CD/DVDs contain all text files of all significant variant versions in manuscript and print, and their collation using the ”Pathantar” software. The DVD of Sonar Tari also contains image files of all the variant versions. These productions are the first output of the series “Jadavpur Electronic Tagore”.
Progressing from these early endeavours, we have now undertaken a two-year project entitled “Bichitra” for a complete electronic variorum edition of all Tagores works in English and Bengali. The project is funded by the Ministry of Culture, Government of India, and is being conducted in collaboration with Rabindra-Bhavana, Santiniketan. The target is to create a website which will contain (a) images of all significant variant versions, in manuscript and print, of all Tagores works; (b) text files of the same; and (c) collation of all versions applying the “Pathantar” software. To this end, the software itself is being radically redesigned. Simultaneously, manuscript and print material is being obtained and processed from Rabindra-Bhavana, downloaded from various online databases, and acquired from other sources. Work on the project commenced in March 2011 and is expected to end in March 2013, by which time the entire output will be uploaded onto a freely accessible website.

 

A few interesting points

 

  • Tagore, as Sukanta noted, “wrote voluminously and revised extensively“. From a DH point of view this means that creating a comprehensive digital edition of his works would require a lot of effort – much more than what we could easily pay people for, if we wanted to mark up all of this text manually. For this reason it is fundamental to find some type of semi-automatic methods for aligning and collating Tagore’s texts, e.g. the ”Pathantar” software. Follows a screenshot of the current collation interface.

    Tagore digital editions

  • The Bengali language, which is used by Tagore, is widely spoken in the world (it is actually one of the most spoken languages, with nearly 300 million total speakers). However this language poses serious problems for a DH project. In particular, the writing system is extremely difficult to parse using traditional OCR technologies: its vowel graphemes are mainly realized not as independent letters but as diacritics attached to its consonant letters. Furthermore clusters of consonants are represented by different and sometimes quite irregular forms, thus learning to read is complicated by the sheer size of the full set of letters and letter combinations, numbering about 350 (from wikipedia).
  • One of the critical points that emerged during the discussion had to do with the visual presentation of the results of the collation software. Given the large volume of text editions they’re dealing with, and the potential vast amount of variations between one edition and the others, a powerful and interactive visualization mechanism seems to be strongly needed. However it’s not clear what are the possible approaches on this front..
  • Textual computing, Sukanta pointed out, is not as developed in India as it is in the rest of the world. As a consequence, in the context of the “Bichitra” project widely used approaches based on TEI and XML technologies haven’t really been investigated enough. The collation software mentioned above obviously marks up the text in some way; however this markup remains hidden to the user and much likely it is not compatible with other standards. More work would thus be desirable in this area – in particular within the Indian continent.
  • Food for thought

     

  • On the visualization of the results of a collation. Some inspiration could be found in the type of visualizations normally used in version control software systems, where multiple and alternative versions of the same file must be tracked and shown to users. For example, we could think of the visualizations available on GitHub (a popular code-sharing site), which are described on this blog post and demonstrated via an interactive tool on this webpage. Here’s a screenshot:Github code visualization

    The situation is striking similar – or not? Would it be feasible to reuse one of these approaches with textual sources?
    Another relevant visualization is the one used by popular file-comparison softwares (eg File Merge on a Mac) for showing differences between two files:

    File Merge code visualization

  • On using language technologies with Bengali. I did a quick tour of what’s available online, and (quite unsurprisingly, considering the reputation Indian computer scientists have) found several research papers which seem highly relevant. Here’s a few of them:- Asian language processing: current state-of-the-art [text]
    Research report on Bengali NLP engine for TTS [text]
    – The Emile corpus, containing fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages [homepage]
    A complete OCR system for continuous Bengali characters [text]
    Parsing Bengali for Database Interface [text]
    Unsupervised Morphological Parsing of Bengali [text]
  • On open-source softwares that appear to be usable with Bengali text. Not a lot of stuff, but more than enough to get started (the second project in particular seems pretty serious):- Open Bangla OCR – A BDOSDN (Bangladesh Open Source Development Network) project to develop a Bangla OCR
    Bangla OCR project, mainly focused on the research and development of an Optical Character Recognizer for Bangla / Bengali script
  •  

    Any comments and/or ideas?

     

    Event: THATcamp Kansas and Digital Humanities Forum

    The THATcamp Kansas and Digital Humanities Forum happened last week at the Institute for Digital Research in the Humanities, which is part of the University of Kansas in beautiful Lawrence. I had the opportunity to be there and give a talk about some recent stuff I’ve been working on regarding digital prosopography and computer ontologies, so in this blog post I’m summing up a bit the things that caught my attention while at the conference.

    The event happened on September 22-24 and consisted of three separate things:

  • Bootcamp Workshops: a set of in-depth workshops on digital tools and other DH topics http://kansas2011.thatcamp.org/bootcamps/.
  • THATCamp: an “unconference” for technologists and humanists http://kansas2011.thatcamp.org/.
  • Representing Knowledge in the DH conference: a one-day program of panels and poster sessions (schedule | abstracts )
  • The workshop and THATcamp were both packed with interesting stuff, so I strongly suggest you take a look at the online documentation, which is very comprehensive. In what follows I’ll instead highlight some of the contributed papers which a) I liked and b) I was able to attend (needless to say, this list matches only my individual preference and interests). Hope you’ll find something of interest there too!

    A (quite subjective) list of interesting papers

     

  • The Graphic Visualization of XML Documents, by David Birnbaum ( abstract ): a quite inspiring example of how to employ visualizations in order to support philological research in the humanities. Mostly focused on Russian texts and XML-oriented technologies, but its principles easily generalizable to other contexts and technologies.
  • Exploring Issues at the Intersection of Humanities and Computing with LADL, by Gregory Aist ( abstract ): the talk presented LADL, the Learning Activity Description Language, a fascinating software environment provides a way to “describe both the information structure and the interaction structure of an interactive experience”, to the purpose of “constructing a single interactive Web page that allows for viewing and comparing of multiple source documents together with online tools”.
  • Making the most of free, unrestricted texts–a first look at the promise of the Text Creation Partnership, by Rebecca Welzenbach ( abstract ): an interesting report on the pros and cons of making available a large repository of SGML/XML encoded texts from the Eighteenth Century Collections Online (ECCO) corpus.
  • The hermeneutics of data representation, by Michael Sperberg-McQueen ( abstract ): a speculative and challenging investigation of the assumptions at the root of any machine-readable representation of knowledge – and their cultural implications.
  • Breaking the Historian’s Code: Finding Patterns of Historical Representation, by Ryan Shaw ( abstract ): an investigation on the usage of natural language processing techniques to the purpose of ‘breaking down’ the ‘code’ of historical narrative. In particular, the sets of documents used are related to the civil rights movement, and the specific NLP techniques being employed are named entity recognition, event extraction, and event chain mining.
  • Employing Geospatial Genealogy to Reveal Residential and Kinship Patterns in a Pre-Holocaust Ukrainian Village, by Stephen Egbert.( abstract ): this paper showed how it is possible to visualize residential and kinship patterns in the mixed-ethnic settlements of pre-Holocaust Eastern Europe by using geographic information systems (GIS), and how these results can provide useful materials for humanists to base their work on.
  • Prosopography and Computer Ontologies: towards a formal representation of the ‘factoid’ model by means of CIDOC-CRM, by me and John Bradley ( abstract ): this is the paper I presented (shameless self plug, I know). It’s about the evolution of structured prosopography (= the ‘study of people’ in history) from a mostly single-application and database-oriented scenario towards a more interoperable and linked-data one. In particular, I talked about the recent efforts for representing the notion of ‘factoids’ (a conceptual model normally used in our prosopographies) using the ontological language provided by CIDOC-CRM (a computational ontology commonly used in the museum community).
  •  

    That’s all! Many thanks to Arienne Dwyer and Brian Rosenblum for organizing the event!

     

    P.S.
    A copy of this article has been posted here too.

    Hack4Europe! – Europeana hackathon roadshow, June 2011

    Europeana is a multilingual digital collection containing more than 15 millions resources that lets you explore Europe’s history from ancient times to the modern day. Europeana API services are web services allowing search and display of Europeana collections in your website and applications. The folks at Europeana have been actively promoting the experimentation with their APIs by organizing ‘hackathons’ – workshops for cultural informatics hackers where new ideas and discussed and implemented.

    Some examples of the outputs of the previous hackathon can be found here. Hack4Europe is the most recent of these dissemination activities:

    Hack4Europe! is a series of hack days organised by the Europeana Foundation and its partners Collections Trust, Museu Picasso, Poznan Supercomputing and Networking Center and Swedish National Heritage Board. The hackathon roadshow will be held simultaneously in 4 locations (London, Barcelona, Poznan and Stockholm) in the week 6 – 12 June and will provide an exciting environment to explore the potential of open cultural data for social and economic growth in Europe.

    Each hackathon will bring together up to 30 developers from the hosting country and the surrounding area. They will have access to the diverse and rich Europeana collections containing over 18 million records, Europeana Search API (incl. a test key and technical documentation) and Europeana Linked Open Data Pilot datasets which currently comprise about 3 million Europeana records available under a CC0 license.

    There are four hackathons coming up, so if you’re interested make sure you sign up quickly:

    Hack4Europe! UK

    9 June 2011, London, hosted by Collections Trust

    Hack4Europe! Spain

    8 – 9 June 2011, Barcelona, hosted by Museu Picasso

    Hack4Europe! Poland

    7 – 8 June 2011, Poznan, hosted by Poznan Supercomputing and Networking Center and Kórnik Library of the Polish Academy of Sciences

    Hack4Europe! Sweden

    10 – 11 June 2011, Stockholm, hosted by Swedish National Heritage Board

    The Valley of the Shadow: Two Communities in the American Civil War

    The Valley of the Shadow: Two Communities in the American Civil War.

    From wikipedia:

    The Valley of the Shadow is a digital history project hosted by the University of Virginia detailing the experiences of Confederate soldiers from Augusta County, Virginia and Union soldiers from Franklin County, Pennsylvania. It is considered one of the most impressive uses of new technology in representing history. […] The Valley of the Shadows project is a great start to beginning to understand the personal side of the nations shared history.

    The website is clearly dated, but I found quite interesting the non-linear approach to the representation of history.

    When we build historical databases we often end up imposing the relational DB ‘way’ of doing things to the historical discipline, even at the visualization level – so for example everything gets displayed using tabular formats or similar. Is there an alternative to this? Can we represent more faithfully the discourse of a discipline?

    Screen shot 2010-11-24 at 13.43.10.png

    Bookmarking interesting DH projects..

    I just found out that wordpress makes available a bookmarklet (called Press this) that will let you post to the blog almost instantaneously from your web browser. You can find it by logging into the dashboard and clicking on the ‘tools’ section.

    Screen shot 2010-11-24 at 10.19.34.png

    I thought we could use it as a handy way to ‘bookmark’ DH projects we find online, for discussion purposes, or even just to build some sort of repository of inspiring DH stuff. What do you think?

    I added a new category ‘DH projects‘ to the blog, and will be adding something I found online straightaway!

    Knowledge Representation workshop @ CCH

    A couple of months ago or so we started a Knowledge Representation workshop with a few enthusiastic colleagues at CCH. The basic idea is to take a broad perspective on the various topics related to KR, and then focus on the digital humanities so to see how these approaches and technologies can be best applied to our domain.

    What is a knowledge representation? Although knowledge representation is one of the central and in some ways most familiar concepts in AI, the most fundamental question about it–What is it?–has rarely been answered directly. Numerous papers have lobbied for one or another variety of representation, other papers have argued for various properties a representation should have, while still others have focused on properties that are important to the notion of representation in general. [continue reading]

    Other than that, the scope of the workshop will remain deliberately unspecified so that we are allowed to decide session after session what topics should be discussed. I’ll be posting the slides and research produced in the context of the workshop on this blog, so maybe also others will be interested in taking part in this (either physically or electronically!). if you do, please get in touch 🙂

    The slides from our first meeting can be found online on slideshare:

    Screen shot 2010-10-13 at 14.59.30.png

    >>>

    Among the TOPICS that emerged as needing more reflection:

  • the ontoclean methodology: need more examples and rationale for each of the meta-principles
  • top level ontologies: is it sensible to aim for having only one? If not, what does a ‘relativist’ position entail?
  • the cyc project: why didn’t it conquer the world? where were its flaws?
  • ontologizing ‘humanities’ data: is the subject domain posing specific challenges, or not?
  • implementing an ontology: what are the languages/frameworks available? (we mentioned the possibility of inviting an external speaker on this topic, some time in the future)
  • Finally, some useful bibliography:

  • Doug. Ontologies: State of the Art, Business Potential, and Grand Challenges. Ontology Management: Semantic Web, Semantic Web Services, and Business Applications (2007) pp. 1-20
  • Sowa. Knowledge Representation: Logical, Philosophical and Computational Foundations. Course Technology (1999)
  • Niles and Pease. Towards a Standard Upper Ontology. FOIS’01 (2001)
  • Doerr. The CIDOC conceptual reference module: an ontological approach to semantic interoperability of metadata. AI Magazine archive (2003) vol. 24 (3) pp. 75-92
  • Gangemi et al. Sweetening Ontologies with DOLCE. 13th International Conference on Knowledge Engineering and Knowledge Management (EKAW02) (2002)
  • Smith,. Beyond Concepts: Ontology as Reality Representation . Proceedings of FOIS 2004. International Conference on Formal Ontology and Information Systems (2004)
  • Guha and Lenat. Cyc: A Midterm Report. AI Magazine (1990) pp. 1-28
  • Gruber. It Is What It Does: The Pragmatics of Ontology. Invited presentation to the meeting of the CIDOC Conceptual Reference Model committee (2003)
  • Doerr. The CIDOC conceptual reference module: an ontological approach to semantic interoperability of metadata. AI Magazine archive (2003) vol. 24 (3) pp. 75-92
  • Guarino and Welty. Evaluating ontological decisions with OntoClean. Commun. ACM (2002) vol. 45 (2) pp. 61-65
  • Stay tuned for the future reports!