Monthly Archives: October 2011

Hafed Walda, media presence

The World Today 31/10/2011 06:05–07:00
7 days left to listen

Live news and current affairs, business and sport from around the world.

An Interview about the Treasure of Benghazi.

Here is the link tp the BBC i Player: Start 0:48:45

http://www.bbc.co.uk/iplayer/console/p00ldwhs

Treasure of Benghazi Stolen in One of the Biggest Heists in …

Continue reading Hafed Walda, media presence

You SPILt my code: a modular approach to creating web front ends

One of the projects I’ve been working on since starting at DDH in May is a review of the front-end development framework we’re currently using to build websites, sUPL or the Simple Unified Presentation Layer. The aim of sUPL was to be a lightweight markup scheme — lightweight both in terms of using minimal HTML markup and short class and ID names (commonly used to apply CSS styles and to trigger Javascript-based interactivity).

Whilst sUPL had served the department well for a number of projects I wanted to update it to reflect recent changes in the front-end development world and also to put the emphasis back on the “simple” in sUPL. After reading around and trying out a number of existing front-end frameworks (e.g.  as Blueprint,  YUI, HTML5 Boilerplate, OOCSS and 320 and Up) I felt our own framework should be updated along the following lines:

  • Be written it in HTML5, to make use of new structural elements and prepare the ground for the use of HTML5 APIs;
  • Move away from terse class names to longer but more “human readable” ones;
  • Employ the Object Oriented CSS (OCSS) methodology of maximising the reuse of CSS code by only applying CSS styles to classes, not IDs;
  • Using the OOCSS concept of “objects”, that is that is reusable chunks of HTML, CSS and Javascript code to build common design patterns.

Welcome to SPIL: the Simple Presentation and Interface Library

There are quite a few frameworks out there so why create another one? Most of the existing frameworks have been created for highly specific purposes (e.g. YUI) or they are more generic (e.g. the HTML5 Boilerplate). SUPL’s successor, SPIL (Simple Presentation and Interface Library) can be thought of as a toolkit (or Lego!)  for constructing web pages and applications, providing both a generic structure for page layout and the ability to “plug in” interface design patterns which will work “out of the box”.

HTML5

SPIL makes use of the new HTML5 structural elements such as header, footer, section, nav and aside, loading in the Modernizr Javascript library to provide support for older, less capable browsers such as Internet Explorer pre-version 9. Of course relying on Javascript to provide this functionality may not always be appropriate, so SPIL provides some alternative markup in the form of reliable old-fashioned divs should you want to use XHTML1.0. For instance if we were marking up a primary navigation element in HTML5 we would use:

<nav class=”primary”> …. </nav>

But should we want to stick with XHTML we could use:

<div class=”nav primary”> … </div>

OOCSS

The development of SPIL has been heavily influenced by Object Oriented CSS (OOCSS) — both the concept and the CSS library. OOCSS encourages the reuse of code in order to enhance performance and keep down CSS file size (also approximating the DRY — Do Not Repeat Yourself  — concept in software engineering). One way to do this is to only style on classes — which can be used any number of times on a page — and not IDs — which can only be used once and also interfere with style inheritance. Class names can be chained together to combine styling effects, reusing predefined styles.

A useful concept in OOCSS is that of “modules”. Although SPIL’s implementation differs from that of the OOCSS library, the ideas are very similar. For instance, we can create a module object for a common design pattern, a tabbed display that can be plugged straight into a page template:

<div class=”mod tabs”>
 <ul class=”tabControls inline”>
  <li class=”tabControlHeader”><a href=”#tab1”>Tab 1</a></li>
  <li class=”tabControlHeader”><a href=”#tab2”>Tab 2</a></li>
 </ul>
 <div class=”tabPanes”>
 <section class=”tabPaneItem” id=”tab1”>Tab 1 content</section>
 <section class=”tabPaneItem” id=”tab2”>Tab 2 content</section>
 </div>
</div>

The structure for this module within the identifying div is built around what jQuery Tool’s implementation of tabs would expect but the class names could also be applied to other implementations. To use this code with jQuery Tools we would simply include a line in our Javascript file, e.g.:

$(".tabControls").tabs(".tabPanes > section");

An advantage of taking a modularised approach to code is we can start to build a library of predefined code snippets that can be slotted in place by anyone involved in interface building, from UI designers and programmers wanting to create a functional prototype through to front-end developers working on the final site build.

Development of SPIL

SPIL is being developed iteratively alongside new web projects within DDH. We’re feeding the work straight into an open source project which we hope will be available for release soon. If you have any comments or if there’s anything you’d like to see in the framework, why not less us know via the comments?

Geocoding your data

In many projects, the collection of data results in a list of items whose distribution can be shown spatially. The process of Geocoding assigns a location (or set of locations) to an item of data; perhaps the site of a battle, the source of a text or the home of notable person. Such visualisations allow for new perspectives on the relationships of data , spatial or otherwise. A long winded way of geocoding would be to simpy go through the data, record by record, and assign it a set of coordinates using a third party resource. If the list is short or if a very precise location is needed then this may be a practical solution, however it easy to underestimate how long it takes to go through what may at first appear to be a short list.

Alternatively, data can be captured directly into a Geographic Information System (GIS) such as ArcGIS or the freely available Quantum GIS, ensuring a location point is recorded with each record added to the data set, though this may be impractical in dark, dusty archive or may not lend itself well to your workflow. Often projects don’t require the sort of pinpoint accuracy that might be needed in scientific project and regional or town level locations are suitable and even preferable.

In many cases these research records end up as spreadsheets with a single column dedicated to recording location and given the ease with which data formats can be converted and imported into other systems, this is an efficient way of systematically recording and organising.

If you have the know how, and perhaps some special data requirements, building your own solution for geocoding is fairly straightforward. Given a list of post codes for example you will quickly be able to create a geocoded dataset with a simple data table join. Matching multiple fields, expecting multiple matches and ranking the results is more tricky.

Fortunately, there seem to be an ever increasing number of freely available resources with which to geocode your data. The Ordnance Survey of Great Britain last year released post code point data and a national gazetteer that can be used to resolve a placename to coordinates. Going beyond the UK, Nominatim is a geolocation service offered by OpenStreetMap, and the Google maps API also offers geocoding (though this is usually limited to restricted number of requests per day). Pamela Fox has created a great Google gadget for use with Google docs that will take your spreadsheet, query the Google geocoding service and return a list of coordinates for you to incorporate into new columns. A especially nice feature is that even when you (inevitably) have a few records left over that couldn’t be matched, they can be physically dragged and dropped on a familiar Google map and these too are given coordinates.

The problem with making the best use of these resources is that very little thought is given to how location is recorded at the moment of capture so that it might be made use of programmatically at a later stage, usually with the location field being treated as free text rather than a discrete data type. The spreadsheet column for place or location may contain extraneous words and punctuation which prevent automatic matching. The hierachy of location information is rarely considered. Sometimes, secondary or tertiary possible candidate locations are recorded in the same column. In order to use any automatic geocoding process there is usually a need for extensive data cleaning which must often be done at least partially manually.

To avoid this situation a few simple guidleines should be considered before embarking on a spreadsheet data acquisition that you anticipate may be geocoded.

  1. Always record the best location you can regardless of your requirements – this will give you far more options for geocoding later on. If you have post code use it, a house number even better
  2. Always split the location components across several columns – Don’t mix in cities with villages and colloquial names. Have a hierarchy in mind, split it across several columns and stick to it. e.g. House, Road, Town, County, Country. You don’t have to fill in every field for each record but keep the schema consistent. Don’t worry about presentational considerations as these values can be concatenated in another column and the data will be far more easily manipulable in this form
  3. Don’t merge several locations in one field – If, as is often the case there is more than one candidate for a location, record them in seperate columns. A GIS technician will be able to associate several points to one record if necessary. If you are worried about the spreadsheet becoming unwieldy, put these columns to the far right of the sheet as they may not be needed very often
  4. Avoid abbreviations and colloquial names – they are hard to match up in geocoding exercises
  5. Avoid punctuation – Question marks and exclamations will mess up the match and even humble commas should be avoided as many formats will use them as column delimiters
  6. Avoid ambiguity – Add more detail than you might think immediately necessary and spare a thought for the poor geocoder who, less familiar than yourself with the dataset, may need to choose from one of the 11 different Newports in the UK or more than 30 worldwide!
  7. Keep it on one line – Other data fields may naturally lend themselves to multiple row entries, but try to stick to the rule of one row, one record. If you need more space in a cell, turn on word wrapping and make the cell higher and wider.

Following these guidelines will allow you to make the best use of your data when using tolls like Google docs gadget above. You can try geocoding on different columns, or combinations of columns, or using the best available column in each record.

 

 

Tagore digital editions and Bengali textual computing

Professor Sukanta Chaudhuri yesterday gave a very interesting talk on the scope, methods and aims of ‘Bichitra’ (literally, ‘the various’), the ongoing project for an online variorum edition of the complete works of Rabindranath Tagore in English and Bengali. The talk (part of this year’s DDH research seminar) highlighted a number of issues I personally wasn’t much familiar with, so in this post I’m summarising them a bit and then highlighting a couple of possible suggestions.

Sukanta Chaudhuri is Professor Emeritus at Jadavpur University, Kolkata (Calcutta), where he was formerly Professor of English and Director of the School of Cultural Texts and Records. His core specializations are in Renaissance literature and in textual studies: he published The Metaphysics of Text from Cambridge University Press in 2010. He has also translated widely from Bengali into English, and is General Editor of the Oxford Tagore Translations.

Rabindranath Tagore (1861 – 1941), the first nobel laureate of Asia, was arguably the most important icon of modern Indian Renaissance. This recent project on the electronic collation of Tagore texts, called ‘the Bichitra project’, is being developed as part of the national commemoration of the 150th birth anniversary of the poet (here’s the official page). This is how the School of Cultural Texts and Records summarizes the project’s scope:

The School is carrying out pioneer work in computer collation of Tagore texts and creation of electronic hypertexts incorporating all variant readings. The first software for this purpose in any Indian language, named “Pathantar” (based on the earlier version “Tafat”), has been developed by the School. Two pilot projects have been carried out using this software, for the play Bisarjan (Sacrifice) and the poetical collection Sonar Tari (The Golden Boat). The CD/DVDs contain all text files of all significant variant versions in manuscript and print, and their collation using the ”Pathantar” software. The DVD of Sonar Tari also contains image files of all the variant versions. These productions are the first output of the series “Jadavpur Electronic Tagore”.
Progressing from these early endeavours, we have now undertaken a two-year project entitled “Bichitra” for a complete electronic variorum edition of all Tagores works in English and Bengali. The project is funded by the Ministry of Culture, Government of India, and is being conducted in collaboration with Rabindra-Bhavana, Santiniketan. The target is to create a website which will contain (a) images of all significant variant versions, in manuscript and print, of all Tagores works; (b) text files of the same; and (c) collation of all versions applying the “Pathantar” software. To this end, the software itself is being radically redesigned. Simultaneously, manuscript and print material is being obtained and processed from Rabindra-Bhavana, downloaded from various online databases, and acquired from other sources. Work on the project commenced in March 2011 and is expected to end in March 2013, by which time the entire output will be uploaded onto a freely accessible website.

 

A few interesting points

 

  • Tagore, as Sukanta noted, “wrote voluminously and revised extensively“. From a DH point of view this means that creating a comprehensive digital edition of his works would require a lot of effort – much more than what we could easily pay people for, if we wanted to mark up all of this text manually. For this reason it is fundamental to find some type of semi-automatic methods for aligning and collating Tagore’s texts, e.g. the ”Pathantar” software. Follows a screenshot of the current collation interface.

    Tagore digital editions

  • The Bengali language, which is used by Tagore, is widely spoken in the world (it is actually one of the most spoken languages, with nearly 300 million total speakers). However this language poses serious problems for a DH project. In particular, the writing system is extremely difficult to parse using traditional OCR technologies: its vowel graphemes are mainly realized not as independent letters but as diacritics attached to its consonant letters. Furthermore clusters of consonants are represented by different and sometimes quite irregular forms, thus learning to read is complicated by the sheer size of the full set of letters and letter combinations, numbering about 350 (from wikipedia).
  • One of the critical points that emerged during the discussion had to do with the visual presentation of the results of the collation software. Given the large volume of text editions they’re dealing with, and the potential vast amount of variations between one edition and the others, a powerful and interactive visualization mechanism seems to be strongly needed. However it’s not clear what are the possible approaches on this front..
  • Textual computing, Sukanta pointed out, is not as developed in India as it is in the rest of the world. As a consequence, in the context of the “Bichitra” project widely used approaches based on TEI and XML technologies haven’t really been investigated enough. The collation software mentioned above obviously marks up the text in some way; however this markup remains hidden to the user and much likely it is not compatible with other standards. More work would thus be desirable in this area – in particular within the Indian continent.
  • Food for thought

     

  • On the visualization of the results of a collation. Some inspiration could be found in the type of visualizations normally used in version control software systems, where multiple and alternative versions of the same file must be tracked and shown to users. For example, we could think of the visualizations available on GitHub (a popular code-sharing site), which are described on this blog post and demonstrated via an interactive tool on this webpage. Here’s a screenshot:Github code visualization

    The situation is striking similar – or not? Would it be feasible to reuse one of these approaches with textual sources?
    Another relevant visualization is the one used by popular file-comparison softwares (eg File Merge on a Mac) for showing differences between two files:

    File Merge code visualization

  • On using language technologies with Bengali. I did a quick tour of what’s available online, and (quite unsurprisingly, considering the reputation Indian computer scientists have) found several research papers which seem highly relevant. Here’s a few of them:- Asian language processing: current state-of-the-art [text]
    Research report on Bengali NLP engine for TTS [text]
    – The Emile corpus, containing fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages [homepage]
    A complete OCR system for continuous Bengali characters [text]
    Parsing Bengali for Database Interface [text]
    Unsupervised Morphological Parsing of Bengali [text]
  • On open-source softwares that appear to be usable with Bengali text. Not a lot of stuff, but more than enough to get started (the second project in particular seems pretty serious):- Open Bangla OCR – A BDOSDN (Bangladesh Open Source Development Network) project to develop a Bangla OCR
    Bangla OCR project, mainly focused on the research and development of an Optical Character Recognizer for Bangla / Bengali script
  •  

    Any comments and/or ideas?

     

    Event: THATcamp Kansas and Digital Humanities Forum

    The THATcamp Kansas and Digital Humanities Forum happened last week at the Institute for Digital Research in the Humanities, which is part of the University of Kansas in beautiful Lawrence. I had the opportunity to be there and give a talk about some recent stuff I’ve been working on regarding digital prosopography and computer ontologies, so in this blog post I’m summing up a bit the things that caught my attention while at the conference.

    The event happened on September 22-24 and consisted of three separate things:

  • Bootcamp Workshops: a set of in-depth workshops on digital tools and other DH topics http://kansas2011.thatcamp.org/bootcamps/.
  • THATCamp: an “unconference” for technologists and humanists http://kansas2011.thatcamp.org/.
  • Representing Knowledge in the DH conference: a one-day program of panels and poster sessions (schedule | abstracts )
  • The workshop and THATcamp were both packed with interesting stuff, so I strongly suggest you take a look at the online documentation, which is very comprehensive. In what follows I’ll instead highlight some of the contributed papers which a) I liked and b) I was able to attend (needless to say, this list matches only my individual preference and interests). Hope you’ll find something of interest there too!

    A (quite subjective) list of interesting papers

     

  • The Graphic Visualization of XML Documents, by David Birnbaum ( abstract ): a quite inspiring example of how to employ visualizations in order to support philological research in the humanities. Mostly focused on Russian texts and XML-oriented technologies, but its principles easily generalizable to other contexts and technologies.
  • Exploring Issues at the Intersection of Humanities and Computing with LADL, by Gregory Aist ( abstract ): the talk presented LADL, the Learning Activity Description Language, a fascinating software environment provides a way to “describe both the information structure and the interaction structure of an interactive experience”, to the purpose of “constructing a single interactive Web page that allows for viewing and comparing of multiple source documents together with online tools”.
  • Making the most of free, unrestricted texts–a first look at the promise of the Text Creation Partnership, by Rebecca Welzenbach ( abstract ): an interesting report on the pros and cons of making available a large repository of SGML/XML encoded texts from the Eighteenth Century Collections Online (ECCO) corpus.
  • The hermeneutics of data representation, by Michael Sperberg-McQueen ( abstract ): a speculative and challenging investigation of the assumptions at the root of any machine-readable representation of knowledge – and their cultural implications.
  • Breaking the Historian’s Code: Finding Patterns of Historical Representation, by Ryan Shaw ( abstract ): an investigation on the usage of natural language processing techniques to the purpose of ‘breaking down’ the ‘code’ of historical narrative. In particular, the sets of documents used are related to the civil rights movement, and the specific NLP techniques being employed are named entity recognition, event extraction, and event chain mining.
  • Employing Geospatial Genealogy to Reveal Residential and Kinship Patterns in a Pre-Holocaust Ukrainian Village, by Stephen Egbert.( abstract ): this paper showed how it is possible to visualize residential and kinship patterns in the mixed-ethnic settlements of pre-Holocaust Eastern Europe by using geographic information systems (GIS), and how these results can provide useful materials for humanists to base their work on.
  • Prosopography and Computer Ontologies: towards a formal representation of the ‘factoid’ model by means of CIDOC-CRM, by me and John Bradley ( abstract ): this is the paper I presented (shameless self plug, I know). It’s about the evolution of structured prosopography (= the ‘study of people’ in history) from a mostly single-application and database-oriented scenario towards a more interoperable and linked-data one. In particular, I talked about the recent efforts for representing the notion of ‘factoids’ (a conceptual model normally used in our prosopographies) using the ontological language provided by CIDOC-CRM (a computational ontology commonly used in the museum community).
  •  

    That’s all! Many thanks to Arienne Dwyer and Brian Rosenblum for organizing the event!

     

    P.S.
    A copy of this article has been posted here too.