Digital Humanities Software Developers Workshop

This post was co-edited by Geoffoy Noel and Miguel Vieira.

The first workshop for Digital Humanities Software Developers took place at the University of Cologne from the 28th to the 29th of November. The main aim of the workshop was to bring together the DH developers, discuss ideas, and work collaborative in projects. Around 40 participants attended to the workshop, with a large majority of developers and some researchers.

First day

On the first day there were short, 5 minutes, presentations from each participant and ten long project presentations.

Short presentations

Some presentations were more formal or personal, others took the form of a brief live demonstration of original editorial tools. This first round clearly showed that many people have developed tools separately, and from scratch, which address a common set of core functionality. The most popular functionality was a user interface to link and annotate regions of a facsimile to parts of a marked-up text. However the tools differ by the technology they are based on and the approach they take to address those common needs. Most editorial interfaces are web-based, others are built on existing editors such as Eclipse.

TEI was also widely supported among the presented tools. Some presentations were more formal or personal, others took the form of a brief live demonstration of original editorial tools. This first round clearly showed that many people have developed tools separately, and from scratch, which address a common set of core functionality. The most popular functionality was a user interface to link and annotate regions of a manuscript facsimile to parts of a marked-up text. However the tools differ by the technology they are based on and the approach they take to address those common needs. Most editorial interfaces are web-based, others are built on existing editors such as Eclipse.

Long presentations

  1. CATMA – Computer Aided Textual Markup & Analysis
  2. CollateX (API doc)
  3. TextGrid (demo)
  4. DigiLib (documentation)
  5. TILE – Text-image linking environment
  6. OAC – Open Annotation Collaboration
  7. xMod/Kiln
  8. SADE – Scalable Architecture for Digital Editions
  9. MOM-CA
  10. Interedition

General discussion

The long presentations were followed by a general discussion. Although no specific topic or structure was imposed, many fundamental questions and issues were raised with regards to application development for the digital humanities.

Among the problems faced most commonly by the participants was the lack of recognition of the value of the DH editorial tools for their own sake, separately from the projects they were originally built for. Most of the tools presented that day have survived only through the funding and work efforts allocated to the projects they support. This type of evolution imposes constraints on the kind of development and functionality that can be carried on and their priority. Reusability allows to save a lot of time on common technical requirements and hence allows a DH project to concentrate further on higher level or more advanced functionality. But the ingredients of reusability such as sustainability (what happens to the tool after the project ends?), documentation, generalisation of the problem definition, extensibility, modularity or parametrisation typically go beyond the needs of a single, particular application.

It seems a common trait that most participants don’t have enough time to work on tools, since they spend most of their time doing project specific work.

Second day

On the second day focus groups were created around the following topics:

  • Publication frameworks
  • Text-Image linking
  • Usability
  • Micro-services
  • DH developers community planning
  • Visualization
  • Backends, databased, data modeling and processing
  • Standoff markup, open annotation/linked data, annotation models

Publishing frameworks

This group discussion started with the goal to describe what a generic publishing framework, that could be used by all, would do and how it could be developed. Given that there were already several existing tools on the table, no one seemed to be willing to let go of their own tools in replacement for one that would be built by the community. The main reasons were the time already invested in the exiting tools and the fact that they are custom built to serve specific problems. Some of the developers were not even willing to try the others publishing frameworks. Given these constraints the discussion focused more on how to share experiences and that we should modularize the frameworks’ components in order to make them reusable by others.

Text-Image linking

Given the number of projects which offer tools to let a user draw a shape over an image and link it to a piece of text this group was the largest of all. A more detailed demonstration of the tools by each participant allowed us to see that there was a large variety in the user interfaces and their respective features. The majority of the tools displays side by side a facsimile of a manuscript and its transcription as an XML document. The linking usually works by drawing a shape rendered with SVG and linking it to a division of the XML document. Rectangles are always supported but some tools also let you draw ovals and polygons or rotate the shapes. Instead of a XML document TILE works with free text annotation of the regions or a pre-loaded plain text transcription of the text.

The most fundamental differences among the different solutions is the format they use to represent the text, the regions and the links. Some will store the data as JSON other will represent it as XML or TEI or a combination of TEI and SVG. The question of whether it would be desirable to join the development effort to build a single system that would eventually replace the others was met with many reservations. People don’t want to simply abandon the applications they have spent months or years building, those applications are already well integrated within project-specific workflows and architectures and each tool offers functionality that the others don’t have. It is interesting to note that the same question was raised independently within other discussion groups and was met with the same conclusions. In all cases the suggested alternative was to first design a common data model, representation and exchange format for the information produced or manipulated by those tools, then publish it as a standard within the community. In a second stage each tool would have to be modified to support this standard format. The main advantage is that consumers are not locked within a single system, rather they can use multiple specialised tools to process or enrich the same dataset.

As part of this standardisation effort it should also be possible to share only part of the information among systems. For instance exporting and importing the just shapes without the text, the links or the annotations. The units for describing the coordinates of the regions should also be interoperable and scale-free. We took a look at TextGrid data model for linking images, it is built on existing standards (SVG embedded in TEI) consistent representation and good separation of the three components (regions, text and links between both).

Micro-services

The discussions revolved around the definition of microservices, the technical challenges faced by the applications consuming the services and the possibility of generalising the organisation of services for common editorial tasks. The use of microservices for text collation in the context of the Interedition project was presented as a proof of concept (http://www.interedition.eu/wiki/index.php/About_microservices). Parallels were observed between the microservices model and the way text inputs can be passed through a chain of independent task-specific tools in a Linux environment. This parallel and concerns about the dependency to the performances, reliability and long term availability of web services hosted by external institutions let us wonder whether the microservices should necessarily be remote HTTP services or could also include locally deployable code.

Platform-independent processes using standard data formats and communication protocols to perform a well-defined and very specific task is then the essence of microservices. They constitute a response to the need of some participants to build their own editorial environment from existing, modular and reusable services without having to redevelop everything from scratch or be tight to a single pre-existing solution. This implies that the same microservice are sufficiently universal so they can be readily called from different project-specific workflows. This is fine for simple inputs formats like plain text but in practice this requirement is difficult to meet for services which accept or produce marked-up documents and since services are meant to be used in sequence, like a pipeline, it is difficult not to make assumptions about the project-specific nature of the documents they manipulate. The issue is that, although the atomic processes among different projects might be identical, their manipulation of the information is not fully independent from the specific data model of the project they belong to.

Workflows using services are often built by starting from what is needed for a project and either looking for already available services that could fit in or building new ones. This is the bottom up approach where services emerge from specific needs and are opportunistically reused afterward by other projects. The group discussed the need for the opposite approach where a more abstract and generic set of DH editorial processes would be defined and broken down into modular tasks implemented by microservices and categorised. New projects could then adapt this workflow to their requirements and benefit from a more coherent and reusable set of services. The feasibility of this approach was met with scepticism due to common claim that each project is unique – the argument of the particularism of each project and how it sometimes hinders the research for reusable DH tools was brought up multiple times during the workshop – and also to the risk of scholars not agreeing on a single, cross-projects conceptualisation of what the digitised material and editorial processes are.

Another topic discussed by this group was the apparent lack of expertise among its member in building workflows dealing with asynchronous services. Synchronous requests are limited to small inputs or computationally cheap processes, anything beyond that causes a synchronous workflow to break due to the excessive delays. This is even more relevant for front-end interactive interfaces where the level of tolerance for delays is even lower. Working with asynchronous requests solve those problems but requires different development skills, software architectures and workflow management middleware.

DH developers community planning

This grouped focused on how to build and maintain a DH developer community. Interesting points of discussion were, which communication tools to use, how to insure that we can have similar future events, and how to be recognized by our scholarly peers? Since there is already a big community built around Interedition it was decided that we would use their wiki as a communication and dissemination tool. Future DH developers workshops should be integrated with academic conferences to ensure visibility and sustainability.

To ensure the future of the DH developer community it was decided that we would act on the following:

  • Create a list of themes under pinning continued bootcamps series
  • Create and broadcast Forum/group/mailing list, IRC channel
    • http://lists.unc.edu/read/all_forums/subscribe?name=dh_developers
    • https://dhs.stanford.edu/algorithmic-literacy/digital-humanities-developers-mailing-list/
    • #interedition on freenode.net
  • Create a clear one paragraph statement about what the DH developer community is about
  • Create a list of concrete motivations about why community building is important
  • Everyone should contact his/her home institutions to commit to supporting the effort
  • Next bootcamps? guerilla tactics for organizing boot camp for as long as there is no sustained funding
  • Create a list of projects in the bigger infra projects
    • Approach them to ask if they are interested in hosting a bootcamp on that
  • Create a list of Potential Bootcamp Local Organisers.

General discussion

The workshop ended with a general discussion about the future of the community of digital humanities developers. One of the most urgent and recurring demand was to establish a web-based platform to keep the community of DH developers alive: a centralised repository for the description and pointers to the tools we build (the Interedition wiki was suggested as a place for sharing knowledge) and communication channels for keeping up to dates and sharing ideas (topic-specific mailing lists were proposed).

Leave a Reply

Your email address will not be published. Required fields are marked *