All posts by Neil Jakeman

Dataset refining and simplified queries

A current project to better understand the evolution of the Celtic language is making use of interactive maps that bring together many different data sets for comparison.

The project is making use of and refining existing data that often exists in a flat tabular form and creating a more scalable relational database that can be used to perform more complex queries than was previously possible.

For instance, one data set contains a list of burial sites with details including the nature of the burial, the type of pottery and other artifacts found in association with the individuals at the site and, where possible, the gender and approximate age of each individual.

When these data were originally collected a spread sheet was used to classify each site but often the entries followed a free text format and it was perhaps not considered that this form of recording would be hard to use programmatically in the future. For instance, a typical summary description of the remains found at a site recorded in a spread sheet cell might be:

“adult male burial asso with adult female and baby”.

Whilst this is descriptive and can be easily understood by a human, it is hard to make use of in a structured query. For this reason much effort has been expended to normalise the spread sheet data into a flexible relational database form. In the previous example, whilst we a still retain one site record, we link this site to three separate buried individuals. Each individual in turn has its own strictly controlled categorisations:

Individual1, Adult Male

Individual2, Adult Female

Individual3, Neonate.

Each of these individuals may have descriptions of the manner in which they were interred, for example orientation of the long axis of the body, position of the head or the side of the body it is lying on. Additionally, the individual grave goods are associated with the correct individual to aid interpretation. Wherever possible, all such free text fields have been converted into discrete lists of valid options.

This structure allows us to use queries like ‘Return all burial sites where there are more than five individuals, all male, regardless of age, found with a Cairn, but not a Pit, where at least one individual is oriented NW-SE’.

This is an admittedly complex example and one that is possibly unlikely to yield results, but it demonstrates the sort of complexity available.

If this sort of query were expressed in SQL, it would be verbose and probably unintelligible to a non-technically minded researcher.

To help researchers assemble similar queries, the data sets will be carefully pre indexed to expose the most useful facets of the data. Each record can exhibit none or many of the data facets indexed. These options are presented in simple checkbox lists to either include or exclude particular facets and to continually narrow the results.

This screen grab from a functional prototype shows a user requesting burials from the database that are of type “Cist”, have been “Disturbed” and DO NOT include individuals aligned “East to West”.

facet

Although these results could then be displayed to the user in a tabular form, they are presumably interested in the geographic distribution of the results. The results are mapped so it can be seen where these sorts of similar burials are.

A common problem when the result set is large, or is particularly concentrated in a small area, is that map markers tend to overlay each other and the relative density of clusters cannot be easily seen or individual markers easily identified. To deal with this potential issue, the mapped results are clustered and styled so that at a glance, the researcher can see where her results are concentrated. Intensity of colour reflects an increasing count in these two example screen grabs (the second being the same result set as the first at a higher zoom level). A useful side effect of this approach is that the web browser doesn’t have to cope with rendering too many markers which can impact performance.

map1       map2

Geocoding your data

In many projects, the collection of data results in a list of items whose distribution can be shown spatially. The process of Geocoding assigns a location (or set of locations) to an item of data; perhaps the site of a battle, the source of a text or the home of notable person. Such visualisations allow for new perspectives on the relationships of data , spatial or otherwise. A long winded way of geocoding would be to simpy go through the data, record by record, and assign it a set of coordinates using a third party resource. If the list is short or if a very precise location is needed then this may be a practical solution, however it easy to underestimate how long it takes to go through what may at first appear to be a short list.

Alternatively, data can be captured directly into a Geographic Information System (GIS) such as ArcGIS or the freely available Quantum GIS, ensuring a location point is recorded with each record added to the data set, though this may be impractical in dark, dusty archive or may not lend itself well to your workflow. Often projects don’t require the sort of pinpoint accuracy that might be needed in scientific project and regional or town level locations are suitable and even preferable.

In many cases these research records end up as spreadsheets with a single column dedicated to recording location and given the ease with which data formats can be converted and imported into other systems, this is an efficient way of systematically recording and organising.

If you have the know how, and perhaps some special data requirements, building your own solution for geocoding is fairly straightforward. Given a list of post codes for example you will quickly be able to create a geocoded dataset with a simple data table join. Matching multiple fields, expecting multiple matches and ranking the results is more tricky.

Fortunately, there seem to be an ever increasing number of freely available resources with which to geocode your data. The Ordnance Survey of Great Britain last year released post code point data and a national gazetteer that can be used to resolve a placename to coordinates. Going beyond the UK, Nominatim is a geolocation service offered by OpenStreetMap, and the Google maps API also offers geocoding (though this is usually limited to restricted number of requests per day). Pamela Fox has created a great Google gadget for use with Google docs that will take your spreadsheet, query the Google geocoding service and return a list of coordinates for you to incorporate into new columns. A especially nice feature is that even when you (inevitably) have a few records left over that couldn’t be matched, they can be physically dragged and dropped on a familiar Google map and these too are given coordinates.

The problem with making the best use of these resources is that very little thought is given to how location is recorded at the moment of capture so that it might be made use of programmatically at a later stage, usually with the location field being treated as free text rather than a discrete data type. The spreadsheet column for place or location may contain extraneous words and punctuation which prevent automatic matching. The hierachy of location information is rarely considered. Sometimes, secondary or tertiary possible candidate locations are recorded in the same column. In order to use any automatic geocoding process there is usually a need for extensive data cleaning which must often be done at least partially manually.

To avoid this situation a few simple guidleines should be considered before embarking on a spreadsheet data acquisition that you anticipate may be geocoded.

  1. Always record the best location you can regardless of your requirements – this will give you far more options for geocoding later on. If you have post code use it, a house number even better
  2. Always split the location components across several columns – Don’t mix in cities with villages and colloquial names. Have a hierarchy in mind, split it across several columns and stick to it. e.g. House, Road, Town, County, Country. You don’t have to fill in every field for each record but keep the schema consistent. Don’t worry about presentational considerations as these values can be concatenated in another column and the data will be far more easily manipulable in this form
  3. Don’t merge several locations in one field – If, as is often the case there is more than one candidate for a location, record them in seperate columns. A GIS technician will be able to associate several points to one record if necessary. If you are worried about the spreadsheet becoming unwieldy, put these columns to the far right of the sheet as they may not be needed very often
  4. Avoid abbreviations and colloquial names – they are hard to match up in geocoding exercises
  5. Avoid punctuation – Question marks and exclamations will mess up the match and even humble commas should be avoided as many formats will use them as column delimiters
  6. Avoid ambiguity – Add more detail than you might think immediately necessary and spare a thought for the poor geocoder who, less familiar than yourself with the dataset, may need to choose from one of the 11 different Newports in the UK or more than 30 worldwide!
  7. Keep it on one line – Other data fields may naturally lend themselves to multiple row entries, but try to stick to the rule of one row, one record. If you need more space in a cell, turn on word wrapping and make the cell higher and wider.

Following these guidelines will allow you to make the best use of your data when using tolls like Google docs gadget above. You can try geocoding on different columns, or combinations of columns, or using the best available column in each record.