Geocoding your data

In many projects, the collection of data results in a list of items whose distribution can be shown spatially. The process of Geocoding assigns a location (or set of locations) to an item of data; perhaps the site of a battle, the source of a text or the home of notable person. Such visualisations allow for new perspectives on the relationships of data , spatial or otherwise. A long winded way of geocoding would be to simpy go through the data, record by record, and assign it a set of coordinates using a third party resource. If the list is short or if a very precise location is needed then this may be a practical solution, however it easy to underestimate how long it takes to go through what may at first appear to be a short list.

Alternatively, data can be captured directly into a Geographic Information System (GIS) such as ArcGIS or the freely available Quantum GIS, ensuring a location point is recorded with each record added to the data set, though this may be impractical in dark, dusty archive or may not lend itself well to your workflow. Often projects don’t require the sort of pinpoint accuracy that might be needed in scientific project and regional or town level locations are suitable and even preferable.

In many cases these research records end up as spreadsheets with a single column dedicated to recording location and given the ease with which data formats can be converted and imported into other systems, this is an efficient way of systematically recording and organising.

If you have the know how, and perhaps some special data requirements, building your own solution for geocoding is fairly straightforward. Given a list of post codes for example you will quickly be able to create a geocoded dataset with a simple data table join. Matching multiple fields, expecting multiple matches and ranking the results is more tricky.

Fortunately, there seem to be an ever increasing number of freely available resources with which to geocode your data. The Ordnance Survey of Great Britain last year released post code point data and a national gazetteer that can be used to resolve a placename to coordinates. Going beyond the UK, Nominatim is a geolocation service offered by OpenStreetMap, and the Google maps API also offers geocoding (though this is usually limited to restricted number of requests per day). Pamela Fox has created a great Google gadget for use with Google docs that will take your spreadsheet, query the Google geocoding service and return a list of coordinates for you to incorporate into new columns. A especially nice feature is that even when you (inevitably) have a few records left over that couldn’t be matched, they can be physically dragged and dropped on a familiar Google map and these too are given coordinates.

The problem with making the best use of these resources is that very little thought is given to how location is recorded at the moment of capture so that it might be made use of programmatically at a later stage, usually with the location field being treated as free text rather than a discrete data type. The spreadsheet column for place or location may contain extraneous words and punctuation which prevent automatic matching. The hierachy of location information is rarely considered. Sometimes, secondary or tertiary possible candidate locations are recorded in the same column. In order to use any automatic geocoding process there is usually a need for extensive data cleaning which must often be done at least partially manually.

To avoid this situation a few simple guidleines should be considered before embarking on a spreadsheet data acquisition that you anticipate may be geocoded.

  1. Always record the best location you can regardless of your requirements – this will give you far more options for geocoding later on. If you have post code use it, a house number even better
  2. Always split the location components across several columns – Don’t mix in cities with villages and colloquial names. Have a hierarchy in mind, split it across several columns and stick to it. e.g. House, Road, Town, County, Country. You don’t have to fill in every field for each record but keep the schema consistent. Don’t worry about presentational considerations as these values can be concatenated in another column and the data will be far more easily manipulable in this form
  3. Don’t merge several locations in one field – If, as is often the case there is more than one candidate for a location, record them in seperate columns. A GIS technician will be able to associate several points to one record if necessary. If you are worried about the spreadsheet becoming unwieldy, put these columns to the far right of the sheet as they may not be needed very often
  4. Avoid abbreviations and colloquial names – they are hard to match up in geocoding exercises
  5. Avoid punctuation – Question marks and exclamations will mess up the match and even humble commas should be avoided as many formats will use them as column delimiters
  6. Avoid ambiguity – Add more detail than you might think immediately necessary and spare a thought for the poor geocoder who, less familiar than yourself with the dataset, may need to choose from one of the 11 different Newports in the UK or more than 30 worldwide!
  7. Keep it on one line – Other data fields may naturally lend themselves to multiple row entries, but try to stick to the rule of one row, one record. If you need more space in a cell, turn on word wrapping and make the cell higher and wider.

Following these guidelines will allow you to make the best use of your data when using tolls like Google docs gadget above. You can try geocoding on different columns, or combinations of columns, or using the best available column in each record.



Leave a Reply

Your email address will not be published. Required fields are marked *