The IHR Blog |

Text mining for Historians

by

Example page from the Text Mining course

Example page from the Text Mining course

The Institute of Historical Research now offer a wide selection of digital research training packages designed for historians and made available online on History SPOT.  Most of these have received mention on this blog from time to time and hopefully some of you will have had had a good look at them.  These courses are freely available and we only ask that you register for History SPOT to access them (which is a free and easy process).  Full details of our online and face-to-face courses can also be found on the IHR website.

I thought that it might be useful to talk a little more about these courses on the blog and provide a brief sample.  Over the coming months I will post up a series of blog posts about each of our training courses, and give you a little sneak peak so that you have a better idea what to expect.

I have chosen the Text Mining module as the first, for several reasons.  First, because it is probably the one that exemplifies what we are trying to do the best.  That is, to make digital tools accessible to historians through a series of introductory training courses.  The Text Mining for Historians module does just this, beginning from the very simple and slowly moving forward toward the more complex.

Text mining is not a tool of itself, but a series of tools that enables us to explore, interrogate, and analyse large bodies of text or texts.  Imagine, if you will, that you have gathered together a corpus of text – perhaps it’s a diary or series of diaries from a particular period, perhaps it’s a series of publications on a particular subject, or maybe it’s a set of official records spanning many decades or even centuries.  Normally you would wade through these documents one at a time and take notes.  Text mining allows you to automate certain elements of this task and helps you to discover trends and connections that you might never be able to do looking at the texts through traditional methods.

This training module takes you from the theory (i.e. what is text mining all about) through to its application for historical texts, and eventually on to the more complex areas of what is called topic modelling, natural language processing, and named entity recognition.  In this post I’m going to quote from the opening section of this course as it gives a description of what historians might consider a good use for text mining.  In this example we are looking at the Old Bailey Trial accounts used on the popular Old Bailey Proceedings Online website:

 ****

Would you like to know how often the word ‘guilty’ appears in the Old Bailey trial accounts? The answer is findable using a standard search engine on the Old Bailey Online website (it’s 182612). How about how many people were found guilty? The answer is 163261. What about the number of defendants found guilty of murder? The answer is 1518. These last two figures are not possible to find through the standard search engine as they are an entirely different type of question; we are not looking for how many times the word ‘guilty’ appears in the proceedings but how many trials resulted in a guilty verdict. We want to discover something meaningful within the body of texts, automatically rather than manually checking each and every trial account.

This is a relatively simple example of text mining where the original documents have been marked up and tagged by surname, given name, alias, offence, verdict, and punishment. To calculate those results manually you would have to work your way through 197,745 criminal trial accounts (some 127 million words in total).

This form of text mining, however, is little more than an advanced search engine – useful but limited. As the creators of the Old Bailey Online themselves admit (and have attempted to redress in a subsequent project):

‘Analyzing this kind of data by decade, or trial type, or defendant gender etc., can re-enforce the categories, the assumptions, and the prejudices the user brings to each search and those applied by the team that provided the XML markup when the digital archive was first created’.

Dan Cohen et al, ‘Data Mining with Criminal Intent’, Final White Paper (31 August 2011), p. 12.

In other words the search options and text tagging were emphasising and reinforcing a pre-determined expectation of what the resource creators believed was the important data. Text mining tools can help to explore alternative questions more openly.

The Data Mining with Criminal Intent (DMCI) project has done just this by enabling researchers not only to query the Old Bailey site but to export those results to a Zotero library to be managed and from there toVoyeur and other text mining tools for text analysis and visualisation.

The team behind the project uses the example of an investigator trying to understand the role poison might have had in murder cases. Using the search engine brings up 448 entries for ‘poison’ but doesn’t tell us much about what this means. Using Zotero and Voyeur it is possible to filter out the stop words and legal terminology common in all entries to find out what other words commonly appear near to the word ‘poison’. Through this method of text mining it was possible to conclude that poison was probably more commonly administered through drinks such as coffee than through food (see pp. 6-7 of the white paper report Data Mining with Criminal Intent’).

****

If you would like to have a look at this module please register for History SPOT for free and follow the instructions (http://historyspot.org.uk).  If you would like further information about this course, and the others that the IHR offer please have a look at our Research Training pages on the IHR website.