The IHR Blog |

IHR Digital

Wliat’s in a n^me? Post-correction of randomly misrecognized names in OCR data


This post originally appeared on the Digging into Linked Parliamentary Data project blog, and is a guest post by team member Kaspar Beelen.


Notwithstanding the recent optimization of Optical Character Recognition (OCR) techniques, the conversion from image to machine-readable text remains, more often than not, a problematic endeavor. The results are rarely perfect. The reasons for the defects are multiple and range from errors in the original prints, to more systemic issues such as the quality of the scan, the selected font or typographic variation within the same document. When we converted the scans of the historical Canadian parliamentary proceedings, especially the latter cause turned out to be problematic. Typographically speaking, the parliamentary proceedings are richly adorned with transitions between different font types and styles. These switches are not simply due to the esthetic preferences of the editors, but are intended facilitate reading by indicating the structure of the text. Structural elements of the proceedings such as topic titles, the names of the MPs taking the floor, audience reactions and other crucial items, are distinguished from common speech by the use of bold or cursive type, small capital or even a combination.

Moreover, if the scans are not optimized for OCR conversion, the quality of the data decreases dramatically as a result of typographic variation. In the case of the Belgian parliamentary proceedings, a huge effort was undertaken to make historical proceedings publicly available in PDF format. The scans were optimized for readability, but seemingly not for OCR processing, and unsurprisingly the conversion yielded to a flawed and unreliable output. Although one might complain about this, it is at the same time highly unlikely that, considering the costs of scanning more than 100.000 pages, the process will be redone in the near future, so we have no option but to work with the data that is available.

Because of the aforementioned reason, names, printed in bold (Belgium) or small capital (Canada), ended up misrecognized in an almost random manner, i.e. there was no logic in the way the software converted the name. Although it showcases the inventiveness of the OCR system, it makes linking names to an external database almost impossible. Below you see a small selection of the various ways ABBYY, the software package we are currently working with, screwed up the name of the Belgian progressive liberal “Houzeau the Lehaie”:

Table 1: Different outputs for “Houzeau the Lehaie”

Houzeau de Lehnie. Ilonzenu dc Lehnlc. lionceau de Lehale.
Ilonseau de Lehaie. Ilonzenu 4e Lehaie. HouKemi de Lehnlc.
lionceau de Lehaie. Honaeaa 4e Lehaie. Hoaieau de Lehnle.
Ilonzenn de Lehaie. Heaieaa ée Lehaie. Homean de Lehaie.
Heazeaa «le Lehaie. Houzcait de Lekale. Houteau de Lehaie.
Hoiizcan de Lchnle. Henxean dc Lehaie. Houxcau de Lehaie.
Hensean die Lehaie. IleuzeAit «Je Lehnie. Houzeau de Jlehuie.
Ileaieaa «Je Lehaie. Honzean dc Lehaie Houzeau de Lehaic.
Hoiizcnu de Lehaie. Honzeau de Lehaie. Ilouzeati de Lehaie.
Houxean de Lehaie. Hanseau de Lehaie. Etc.

Although the quality of the scanned Canadian Hansards is significantly better, the same phenomenon occurs.

 Table 2: Sample of errors spotted in the conversion Canadian Hansards (1919)


In many other cases even an expert would have hard time figuring out to whom the name should refer to.

Table 3: Misrecognition of names

I* nréeldcn*.

These observation are rather troubling, especially with respect to the construction linked corpora: even if, let’s say, 99% of the text is correctly converted, the other 1% will contain many of the most crucial entities, needed for marking up the structure and linking the proceedings to other sources of information. To correct the tiny but highly important 1%, I will focus in this blog post on how to automatically normalize speaker entities, those parts of proceedings that indicate who is taking the floor. In order to retrieve context information about the MPs, such as party and constituency, we have to link the proceedings our biographic databases. Linking will only be possible of the speaker entities in the proceedings match those in our external corpus.

In most occasions speaker entities include a title and a name followed by optional elements indicating the function and/or the constituency of the orator. The semicolon forms the border between the speaker entity and the actual speech. In a more formal notation, a speaker entity consists of the following pattern:

Mr. {Initials} Name{, Function} {(Constituency)}: Speech.

Using regular expression we can easily extract these entities. The result of this extraction is summarized by the figures below, which show the frequency with which the different speaker entities occur.

 Figure 1: Distribution of extracted speaker entities (Canada, 1919)





Figure 2: Distribution of extracted speaker entities (Belgium, 1893)





The figures lay bare the scope of the problem caused by these random OCR errors in more detail. Ideally there shouldn’t be more speaker entities than there are MPs in the House, which is clearly not the case. As you can see for the Belgian proceedings from the year 1893, the set of items occurring once or twice alone contains around 3000 unique elements. The output for the Canadian Hansards from 1919, looks slightly better, but there are still around 1000 almost unique items. Also, as is clear from the plots, the distribution of the speakers is more right skewed, due to the large amount of unique and wrongly recognized names in the original scans. We will try to reduce the right-skewedness by replacing the almost unique elements with more common items.


In a first step we set out to replace these names with similar items that occur more frequent. Replacement happens in two consecutive rounds: First by searching in the local context of the sitting, and secondly by looking for a likely candidate in the set of items extracted from all the sittings of a particular year. To measure whether two names resemble each other, we calculated cosine similarity, based on n-grams of characters, with n running from one to four.

More formally, the correction starts with the following procedure:

More formallyAs shown in table 4, running this loop yields many replacement rules. Not all of them are correct so we need manually filter out and discard any illegitimate rules that this procedure has generated.

 Table 4: Selection of rules generated by above procedure

Legitimate rules Illegitimate rules

Just applying these corrected replacement rules, would increase the quality of the text material a lot. But, as stated before, similarity won’t suffice when quality is awful, such as is the case for the examples shown in table 2. We need to go beyond similarity, but how?

The solution I propose is to use the replacement rules to train a classifier and consequently apply the classifier to instances that couldn’t be assigned to a correction during the previous steps. OCR correction thus becomes a multiclass classification task, in which each generated rule is used as a training instance. The right-hand side of the rule represents the class or the target variable. The left-hand side is converted to input variables or features. After training, the classifier will predict a correction, given a misrecognized name as input. For our experiment we used Multinomial Naïve Bayes, trained with n-grams of characters as features, with n againg ranging from 1 to 4. This worked surprisingly well: 90% of the rules it created were correct. Only around 10% of the rules generated by the classifier were either wrong or didn’t allow us to make a decision. Table 4 shows a small fragment of the rules produced by the classifier.

Table 5: Sample of classifier output given input name

Input name Classifier output
,%nsaaeh-l»al*saai. Anspach-Puissant.
aandcrklndcrc. Vanderkindere.
fiillleaiix. Gillieaux.
IYanoerklnaere. Vanderkindere.
I* nréeldcn*. le président.
Ilellcpuitc. Helleputte.
Thlcapaat. Thienpont.


As you can see in table 5, the predicted corrections aren’t necessarily very similar to the input name. If just a few elements are stable, the classifier can pick up the signal even when there is a lot of noise. Because OCR software mostly recognizes at a handful characters consistently, this method seems to perform well.

To summarize: What are the strong points of this system? First of all, it is fairly simple, reasonably time-efficient and works even when the quality of the original data is very bad. Manual filtering can be done quickly: for each year of data, it takes an hour or two to correct the rules generated by each of the two processes and replace the names.  Secondly: Once a classifier is trained, it can also predict corrections for the other years of the same parliamentary session. Lastly, as mentioned before, the classifier can correctly predict replacements just on the basis of a few shared characters.

Some weak points need to be addressed as well. The system still needs supervision. But, nonetheless, this is worth the effort, because it can enhance the quality of the data significantly, especially with respect to linking the speeches in a later stage. In some cases, however, it can be impossible to assess whether a replacement rule should be kept or not. Another crucial problem is that the manual supervision needs to be done by experts who are familiar both with the historical period of the text and with the OCR errors. That is, the expert has to know which names are legal and also has to be proficient in reading OCR errors.

At the moment, we are trying to improve and expand the method. So far, the model uses only the frequency of n-grams, and not their location in a token. By taking location into account, we expect that we could improve the results, but that would also increase dimensionality. Besides adding new features, we should also experiment with other algorithms, such as support-vector machines, which perform better in a high-dimensional space. We will also test whether we can expand the method to correct other structural elements of the parliamentary proceedings, such as topical titles.

New reviews: Roy Jenkins and his biographer, Abraham Lincoln and early modern alehouses


jenkins2More fruits of that pressure now, anyway, as we have a special feature on biographer John Campbell. Adam Timmins looks back over his previous work (no. 1740) as a prelude to Robert Saunder’s examination of his latest effort, Roy Jenkins: A Well-Rounded Life (no. 1741).

Then we cross the Atlantic, turning to Founders’ Son: A Life of Abraham Lincoln by Richard Brookhiser. Sean Ledwith and the author discuss an innovative biography of the 16th President (no. 1739, with response here).

Finally we have Mark Hailwood’s Alehouses and Good Fellowship in Early Modern England. Jennifer Bishop believes that this book makes a very strong case for the alehouse as one of the key institutions in early modern society (no. 1738).


New reviews: Early modern women x 2, French Revolution, colonial Seoul



Mary Sidney Herbert (1561-1621), one of the stars of Mediatrix

We start with Mediatrix: Women, Politics and Literary Production in Early Modern England by Julie Crawford. Alice Ferron and the author discuss a book which provides innovative close readings of the lives and writings of some of early modern England’s most famous and controversial aristocratic women (no. 1737, with response here).

Then we have Female Alliances: Gender, Identity and Friendship in Early Modern Britain by Amanda Herbert. Leonie Hannan praises a beautifully written and insightfully argued work, based on meticulous primary research (no. 1735).

Next up is Eric Hazan’s A People’s History of the French Revolution, and Michiel Rys believes this book succeeds in delivering a vivid, lucid, informative, detailed account of the French Revolution (no. 1736).

Finally we turn to Todd Henry’s Assimilating Seoul: Japanese Rule and the Politics of Public Space in Colonial Korea, 1910–1945. Mark Caprio finds this book brings an impressive depth to our understanding of the Japanese articulation of their colonial goals (no. 1734).

New reviews: Inter-war health, global history, Parisian smiles and US anti-communism


kchFirst up is The Politics of Hospital Provision in Early Twentieth-Century Britain by Barry Doyle], as Martin Gorsky and the author discuss a new study of Britain’s inter-war health services (no. 1733, with response here).

Then we turn to Lynn Hunt’s Writing History in the Global Era. Julia McClure believes this book’s identification of globalization as a paradigm establishes the foundations for analysing the meanings and implications of globalization narratives (no. 1732).

Next up is The Smile Revolution In Eighteenth Century Paris by Colin Jones, and Jennifer Wallis finds this book beautifully complicates the notion that the smile is a static and timeless form of emotional expression (no. 1731).

Finally we have Little “Red Scares”: Anti-Communism and Political Repression in the United States, 1921-1946, edited by Robert Justin Goldstein. Jennifer Luff welcomes a new edited collection on inter-war anti-communism (no. 1730).


New reviews: Trust, Italian Army, British India and Colonial Boston


hoskingWe kick off this week with Geoffrey Hosking’s Trust: A History, with Eric M. Uslaner and the author disagreeing over this key issue (no. 1729, with response here).

Next up is The Italian Army and the First World War by John Gooch. Mario Draper reviews a book which will almost certainly remain a seminal text for scholars of the period and anyone else interested in European military history (no. 1728).

Then we turn to G. J. Bryant’s The Emergence of British Power in India, 1600-1784: A Grand Strategic Interpretation, and James Lees finds this book to be a refreshing addition to the historiography (no. 1727).

Finally we have Robert Love’s Warnings: Searching for Strangers in Colonial Boston by Cornelia Hughes Dayton and Sharon Salinger. Kristin O’Brassill-Kulfan believes this research fills an important gap in the on-the-ground history of pre-industrial poverty in the United States (no. 1726).

New reviews: John Wyclif, Medieval space, Cypriot communists and labour and liberalism


1280px-Wycliffecollege_toronto_chapel1We start this week with John Wyclif on War and Peace by Rory Cox. Christopher Allmand and the reviewer discuss a work which places Wyclif in a long historical context (no. 1725, with response here).

Then we turn to Space in the Medieval West: Places, Territories, and Imagined Geographies, edited by Meredith Cohen and Fanny Madeline. Sarah Ann Milne recommends a book which serves to substantiate and complement existing studies whilst offering a number of fascinating new explorations (no. 1724).

Next up is Yiannakis Kolokasidis’s History of the Communist Party in Cyprus: Colonialism, Class and the Cypriot Left, which Alexios Alecou finds to be an original contribution, rich with theoretical insights and practical implications (no. 1723).

Finally we turn to Labour and the Caucus: Working-Class Radicalism and Organised Liberalism in England, 1868-1888 by James Owen. Jules Gehrke believes this book is sure to become a valued part of the scholarly conversation (no. 1722).

New reviews: Eurasian Borderlands, peace, early American wars, Reformation


eurasia-map-oldWe start with The Struggle for the Eurasian Borderlands: From the Rise of Early Modern Empires to the End of the First World War by Alfred Rieber. Simone Pelizza and the author discuss a book which is destined to be an indispensable reference work for both students and researchers for many years to come (no. 1721, with response here).

Next up is William Mulligan’s The Great War for Peace. Cyril Pearce reviews a significant, if flawed, contribution to the debate about the impact of the First World War (no. 1720).

Then we have the Encyclopedia of the Wars of the Early American Republic, 1783-1812: A Political, Social, and Military History, edited by Spencer C. Tucker, which Jonathan Chandler believes this encyclopedia will be a welcome addition to the shelves of any library (no. 1719).

Finally we turn to Reformation Unbound: Protestant Visions of Reform in England, 1525–1590 by Karl Gunther. Donald McKim finds this to be a splendid study which clearly delineates the various Protestant visions of reform in England (no. 1718).

New reviews: Lincoln and Latin America, English clergy, Louis XIV and the Indian Army


lincolndouglasWe start this week with Slavery, Race and Conquest in the Tropics : Lincoln, Douglas, and the Future of Latin America by Robert E. May. Phillip Magness and the author debate a book which gives us a Civil War that was both the product of international affairs, and a shaping force on their subsequent course (no. 1717, with response here).

Then we turn to Hugh M. Thomas’s The Secular Clergy in England, 1066-1216, and Katherine Harvey and the author discuss a book which is surely destined to become one of the definitive works in the field for many years to come (no. 1716, with response here).

Next up is Status Interaction During the Reign of Louis XIV by Giora Sternberg. Linda Kiernan believes this book presents historians of the court with a vigorous model to test (no. 1715).

Finally we have George Morton-Jack’s The Indian Army on the Western Front: India’s Expeditionary Force to France and Belgium in the First World War. Adam Prime finds this to be an extremely stimulating book, which should appeal to academics and enthusiasts alike (no. 1714).

New reviews: London women, Tokyo Zoo, Callaghan Government, Mystic Ark


williamsFirst up is Women, Work and Sociability in Early Modern London by Tim Reinke-Williams. Hannah Hogan and the author discuss an inspiring starting-point for further, in-depth histories of women, work and sociability in early modern England (no. 1713, with response here).

Then we turn to Ian Jared Miller’s The Nature of the Beasts: Empire and Exhibition at the Tokyo Imperial Zoo, which Jonathan Saha recommends as being important beyond its obvious and substantial contribution to both Japanese history and zoo history (no. 1712).

Next up is Crisis? What Crisis? The Callaghan Government and the British ‘Winter of Discontent’ by John Shepherd. Ian Cawood reviews a concisely written, forensic political analysis of the defining historical myth by which all British political parties still live (no. 1711).

Finally we have The Mystic Ark: Hugh of Saint Victor, Art, and Thought in the Twelfth Century by Conrad Rudolph, which Karl Kinsella believes to be a thoroughly worked out and thoughtful piece of scholarship (no. 1710).


Welcome to PORT


This post has kindly been written for us by Dr Matthew Phillpott, SAS-Space Manager and SAS Digital Project Officer.

PORT01For those of you who have been using the Institute of Historical Research’s online research training platform, History SPOT you will have noticed a variety of changes recently. The sites web address has changed, its name has changed, and its design has changed.

The refit of History SPOT and its transformation into PORT (Postgraduate Online Research Training) is an exciting development. We believed that the old site was beginning to look tired but yet its contents still remain useful and relevant and there is still so much scope for expansion.

In addition the opportunity arose to merge the IHR’s efforts with the wider efforts of the School of Advanced Study (of which the IHR is one component). History SPOT has therefore become PORT, an online research training platform not just for historians, but for all humanities studies.

This is a good thing for historians. The extent of training provision on PORT will rapidly expand over the next few years and a vast amount of it will be relevant to students studying history.  Already, PORT provides additional resources offering advice about completing a PhD and a host of handbooks providing links to modern languages resources. Soon a resource will be launched providing introductory guidance to research using quantitative methods, various videos covering all kinds of research needs, and more ‘history’ focused courses, such as managing your data as an historian.

So please do check out PORT and let us know what you think.

[Note: Those familiar with History SPOT will see that not all of the old resources are currently online. These just require a quick fix to work with the new design and will be reappearing over the coming weeks]


< Older Posts

Newer Posts >