The IHR Blog |

IHR Digital


New reviews: Broadcasting buildings, Joan of Arc, Demonology and Tawney

by

yusaf2First up this week we have Broadcasting Buildings: Architecture on the Wireless, 1927-1945 by Shundana Yusaf, as Laura Carter and the author discuss a playful and scholarly new book (no. 1765, with response here).

Then we turn to Helen Castor’s Joan of Arc: A History. Kieron Creedon recommends a vivid and riveting book which combines a consummate skill for storytelling with the cogent precision of a trial lawyer (no. 1764).

Next up is Demonology and Scholarship in the Counter-Reformation by Jan Machielsen, which Francis Young believes is a book that deserves to be on the reading list of every course on the Counter-Reformation (no. 1763).

Finally we have The Life of R. H. Tawney: Socialism and History by Lawrence Goldman. Adam Timmins reviews the first full biography of the historian and social reformer (no. 1762).

New reviews: C16 Mexico and China, rural Indian technology, Venice and Dublin Press

by

cortesWe start this week with a lively discussion between Felipe Fernandez-Armesto and Serge Gruzinski over the latter’s new work of comparative global history The Eagle and the Dragon: Globalization and European Dreams of Conquest in China and America in the Sixteenth Century (no. 1761, with response here).

Next up is Technology and Rural Change in Eastern India, 1830–1980 by Smritikumar Sarkar, and Amelia Bonea recommends a valuable book for anyone with an interest in the history of science and technology (no. 1760).

Then we have Rosa Salzberg’s Ephemeral City: Cheap Print and Urban Culture in Renaissance Venice, which Alexander Wilkinson believes is one of the best and most original works on book history to appear in recent years (no. 1759).

Finally we turn to Newspapers and Newsmakers: The Dublin Nationalist Press in the Mid-Nineteenth Century by Ann Andrews. Patrick Maume praises a useful contribution to the growing body of research on 19th-century Irish print media (no. 1758).

New reviews: US human rights, women and pre-modern law, strategy and latest VCH

by

lgbtq_protest_480We start this week with Reclaiming American Virtue: The Human Rights Revolution of the 1970s by Barbara Keys. Umberto Tulli and the author discuss a book which offers a new interpretation and will pave the way for future historical scholarship (no. 1757, with response here).

Next up is Women, Agency and the Law, 1300-1700, edited by Bronach Kane and Fiona Williamson, and Sparky Booker finds these essays break new ground in the history of women, law and agency in the pre-modern period (no. 1756).

Then we turn to Lawrence Freedman’s Strategy: a History, which Marcel Berni believes belongs with the classics in the field of strategic studies (no. 1755).

Finally James Bowen reviews Victoria County History: Shropshire VI Shrewsbury, edited by William A. Champion and Alan Thacker, a beautifully presented addition to the VCH series, of interest to both local and national historians as well as urban historians (no. 1754).

New reviews – Emotions, the end of the Iron Curtain, and Turkish heroin

by

Darwin-expressionThis week we have a real treat for you, as we focus on Jan Plamper’s exciting new work The History of Emotions: An Introduction. There’s a lengthy review by Rob Boddice (no. 1752, with response here) and then a fascinating interview between Professor Plamper and our very own Jordan Landes (no. 1753).

Then we turn to another German work, and Eliten und Zivile Gesellschaft: Legitimitätskonflikte in Ostmitteleuropa by Helmut Fehr. Steven Jefferson believes this to be an impressive volume of detailed empirical research and careful analysis (no. 1751).

Finally, we have Ryan Gingeras’s Heroin, Organized Crime, and the Making of Modern Turkey, and Egemen Bezci reviews a remarkable contribution that paves the path for further studies on the topic (no. 1750).

New reviews: Magna Carta, Lady Antonia, memory and French Army

by

The slippers of Archbishop Walter on loan from Canterbury Cathedral on display in Magna Carta Law Liberty Legacy, British LibraryWe’re delighted to be able to present to you a review of the new BL exhibition on Magna Carta: Law, Liberty, Legacy. John Sabapathy reviews a wonderful exhibition which is as much about Magna Carta’s 800 year reception as its immediate 13th-century matrix (no. 1749).

A further treat is a new Daniel Snowman interview, in which he talks to Lady Antonia Fraser about her work as a historian and biographer (no. 1748).

Next we turn to The Memory of the People: Custom and Popular Senses of the Past in Early Modern England by Andy Wood. Brodie Waddell believes that the author has produced a study that proves the centrality of custom and popular memory across more than three centuries (no. 1747).

Finally, Mario Draper recommends The French Army and the First World War by Elizabeth Greenhalgh, on the grounds of the quality of the extensive research, the clarity with which it is delivered and the insightfulness on offer (no. 1746).

The Historical Aspects of Dilipad: Challenges and Opportunities

by

This post originally appeared on the Digging into Linked Parliamentary Data project blog, and is a guest post by one of the historians working the project, Luke Blaxill.

The Dilipad project is on one hand exciting because it will allow us to investigate ambitious research questions that our team of historians, social and political scientists, and computational linguists couldn’t address otherwise. But it’s also exciting precisely because it is such an interdisciplinary undertaking, which has the capacity to inspire methodological innovation. For me as a historian, it offers a unique opportunity not just to investigate new scholarly questions, but also to analyse historical texts in a new way.

We must remember that, in History, the familiarity with corpus-driven content analysis and semantic approaches is minimal. Almost all historians of language use purely qualitative approaches (i.e. manual reading) and are unfamiliar even with basic word-counting and concordance techniques. Indeed, the very idea of ‘distant reading’ with computers, and categorising ephemeral and context-sensitive political vocabulary and phrases into analytical groups is massively controversial even for a single specific historical moment, let alone diachronically or transnationally over decades or even generations. The reasons for this situation in History are complex, but can reasonably be summarised as stemming from two major scholarly trends which have emerged in the last four decades. The first is the wide-scale abandonment of quantitative History after its perceived failures in the 1970s, and the migration of economic history away from the humanities. The second is the influence of post-structuralism from the mid-1980s, which encouraged historians of language to focus on close readings, and shift from the macro to the micro, and from the top-down to the bottom-up. Political historians’ ambitions became centred around reconstructions of localised culture rather than ontologies, cliometrics, model making, and broad theories. Unsurprisingly, computerised quantitative text analysis found few, if any, champions in this environment.

In the last five years, the release of a plethora of machine-readable historical texts (among them Hansard) online, as well as the popularity of Google Ngram, have reopened the debate on how and how far text analysis techniques developed in linguistics and the social and political sciences can benefit historical research. The Dilipad project is thus a potentially timely intervention, and presents a genuine opportunity to push the methodological envelope in History.

We aim to publish outputs which will appeal to a mainstream audience of historians who will have little familiarity with our methodologies, rather than to prioritise a narrower digital humanities audience. We will aim to make telling interventions in existing historical debates which could not be made using traditional research methods. With this in mind, we are pursuing a number of exciting topics using our roughly two centuries-worth of Parliamentary data, including the language of gender, imperialism, and democracy. While future blog posts will expand upon all three areas in more detail, I offer a few thoughts below on the first.

The Parliamentary language of gender is a self-evidently interesting line of enquiry during a historic period where the role of women in the political process in Great Britain, Canada, and the Netherlands was entirely transformed. There has been considerable recent historical interest on the impact of women on the language of politics, and female rhetorical culture. The Dilipad project will examine differences in vocabulary between male and female speakers, such as on genre of topics raised, and also discursive elements, hedging, modality, the use of personal pronouns and other discourse markers- especially those which convey assertiveness and emotion. Next to purely textual features we will analyse how the position of women in parliament changed over time and between countries (time they spoke, how frequently they were interrupted, the impact of their discourse on the rest of the debate etc.).

A second area of great interest will be how women were presented and described in debate – both by men and by other women. This line of enquiry might present an opportunity to utilise sentiment analysis (which in itself would be methodologically significant) which might shed light on positive or negative attitudes towards women in the respective political cultures of our three countries. We will analyze tone, and investigate what vocabulary and lexical formations tended to be most associated with women. In addition, we can also investigate whether the portrayal of women varied across political parties.

More broadly, this historical analysis could help shed light on the broader impact of women in Parliamentary rhetorical culture. Was there a discernible ‘feminized language of politics’, and if so, where did it appear, and when? Similarly, was there any difference in Parliamentary behaviour between the sexes, with women contributing disproportionately more to debates on certain topics, and less to others? Finally, can we associate the introduction of new Parliamentary topics or forms of argument to the appearance of women speakers?

Insights in these areas – made possible only by linked ‘big data’ textual analysis – will undoubtedly be of great interest to historians, and will (we hope) demonstrate the practical utility of text mining and semantic methodologies in this field.

New reviews: Roy Foster interview, early modern pamphlets, C19 women professionals and Nat Turner

by

foster2We start off this week with another in our occasional interview series, with Daniel Snowman talking to Professor Roy Foster about his recent work on the human dimension behind the Easter Rising, Vivid Faces (no. 1745).

Next we have Thomas Dekker and the Culture of Pamphleteering in Early Modern London by Anna Bayman. Kirsty Rolfe and the author discuss a highly readable study, with important implications for critical understanding of ‘popular print’ and the cultures with which it interacted (no. 1744, with response here).

Then we turn to Crafting the Woman Professional in the Long Nineteenth Century, edited by Kryriaki Hadjiafxendi and Patricia Zakreski, which Zoe Thomas believes will positively contribute to a number of academic fields (no. 1743).

Finally there is David F. Allmendinger Jr.’s Nat Turner and the Rising in Southampton County, as Vanessa Holden reviews an account of the most famous slave rebellion in American history (no. 1742).

 

Wliat’s in a n^me? Post-correction of randomly misrecognized names in OCR data

by

This post originally appeared on the Digging into Linked Parliamentary Data project blog, and is a guest post by team member Kaspar Beelen.

Problem.

Notwithstanding the recent optimization of Optical Character Recognition (OCR) techniques, the conversion from image to machine-readable text remains, more often than not, a problematic endeavor. The results are rarely perfect. The reasons for the defects are multiple and range from errors in the original prints, to more systemic issues such as the quality of the scan, the selected font or typographic variation within the same document. When we converted the scans of the historical Canadian parliamentary proceedings, especially the latter cause turned out to be problematic. Typographically speaking, the parliamentary proceedings are richly adorned with transitions between different font types and styles. These switches are not simply due to the esthetic preferences of the editors, but are intended facilitate reading by indicating the structure of the text. Structural elements of the proceedings such as topic titles, the names of the MPs taking the floor, audience reactions and other crucial items, are distinguished from common speech by the use of bold or cursive type, small capital or even a combination.

Moreover, if the scans are not optimized for OCR conversion, the quality of the data decreases dramatically as a result of typographic variation. In the case of the Belgian parliamentary proceedings, a huge effort was undertaken to make historical proceedings publicly available in PDF format. The scans were optimized for readability, but seemingly not for OCR processing, and unsurprisingly the conversion yielded to a flawed and unreliable output. Although one might complain about this, it is at the same time highly unlikely that, considering the costs of scanning more than 100.000 pages, the process will be redone in the near future, so we have no option but to work with the data that is available.

Because of the aforementioned reason, names, printed in bold (Belgium) or small capital (Canada), ended up misrecognized in an almost random manner, i.e. there was no logic in the way the software converted the name. Although it showcases the inventiveness of the OCR system, it makes linking names to an external database almost impossible. Below you see a small selection of the various ways ABBYY, the software package we are currently working with, screwed up the name of the Belgian progressive liberal “Houzeau the Lehaie”:

Table 1: Different outputs for “Houzeau the Lehaie”

Houzeau de Lehnie. Ilonzenu dc Lehnlc. lionceau de Lehale.
Ilonseau de Lehaie. Ilonzenu 4e Lehaie. HouKemi de Lehnlc.
lionceau de Lehaie. Honaeaa 4e Lehaie. Hoaieau de Lehnle.
Ilonzenn de Lehaie. Heaieaa ée Lehaie. Homean de Lehaie.
Heazeaa «le Lehaie. Houzcait de Lekale. Houteau de Lehaie.
Hoiizcan de Lchnle. Henxean dc Lehaie. Houxcau de Lehaie.
Hensean die Lehaie. IleuzeAit «Je Lehnie. Houzeau de Jlehuie.
Ileaieaa «Je Lehaie. Honzean dc Lehaie Houzeau de Lehaic.
Hoiizcnu de Lehaie. Honzeau de Lehaie. Ilouzeati de Lehaie.
Houxean de Lehaie. Hanseau de Lehaie. Etc.

Although the quality of the scanned Canadian Hansards is significantly better, the same phenomenon occurs.

 Table 2: Sample of errors spotted in the conversion Canadian Hansards (1919)

BALLANTYNE ARCHAMBAULT
BAILLANiTYNE ARCBAMBAULT
BALLAINTYNE ARCHAMBATJLT
BALLANT1NE AECBAMBAULT
BALLAiNTYNE ABCHAMBAULT
iBALiLANTYNE AROHASMBAULT
BAIiLANTYNE ARlQHAMBAULT
BALLANTYINE AECBAMBAULT

In many other cases even an expert would have hard time figuring out to whom the name should refer to.

Table 3: Misrecognition of names

,%nsaaeh-l»al*saai.
aandcrklndcrc.
fiillleaiix.
IYanoerklnaere.
I* nréeldcn*.
Ilellcpuitc.
Thlcapaat.

These observation are rather troubling, especially with respect to the construction linked corpora: even if, let’s say, 99% of the text is correctly converted, the other 1% will contain many of the most crucial entities, needed for marking up the structure and linking the proceedings to other sources of information. To correct the tiny but highly important 1%, I will focus in this blog post on how to automatically normalize speaker entities, those parts of proceedings that indicate who is taking the floor. In order to retrieve context information about the MPs, such as party and constituency, we have to link the proceedings our biographic databases. Linking will only be possible of the speaker entities in the proceedings match those in our external corpus.

In most occasions speaker entities include a title and a name followed by optional elements indicating the function and/or the constituency of the orator. The semicolon forms the border between the speaker entity and the actual speech. In a more formal notation, a speaker entity consists of the following pattern:

Mr. {Initials} Name{, Function} {(Constituency)}: Speech.

Using regular expression we can easily extract these entities. The result of this extraction is summarized by the figures below, which show the frequency with which the different speaker entities occur.

 Figure 1: Distribution of extracted speaker entities (Canada, 1919)

fig1afig1b

 

 

 

Figure 2: Distribution of extracted speaker entities (Belgium, 1893)

fig1afig1b

 

 

 

The figures lay bare the scope of the problem caused by these random OCR errors in more detail. Ideally there shouldn’t be more speaker entities than there are MPs in the House, which is clearly not the case. As you can see for the Belgian proceedings from the year 1893, the set of items occurring once or twice alone contains around 3000 unique elements. The output for the Canadian Hansards from 1919, looks slightly better, but there are still around 1000 almost unique items. Also, as is clear from the plots, the distribution of the speakers is more right skewed, due to the large amount of unique and wrongly recognized names in the original scans. We will try to reduce the right-skewedness by replacing the almost unique elements with more common items.

Solution.

In a first step we set out to replace these names with similar items that occur more frequent. Replacement happens in two consecutive rounds: First by searching in the local context of the sitting, and secondly by looking for a likely candidate in the set of items extracted from all the sittings of a particular year. To measure whether two names resemble each other, we calculated cosine similarity, based on n-grams of characters, with n running from one to four.

More formally, the correction starts with the following procedure:

More formallyAs shown in table 4, running this loop yields many replacement rules. Not all of them are correct so we need manually filter out and discard any illegitimate rules that this procedure has generated.

 Table 4: Selection of rules generated by above procedure

Legitimate rules Illegitimate rules
EOWELL->ROWELL W.HIDDEN -> DENIS
McOOIG->McCOIG SCOTT -> CAEVELL
ROWELiL->ROWELL THOMAS VIEN -> THOMAS WHITE
RUCHARBSON->RICHARDSON BRAKE -> SPEAKER
(MdMASTER->McMASTER CLARKE -> CLARK
ABCHAMBAULT->ARCHAMBAULT
AROHASMBAULT->ARCHAMBAULT
CQCKSHUTT->COCKSHUTT

Just applying these corrected replacement rules, would increase the quality of the text material a lot. But, as stated before, similarity won’t suffice when quality is awful, such as is the case for the examples shown in table 2. We need to go beyond similarity, but how?

The solution I propose is to use the replacement rules to train a classifier and consequently apply the classifier to instances that couldn’t be assigned to a correction during the previous steps. OCR correction thus becomes a multiclass classification task, in which each generated rule is used as a training instance. The right-hand side of the rule represents the class or the target variable. The left-hand side is converted to input variables or features. After training, the classifier will predict a correction, given a misrecognized name as input. For our experiment we used Multinomial Naïve Bayes, trained with n-grams of characters as features, with n againg ranging from 1 to 4. This worked surprisingly well: 90% of the rules it created were correct. Only around 10% of the rules generated by the classifier were either wrong or didn’t allow us to make a decision. Table 4 shows a small fragment of the rules produced by the classifier.

Table 5: Sample of classifier output given input name

Input name Classifier output
,%nsaaeh-l»al*saai. Anspach-Puissant.
aandcrklndcrc. Vanderkindere.
fiillleaiix. Gillieaux.
IYanoerklnaere. Vanderkindere.
I* nréeldcn*. le président.
Ilellcpuitc. Helleputte.
Thlcapaat. Thienpont.

Conclusion.

As you can see in table 5, the predicted corrections aren’t necessarily very similar to the input name. If just a few elements are stable, the classifier can pick up the signal even when there is a lot of noise. Because OCR software mostly recognizes at a handful characters consistently, this method seems to perform well.

To summarize: What are the strong points of this system? First of all, it is fairly simple, reasonably time-efficient and works even when the quality of the original data is very bad. Manual filtering can be done quickly: for each year of data, it takes an hour or two to correct the rules generated by each of the two processes and replace the names.  Secondly: Once a classifier is trained, it can also predict corrections for the other years of the same parliamentary session. Lastly, as mentioned before, the classifier can correctly predict replacements just on the basis of a few shared characters.

Some weak points need to be addressed as well. The system still needs supervision. But, nonetheless, this is worth the effort, because it can enhance the quality of the data significantly, especially with respect to linking the speeches in a later stage. In some cases, however, it can be impossible to assess whether a replacement rule should be kept or not. Another crucial problem is that the manual supervision needs to be done by experts who are familiar both with the historical period of the text and with the OCR errors. That is, the expert has to know which names are legal and also has to be proficient in reading OCR errors.

At the moment, we are trying to improve and expand the method. So far, the model uses only the frequency of n-grams, and not their location in a token. By taking location into account, we expect that we could improve the results, but that would also increase dimensionality. Besides adding new features, we should also experiment with other algorithms, such as support-vector machines, which perform better in a high-dimensional space. We will also test whether we can expand the method to correct other structural elements of the parliamentary proceedings, such as topical titles.

New reviews: Roy Jenkins and his biographer, Abraham Lincoln and early modern alehouses

by

jenkins2More fruits of that pressure now, anyway, as we have a special feature on biographer John Campbell. Adam Timmins looks back over his previous work (no. 1740) as a prelude to Robert Saunder’s examination of his latest effort, Roy Jenkins: A Well-Rounded Life (no. 1741).

Then we cross the Atlantic, turning to Founders’ Son: A Life of Abraham Lincoln by Richard Brookhiser. Sean Ledwith and the author discuss an innovative biography of the 16th President (no. 1739, with response here).

Finally we have Mark Hailwood’s Alehouses and Good Fellowship in Early Modern England. Jennifer Bishop believes that this book makes a very strong case for the alehouse as one of the key institutions in early modern society (no. 1738).

 

New reviews: Early modern women x 2, French Revolution, colonial Seoul

by

ferron

Mary Sidney Herbert (1561-1621), one of the stars of Mediatrix

We start with Mediatrix: Women, Politics and Literary Production in Early Modern England by Julie Crawford. Alice Ferron and the author discuss a book which provides innovative close readings of the lives and writings of some of early modern England’s most famous and controversial aristocratic women (no. 1737, with response here).

Then we have Female Alliances: Gender, Identity and Friendship in Early Modern Britain by Amanda Herbert. Leonie Hannan praises a beautifully written and insightfully argued work, based on meticulous primary research (no. 1735).

Next up is Eric Hazan’s A People’s History of the French Revolution, and Michiel Rys believes this book succeeds in delivering a vivid, lucid, informative, detailed account of the French Revolution (no. 1736).

Finally we turn to Todd Henry’s Assimilating Seoul: Japanese Rule and the Politics of Public Space in Colonial Korea, 1910–1945. Mark Caprio finds this book brings an impressive depth to our understanding of the Japanese articulation of their colonial goals (no. 1734).

< Older Posts

Newer Posts >