A few months ago, I posted an image of seventeen volumes to the BHO Twitter account. These volumes were from several different series: London Record Society, Victoria County History Gloucestershire and Oxfordshire and Calendar of Scottish papers.
That day, we were sending this entire batch of volumes off to the scanners to begin the process of digitising them in order to add them to BHO.
Since that day, we have published several volumes on BHO, including three from that initial photograph (VCH Gloucestershire volume 7 and VCH Oxfordshire volumes 16 and 17. I thought I would explain what the process of getting one of those texts from that printed volume stage to this one:
The process is a collaborative one, and can be long and expensive. Depending on the size of the volume, the density of the text and the number of images, a single volume can cost between £900 and £1500 to digitise. However, we think that the final outcome of a reliable digitised text, accessible and searchable from anywhere in the world, is completely worth it.
The first step is sending a book to the scanners. The scanning process usually takes a few weeks and the books are returned to us along with files of high-resolution TIFF scans. We have occasionally considered doing the scanning ourselves, but the scanning company’s machines, with their automatic page turners and unrivalled speed, surpass any machines that we would ever be able to afford. The scanning company is able to produce high-quality scans without damaging the books, which is another important factor to consider. We pull most of our source volumes off of the IHR Library’s shelves so we want to make sure that we return them in the same state that we took them.
Once we have received the scans, we must create the publications and components (that is, the sections that books are split into on BHO) in our database. Each publication and component has a unique ID, which allows us to keep our 100, 000+ text files organised. At this stage, we prepare all the metadata associated with the text, including the tags that are used in our faceted search interface.
Once we have publication and component IDs, we prepare rekeying instructions. Our texts are transcribed through a process called double rekeying. This transcription method involves two typists inputting text independently from page scans. The two transcriptions are then compared and any differences are manually resolved. This process ensures a very high level of accuracy as both typists are highly unlikely to make the same mistakes. All of our texts are transcribed in extensible mark-up language (XML). Our instructions have to explain how to mark up a table or an index, for instance, in XML. We send the instructions, the list of IDs and the page scans to a rekeying company. Again, relying on experts for this kind of work is much more efficient and cost-effective than doing it ourselves. The particular company that we have partnered with for many years has reliably produced hundreds of accurately transcribed volumes. BHO and its users place enormous value on the quality of the transcriptions on the site (which are 99.995% accurate) and it is crucial for us to work with companies that we can count on. The rekeyers also extract any images from the text.
They return the XML and image files to us and it is time for the next step. Although very few people enjoy sorting out copyright permissions, we love being able to show as many images from the text as possible so we bite the bullet. Some of our volumes were published fairly recently and so sorting out what images we can reuse on the web is pretty straightforward. Other times, we find ourselves contacting mostly retired editors asking if they remember the wording of their image licences from thirty years ago! Once we have collected image permissions, we edit the XML texts to make sure that we only show images for which we have received permission. Next, the images are processed further so that they don’t take up too much space on our servers. Conserving server space without sacrificing image quality can be a delicate balancing act, sometimes requiring several iterations before we find the perfect size and resolution for each image.
Next, we upload the XML files to the database and the images to our servers so that we can begin our quality assurances. We visually check each file to make sure that all the formatting has been done correctly and every page looks the way that it should. We verify the URLs and the page numbers. We double-check that the image quality meets our standards and that only licensed images are visible on the site. Finally, we check the quality of the transcriptions themselves in order to ensure that they meet our 99.995% accuracy rate.
After some double and triple-checking, it’s time for the most exciting part: publishing the volume on BHO! Once it has gone live, we do a final check to make sure everything was published correctly, celebrate with a nice cuppa, and then move on to the next volume.