The British Library is about to embark on its annual task of archiving the entire UK web space. We will be pushing the button, sending out our ‘bots to crawl every British domain for storage in the UK Legal deposit web archive. How much will we capture? Even our experts can only make an educated guess.
You’ve probably played the time-honoured village fete game, to guess how many jelly beans are in the jar and the winner gets a prize? Well perhaps we can ask you to guess the size of the UK internet and the nearest gets … the glory of being right. Some facts from last year might help.
2013 Web Crawl
In 2013 the Library conducted the first crawl of all .uk websites. We started with 3.86 million seeds (websites), which led to the capture of 1.9 billion URLs (web pages, docs, images). All this resulted in 30.84 terabytes (TB) of data! It took the library robots 70 days to collect.
In addition to the .uk domains the Library has the scope to collect websites that are hosted in the UK so we will therefore attempt to geolocate IP addresses within the geographical confines of the UK. This means that we will be pulling in many .com, .net, .info and many other Top Level Domains (TLDs). How many extra websites? How much data? We just don’t know at this time.
A huge issue in collecting the web is the large number of duplicates that are captured and saved, something that can add a great deal to the volume collected. Of the 1.9 billion web pages etc. a significant number are probably copies and our technical team have worked hard this time to attempt to reduce this or ‘de-duplicate’. We are, however, uncertain at the moment as to how much effect this will eventually have on the total volume of data collected.
In summary then, in 2014 we will be looking to collect all of the .uk domain names plus all the websites that we can find that are hosted in the UK (.com, .net, .info etc.), overall a big increase in the number of ‘seeds’ (websites). It is hard, however, to predict what effect these changes will have compared to last year. What the final numbers might be is anyone’s guess? What do you think?
Let us know in the comments below, or on twitter (@UKWebArchive) YOUR predictions for 2014 – Number of URLs, size in terabytes (TBs) and (if you are feeling very brave), the number of hosts e.g. organisations like the BBC and NHS consist of lots of websites each but are one ‘host’.
URLs (in billions)
Size (in terabytes)
Hosts (in millions)
We will announce the winner when all the data is safely on our servers sometime in the summer. Good luck.
We were delighted to hear on 15 January that the IHR, along with the universities of Amsterdam and Toronto, King’s College London and the History of Parliament Trust, has been awarded funding by the international Digging into Data Challenge 2013. ‘Digging into Linked Parliamentary Data’ is one of fourteen projects which, over the next two years, will investigate how computational techniques can be applied to ‘big data’ in the humanities and social sciences.
Parliamentary proceedings reflect our history from centuries ago to the present day. They exist in a common format that has survived the test of time, and reflect any event of significance (through times of war and peace, of economic crisis and prosperity). With carefully curated proceedings becoming available in digital form in many countries, new research opportunities arise to analyse this data, on an unprecedented longitudinal scale, and across different nations, cultures and systems of political representation.
Focusing on the UK, Canada and The Netherlands, this project will deliver a common format for encoding parliamentary proceedings (with an initial focus on 1800–yesterday); a joint dataset covering all three jurisdictions; a workbench with a range of tools for the comparative, longitudinal study of parliamentary data; and substantive case studies focusing on migration, left/right ideological polarization and parliamentary language. We hope that comparative analysis of this kind, and the tools to support it, will inform a new approach to the history of parliamentary communication and discourse, and address new research questions.
We are delighted to have been awarded AHRC funding for a new research project, ‘Big UK Domain Data for the Arts and Humanities’. The project aims to transform the way in which researchers in the arts and humanities engage with the archived web, focusing on data derived from the UK web domain crawl for the period 1996-2013. Web archives are an increasingly important resource for arts and humanities researchers, yet we have neither the expertise nor the tools to use them effectively. Both the data itself, totalling approximately 65 terabytes and constituting many billions of words, and the process of collection are poorly understood, and it is possible only to draw the broadest of conclusions from current analytical analysis.
A key objective of the project will be to develop a theoretical and methodological framework within which to study this data, which will be applicable to the much larger on-going UK domain crawl, as well as in other national contexts. Researchers will work with developers at the British Library to co-produce tools which will support their requirements, testing different methods and approaches. In addition, a major study of the history of UK web space from 1996 to 2013 will be complemented by a series of small research projects from a range of disciplines, for example contemporary history, literature, gender studies and material culture.