Report on the 'Missing links: the enduring web' conference

The conference, organised by the Joint Informations Systems Committee (JISC), the Digital Preservation Coalition (DPC) and the UK Web Archiving Consortium (UKWAC), was timely and of great significance (all of the presentations are available on the DPC website). In the main it was concerned with the ‘how’ of web archiving. Among others, instructive papers were given on the Dioscuri emulator software, archiving tools and the technological ‘arms race’ (Adrian Brown) involved in developing archiving tools fast enough to keep pace with the evolution of web technologies themselves.

These discussions were interesting enough to engage even an unaware audience, and indeed they are essential if the web is in fact to ‘endure’. However, as you might expect from historians, what is most obviously interesting to us is the ‘what’ of web archiving. Treatment of this question was inevitably less prominent, but theories and practice of selection necessarily underpin the work of preservation, and were touched on by most of the speakers at the conference. This report attempts to draw out some of the ideas and concerns that presented themselves to us during the day, one of which Dr Webster raised during the lively final round-table session.

One distinction that might well be made is between the history of the web and history on the web. Several papers (Eric Meyer, Richard Davis, Kevin Ashley and others) explored means by which web archiving might aid future historians in assessing (for instance) the impact of the movement of social interaction into cyberspace (Facebook etc.), or patterns in social bookmarking, or the ‘geographies’ of IP addresses. All of these issues can be fruitfully investigated using the metadata of the web, rather than the content of each individual page. This type of work is likely to be taken care of by precisely such bodies as the Oxford Internet Institute, and has its own existing impetus. There was considerably less discussion about history on the web, that is, the content. It is this content that will stand as the periodical press and institutional records for the late twentieth and early twenty-first century.

We were heartened to see the major move being made by The National Archives systematically to archive the web estate of central government. It was clear during the day, however, that neither local government, nor non-governmental bodies, commercial organisations or the voluntary sector have such a single archival organisation with the resources to act decisively in this way. This is not to mention the proliferating volume of content generated by individuals: the diaries and working papers that are now never committed to paper.

That there is a sense of urgency about the issue showed through very clearly. There was, however, a persistent tension in the proceedings between opposed approaches to the question of what should be archived: between large-scale, centrally-directed attempts at comprehensive or near-comprehensive archiving of the web, and smaller-scale, discipline-specific approaches; between single-interface national archives, and proliferating individual, community-curated repositories. Cathy Smith (TNA) strongly advocated immediate action comprehensively to archive UK webspace, and to present it through a single interface. There was, in contrast, a sense during the Q&A session that curation of archives, and the process of selecting content to archive, could usefully be devolved to the (very many and diverse) user groups who are arguably best placed to identify the most significant materials. We would also suggest that it is possible to be too agnostic about our collective ability to predict which content will be of interest to future scholars, and which may safely be discarded. The argument that we must archive everything since we cannot know what is important should not be overstressed.

It is a common theme of conferences such as this that it is very difficult to involve researchers working in university departments of history, politics, archaeology, physics, philosophy, etc. The dangers inherent in this, for both archivists and researchers, were brought out by Cathy Smith. The TNA study, ‘Delivering coordinated UK web archives to user communities’ sought to address the questions of ‘What audiences should web archives anticipate …?’ and ‘What will the web be like as an historical source …?’ However, it proved more difficult than anticipated to find individual users to consult, even among contemporary historians. A (very) unscientific survey of colleagues indicates that there is little awareness even of the idea of web archives, let alone of their actual existence. Meanwhile, Helen Hockx-Yu noted that the process of content selection by web archivists at present continues to be informal, undocumented and relatively unsupported, being down to the individual archivist’s sense of the subject area for which (s)he is responsible. There is clearly a role for learned societies and analogous bodies across all the disciplines as mediators between archivists and scholars with an interest in seeing their sources preserved. Institutions such as the IHR perhaps stand a better chance of engaging the attention of their constituency than more general ventures from national archiving bodies. How precisely such mediation might be arranged remains to be seen, but discussions might usefully be begun.

We were also interested by Hanno Lecher’s paper on the Digital Archive for Chinese Studies (DACHS) citation repository. This seems to offer a useful model of selective web archiving in which the user community in effect chooses the content that is worthy of archiving simply by citing it. In the Chinese context, as Dr. Lecher demonstrated, this approach also allows for a crucial degree of responsiveness to a rapidly changing digital context. It may be argued that this ought to be managed by publishers (which raises questions of resources, since it is not obviously in the interest of commercial publishers, and is expensive for smaller ones). An alternative model might involve the development of discipline-specific citation repositories, which might also be administered by learned societies and subject organisations. However it is handled, the citation ‘problem’ needs to be solved as it remains a major barrier to the use and acknowledgement of digital resources in academic research. (1)

Our final point of note was that of impending legal deposit legislation. While recognising that this would alleviate the remarkable state of affairs in which UKWAC were able to secure permission to archive only 25% of sites contacted, the effect on smaller publishers, through which much humanities scholarship is made available, ought to be taken into account. Content provided by individuals also needs to be considered, as do resources with complex third-party rights issues relating to their content. To take just one example, in British local history much interesting and important work is being done on the web by individuals entirely outside any institutional framework. It is often hard to find, and even harder to identify as valuable, but it is there, and it is significant.

Much enormously valuable work is being done in the field of digital preservation, and we heard about some leading examples at this conference. However, as we are sure all of the speakers would acknowledge, there is still a huge amount to be done and some very important decisions to be made.

Peter Webster and Jane Winters, August 2009

(1) A 2006 report, produced by the IHR with funding from the Arts and Humanities Council, found that many researchers were reluctant to cite digital resources either because they did not know how to or because they were concerned about their authority or longevity. In some cases scholars were using digital resources to find information, and then converting the citation into a reference to, for example, the original manuscript (Peer review of digital resources for the arts and humanities (London: Institute of Historical Research, 2006), p. 24).

Report on the ‘Missing links: the enduring web’ conference

Categories