Archiving the Web: collecting web crawls for the University Archive

Sam Brenton, Bridging the Digital Gap Trainee, writes about his work in Special Collections and the Theatre Collection.

Hello, my name’s Sam and I’m the Digital Archives trainee on the Bridging the Digital Gap programme from The National Archives. This scheme aims to place people with technical skills within archives around the country to help preserve the increasing number of digital items they collect. Over the past fifteen months I’ve been at the University of Bristol, working on a number of digital archiving projects with Special Collections and the Theatre Collection. One of the things I’ve been working on is expanding the quantity of web pages in the University’s web archive.

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark.

So what is a web archive? And how is it different from the website itself? A web archive is a collection of web pages preserved offline and is totally independent from the source website. This means that should the original web pages become unavailable, or are altered in any way, there is still a perfect copy of the original. The pages are stored as WARC files, a format specifically designed for the preservation of web pages as it acts as a container for all the elements that make up the web page, such as text and images.

When we’re concerned with long term preservation we can’t guarantee that the pages will still be hosted by their original source. Sometimes this is simply because an organisation likes to regularly refresh its content, for example a manufacturer listing their current products. But even something that appears to be more permanent, such as online encyclopedias and other information resources, may be altered, or older content might be removed without warning. It’s important that an archive is aware of any websites that come under the scope of its collection policy, particularly any that might be at risk.

I’ve been archiving parts of the University’s own website, such as the various news and announcements and the catalogues of courses offered, by crawling them in Preservica (our digital preservation system). The websites were identified by the University Archivist as being similar to traditional paper elements of the archive. So in order to mirror those collections, I’ve been doing small individual web-crawls based on dates (either year or month). These smaller crawls deliver more consistent results and will allow for better cataloguing in the future. Sometimes this is challenging, as it takes a lot of time to process each crawl. When web crawling, it is common to run in to issues when rendering the pages, this if usually because complex JavaScript elements of modern web pages, such as interactivity and animations, are difficult for the crawlers to capture, so it was important that I checked each crawl before adding it to the archive. Fortunately for me, the sites I’ve been crawling are relatively simple, so the only issues I had were with the .WARC viewers themselves. Each one behaves slightly differently, so it’s useful to try rendering the crawl in a different viewer (such as Conifer) if there are issues with it, before re-doing the crawl.

WARC files of crawled University web pages in Preservica.

In the future I’d like to look into adding relevant external web pages to the collections. In due course, we also hope to be able to catalogue and make web crawls accessible. In the longer term I would like to look into archiving social media profiles. These are far more challenging to preserve due to log-in requirements and the large number of interactive elements, but they are arguably just as important as standalone web pages. The posts are far more ephemeral than web pages and we are reliant on the platform to maintain them. They are also a key way that the University communicates with the public.

Special Collections Blog

University of Bristol Library

Archiving the Web: collecting web crawls for the University Archive