Tag Archives: University of Bristol

Archiving the Web: collecting web crawls for the University Archive

Sam Brenton, Bridging the Digital Gap Trainee, writes about his work in Special Collections and the Theatre Collection.

Hello, my name’s Sam and I’m the Digital Archives trainee on the Bridging the Digital Gap programme from The National Archives. This scheme aims to place people with technical skills within archives around the country to help preserve the increasing number of digital items they collect. Over the past fifteen months I’ve been at the University of Bristol, working on a number of digital archiving projects with Special Collections and the Theatre Collection. One of the things I’ve been working on is expanding the quantity of web pages in the University’s web archive.

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark.

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark.

So what is a web archive? And how is it different from the website itself? A web archive is a collection of web pages preserved offline and is totally independent from the source website. This means that should the original web pages become unavailable, or are altered in any way, there is still a perfect copy of the original. The pages are stored as WARC files, a format specifically designed for the preservation of web pages as it acts as a container for all the elements that make up the web page, such as text and images.

When we’re concerned with long term preservation we can’t guarantee that the pages will still be hosted by their original source. Sometimes this is simply because an organisation likes to regularly refresh its content, for example a manufacturer listing their current products. But even something that appears to be more permanent, such as online encyclopedias and other information resources, may be altered, or older content might be removed without warning. It’s important that an archive is aware of any websites that come under the scope of its collection policy, particularly any that might be at risk.

I’ve been archiving parts of the University’s own website, such as the various news and announcements and the catalogues of courses offered, by crawling them in Preservica (our digital preservation system). The websites were identified by the University Archivist as being similar to traditional paper elements of the archive. So in order to mirror those collections, I’ve been doing small individual web-crawls based on dates (either year or month). These smaller crawls deliver more consistent results and will allow for better cataloguing in the future.  Sometimes this is challenging, as it takes a lot of time to process each crawl. When web crawling, it is common to run in to issues when rendering the pages, this if usually because complex JavaScript elements of modern web pages, such as interactivity and animations, are difficult for the crawlers to capture, so it was important that I checked each crawl before adding it to the archive. Fortunately for me, the sites I’ve been crawling are relatively simple, so the only issues I had were with the .WARC viewers themselves. Each one behaves slightly differently, so it’s useful to try rendering the crawl in a different viewer (such as Conifer) if there are issues with it, before re-doing the crawl.

WARC files of crawled University web pages in Preservica.

WARC files of crawled University web pages in Preservica.

In the future I’d like to look into adding relevant external web pages to the collections. In due course, we also hope to be able to catalogue and make web crawls accessible. In the longer term I would like to look into archiving social media profiles.  These are far more challenging to preserve due to log-in requirements and the large number of interactive elements, but they are arguably just as important as standalone web pages. The posts are far more ephemeral than web pages and we are reliant on the platform to maintain them.  They are also a key way that the University communicates with the public.


Covid-19 Collecting and the University of Bristol Community

Like me, you might well be sick of the phrase that we are living through interesting times, though it could be argued that living through a Covid-19 lockdown and trying to work/study from home, is ‘interesting’.

As this is the case, and as we like to collect contemporary materials which may go into a future archive relating to the coronavirus and how it has affected the University of Bristol Community, we are proposing a collection of Covid-19 related materials.

We are well aware that this is a stressful and sensitive situation for all, and that many of you are more busy than usual; trying to adapt to new situations; working to care for people and develop strategies to combat the pandemic; and suffering losses of loved ones. However we would appeal to you all as part of your busy day to consider what is happening around you, and if you think it could be relevant to how people in the future will study how we coped with the Covid-19 pandemic, do please get in touch with us.

We should receive archives of committees and the like due to our current collecting of archives of the University of Bristol, but there are many other strands that will be of interest.

-Webpages and SharePoint sites: The University Coronavirus web pages and share point sites for students and staff



The work with the Community to help, support, and discover

The work of the Uncover team

We have created our own SharePoint site so people can upload material and submit it to us. This is new to us, so let us know if there are any problems with it.  It is now available here: https://uob.sharepoint.com/teams/grp-Covid-19-collecting

-Press and media: Our academics and students are busy engaging with many forms of media, (we are aware of the public relations web pages and thank them for being supportive)

-Social Media: Blogs, twitter, Instagram, twitter, facebook, yammer. All of which may show a more informal side of what is happening

-Emails: From colleagues/managers/schools to students/staff/individuals giving support and laying down new regimes/suggestions

-Talks and interviews: Such as staff addresses and talks from individual academics.

-Photographs: Images of your working at home desk/study area. Your new co-colleague pets and family. Rainbows, teddy bears in the windows of houses around you. Signs in shop/business/domestic windows. Graffitied messages of support

-Objects: When we go back to campus keep the signs put up to record that a building/library was closed. Did you sew a face mask? Did someone you know create PPE using school 3D-printers or sew scrubs? Did you get involved in volunteering in the community in many different ways? If you don’t want to give up the actual object we would be happy to have a photograph.


-Writing: Some people are writing diaries, finding solace in poetry, reading more (or less). The Brigstow Institute has supported diary projects, Mass Observation is collecting diaries on 12 May, and we would love to see your work (but only if you are happy to share).



Our Request

We are going to concentrate on the University as a community, be that student, staff, or alumni. We are interested in your story, whatever faculty you are based in (not just the arts) and whatever your job title or course of study.

We are conscious that there is a lot of collecting already going on. For instance the MShed in Bristol is collecting; as are multiple archives, libraries, museums, and organisations.  Collections may be physical or digital, or a mixture of the two.  The Wellcome Trust is also giving some good guidelines about the ethics of current collecting, which we are very anxious to follow.  So if you would like to talk to us that is brilliant, but if you have already offered your materials to another organisation that is equally fine (we are a bit late in asking).

We also realise how busy everyone is and though we seem to be entering the next stage of the pandemic after 7 weeks of lockdown, we would rather that you save something and get in touch in the future, when you have time to process what you are living through.  As I write this on 11 May 2020, the Government and University authorities, and the wider community are talking about what the next stage will be for us all to cope with what we are experiencing.  It is a rapidly changing situation, and we would love to record this as it is happening.

Do get in touch with us at

grp-Covid-19-collecting@groups.bristol.ac.uk or special-collections@bristol.ac.uk

We would love to hear from you, and thank you for your time.

Special Collections Web Page: https://www.bristol.ac.uk/library/special-collections/

Hannah Lowery on behalf of the Special Collections Team

11 May 2020

(images all Hannah Lowery)