digital | Special Collections Blog

Emma Hancox, Digital Archivist, working in Special Collections and Theatre Collection, shares what she learnt at the Digital Preservation Coalition’s Unconference event in May.

In May I attended the Digital Preservation Coalition’s annual Unconference and Networking Event. The Digital Preservation Coalition or DPC is a membership organisation for institutions involved in digital preservation activities. Its vision is to build ‘a welcoming and inclusive global community, working together to bring about a sustainable future for our digital assets.’ My role at the University of Bristol is Digital Archivist and I was excited to attend this event as it was the perfect opportunity for me to meet others in similar positions from a variety of institutions in Europe, the UK and the US. The venue this year was the beautiful Royal Irish Academy, an historic building in the centre of Dublin. Its meeting room was lined with books and it was a grand setting for the events of the unconference which were spread over two days.

Digital Preservation Coalition Unconference and Networking Event 2024, Royal Irish Academy, Dublin.

The unconference had a strong focus on opportunities to network, share information and have conversations with others. This is particularly important to practitioners in the digital field as they are often the only member of staff working on digital preservation in their institution and there may not be any other Digital Archivists working in their local area. I am fortunate that we have two Digital Archive Assistants at the University of Bristol but it is still invaluable for me to be able to meet other Digital Archivists. The programme for the event was driven by members hence the ‘unconference’ name. As well as proposing talks in the run up to the event, we were able to suggest and vote on topics for discussion on the day. The winning areas were cloud storage, procurement of digital preservation systems, advocacy for digital preservation and lastly artificial intelligence which was no surprise due to its popularity as a topical issue!

One of the highlights of the event for me was going to the DPC’s reading club for the first time. Whilst I have attended reading groups outside of work, a reading group based on professional literature was new to me. This was the first in-person reading club as it is normally held online. The topic of the session was an article called ‘Toward a Conceptual Framework for Technical Debt in Archives’. ‘Technical debt’ is a term borrowed from commercial software development. It is a metaphor for future costs and work which are necessary because of compromises that were made (either intentionally or unintentionally) when setting up a system or project. Technical debt applies to digital archives work where there is often a legacy of early collections-based projects and infrastructure which are time intensive and resource heavy to maintain in the future and compete with the many other tasks archivists are faced with in their day-to-day jobs. The article proposed a model to assess past digital projects through an understanding of technical debt to make better decisions in the future. I found the article a useful starting point for evaluating the status of legacy projects and it gave me a framework I will use when analysing past projects that are presented to me. I enjoyed the discussion around how the article could be expanded to include case studies relating to born-digital rather than just digitised material. I plan to attend more of the DPC’s reading groups in the future online as they are a useful driver for engaging with professional literature.

Other talks I attended included one on fixity file checking in the cloud by Gen Schmitt from the University of Illinois. File fixity checking allows archivists to verify that files in their care have not become damaged or corrupted over time. The talk discussed performing fixities at scale across a whole repository of content and it was interesting to hear how the cost and efficiency of the process had been balanced. There was also a useful discussion around appraisal of born-digital collections led by Nicola Caldwell from the National Library of New Zealand. Appraisal in the digital realm is a very challenging area due to the sheer volume of digital files produced. It was encouraging to hear about tools that could help to make this piece of work easier such as Brunnhilde and the full version of FTK Imager. We are also grappling with challenges around how to appraise born-digital files at Bristol and because of the information gained from this session we will certainly look at these two tools as part of our future research and testing.

Professional visits were another part of the unconference and I was lucky to get a place on a tour of the Irish Traditional Music Archive. Housed in the Georgian Merrion Square the archive collects everything to do with Irish traditional music and has a fascinating and wide-ranging collection. After a tour of their digitisation and recording studios we learnt about their collections and how they provide a service to the public. As expected, the ITMA staff spend a lot of time clearing rights in their collections to be able to make them accessible. If you are ever in Dublin a visit is highly recommended.

I’m very grateful to have had the opportunity to attend this year’s unconference. It was a fantastic opportunity to make new connections and I plan to integrate what I learnt about dealing with legacy projects, fixity checking in the cloud and digital appraisal in my own role at the University of Bristol.

Sam Brenton, Bridging the Digital Gap Trainee, writes about his work in Special Collections and the Theatre Collection.

Hello, my name’s Sam and I’m the Digital Archives trainee on the Bridging the Digital Gap programme from The National Archives. This scheme aims to place people with technical skills within archives around the country to help preserve the increasing number of digital items they collect. Over the past fifteen months I’ve been at the University of Bristol, working on a number of digital archiving projects with Special Collections and the Theatre Collection. One of the things I’ve been working on is expanding the quantity of web pages in the University’s web archive.

Illustration by Jørgen Stamp digitalbevaring.dk CC BY 2.5 Denmark.

So what is a web archive? And how is it different from the website itself? A web archive is a collection of web pages preserved offline and is totally independent from the source website. This means that should the original web pages become unavailable, or are altered in any way, there is still a perfect copy of the original. The pages are stored as WARC files, a format specifically designed for the preservation of web pages as it acts as a container for all the elements that make up the web page, such as text and images.

When we’re concerned with long term preservation we can’t guarantee that the pages will still be hosted by their original source. Sometimes this is simply because an organisation likes to regularly refresh its content, for example a manufacturer listing their current products. But even something that appears to be more permanent, such as online encyclopedias and other information resources, may be altered, or older content might be removed without warning. It’s important that an archive is aware of any websites that come under the scope of its collection policy, particularly any that might be at risk.

I’ve been archiving parts of the University’s own website, such as the various news and announcements and the catalogues of courses offered, by crawling them in Preservica (our digital preservation system). The websites were identified by the University Archivist as being similar to traditional paper elements of the archive. So in order to mirror those collections, I’ve been doing small individual web-crawls based on dates (either year or month). These smaller crawls deliver more consistent results and will allow for better cataloguing in the future. Sometimes this is challenging, as it takes a lot of time to process each crawl. When web crawling, it is common to run in to issues when rendering the pages, this if usually because complex JavaScript elements of modern web pages, such as interactivity and animations, are difficult for the crawlers to capture, so it was important that I checked each crawl before adding it to the archive. Fortunately for me, the sites I’ve been crawling are relatively simple, so the only issues I had were with the .WARC viewers themselves. Each one behaves slightly differently, so it’s useful to try rendering the crawl in a different viewer (such as Conifer) if there are issues with it, before re-doing the crawl.

WARC files of crawled University web pages in Preservica.

In the future I’d like to look into adding relevant external web pages to the collections. In due course, we also hope to be able to catalogue and make web crawls accessible. In the longer term I would like to look into archiving social media profiles. These are far more challenging to preserve due to log-in requirements and the large number of interactive elements, but they are arguably just as important as standalone web pages. The posts are far more ephemeral than web pages and we are reliant on the platform to maintain them. They are also a key way that the University communicates with the public.

Special Collections Blog

University of Bristol Library

Tag Archives: digital

My experience of attending the DPC Unconference 2024

Archiving the Web: collecting web crawls for the University Archive