Fixity: Safely Sending, Receiving and Moving Digital Assets


Fixity in digital assets is important for long term digital preservation. You need to be able to trust that any digital object has remained unchanged and can be accessed in the future. I previously discussed  how to ensure trust in digitisation techniques using a camera as part of GLAM Blog Club and alluded to a follow up post relating to trust in digital preservation - the September theme of "Safe" provides the perfect opportunity.

Knowing that a digital file remains exactly the same as it was when received is a bedrock principle of digital preservation (Bailey 2014). It is also important to know that the file is the same as when it was sent. This is where checksums come into play. Often referred to as a 'digital fingerprint', a checksum is generated using a cryptographic algorithm (such as MD5 as seen in the image below). There are many open source programs available that can do this for you, but you need to understand what they are and how you are going to use them before you begin.

Checksums are a fixed length string with hexadecimal characters 0-9 and A-F. The length of the string is determined by the algorithm and the longer the string, the longer it can take to generate. Theoretically, a checksum is unique to each individual bit stream of a file. Any change in the file, whether it is a single pixel in an image or character in a text document, and the checksum will change. Because a checksum is generated from the content of the file, changes to file name or metadata such as timestamps will not result in a different hash.


Example manifest from a bag conforming to the BagIt standard, utilising the MD5 checksum algorithm.

While on the topic of safety, it is important to note the difference between using checksums for fixity and using checksums for security. While weaker algorithms such as MD5 and SHA1 are suitable for fixity, they have known vulnerabilities and are not suitable for anything that requires secure transfer or use in a legal setting. The vulnerabilities allow those with malicious intent to create collisions (where two files have the same checksum) or make changes to a file while retaining the original checksum. While the risk of collisions occurring naturally are quite low (theoretically, 21 quintillion files needed for a collision with MD5), this risk can be mitigated by the use of multiple checksums in a digital repository.

A lot of the work I have been doing over the past six months has revolved around checksums and their use for both born-digital and digitised workflows. On the back of research into the BagIt standard towards the end of 2016, I have been involved in implementing its use for born-digital acquisitions as well as outsourced digitisation projects at the Library. The standard allows us to validate checksums of each "bag" every time it is moved or copied. These means we can safely transfer the contents from physical media (external hard drive, usb etc) to our internal network storage, move it from one network location to another, and ingest digital assets into our digital repository with the ability to validate the original checksum at every step.

Checksums are not the only way to ensure fixity. The BagIt standard also includes a payload-oxum which provides a record of the bag size and file count - this is useful as a quick, first check before beginning the process heavy checksum validation and is advantageous when dealing with larger digital assets such as audio or video.



Cover image: courtesy of www.digitalbevaring.dk.

References and further reading:


Matthew Burgess

.

No comments:

Post a Comment