I attended the 16th International Conference on Digital Preservation (iPRES 2019) as part of my research travel project funded by the Gordon Darling Foundation. My previous post discussed the tutorial and workshop I attended on day one. Here I will discuss a few selected highlights from the conference.

Towards a Universal Virtual Interactor (UVI) for Digital Objects

Euan Cochrane (Yale University Library), Klaus Rechert (OpenSLX GmbH), Ethan Gates (Yale University Library)

Photograph by Sebastiaan ter Burg, CC BY 4.0 / flickr
Cochrane discussed the Universal Virtual Interactor (UVI) that forms part of the Emulation-as-a-Service Infrastructure (EaaSI) program and how it builds on many years of preceding work. This includes the Universal Virtual Computer design concept from IBM, the Dioscuri emulator design from the National Library of the Netherlands, the Keeping Emulation Environments Portable (KEEP) project and the Baden-Württemberg Functional Long-Term Archiving (bwFLA) project which developed the suite of tools referred to as Emulation as a Service (EaaS).

UVI was described as a two part process, where (1) a file is provided to a program interface for analysis which suggests pre-configured emulators that can (2) interact with/render it. Analysis is based on identification (using Siegfried and DROID), dates (eg created, modified) and any further metadata provided or available. Cochrane used a Works Word Processor file, which does not render accurately in modern Microsoft Word, to demonstrate how the UVI utilises a script to automatically open the file in the appropriate software once the emulated environment opens in a web browser.

Cochrane highlighted how documentation and discovery are an important part of the project, where they are documenting various aspects of software such as what file formats it can import, save to and export. As I mentioned in my IDCC19 reflection discussing EaaSI, copyright law is a challenge with various legal jurisdictions for a global network for emulation but they are using the fair-use rights available in the United States of America to facilitate the sharing of environments across the current network.


Digital Preservation and Enterprise Architecture Collaboration at the University of Melbourne: A Meeting of Mindsets 

Jaye Weatherburn (University of Melbourne) , Lyle Winton (University of Melbourne) and Sean Turner (University of Melbourne)
Photograph by Matthew Burgess
Weatherburn, Winton and Turner provided a great example of working together to achieve a common goal. Their presentation style was engaging and I enjoyed their 'real talk', where they sat down for a question and answer session about a meeting of mindsets between digital preservation and enterprise architecture. They highlighted how collaboration has been a driver for greater visibility and understanding of digital preservation across the University of Melbourne, which is now on their Enterprise Architecture Roadmap as an important socio-technical ecosystem. Turner discussed key information that is useful for IT, where a focus on standards and models such as OAIS are useful by providing a framework and terminology to understand the challenge of digital preservation. Concrete understanding of digital preservation was the starting point, with an awareness of terms like 'born-digital collections' and their importance.

Their talk highlighted the importance of dedicated, permanent roles (rather than a project approach) with the ability to establish person to person relationships and sharing mindsets. Enterprise architecture aims to reduce complexity and cost through standardisation, with a strong focus on cost and effectiveness, and digital preservation is now seen as a key component for this. Their collaborative approach has resulted in more understanding across the organisation that digital preservation is more than technology, and also requires ongoing work around resourcing, policy, process and governance.


Cloud Atlas: Navigating the Cloud for Digital Preservation 

Andrea Goethals (National Library of New Zealand), Jefferson Bailey (Internet Archive), Roslynn Ross (Library and Archives Canada) and Nicholas Taylor (Stanford Libraries)

This panel discussion offered contrasting institutional perspectives on the potential or the perils of the cloud for digital preservation, featuring case studies on how memory institutions can leverage the cloud in deliberate and mission-supporting ways, and how some are working to build alternative, community-based infrastructures.

Roslynn Ross discussed questions to ask when setting course for cloud storage, including how will we move our collection to the cloud? How will we preserve integrity? How will we deal with privacy and copyright? How will we ensure security? What is our exit strategy? She said that cloud providers guarantee they will get your data back, but not in what format, so you need to be clear on that. She spoke about how Library and Archives Canada is taking an iterative approach by choosing a pilot project/collection to begin with.

Nicholas Taylor said that "The Cloud" is playing a growing role in digital preservation but which "The Cloud" we use, and how we use it, matters both for our missions and the likely success of our efforts. He discussed threats to digital information and the difference between commercial and community cloud, highlighting pilot models and values-aligned partnerships to build private clouds. He said that our values as memory institutions suggest that there are question we should be asking of large, for-profit cloud service providers and asked whether we can claim to have custody and intellectual control over content stored in commercial cloud. He also highlighted opaque data integrity with commercial providers, where we must trust that the service is performing fixity checks and it may be prohibitively expensive to retrieve content to perform hashing.

Jefferson Bailey spoke about how the Internet Archive runs their own data centers, how their archiving service supports the archive itself and that they do not monetise input and output. He said that only 20% of users download their data which highlights that they trust the Internet Archive as a storage provider.

During discussions, concerns were raised about the cost of storage and retrieval in the cloud as collections grow. The Internet Archive was used as an example of the financial benefit to work on premises rather than outsource when working at scale, where it would cost three or four years of the Internet Archive budget to retrieve their entire storage/data from a certain commercial provider. There was conversation on keeping a local copy, where the cloud copy could be seen as redundancy. Panelists were asked for their thoughts on the long term reliability of the cloud, where Ross said you really need to start with your exit strategy in mind.


The Integrated Preservation Suite: Scaled and automated preservation planning for highly diverse digital collections

Peter May (British Library), Maureen Pennock (British Library), David A. Russo (British Library)

Photograph by Sebastiaan ter Burg, CC BY 4.0 / flickr

Peter May discussed the Integrated Preservation Suite (IPS) at the British Library, which aims to enhance preservation planning capability through automation and documentation. He highlighted how digital preservation knowledge is generated at the Library through collection profiles (documentation to understand what collections are about, their preservation intent), help desk (where colleagues from across the Library can send requests, whether it is about rendering issues in the reading room, or curators dealing with a new digital acquisition), projects (such as Emerging Formats) and file format assessments (detailed understanding on particular formats and their preservation risks).

May described IPS as a suite of tools and services providing a way to manage knowledge and facilitate access to it in an automated way. He said that the knowledge base underpins a lot of activity and was developed using their own data model. They pull information from various sources, where their own knowledge is treated as a separate data source. To avoid duplication of information across multiple sources, information is first held in a staging area where a user compares existing data and determines whether to keep/discard/merge new information. While this is currently a manual process, they aim to automate this in the future. He mentioned that the National Library of Australia provided them with a spreadsheet documenting links between file formats, software and hardware which expanded their knowledge base to help determine what file formats software can create, render, validate and extract metadata from.

Discussion at the end of this session highlighted the potential community benefits for this project, where it could possibly tie in will with the Preservation Action Registry and also be helpful to others if the knowledge base was publicly available. May commented that it has been thought about, but there are no definitive plans on how they will do that at this point.

I was really interested in the concept of collection profiles and preservation plans, particularly when it comes to complex collections. I later discussed with May whether they intend to create a profile and/or plan for each individual collection, where we agreed it would not be necessary for all incoming collections given the analogous nature of some. I can see this in my current work, where it would be useful to create an overall collection profile and preservation plan for all incoming born-digital photographs for example, but look at more specific plans for incoming born-digital manuscript collections that might be more heterogeneous in scope and file formats.


The Australasia Preserves Story: Building a digital preservation community of practice in the Australasian region

Jaye Weatherburn (University of Melbourne)

Photograph by Sebastiaan ter Burg, CC BY 4.0 / flickr
Australasia Preserves' presence at iPRES was evident with Weatherburn's poster presentation on building capacity through collaboration. The poster aimed to highlight the growth of the digital preservation community of practice and generate discussion and input on how to build on the initiative. I had conveniently printed a bunch of flyers before I left Australia and decided to stand in solidarity with Jaye and help prompt discussion, answer questions and promote the community. The flyers proved popular and there were a few questions on whether we had any stickers - something to consider for next time!


You can find all of the papers I have discussed, and much more, on the iPRES2019 website (https://ipres2019.org/program/conference-programme/) and in the collaborative notes on Google Drive.

Cover image: Taken during the canal cruise on the way to the conference dinner, 18 September 2019. Photograph by Matthew Burgess

I was fortunate to have the opportunity to travel to Amsterdam and attend the 16th International Conference on Digital Preservation (iPRES 2019) as part of my research travel project funded by the Gordon Darling Foundation. Hosted by the Dutch Digital Heritage Network at the Eye Film Museum, this was a great opportunity to both attend iPRES and visit Amsterdam for the first time.

Coming off the back of three weeks in the United Kingdom discussing all things digital preservation, I was over my jet lag but will admit I was exhausted before the week even started. I had a burst of energy from day one, though, being surrounded by so many brilliant minds in the digital preservation community and with so many interesting sessions to attend and people to talk to.

Day One: Tutorials and workshops

There were many interesting tutorials and workshops to choose from, including preserving complex digital objects to understanding and implementing PREMIS. I based my choices on areas of interest relating to my current role, including email and preservation actions.

Review, Appraisal, and Triage of Mail (RATOM)

Presented by Cal Lee and Kam Woods, participants were introduced to the RATOM project and examined development efforts so far. With selection and appraisal a key motivation for the project, it was highlighted that with 48 years of history, emails serve as a massive source of evidence and information. RATOM aims assist in determining what records need to be kept and what records might have sensitivities. The goal is not for perfect identification, but useful information for appraisal - a stage before deciding what emails to keep, what emails to redact, tag for further review etc.

I was very interested to hear discussion on the aim to provide a way to migrate from the PST format, which is proprietary and complicated. It was noted that there are many different (often commercial) tools available to convert PST to other formats, but they are a 'black box' and their outputs may be different. You only need to look in the Digital Curation Google Group and see the almost daily spam posts to see how difficult it can be to find a suitable tool for this task. Open source forensic libraries allow RATOM to process emails from various sources and quickly generate feature files that are verifiable, reproducible and reusable.

A RATOM Jupyter notebook (run through binder) was used to demonstrate current progress, which I think worked quite well. Everyone was able to load and run the python scripts from the web interface without having to install any software on their computer, which can often be a time consuming process in tutorials/workshops. When working through the entity extraction component, which utilises spaCy natural language processing models, participant discussion centred around a law scenario where prosecution, defence and judges may have separate entities. Lee and Woods discussed the trade off in building a model for a specific situation vs using a generic one, commenting that manuscript repositories will have different entities within each acquisition, so specialised models are not so useful. Utilising spaCy means you can process data without having to train your own models, but Lee commented that it would be good to receive feedback on how the spaCy models in languages other than English fare.

Tutorial participants discussed their current challenges with email which included not knowing where to start, volume, sensitive information and the EU General Data Protection Regulation (GDPR), and the Capstone approach. Apart from being interested to see how RATOM continues with development into 2020, my key takeaways from this session were to think about what type of entities (eg. people, places, events) would be important for my organisation and to think about the types of metadata that can be made available in collections containing emails for discovery. I will be watching the outcomes of this project very closely.

Tutorial slides available here.

Preservation Action Rules Workshop. Parcore: See one, Do one, Teach one

Preservation Action Registries (PAR) is about describing digital preservation with interoperable registries of good practice and has two main concepts; preservation actions and business rules (context). Tools can be used in different ways to perform different types of digital preservation actions, where there could be multiple tool options for a single activity that produce different results, which is where business rules and context come in to play.  The main motivation for the workshop was for the creators to determine whether this way of thinking for describing preservation actions is useful and something worth continuing, and for further refinement.

After an introduction to PAR and a few examples, participants were tasked with working in groups to write down some new examples and present to the room. The workshop was popular, and I ended up in a rather large group. The group brainstormed possible preservation actions to work with, where three possible options were considered; 1. File normalisation (eg. proprietary camera raw to Digital Negative), 2. Add a new format to a registry for a previously unidentifiable format, or 3. Editing METS files to correct an error. After a lengthy discussion, the group settled on option 2 as it was dissimilar from the presented examples.

My group focused more on discussion rather than filling out the PAR use case template with our chosen scenario, which highlighted some issues that need to be clarified or resolved. This included the use of flowcharts or series of decisions based on conditions and results, ambiguity between what goes into business rules vs preservation actions, the need to describe what you don't do and why for others to learn from, and validity dates on preservation actions. We also discussed the fact that a 'tool' could be a person or a community, where it is quite common to go to another expert or the community to resolve a problem (such as adding a new file format/signature to PRONOM) when the expertise are not available within your own organisation.

There was a lot of interest in the room from participants for PAR to continue and, as was discussed at the end of the workshop, there is a huge amount of potential for it to enable the sharing of information as a community at a higher rate than at conferences such as iPRES.

Workshop slides available here, collaborative notes here.

Amsterdam canal, 18 September 2019, photograph by Matthew Burgess

Stay tuned for part 2 of my iPRES 2019 reflection on the main conference program.

Cover image: Eye Film Museum, 16 September 2019, photograph by Matthew Burgess
I am super excited and incredibly honoured to have the opportunity to undertake an overseas research travel project in August/September 2019. Supported by the Gordon Darling Foundation through the Darling Travel Grants - Global, the aim of my project is to investigate how cultural institutions abroad are acquiring, preserving and providing access to born-digital collections.

This will be my first time travelling to the United Kingdom and Europe, as well as my first time travelling solo overseas. I will be spending time in London, Edinburgh, Glasgow and Amsterdam where I will be visiting some amazing organisations (and I am open for more visits - get in touch!) and attending iPRES for the first time as well! I will also be making a quick stop over to Dublin at the end of the trip, which I am also super excited about (more details to come).

I look forward to meeting new people as well as catching up with those I have previously met online and face to face in Australia. I haven't quite decided whether I will actively blog while travelling, but I will definitely be posting to my Twitter and Instagram accounts.

Cover image credit: Map by Free Vector Maps


I have come to appreciate the availability of free graphics with the amount of presentations I have had to put together, particularly over the last six months with teaching as well as presenting as part of my role at the Library. I am a heavy user of the wonderful digital preservation illustrations from Digitalbevaring.dk, which has inspired me to make my own illustrations available under a creative commons licence.

I have put together a small pack containing eight illustrations (with some extra colour variations) of digital physical carriers. This includes a 3.5" floppy disk, 5.25" floppy disk, Apple PowerBook 1400c, CD-ROM, Digital Linear Tape, Jaz Disk, USB drive and Zip Disk.

I am providing access to these under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) licence.

You can download a zip file on the following page, which may be updated in the future with more illustrations: https://blog.matthewburgess.net/p/illustrations.html

The International Digital Curation Conference travelled to Australia and the southern hemisphere for the first time this year, hosted by the University of Melbourne from 4 - 7 February 2019. With the theme of collaborations and partnerships in the field of digital curation and preservation, this event highlighted the collaborative nature of the community of practice surrounding this field of work on a global scale. Even as a relatively new professional, I recognised many names and faces ("I follow you on Twitter!") and found the programme engaging. I was fortunate to take part in many aspects of the event, from the pre-conference workshops to the unconference, and along with the excitement of attending the conference itself I was also excited to present at an international conference for the first time.

It all began on Monday morning with Digital Preservation Carpentry - a full day, pre-conference workshop that aimed to trial a hands-on technical lesson for digital preservation processes using the pedagogical teaching style of the Carpentries; and to gather feedback from participants to enhance further development of digital preservation lessons. The idea for the workshop was raised at the inaugural Australasia Preserves event in February 2018, where it was highlighted that there was a lack of training and education in this space within the Australasian context. It was a pleasure to be involved with a team of talented professionals in developing this first iteration. My focus was on the BagIt File Packaging Format and the use of Bagger as a tool. Overall, it was great to see a mix of participants (skill level, background, location etc) in an open and welcoming environment, keen to engage with the content. As organisers, we came away with some great feedback to improve on what we developed, as well as interest from the community in developing further lessons. Where to from here will be explored through Australasia Preserves, so make sure you join the Google Group if you would like to know more. Collaborative notes from the workshop are available here: https://tinyurl.com/y8zrk8oo


My lightning talk was part of the afternoon Parallel Session C - Digital curation & preservation on day one of the main conference and was chaired by Paul Wheatley, Head of Research and Practice at the Digital Preservation Coalition (DPC). We were fortunate to have Wheatley also volunteer his time in the Digital Preservation Carpentry workshop where he provided an impromptu thought session on what a manifest is really about in relation to the BagIt File Packaging Format. He highlighted threats to digital objects and noted that we should 'trust nothing, validate everything' and contemplate what minimum information is required for a meaningful, verifiable manifest. This reminded me of Ross Spencer's discourse on file digests, noting that 'understanding how to create a file digest, and what that means, provides a mechanism to ensure that a file transferred from a donor, or from a central government agency, to an archive remains unchanged'.


The lightning talk session included some fascinating talks, from thoughts on the DCC Curation Lifecycle model by Sayeed Choudhury, the use of BitCurator for processing, appraisal and iterative selection of email by Cal Lee and Lachlan Glanville discussing how the Germaine Greer archive drove digital preservation at the University of Melbourne Archives. The session, and day one, concluded with Carolyn Hank's energetic talk 'Dead, Dormant, Zoetic: Modeling the Blog Lifecycle', which made me think of my own blog that has been dormant since early 2018 (until now!). My lightning talk 'Digital preservation at the point of acquisition: Collecting born-digital photographs' aimed to highlight the collaborative process of developing new guidelines and specifications for collecting born-digital photographs and upskilling librarians through a hands-on photography workshop to understand the requirements being asked of donors and vendors. It is available as a blog post via the following link, along with a copy of the specifications and guidelines: http://bit.ly/2EGNMIj


BitCuratorEdu

After hearing Cal Lee speak during the minute madness rapid fire poster presentations, I was keen to hear about BitCuratorEdu - a two-year project to study and advance the adoption of digital forensics tools and methods in libraries and archives through professional education efforts. I will certainly be keeping a close eye on this project as one of the outputs includes the production and dissemination of a publicly accessible set of learning objects to be used in providing hands-on digital forensics education. This is something that is clearly lacking for both students as well as information professionals in the context of galleries, libraries, archives and museums (GLAM).

As someone fortunate to have had the opportunity to undertake computer forensics training utilising Forensic Toolkit (FTK) in 2017, I have since been interested in finding something that is specific to GLAM. The requirements for law enforcement are quite different when it comes to digital forensics, and the power of this software also raises ethical concerns when dealing with collection material.

I managed to catch Lee in between all of his engagements at the conference to discuss the project/poster. In discussing the preliminary finding that instructors desire realistic datasets and mechanisms to connect students to real-world projects, I commented that this is relevant when learning about many aspect of digital preservation and digital asset management. My education included the use of open source digital library software and dummy data to analyse and document requirements and specifications for the design of a digital asset management system. This makes me wonder how we can connect students with GLAM organisations in an effort to provide real-world projects, particularly in Australia. I think the challenge here revolves around the learning outcomes required for the project, and whether the organisation can provide enough autonomy for students to meet them. Lee pointed out Digital Corpora as a useful resource for computer forensics education, which contains freely available disk images and other files.

Scaling Emulation and Software Preservation Infrastructure, the EaaSI network 

Unfortunately I missed the demo by Euan Cochrane at the conference, but I managed to have a discussion with him regarding the EaaSI network over a drink or two and it sounds like an exciting program. Led by the Digital Preservation Services team at Yale University Library, EaaSI aims to enable broader access and use of preserved software and digital objects. Being able to click a link in an online catalogue to open an emulated environment in a web browser, looking at born-digital files within their original software configuration, is a future I would like to see! Unfortunately it sounds like this will not be an easy feat on a global scale, with Cochrane noting that the copyright and legal requirements for different jurisdictions creates the need for local instances of the EaaSI network. Hearing about this project re-enforced my current thinking, and the current policy in my organisation, to retain original files when acquiring born-digital collections - even after normalising them. It is important to be able to go back to the original, with emulation developments in the future providing alternative access mechanisms.

There was a lot of involvement in IDCC19 by Australasia Preserves, a digital preservation community of practice (CoP) established by the University of Melbourne in February 2018. From the Digital Preservation Carpentry workshop, to multiple sessions during the unconference, it was great to see. Jaye Weatherburn, Data Stewardship Coordinator at UniMelb, also gave a talk on 'Advancing digital preservation capability through collaborative connection' that promoted Australasia Preserves, highlighting its achievements in its first year. I have enjoyed my involvement with this CoP over the past 12 months and it provided me with the opportunity to organise my first event in July last year, which was a learning experience to say the least.

Further general comments and selected highlights from both days of the conference:

  • Christine Keneally's keynote discussed the significance of data curation in democracy, where institutions can destroy and rewrite important truths, with data curators as frontline guardians to the bedrock of society 
  • The importance of metadata was noted in several talks, from Joakim Philipson's comment that validation is key to keeping metadata in good shape and being adaptable for the future; Lars Vilhuber discussing the lack of consistent, reliable metadata for restricted data (eg, no information on licenses, accessibility); Donna Hensler's lessons learned, existing metadata needs to be in good shape before importing into new systems, where the curation of metadata is a substantial, time consuming activity 
  • In discussion on collaboration across communities with Nancy McGovern and Clifford Lynch, chaired by Kevin Ashley, the question was raised on whether content creators should be involved. Lynch said that capturing intent of creators is important and has emerged as a key issue in the preservation of digital art. McGovern stated that we have to 'have our ducks in a row' before we start talking to content creators and understand what we are trying to do 
  • Flora Feltham's talk on building an Aotearoa New Zealand-wide digital curation community or practice for sector wide collaboration to give people confidence and expertise to collect and manage born-digital materials 
  • Michelle Negus Cleary and Peter Neish spoke about collaborating across borders with the Anzac Gallipoli Archaeological Database (AGAD), highlighting the challenges in ongoing custodianship and the need for data management plans 
  • Dr Patricia Brennan's keynote, highlighting the importance of digital curation to maximise reuse of data for other studies, mitigate obsolescence, maintain value, facilitate reproducibility and increase pathways of discovery 


For me, the unconference was heavily geared towards Australasia Preserves where we had morning and afternoon sessions that looked at the past 12 months, discussing what worked well, what did not and how we can make it work going forward. This included discussions on how we could connect more with the private sector, how the Digital Preservation Coalition can help, how the National and State Libraries Australia (NSLA) digital preservation CoP operates and how we can ensure sustainability for another year. The day finished with some outcomes and next steps, including the development of a briefing pack to enable people to advocate to their management for involvement in the community as part of their professional development. Well done to Jaye for putting together such a great document. The collaborative notes from the unconference can be found here: https://bit.ly/2SBcxgw

The unconference also saw an impromptu, brief introduction to BitCurator workshop with Cal Lee where he helped participants install the environment using VirtualBox. Unfortunately for me, my laptop did not have a suitable processor to make it work so I could not get it running on the day, but it did give me a very brief overview that spurred me to install and look at it once I returned to work and had access to a suitable computer. Lee highlighted that when working with disk images, you should create them in the virtual environment and then determine whether any further actions could be undertaken in the host environment where you will have more processing power. He provided a very quick overview of Bulk Extractor Viewer, which is a graphical user interface that can be used to scan for personally identifiable information (PII). It was great to be able to attend this short session as I did not have the opportunity to stay in Melbourne for Lee's workshop the following morning.


As a first time attendee and presenter at a conference in this field of work, I found IDCC19 to be a welcoming and invigorating experience with great diversity in attendees and the programme. It provided the opportunity to meet professionals from across the globe as well as the Australasian region. While I was completely exhausted after a whirlwind four days, I returned home with a renewed passion for the work I do as well as practical plans for further research and actions based on presentations and impromptu conversations during networking events.

You can find out more details about the conference, and links to collaborative notes and slides at the following location: http://www.dcc.ac.uk/events/idcc19

I have also created an #IDCC19 TAGS archive of tweets that you can can access via the following link: http://bit.ly/2Eqm94K

Related links and useful resources:

  1. Australasia Preserves Briefing Pack
    https://blogs.unimelb.edu.au/digital-preservation-project/2019/02/27/australasia-preserves-briefing-pack-2019
  2. Australasia Preserves Google Group
    https://groups.google.com/forum/#!forum/australasia-preserves
  3. Australasia Preserves at IDCC 2019 blog post
    https://blogs.unimelb.edu.au/digital-preservation-project/2019/02/14/australasia-preserves-at-idcc-2019
  4. Bagger
    https://github.com/LibraryOfCongress/bagger
  5. BagIt File Packaging Format
    https://tools.ietf.org/html/rfc8493
  6. BitCuratorEDU project website
    https://educopia.org/bitcurator-edu
  7. Digital Corpora for computer forensics education research
    https://digitalcorpora.org
  8. Digital Preservation Carpentry workshop
    http://www.dcc.ac.uk/events/workshops/digital-preservation-carpentry
  9. Digital Preservation Coalition website
    https://www.dpconline.org
  10. Scaling Emulation and Software Preservation Infrastructure (EaaSI) website
    https://www.softwarepreservationnetwork.org/eaasi

Cover image: My view from the plane window on the way to Melbourne from Sydney, 3 February 2019.

Standards and best practice are important when dealing with digital assets in a cultural collecting institution. It is important when dealing with information. They facilitate the access, discovery and sharing of digital resources, as well as their long-term preservation1. This is where preferred file formats, procedures for offloading digital assets from physical carriers, documenting preconditioning actions and many other activities come into play.

But you cannot always control what you receive when it comes to digital collections. Standards are there for guidance and sometimes decisions need to be made on whether to allow something into the collection that does not meet them. The intrinsic value of the object, its uniqueness and rarity may very well trump the technical requirements for digital collecting. When dealing with born-digital photographs for example, where some institutions prefer a Camera Raw or uncompressed TIFF file format, a low resolution JPEG would also be accepted under the right circumstances.

The digital collecting workflow has changed significantly in the last 12 months in my workplace with the introduction of new standards and tools such as BagIt2 and Bagger3, as well as beginning to ingest the significant backlog of both digitised and born-digital collections into our digital preservation system. We have strict control over the process for new acquisitions, but our legacy collections are another story.

While checksums have been generated for acquisitions for a number of years, there are legacy collections that do not contain as much metadata as we generate and use today. This includes checksums, virus scans and information relating to the physical carrier it was received on. So now that we have these new procedures, guidelines and workflows in place, what do we do with these legacy collections? Should we go back to the creator and ask them to submit the files again? Should we try and locate the physical carrier it was received on before we had a policy in place to manage and store them? While this may be possible in some cases, there is a point where you need to draw the line and accept things as they are.

Authenticity is an attribute that is highly valued in digital preservation, where appropriate steps need to be taken to ensure that it is not compromised during the process of managing digital assets 4. It is important to establish authenticity (including fixity) as early as possible. Drawing the line with legacy material means accepting them as they are, generating checksums now and bringing them up to our current standards for our ingestion processes to make them accessible now and into the future.

Custodial control of digital assets can only be maintained within the context of both organisation and system policies, procedures, guidelines and following best practice. These will change over time and it is important to understand that you may have to let go of strict control requirements under some circumstances and do the best with what you have at the time.


This post is my contribution to the GLAM Blog Club April theme: 'Control'.


1. Standards and best practice, Digital Preservation Handbook. Digital Preservation Coalition.
https://www.dpconline.org/handbook/institutional-strategies/standards-and-best-practice

2. Kunze, J., Littman, J., Madden, L., Summers, E., Boyko, A., Vargas, B., 2016. The BagIt File Packaging Format (V0.97).
https://tools.ietf.org/html/draft-kunze-bagit-14

3. Bagger, Library of Congress.
https://github.com/LibraryOfCongress/bagger

4. Harvey, R., Weatherburn, J., 2018. Requirements for Successful Digital Preservation, in: Preserving Digital Materials. Rowman & Littlefield, Lanham, MD, United States, p. 86.



Cover image credit: Illustration copyright of digitalbevaring.dk and shared under a CC BY 2.5 Denmark licence (illustrations) https://creativecommons.org/licenses/by/2.5/dk/deed.en_GB , and a CC0 1.0 licence (icons) https://creativecommons.org/about/cc0.

Last year was a big year for me in terms of professional development (PD) and I am fortunate to be in a role that invariably includes learning (What file format is that? How do I get something off this physical carrier? How does this thing connect to that thing?). As 2018 begins, I feel like this year will be much the same. One of my goals this year is to actively take the time to learn more. It has been very easy to get swept up in everyday work over the last six months, so this year I plan on setting aside some dedicated time for learning.

I joined the ALIA PD Scheme this time last year and successfully completed my first year of compliance at the end of June. Based on financial year, I admittedly have not taken the time to reflect on the last six months so my first goal is to catch up on tracking my PD and plan the next five months to complete my second year of the scheme.

With that in mind, here are some things I want to learn in 2018:

  • Python: this has been at the top of my "to do" list for a while and I am increasingly finding myself in situations where I believe it could be quite useful. I have no interest in becoming a software developer, but this coding language has become quite popular with working professionals who are using programming skills to get better at their jobs. I plan on starting with Automate The Boring Stuff With Python and go from there. 

  • Public speaking: I have mentioned previously that this has not always been a strong point for me. I had several opportunities to present last year, which includes to students, to industry as well as developing and running a workshop for the first time. I am also looking forward to taking part in Australasia Preserves in Melbourne next month. This is not so much what I want to learn, more an area I would like to get more practice and experience.

  • Writing: GLAM Blog Club has provided a great opportunity to keep me active in writing something at least once a month for the past 12 months. While I plan on continuing that, I also want to start exploring other avenues. I had my first attempt at submitting an abstract to a conference last year. While unsuccessful, my plan is to build on that and keep trying.


This post is my contribution to the GLAM Blog Club January theme: 'What I want to learn in the year ahead'.

I have been taking photos with digital cameras since 2002. I still have a lot of those photos, but there are two distinct occasions where a computer or hard drive failure resulted in the loss or corruption of some images. I have previously discussed my attitude to naming files over the years. Thankfully my education in both photography and information management improved that quite significantly and I recently finished going through my archive and giving everything a meaningful file name and organising it in a meaningful way. But the question remains: how can I mitigate the risk of losing my digital files? Can I do anything to salvage the corrupt photos?

My first instance of a hard drive failure occurred in 2006. I was backing everything up to an external hard drive and occasionally to CD. It had been months since I had backed up to CD so I lost some photos. I wrote a personal blog entry at the time, stating "this enforces the fact that digital photography is not safe and I should take more precautions in the future". In 2007 my PC hard drive failed right after I purchased my first iMac. I was due to have some event photography published in a local magazine but missed the deadline due to the failure. I wrote at the time, "I no longer trust technology. It hates me. It's the second time in two years a hdd has died and i've lost my work."

Since then, my backup system has not been much of a system. Up until recently I had my archive across multiple hard drives, with at least two copies of everything. I was also using DVDs up until mid-2009. At the end of 2016, I decided it was time to bring my archive onto a single backup system so I had everything in one, accessible location. This led to the purchase of a two bay RAID enclosure, which I set up in a RAID 1 drive mirroring configuration using two identical 3TB hard drives. I was still using an iMac at the time as my main computer so I decided to set this up in Apple's HFS+ format.

I recently decided to build myself a computer for the first time, which presented its own challenges (I have nightmares about thermal paste). Switching from OS X to Windows 10 is problematic with a HFS+ formatted external hard drive. There are programs available to be able to both read and write to Apple formatted hard drives on Windows computers (such as Paragon HFS+ or Mac Drive), but that is not ideal. I made the decision to copy the archive onto a dedicated hard drive on my new computer before erasing the external hard drive and reformatting it for Windows, which provided the perfect opportunity to assess and organise my archive. I knew there were corrupted images in my archive but hadn't tried to do anything about it until now.

The photo at the beginning of this post is an example of one of the many corrupted Canon CR2 camera raw files I have in my archive from the time of the second hard drive failure. These images were recovered by a family member at the time, but obviously not everything was a success. All of my backup copies appear to have the same corrupted files. I am thankful that I had started shooting in the camera raw file format at the time, because that gives me some options for recovery because most cameras embed a JPEG preview image into a raw file. Depending on the camera model and manufacturer, embedded JPEG files may be full resolution or smaller.

So how do you extract a JPEG from a camera raw file? Utilising a tool I have been using a lot both personally and professionally - ExifTool by Phil Harvey. ExifTool has a lot of great uses, particularly when it comes to digital photographs and metadata. Using the command line tool, it is possible to extract the JPEG image from the raw file and embed all the metadata from the original file (example provided by Harvey here under "copying examples"). Unfortunately for me, the camera model I was using at the time embedded smaller resolution previews so I will never have the full resolution images again, but a low resolution JPEG is better than nothing!

Preview JPEG extracted from corrupted Canon CR2 camera raw file.
Once I have finished extracting all of the JPEGs from my corrupted CR2 files, my next step will be to reformat my external RAID hard drive to Windows-based NTFS and copy everything back across from my computer. Before I do that, I will first make a backup copy of the files currently on my new computer to a regular, portable external hard drive (encrypted with BitLocker). This will become my offsite copy that will be stored at a relatives' house.

Ultimately, I will have three separate devices with a copy of all of my digital files. Technically this will be four copies with the RAID 1 setup. To combat the issue of file corruption I plan to use the BagIt standard to store checksums which can be validated on a regular basis. It is not ideal for working files, though, as any changes to the files will change their checksum.

It is hard to follow digital preservation standards on a personal level because it is not cheap. Good practice includes multiple independent copies that are geographically separated, using different storage technologies and actively monitoring storage to ensure any problems are detected and corrected quickly (Digital Preservation Handbook). I did not even bother looking at the costs involved in cloud storage for my almost 3TB digital archive and I am not sure if there are any consumer equivalents to digital asset and preservation management systems used by collecting institutions. My experience has taught me a lesson, though, and it is important to do the best you can within your means to backup, monitor and organise your digital life. Digital preservation is an ongoing activity. I have previously made the mistake of just putting things on a hard drive/CD/DVD and thinking it was safe. I will not make that same mistake again.