Reflections on iPRES 2019: Part 1 - Tutorials and workshops

I was fortunate to have the opportunity to travel to Amsterdam and attend the 16th International Conference on Digital Preservation (iPRES 2019) as part of my research travel project funded by the Gordon Darling Foundation. Hosted by the Dutch Digital Heritage Network at the Eye Film Museum, this was a great opportunity to both attend iPRES and visit Amsterdam for the first time.

Coming off the back of three weeks in the United Kingdom discussing all things digital preservation, I was over my jet lag but will admit I was exhausted before the week even started. I had a burst of energy from day one, though, being surrounded by so many brilliant minds in the digital preservation community and with so many interesting sessions to attend and people to talk to.

Day One: Tutorials and workshops

There were many interesting tutorials and workshops to choose from, including preserving complex digital objects to understanding and implementing PREMIS. I based my choices on areas of interest relating to my current role, including email and preservation actions.

Review, Appraisal, and Triage of Mail (RATOM)

Presented by Cal Lee and Kam Woods, participants were introduced to the RATOM project and examined development efforts so far. With selection and appraisal a key motivation for the project, it was highlighted that with 48 years of history, emails serve as a massive source of evidence and information. RATOM aims assist in determining what records need to be kept and what records might have sensitivities. The goal is not for perfect identification, but useful information for appraisal - a stage before deciding what emails to keep, what emails to redact, tag for further review etc.

I was very interested to hear discussion on the aim to provide a way to migrate from the PST format, which is proprietary and complicated. It was noted that there are many different (often commercial) tools available to convert PST to other formats, but they are a 'black box' and their outputs may be different. You only need to look in the Digital Curation Google Group and see the almost daily spam posts to see how difficult it can be to find a suitable tool for this task. Open source forensic libraries allow RATOM to process emails from various sources and quickly generate feature files that are verifiable, reproducible and reusable.

A RATOM Jupyter notebook (run through binder) was used to demonstrate current progress, which I think worked quite well. Everyone was able to load and run the python scripts from the web interface without having to install any software on their computer, which can often be a time consuming process in tutorials/workshops. When working through the entity extraction component, which utilises spaCy natural language processing models, participant discussion centred around a law scenario where prosecution, defence and judges may have separate entities. Lee and Woods discussed the trade off in building a model for a specific situation vs using a generic one, commenting that manuscript repositories will have different entities within each acquisition, so specialised models are not so useful. Utilising spaCy means you can process data without having to train your own models, but Lee commented that it would be good to receive feedback on how the spaCy models in languages other than English fare.

Tutorial participants discussed their current challenges with email which included not knowing where to start, volume, sensitive information and the EU General Data Protection Regulation (GDPR), and the Capstone approach. Apart from being interested to see how RATOM continues with development into 2020, my key takeaways from this session were to think about what type of entities (eg. people, places, events) would be important for my organisation and to think about the types of metadata that can be made available in collections containing emails for discovery. I will be watching the outcomes of this project very closely.

Tutorial slides available here.

Preservation Action Rules Workshop. Parcore: See one, Do one, Teach one

Preservation Action Registries (PAR) is about describing digital preservation with interoperable registries of good practice and has two main concepts; preservation actions and business rules (context). Tools can be used in different ways to perform different types of digital preservation actions, where there could be multiple tool options for a single activity that produce different results, which is where business rules and context come in to play.  The main motivation for the workshop was for the creators to determine whether this way of thinking for describing preservation actions is useful and something worth continuing, and for further refinement.

After an introduction to PAR and a few examples, participants were tasked with working in groups to write down some new examples and present to the room. The workshop was popular, and I ended up in a rather large group. The group brainstormed possible preservation actions to work with, where three possible options were considered; 1. File normalisation (eg. proprietary camera raw to Digital Negative), 2. Add a new format to a registry for a previously unidentifiable format, or 3. Editing METS files to correct an error. After a lengthy discussion, the group settled on option 2 as it was dissimilar from the presented examples.

My group focused more on discussion rather than filling out the PAR use case template with our chosen scenario, which highlighted some issues that need to be clarified or resolved. This included the use of flowcharts or series of decisions based on conditions and results, ambiguity between what goes into business rules vs preservation actions, the need to describe what you don't do and why for others to learn from, and validity dates on preservation actions. We also discussed the fact that a 'tool' could be a person or a community, where it is quite common to go to another expert or the community to resolve a problem (such as adding a new file format/signature to PRONOM) when the expertise are not available within your own organisation.

There was a lot of interest in the room from participants for PAR to continue and, as was discussed at the end of the workshop, there is a huge amount of potential for it to enable the sharing of information as a community at a higher rate than at conferences such as iPRES.

Workshop slides available here, collaborative notes here.

Amsterdam canal, 18 September 2019, photograph by Matthew Burgess

Stay tuned for part 2 of my iPRES 2019 reflection on the main conference program.

Cover image: Eye Film Museum, 16 September 2019, photograph by Matthew Burgess