DPU@SCARC: Current Digital Preservation

Although we reserve term preservation-level digitization for cases where digitization is a preservation strategy for the object(s) due to active degradation or to issues related to obsolescence, the workflows for all the digitization that we do are generally the same. That means we use well-maintained and calibrated equipment. Our scanners are calibrated via SilverFast with IT8 color targets. Our secondary displays are IPS LEDs, which means that they offer consistent and wide-angle color accuracy. The displays are calibrated with X-Rite's ColorMunki calibration and profiling system. We then use ColorSync to share those calibration profiles with our scanning software thereby ensuring color accuracy. We utilize full-spectrum lighting to soft proof our reflective materials scanning. With all of our digitization, we try to adhere to the FADGI Technical Guidelines for Digitizing Cultural Heritage Materials.

Current Digital Preservation

PREMIS view of Photoshop edit history

DPU’s digital preservation work actually begins before any object has been scanned. According to PREMIS, change history for an object should be recorded as Event information. Unfortunately this leaves out the entire life of the object during quality control processes and the file movement stage, which is period when accidental and intentional changes are more likely to occur. Aside from the typical cropping and de-skewing, a substantial edit history puts a red flag on the object in terms of digital preservation and needs to be documented. DPU saves all editing actions to an embedded history file to ensure the integrity of the files. DPU has been successful in pulling this embedded edit history out of the files with Archivematica and other PREMIS-aware digital preservation tools.

Technician/Collection xml templates in Adobe Bridge

One goal of digital preservation is to save as much scanning session information as possible. This, however, has to be done in a way that does not impede digital production workflows. For still image materials, an XML template was written that embeds capture-related info into the image file. This information can be extracted with a variety of tools. The information being written to the file includes the following:

ImageProducer (student technician)

HostComputer (model identifier)

SensingMethod (scanner sensor type)

WorkType (AAT Source)

Mimetype (file type)

With the volume of materials that DPU works with, it did not make sense to apply the templates file by file. Appending the template to a directory of files through a batch process allowed us to maintain our digital production throughput without getting bogged down with item-level preservation metadata.

Temporary folder-level fixity

Files are moved frequently throughout the quality control process. Initially created on a student workstation, files are moved onto a temporary storage server and then onto the supervisor’s workstation for review and processing. Moving files is precisely when the chance of corruption and read/write errors are the highest. Files are run through a checksum tool to create temporary folder-level fixity lists that can be easily verified during this process.

DPF Manager

DPF Manager is an open source conformance checker for TIFF files designed to help archivists and digital content producers ensure that files are fit for long term preservation. DPF Manager can parse file directories and determine that files are valid and that digitization specifications have been followed. Verifying this information in a batch process eliminates the extra steps of manually checking with Photoshop.

File name verification

After verifying the files, the filenames are checked to ensure that numbering was consistent and that leading zeros have been used for accurate sorting.

It is at this point that a quick visual review of the file is done to ensure that the cropping and orientation is consistent and that the appropriate color settings have been used.

BagIt is a widely used packaging format developed as part of the National Digital Information Infrastructure and Preservation Program to accurately transfer digital content across networks and file systems. The BagIt specification is organized around the notion of a "bag". A bag is a named file system directory that contains a "data" directory that includes the digital content being preserved, along with a fixity manifest file. BagIt can examine the manifest file to make sure that the files are present and that their checksums are correct. This allows for accidentally removed or corrupted files to be identified. DPU uses verified bags (BagIt) for final AIP transfers to Parthenon.

Rather than using the decidedly clunky JAVA version of BagIt, DPU has wrapped key BagIt Python commands inside of shell scripts that can be executed via macOS right-clicks. This way many files/folders can be bagged and verified with a simple right-click. DPU has those shared on GitHub.

Moving Image Materials

vrecord

With most moving image materials there is a substantial level of degradation actively occurring with the object and we will be relying upon this newly created digital surrogate as the representation of this artifact as time goes on and the source further degrades. All magnetic media materials require preservation-level digitization. At this stage, it is digitization as a preservation strategy for the materials.

The digitization process for moving image materials is complex when compared to traditional scanning of visual materials. There are a number of playback, correction, and signal processing components that need to be included as part of the digital production metadata. DPU keeps a spreadsheet that contains information on both the physical tape, the transfer process, and the resulting digital file. Things like condition of source tape, the specific tape stock, generation of the source (original or copy), and risk-assessment ranking are documented.

AV digitization spreadsheet

As with most digital objects, the generation and verification of checksums can aid the confirmation (or denial) of digital authenticity over time. A mismatch is an alert that a file has changed from a prior state; potentially triggering retrieval of backups, review of hardware, or migration of content. It is for this reason that we do a file-level checksum on each video file. We chose to run a SHA256 checksums over MD5 due to the output hash value of 256 bits. Because this is digitization as preservation, the high level of detail and added security assurance of SHA256 is certainly warranted. To speed up the process, we have installed a shell script with the checksum terminal command as a macOS Service. That means that we can right-click on a file to run the checksum rather than keying in the terminal commands manually.

macOS Services Menu

Much of PREMIS relates to the individual bitstreams contained within a file. With video files, there are individual tracks for the video stream, one or more audio streams, and a timecode stream that all have frame-level values that can be parsed. Since these individual essence tracks can all incur some type of error separate from the larger file, I run smaller frame-level checksums for each video through FFmpeg, an open-source encoding library. Framemd5 creates MD5 hash values for each audio and video stream in each frame. There are 29.97 frames per second with NTSC video - lots of places for errors and flipped bits to hide. By producing checksums on a more granular level it is more feasible to assess the extent or pinpoint the location of the digital change in the event of a checksum mismatch. Framecrc creates a single transmission checksum for the file and allows me to verify that each field of video is received correctly.

frameMD5 AppleScript

MediaInfo supplies technical and tag information about a video or audio file. As with FFmpeg and the checksums, I have this process installed as a macOS Service. I also use a tool from the AVPreseve called Fixity to monitor locally stored video files. Fixity scans a directory, creating a manifest of the files, including their file paths and their checksums. A regular checksum validation can be run on a predefined schedule. I have set this up to run every Friday evening.

Fixity

DPU also uses a local ZFS filesystem for temporary production-level storage that provides routine block-level fixity checks and self-healing for damaged files. DPU uses this exclusively for video files.

ZFS Filesystem (Ubuntu)
One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.