The Future Is Now

PSA: Homebrew-digipres Repository Now Available!

Outside of archivy, I’m also a collaborator on Homebrew, the awesome, lightweight package manager for OS X. I’ve been building a private repository of niche packages which aren’t available in the core repository for some reason or another, and ended up collecting enough digital preservation tools to create a new digital preservation-focused repository. You can find the new homebrew-digipres here: https://github.com/mistydemeo/homebrew-digipres I’d welcome any contributions if you want to improve an existing formula, submit updates, or add a new package! Fork away.

File ID Hackathon Debrief: FITS Handles Video Now!

I took part in the 24-hour file ID hackathon November 16th. It was a fantastic event, and between us the 15-ish participants got a lot of practical work done. You can read more about it and what was accomplished at the CURATEcamp wiki.

I spent most of my time working with video content and with FITS, the File Identification Tool Set. FITS is a useful tool, but it’s traditionally had some problems that have held it back from being as effective as it could for digital preservation. Aside from its performance, which is an issue that still needs to be addressed, its support for audio-visual material has been pretty poor. I addressed a couple of the more serious items:

Its embedded Exiftool was badly out of date

FITS bundles its own versions of the various tools it uses, rather than use the versions installed elsewhere on the machine. In theory this is a good idea; incompatibilities in the tools it uses could subtly break its output. In practice, however, it means that FITS has missed out on a lot of format identification improvements its tools have made. Before the hackathon FITS included exiftool 7.74, which was released in April, 2009. Back then exiftool had only rudimentary video support, but it’s made enormous strides in past several years and now has very robust video metadata extraction. The first thing I did in FITS was update the embedded exiftool to the current release. That alone has made a big difference in format detection.

In the future I think it would be best to rethink the policy of embedding tools rather than using external copies, or at least provide the option to use another version the user has installed. exiftool is updated once every week or two and changes rapidly. I doubt FITS will be updated that frequently. A better option might be to recommend specific known-good versions of tools, but allow the user the option of running whichever tool version they prefer.

Its metadata mapping for video formats was primitive

FITS uses XSLT to map metadata fields from their native tag names to its own vocabulary, but the list of tags used for video was very short compared to other formats. As a result, a lot of potentially useful information from exiftool’s output was being discarded. Based on videos in my collection which had extensive embedded metadata, I beefed up FITS’s mapping table to enable it to grab many more common tags.

While this made a good short-term solution, it made me think a bit more about how FITS approaches mapping fields. In particular,

  1. FITS has separate mappings for types such as “image”, “video”, “audio.” In practice, though, many of these formats use the exact same tags to mean the same things; this means either some mapping logic is duplicated, or certain fields are skipped for some files even though they’re mapped for others. After looking at practical examples of how FITS maps images and videos, I’m not convinced that treating them separately is practical.

  2. Beyond that, FITS uses file extension to determine whether a file is an image, video, etc. In practice many container file extensions can represent many kinds of files; extension is a pretty fragile way of determining type. If FITS keeps a distinction between file type mappings, it should move to using something like mimetype instead of extension.

Aside from my work improving FITS, I also submitted a set of Quicktime videos to the OpenPlanets Format Corpus on GitHub. The 61-video set covers almost every codec Apple ships with Quicktime and Final Cut Pro, and should be useful for anyone who wants to try to identify individual codec/container combinations. They’re available at: https://github.com/openplanets/format-corpus/tree/master/video/Quicktime

I’ll end this off with some eye candy, to show how nicely FITS’s video support has improved.

Before. The video is detected only as “Unknown Binary” (this was sadly common for video), and no meaningful metadata is extracted.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<?xml version="1.0" encoding="UTF-8"?>
<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.1" timestamp="11/17/12 10:18 PM">
<identification status="UNKNOWN">
<identity format="Unknown Binary" mimetype="application/octet-stream" toolname="FITS" toolversion="0.6.1">
<tool toolname="Jhove" toolversion="1.5" />
</identity>
</identification>
<fileinfo>
<filepath toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">/Users/mistydemeo/Downloads/set1/00000.MTS</filepath>
<filename toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">/Users/mistydemeo/Downloads/set1/00000.MTS</filename>
<size toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">6039552</size>
<md5checksum toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">8c7c728334017a3ab4caff6e78b30037</md5checksum>
<fslastmodified toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">1261684470000</fslastmodified>
</fileinfo>
<filestatus />
<metadata />
</fits>

After. Not only is the video format extracted, but a good 18 video tags are extracted.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
<?xml version="1.0" encoding="UTF-8"?>
<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.1" timestamp="11/17/12 10:20 PM">
<identification status="SINGLE_RESULT">
<identity format="M2TS" mimetype="video/m2ts" toolname="FITS" toolversion="0.6.1">
<tool toolname="Exiftool" toolversion="9.05" />
</identity>
</identification>
<fileinfo>
<lastmodified toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">2009:12:24 13:54:36-06:00</lastmodified>
<filepath toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">/Users/mistydemeo/Downloads/set1/00001.MTS</filepath>
<filename toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">/Users/mistydemeo/Downloads/set1/00001.MTS</filename>
<size toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">4552704</size>
<md5checksum toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">770fd667d68ca8e6509670b0ef50e61c</md5checksum>
<fslastmodified toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">1261684476000</fslastmodified>
</fileinfo>
<filestatus />
<metadata>
<video>
<digitalCameraManufacturer toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">Sony</digitalCameraManufacturer>
<digitalCameraModelName toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">HXR-NX5U</digitalCameraModelName>
<duration toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">0.09 s</duration>
<imageWidth toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">1920</imageWidth>
<imageHeight toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">1080</imageHeight>
<videoStreamType toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">DigiCipher II Video</videoStreamType>
<shutterSpeedValue toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">1/60</shutterSpeedValue>
<apertureSetting toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">Auto</apertureSetting>
<fNumber toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">3.7</fNumber>
<gain toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">-3 dB</gain>
<exposureTime toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">1/60</exposureTime>
<exposureProgram toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">Manual</exposureProgram>
<whiteBalance toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">Daylight</whiteBalance>
<imageStabilization toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">On (0x3f)</imageStabilization>
<focus toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">Manual (2.3)</focus>
<gpsVersionID toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">2.2.0.0</gpsVersionID>
<gpsStatus toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">V</gpsStatus>
<gpsMapDatum toolname="Exiftool" toolversion="9.05" status="SINGLE_RESULT">WGS-84</gpsMapDatum>
</video>
</metadata>
</fits>

Revisiting Archival Description – LOD-LAM Session Idea

Apologies for the brevity of this blog post – I’m keeping this brief to make sure I get it posted before LOD-LAM.

So, archival description.

Archival records are hard to find. They’re often in large bodies of records, difficult to browse through and generally less cut-and-dry than publications which are intended for formal publication and/or public consumption. Archival finding aids are the researcher’s traditional first point of contact, providing background biographical information on the organization and/or personal creator(s), as well as a description of how the records are arranged and description of the various levels of organizational hierarchy. They’re useful!

But they’re also a bit old-fashioned, at least as typically implemented. The finding aid structure imposes a few issues for linked open data applications.

I see two[^1] major problems with current archival description:

They’re hierarchical

Most countries’ archival description standards are based on a strict hierarchy from higher levels of description (fonds, etc.) to more precise levels of description (series, sub-series, file, item) with fairly rigidly prescribed relationships between items. The finding aid also assumes a “paper” whole-body approach, rather than a linking approach. This is kind of non-webby, and imposes a stricter order on documents than their creators may have had, in many cases.

(The Australians, of course, are a few steps ahead of the rest of us already.)

Perhaps even more though, a major problem is that:

They’re imprecise.

This is the real issue, or at least the most immediate issue. Archival descriptions are designed for human eyes in a paper world, and so they’re often encoded with a level of ambiguity that’s difficult for machines to extract. (LOCAH has been doing a great job of identifying points of concern and trying to route around them.)

Archival descriptions have some inherent ambiguity because interpretation of archival holdings is not always cut and dry, but that doesn’t mean that we have to be ambiguous in how we create those descriptions. We can be precise about the ways in which our collections are ambiguous.

I’d love to get a conversation going about revising descriptive standards to enhance precision in finding aids in order to enhance the ability to use them as computer-readable metadata. I can see a number of areas for improvement:

  • More strongly-typed data fields, rather than “fuzzy” fields that can hold a variety of types of subjectively-defined data
  • More focus on “globally-scoped” names rather than “locally scoped” (as pointed out by Pete@LOCAH here)
  • A stricter, clearer inheritance model rather than ISAD(G)’s rule of non-repetition (Thanks to Pete again)
  • Certainly more, which we can talk about at LOD-LAM!

The extent to which all this can be implemented will depend on the organization, of course – retrofitting older archival descriptions for all of this would be time-consuming, if practical at all. But I think there are a lot of benefits to be gained by changing practices going forward, and I see this as an enhancement to current descriptive standards/practices that can benefit more than just linked open data applications.