Livingstone's 1871 Field Diary

A Multispectral Critical Edition

Data Management
Preliminary spectral imaging of Livingstone’s diary in Scotland and follow-up spectral image processing generated a significant amount of data and metadata. This data required far-sighted and detail-oriented data management to ensure efficient capture, production, and organization for long-term viability. The team’s data manager, Doug Emery, directed this process from beginning to end with assistance from members of the Livingstone team.
Figure 1. Doug Emery (foreground) and Roger L.
Easton, Jr. review the Livingstone data.
The team summarized data management objectives in their original NEH grant application: "The Nyangwe field diary project will provide a complete package of images with documentation and full metadata. The resulting archive data set will be based on the archive and metadata model used for the Archimedes Palimpsest. The product will be a completely self-documenting and autonomous data set. […] This will be a full digital archive of images that meets library and archival standards, as we have produced for previous projects."
Visit the related section from Livingstone’s Letter from Bambarre.
Preliminary Data Management
The team followed a series of intermediate steps to meet the data objectives, a process that spanned nearly the whole 18-month NEH and British Academy grant periods. The process began prior to imaging in Scotland with a series of discussions by email and teleconference in which the team set out the details of the data archive deliverable, began to define XML transcription guidelines, and assessed metadata needs of key stakeholders. The team also explored the implications of various hosting and publication options.
During this period, Emery began planning for the collection of appropriate metadata. He created a spreadsheet of the folia to be imaged that included location shelfmark (DLC or NLS), details of Livingstone’s handwritten "overtexts" and pre-printed "undertexts," and shots required (to identify folia to be imaged in segments). He reconfigured his online logging application to meet project needs and used the application to create Livingstone imaging projects and associated shot sequences. Finally, he instructed the imaging scientists in collecting image setup data.
Figure 2. Hard drives and other equipment, spectral
imaging phase, National Library of Scotland, June 2010.
From Baltimore, Emery continued to support daily metadata collection while the team imaged in Scotland – a heroic task given that it entailed waking up each day at 2:30 a.m. EST in order to be available for the team start time of 8:30 a.m. GMT!
"Data Scrubbing"
Once imaging ended, the team moved to a lengthy period of data assessment and correction. In late July 2011, Emery produced an initial evaluation of the Livingstone data, then based on this evaluation he, Toth, and Wisnicki sought to refine the existing data management plan. In the coming months, Toth also outlined the overall path forward in a series of emails to the team, ultimately developing a phased approach based on prioritized goals, capability needs, and funding available for each phase.
The duration of the assessment and correction period had partly to do with the file-naming scheme created by the team. The scheme took the following form:
Segment Referent Example
First institution initials plus shelfmark DLC297b
Second Livingstone’s folio number(s) (in Arabic numerals) or, if not provided, begin with 001 (for both recto and verso), then 002 (recto and verso), etc. 149-146
Third institution folio number or, if not provided, begin with 001 (for both recto and verso), then 002 (recto and verso), etc. plus "r" or "v" designation 012r
Fourth "0" if only one shot required; "0," "1," "2," etc. if folio imaged in parts 0
Fifth shot sequence letter "A," "B," "C," etc. A
Suffix image or document type dng
This scheme produced a rather lengthy name for each image file, e.g., DLC297b_149-146_012r_0_A.dng. The advantage of the scheme lay in how much information it recorded. The disadvantage of the scheme, which the team realized only in hindsight, inhered in its complexity. The scheme allowed for a wide margin of error, especially as the hours and days in the darkened imaging room lengthened. Moreover, the need for preliminary processing during the imaging phase and beyond resulted in the exponential propagation of incorrect file names.
After Emery’s evaluation, Knox and Wisnicki collaborated in reviewing and correcting the raw data and Knox’s processed data. Using Wisnicki’s list of file and folder corrections and additional input, Knox "scrubbed" the data in a series of steps:
  1. identify and correct duplicates, misnamed files, and excess files;
  2. create command files to convert the data according to Wisnicki’s list;
  3. produce PowerPoint summary of data set.
The process – which required many, many hours due to the complexity of the files names – lasted from August to December 2010, significantly longer than anticipated.
Creating the Data Archive
The need for extensive data correction pushed the spectral image processing phase to the spring of 2011. This development, in turn, delayed Wisnicki and Simpson from beginning XML transcription until late February 2011. As a result, the team continued to produce significant new primary data (images and transcriptions) well into June 2011 and so prevented final work on the archive until mid June.
In the NEH grant application, the team had described this next phase as follows: "After imaging and processing is completed, the imaging logs will be collated with other required metadata to generate complete metadata records for raw and processed images, and to assemble the final data archive." To realize these objectives, Emery needed to complete a complex series of discrete tasks. These covered preparing the raw data, collecting the new data produced by the team, assembling and refining the database, collecting and building the metadata, and loading and packaging the data.
During the final months of the data management phase (June to September 2011) Emery worked in close collaboration with Wisnicki and the imaging scientists to finalize the archive, refine details of the archive structure, collect all required metadata, and identify missing processed images. In his efforts, Emery prioritized the 1871 Field Diary data but, where possible, included NLS data in bulk management tasks.
Figure 3. Wisnicki opens the package from Emery containing the
completed data archive, Indiana, Pennsylvania, 29 Sept. 2011.
Emery allocated much of his time to the labor intensive task of correcting the database so that it would work with the corrected file names produced by Knox and so that he, Emery, could produce accurate metadata. Emery also set up a server with the metadata database allowing Wisnicki to verify the data and add new information as needed (including image rotation data). Finally, Emery and Wisnicki carefully corrected the names of processed files from Christens-Barry, Easton, and Houston. Emery completed the data archive on 27 September 2011, made a backup, then shipped the hard drive to Wisnicki, who received it 29 September, made a copy, and shipped it on to Stephen Davison, Head of the UCLA Digital Library Program.
Figure 4. The data archive as it copies onto Wisnicki's computer.
Documents for Download
  1. Email: Evaluation of Data, Emery, 23 July 2010
  2. File and Folder Corrections Spreadsheet, Wisnicki, July 2010
  3. Summary of Livingstone Data Tasks, Emery, September 2010
  4. Internal: Data Management and Sharing Objectives, Livingstone Team, November 2010
  5. PowerPoint: Livingstone Image Data, Knox, December 2010
  6. PowerPoint: DLC297 Thumbnails, Knox, December 2010
  7. PowerPoint: NLS107nn Thumbnails, Knox, December 2010
  8. List of Corrected Raw Files, Knox, July 2011
  9. List of Knox Processed Images, Knox, July 2011
  10. Email: Livingstone Data Set Tasks, Emery, 4 June 2011
  11. Livingstone Document Metadata Summary, Wisnicki, June 2011
  12. Livingstone Data Set Tasks Update, Emery, 4 August 2011
  13. Metadata Sample Record: DLC297b ratio by 940
  14. Emery White Paper Input, October 2011
Restoring the Text