On Sunday, June 25, 2017, the Association for Library Collections and Technical Services (ALCTS) Metadata Interest Group sponsored “Metadata Migration: Managing Methods and Mayhem” along with co-sponsors ALCTS Creative Ideas in Technical Services Interest Group and ALCTS Cataloging and Metadata Management Section (CaMMS) Cataloging Norms Interest Group. The program featured two speakers discussing their experiences with evolving procedures and tools around metadata migrations. Of note, the audio and slides are available on the conference event page.
The first speaker was Maggie Dickson, the Metadata Architect for Duke University Libraries (DUL), who presented on “Looking Back, Moving Forward: Remediating Duke Digital Collections Metadata.” Dickson gave an overview of a DUL Digital Collections remediation project resulting from their migration to a new Digital Repository. For the project, DUL created the position of Metadata Architect and formed a task group. Dickson discussed how the task group established four guiding principles: fitness for purpose, broad applicability, broad shareability, and forward-thinking. Dickson described how they visualized the values in the dataset using OpenRefine and Tableau Public, which helped them determine which fields in the dataset could be replaced with Dublin Core (DC) terms and which would be local terms. They were also able to visualize which collections used which fields. The task group used Google Sheets to track their analyses and carry out discussions. Following the development of an action plan, the group began remediation collection by collection in 2015. They used OpenRefine, regular expressions (in particular Text Wrangler), and Ruby. The remediation efforts focused on field normalization – including usage alignment with Digital Public Library of America (DPLA) and consistent application of core fields across collections, value remediation, implementation of Extended Date-Time Format Standard (EDTF), and synchronization of unique identifiers with Encoded Archival Description (EAD) finding aids allowing bidirectional linking. Project outcomes include the formation of a metadata advisory group, documenting the metadata application profile, better user experience with faceting across collections, and more actionable structured data. It further enables an easier transition for linked data, more shareability, and will ease the inevitable next migration. Looking forward, DUL will be continuing remediation efforts, reconciling their data with linked data sources, developing programmatic approaches for auditing repository metadata quality, and integrating digital library metadata practices library-wide.
In a talk titled “Neverending Migration,” Gretchen Guegen, the Data Services Coordinator at the Digital Public Library of America (DPLA), described the tools used and lessons learned in the perpetual migrations of DPLA and their partners. Guegen described the history and challenges associated with such a large and heterogeneous dataset. Using tools such as OpenRefine, Metadata Breakers, XML Starlet, Guegen described how she performs analyses on data contributed by partners and how it maps to DPLA’s profiles. DPLA also performs data normalization and enrichment with EDTF. Another tool, Kibana allows DPLA to perform database analysis and to create visualizations and reports. DPLA will be working towards developing an open source tool called Ingest3 that leverages “big data” principles to enhance their data ingest process. Given the nature of DPLA’s data, Guegen described data quality improvements as being possible from the grassroots perspective or the top down perspective. Grassroots means getting data contributors to enhance their data using tools like rightstatements.org uniform resource identifiers (URIs). Top down means DPLA making enhancements such as geospatial enrichment and format mapping. Guegen said that never-ending migrations are not necessary bad and allows DPLA to change how they deal with their myriad challenges. Additionally, she stated that good tools and processes are collaborative and serve multiple purposes. Finally, Guegen said that creating consistency in data requires both human-mediated and programmatic approaches.
Both presenters took questions following their presentations.