About MIDS

DRAFT

Introduction

The need to rapidly digitise millions of specimens in Natural History Collections has seen a staged approach for data capture being widely adopted. Mass digitisation programmes have generally started with the creation of skeletal or stub records which can then be expanded as more funding or support is available. When combined with the previous practice which was often to create relatively full data records for each specimen, there is currently huge variation in the level of digitisation both within and between collections. From a range of international digitisation initiatives, it was also clear that when discussing digitisation, many people have different understandings of the term. This leads to confusion and uncertainty when something is described as having been digitised. Thus, the idea for a ‘minimum information standard’ was born to serve a range of aims that include:

  • Offering clarity to collection owners about the minimum information they should be publishing out of digitisation initiatives to make digital specimen information useful for multiple purposes of teaching and learning, research, etc.;
  • Assisting the global effort to digitise natural science collections, estimated to be 3 billion specimens worldwide, by providing a structured framework that clarifies the outcomes of digitisation and the level of digitisation achieved; to assist prioritisation of the remaining work;
  • Supporting and contributing towards assessments of fitness for purpose of data (suitability) for feeding specific types of data processing pipelines; and,
  • Assisting researchers to know what information to include in their journal articles and data deposits about specimens they have used in their research.

This framework includes making the data publicly available because open access policies in countries around the world require that digital data should be findable and accessible, even at the most minimal level of available digitised information.

image

Audience/Stakeholders

A range of key stakeholders from bio- and geodiversity domains can be identified as beneficiaries of MIDS. These include:

  • Developers of collection management systems for automating MIDS calculation and management of missing data; of crowdsourcing platforms in relation to field inclusion and management of missing data; and of other software tools;
  • Digitisation and administration staff, for example to identify and manage missing data and to calculate costs of and plan for further digitisation;
  • Management for developing and managing digitisation strategies, and receiving reports;
  • Public relations staff for public communication of the progress of digitisation; and,
  • Domain experts/researchers, for example for assessing and developing usability in research and teaching, and for data mining.

Motivation

In its most general sense, digitisation in natural sciences is the process of converting analog information about physical specimens to digital form, which includes electronic text, images and other representations.

However, the term ‘digitisation’ is understood diversely. It can mean, for example: creating database records (of various extents); making images of collections containers, specimens and/or their label(s); a level of data capture (transcription, excluding or including interpretation of data); and more recently, semantic enrichment of data, and notions of ‘born digital’/’digital by default’. From one digitisation initiative to another, the outputs can vary widely because aims, practices and procedures vary across different collection types and institutions. Thus, when a curator, collections manager or scientist talks of something being digitised it is not apparent in an objective way what is meant. Nor is it apparent what ‘sufficient digitisation’ means and when (if at all) digitisation is complete. Furthermore, most collections need to report on the progress of digitisation to the management and/or funding agencies and therefore, agreed measures are needed.

A harmonising framework captured as a TDWG standard can help clarify levels (depth) of digitisation and the minimum information captured and published at each level. This would help to ensure that enough data are captured, curated and published against specific requirements so they are useful for the widest range of possible purposes; as well as making it easier to consistently measure the extent of digitisation achieved over time and to set priorities for remaining work. Such a framework would also be beneficial for 'born digital' specimens where digital data is captured from the outset, beginning with the gathering event.

Inspired by the idea of ‘minimum information standards’ adopted in other areas of biology we name this proposed TDWG standard as ‘Minimum Information about a Digital Specimen’ (MIDS) - the topic of the present task group. This harmonising framework includes making the data publicly available because open access policies in countries around the world require that digital data should be findable and accessible, even at the lowest level of available digitised information.

Context/history

Minimum information standards have been an initiative in biosciences to provide sets of guidelines for reporting data derived by relevant scientific methods. As a general principle, however, there is no reason to confine them to bioscientific disciplines. Minimum information standards can be applied wherever else it is necessary to capture and present (publish) data for interoperability and re-use by others. When followed, minimum information standards should ensure that such data can be easily verified, analysed and clearly interpreted by the wider scientific community. Minimum information standards also facilitate structured databases, public repositories, and development of processes, procedures and software tools.

The Minimum Information Standards for Scientific Collections (MISC)/Authority Files Working Group was established in 2012 by iDigBio, the National Resource for Advancing Digitization of Biodiversity Collections (ADBC) funded by the National Science Foundation. It was not an attempt to establish a standard for minimum information for scientific collections, but rather an attempt to suggest to data providers what data should be provided for ingestion to the iDigBio infrastructure. The guidance (MISC 2012) denoted three categories of elements - i) required, ii) highly desired, or iii) complementary - that were felt necessary to support better practices for discoverability, research use, and cross-linking (through the use of globally unique identifiers (GUID), for example). This work helped the USA community move toward understanding what was needed to enhance discovery, research use, and linking.

Design study work funded by the European Union Horizon 2020 ICEDIG (https://icedig.eu/) project (2018 - 2020) for the future European Distributed System of Scientific Collections (DiSSCo) research infrastructure identified the need for a minimum information standard for the development of the infrastructure. An additional driver was the strategic objective of the Consortium of European Taxonomic Facilities (CETAF) to have 10% of the collections digitised. The CETAF Digitisation Working Group recognised that the ambiguity in the understanding of the term ‘digitised’ required a standardised terminology to enable measuring and monitoring digitisation across more than 70 institutions.

The Minimum Information about a Digital Specimen is being developed within the Biodiversity Information Standards (TDWG) infrastructure.