Data Publishers & Repositories

Summary problem statement

While Data Publishers and Repositories work directly with researchers to create and improve metadata, our community routinely struggles with adoption. This creates a tension between quantity of deposits versus the quality of the metadata collected. For those data repos that are successful, there is still a struggle to deal with inconsistent information across repository datasets. Many of these issues are due to the position of the data repositories within researcher workflows. However, there are other significant hurdles including use of inconsistent metadata vocabularies, lack of tools, and lack of integrated metadata evaluations and guidance.

Challenges

Disciplinary communities, repositories, and journals create metadata “standards” and requirements without thorough examination, understanding, and evaluation of existing alternatives. Growing your own is many times easier than adopting or adapting the work of others. This leads to redundancies, inconsistency, and confusion among scientists trying to plug-in to the system.
Existing information about data repositories (or properties of repositories) that are important to journals is difficult to discover and/or comprehend because of inconsistency in vocabularies and presentation.
Because of uncertainty surrounding best practices, a lot of groups are trying to clarify approaches (publishers, repository certifiers, etc.) and they are not consistent. This causes issues for researchers/others when approaching data repositories.
Data repositories are caught between the need for clean metadata and improving adoption rates. The myth is that there must be a trade-off between getting large amounts of data with incomplete metadata vs. repositories with a small number of well-documented datasets.
Data repositories are consistently working in an unequal power dynamic with journal publishers. These two communities are not always aligned on metadata requirements and researchers tend to value journal publisher requirements/guidance over data repository requirements/guidance.
Researchers cannot find consistent or clear information about metadata and its role in helping future repository users understand and trust their data.
Researchers, and others in the scholarly communications community, do not clearly understand the metadata life-cycle and important synergies between data, processes, and the metadata that describe them.
Automating metadata creation in instrumentation and processing workflows is an important step forward, but metadata necessary for understanding specific datasets must not be lost in the process.

Opportunities

Development and integration of metadata recommendations and evaluation tools.
Collaboration with other communities to synchronize vocabulary and evolve messaging for researchers surrounding the importance of and uses for metadata.
Collaboration with other communities to develop consistent metadata best practices and principles to share with repositories and data publishers and guidance that supports those best practices and principles.
Mapping the metadata lifecycle in a way that is easily demonstrated to others the journey of metadata and diagnosis of metadata. We can identify gaps and breakages to isolate interoperability issues.

What exactly can Data Publishers / Repositories do?

Develop an extendable foundation of broadly applicable metadata use cases and engage researchers to understand specific implementations and benefits of those use cases.
Directly challenge the myth is that there must be a trade-off between getting large amounts of data with incomplete metadata vs. repositories with a small number of well-documented datasets. Lead a 360o community engagement with scientists, data repositories, and journals around this issue.
Identify and share existing processes and behaviors that are consistent with best practices. These should include core principles surrounding metadata requirements as well as deviance from these principles that we see as positive and/or negative.
Work with other communities to develop, share and integrate metadata evaluation tool (examples: metadata librarians, data curators, data repo certifiers, etc.).
Build bridges between larger, more successful repositories and smaller data repositories to ensure quality can be achieved at a variety of scales.
Ensuring data repositories are depositing their metadata to the appropriate community infrastructure provider (ex: DataCite, etc.) to ensure that data repository metadata is shared and utilized.
Collaborate with other Metadata 20/20 communities (such as Journal Publishers, Librarians, and Researchers) to form and share a consistent cross-community vocabulary surrounding metadata.

Vision Statement

Often experienced with working with researchers to improve metadata, Data Publishers and Repositories are well positioned to help the scholarly communications community improve standards, best practices, evaluation tools, and communication with researchers as a whole. Data Publishers and Repositories will be an integral part of advancing the mapping and interoperability of metadata. The contribution this group can make to collaborative action across scholarly communications as a whole will eventually improve the quality of metadata deposited with Data Publishers and Repositories.

Group participants

John Chodacki, CDL and DataCite (Chair)
Adrian Price, SUND/SCIENCE Bibliotek
Barbara Chen, Modern Language Association
Jennifer Lin, Crossref
Scott Plutchak, University of Alabama at Birmingham (retired)
Ted Habermann, HDF Group

Projects

The Data Publishers and Repositories Community Group are involved with: TBD February 2018

Please email us if you’d like to know more.