One of the roles of Dimensions is to make it easy to navigate the many links and relationships between grants, publications, clinical trials, patents, and policy documents. 

In order to get the complete picture on research, datasets must be treated as a first class research output, in the same way that aforementioned outputs are by the academic community. 

Recently, funders have started mandating the publishing of non-traditional outputs such as datasets in order to inform more reproducible and replicable research. Most recently the NIH announced that it is beefing up its data sharing rules after 16 years. This draft policy will require all investigators with NIH funding to make their datasets available to colleagues and the wider public. Policies like this already exist for many funders globally, such as the EPSRC, or the Gates Foundation.

With these new requirements, datasets are growing in importance in terms of both performing research and tracking scholarly impact. Within Digital Science, Figshare has been supporting researchers with publishing datasets and other non-traditional research outputs since 2011, along with other generalist repositories such as Dryad or Zenodo. Subject-specific repositories have been operating in academia even longer. Whilst many in the life sciences will be familiar with big names like GenBank or the Protein Data Bank, the registry of research data repositories re3data now lists over 2000!

Ingestion of datasets into Dimensions demonstrates Digital Science’s commitment to the elevation of data to a first class output, a first step on what we see to be a long and complex, but worthwhile endeavour.

Dimensions aims to pull in high quality metadata about datasets, linked out through DOIs and potential other PIDs in future. The approach thus far has been to target the low-hanging fruit. This will be done via Figshare and DataCite APIs. Figshare metadata has been pulled for the first release, to be topped up with DataCite metadata for repositories that are known to have high quality, clean metadata. This will be an ongoing process and like Dimensions other indexing, will be curated by humans and machines and extended over time.

The first tranche contains nearly 1.5 million datasets and includes datasets associated with publications from publisher data repositories including Springer Nature, PLOS, Frontiers and the American Chemical Society, amongst others. Also included is Figshare, Zenodo, Dryad and Pangaea. Content in these repositories that is not labelled as a dataset does not get indexed. By pulling metadata from DataCite, Dimensions will be able to add another large corpus of metadata about datasets, including those from University Data Repositories.

We are also investigating the power of the existing Dimensions database to enhance the metadata around datasets, through confirmed linkages between types of output.

We believe the addition of datasets in Dimensions will be of tremendous value to all members of the research community. As this makes even more linked data available in one platform, rather than in disconnected databases, it will allow academic researchers to more easily discover datasets relevant to their work and to showcase their own datasets on their profiles. Research administrators and publishers will be able use the datasets in trends and impact analyses, and Dimensions users working in corporate R&D will be able to enrich their field analyses. The possibilities are endless. 

Whilst this ongoing task will never be complete (indexing thousands of repositories well will take time!), once we achieve a standard sufficiently high to build on top of, where we go from there will be driven in large part by the community. 

What is most important? 

Sentiment analysis on links and references to determine why a dataset is being mentioned? 

Is it working to establish or solidify community-driven standards on what data metrics mean? 

Or is it something we haven’t even imagined yet?

Let’s make the future together. We’d love to hear your thoughts.