In this blog post, Digital Science’s CEO Daniel Hook shares some insights which can be derived from the more than 8 million datasets indexed in Dimensions.

A few weeks ago, we released a significant enhancement to the objects that Dimensions includes in its index. Previously, the Dimensions data index only included data from Figshare and the sources that Figshare includes in its index. Launched at the end of January 2020, this was a “toe in the water” for us to test how to handle some of the significant challenges of indexing datasets. Datasets are quite different to papers, patents, grants or the other objects that Dimensions indexes – the metadata attached to data is weaker than for other research outputs and textual descriptions are less easy to come by. 

Dimensions now indexes more than 8 million datasets (more than four times as many datasets as in the original launch of the data index). Our approach has always been about inclusiveness and lowering the bar to creating the context around a piece of research.  As such, including all DOIs that are classified as “dataset” in Datacite in our index is completely aligned with two of the fundamental principles that we laid out when we launched Dimensions in 2018.

Including this wider world of datasets leads to some fascinating insights:

  • If we ignore the outlying point in 2014, the number of datasets given a DOI in each year has risen from around 300,000 in 2013 to just over 1 million in 2019 – tripling the number of datasets that are published annually in just 7 years.  Over the same period, traditional publications rose by almost 44%.

  • The number of datasets published in 2020 has decreased significantly.  While publication output rose during the COVID-19 crisis has theoretical work continued and researchers around the world prioritised writing up previous work, datasets are produced by people being in the lab – a much less frequent place to visit this year. Countries that managed lockdowns efficiently in 2020 such as Australia and China saw a relatively lower impact on their dataset production rates as compared with countries that allowed the virus to spread such as the US.

  • In the funders view in the Dimensions web interface, it is easy to see that despite the continued dominance of the US in the global research landscape, fewer US grants are acknowledged by dataset producers with the lion’s share of dataset grant affiliations being to the National Natural Science Foundation of China (NSFC), followed by the Japan Society for the Promotion of Science (JSPS) and the European Commission.

  • Beyond what can be seen rapidly from the standard Dimensions web application, these data are also available in the Dimensions API and in Dimensions on Google BigQuery.  This adds significantly to the capability for analysts. Some fields in the underlying metadata that is held on datasets is not directly faceted in the Dimensions web interface.  In this case, I wanted to know about which countries have funded datasets that have been published under a Creative Commons ‘No Rights Reserved’ Licence (CC0).  There is rich data licence information in Dimensions, and with BigQuery, we can drill into that.  This simple example, showing the distribution of CCO license datasets by country, took just a few moments to create and leverages the dataset data around licences, the GRID data and the links from datasets to funders.

At Digital Science, we continue to believe in the power of context.  As has been the case several times in 2020, the data in Dimensions has given us insights into how COVID-19 has impacted research.  When making arguments for funding, or decisions about how best to support colleagues across the research ecosystem, we believe that the right data, used sensitively with the right context can help everyone.