Earlier this week we saw some great comments about Dimensions from members of the lis-bibliometrics JISCMail list. Below you can find our feedback on key points, which we hope will be useful for reference.

The posts raised some very interesting questions on data, coverage and technical aspects of the database:

_________________________________________

Q: I’d been quite cautious of the coverage in Dimensions when I first looked at it, but between them these seem quite promising – the citation counts in both services compare quite well, and Dimensions has records for the vast majority of articles found in Scopus.

A: Thanks Andrew! We put a lot of work into the content we indexed prior to launch, and today Dimensions includes records for over 91 million research publications, all of which are linked wherever possible to associated grants, patents, and clinical trials. Where applicable, Altmetric data and citations for those publications are also displayed (more on the latter below). If helpful, more detail on all of the data in Dimensions can be found in this report.

Q: The newer paper highlights substantial inconsistencies for the “most cited ever” papers, which presumably indicates that Dimensions has patchier coverage for much older papers.

A: Thanks for focusing on this – you are right, but we would like make the argument that this is also focusing on a few records which are not representing the majority, but more the outliers where systematic effects are very much amplified. Due to two processes (which are are always ongoing) we have already now improved citation counts for these ‘lighthouse records’: the two processes are a reload of the citation graph with an updated dataset we received from Crossref as a ‘backbone’, and more publishers joining to have their content made more discoverable by having their full text records indexed.

The the improved citation counts are in the range of 2-5 percent increase, with one outlier where we have added more citations.

Q: The newer paper highlights a couple of the major current weaknesses – fragmented author records and an inability to filter/search by location (which exists, but only as a subscriber feature).

A: Our author disambiguation is currently still in beta, and in order to ensure high quality results we have prioritised accuracy over recall – meaning that yes, in some places you may find that author records are missing or not complete. This is something we continue to work on and hope to have improved substantially in the next few months! In one of the upcoming releases (weeks away, not months) we are integrating with ORCID so that researchers can use Dimensions to complete their ORCID record and also improve the way they are represented in Dimensions.

Q: They also find that the automated subject identification in Dimensions is very flaky – a lot of articles are getting characterised in unexpected fields, and non-English material is often not being assigned to a field at all. Coupled with the lack of facilities for really detailed keyword searching, this feels like it might be a bit of a problem for getting useful comparative data.

A: This does indeed vary by field – and is currently strongest in the medical and health sciences. In the majority of cases this inconsistency is down to the size of the sample dataset we have designed the system to learn from – where they are smaller the matches are not as strong. This is, of course, something that we are already working on to be able to improve it. We have opted for an automatic classification approach based on the content of the document rather than using a journal as a proxy, since this allows us to apply the categorisation also to other document types.

The search bar at the top offers a boolean search, and the option to search just abstracts or over the full text of all of the records in the system. We also provide a very powerful API for which we developed a domain specific querying language which does not only allow to retrieve data, but aggregate and facet it in one call. The documentation for it can be found here if you want to learn more.

Q: Finally, there’s an interesting note in passing about the rapid increase in citation links – almost a 20% increase in the recorded total citations for one author between February and April 2018. This seems to suggest they’re still aggressively populating the database with historical data, rather than only updating with new material – it will be interesting to see what this looks like in a year or so.

A: Thank you for the observation, yes, this is mainly due to rebuilding the citation graph with an improved dataset provided by Crossref and that we are adding more and more publishers.

Q: My understanding (I could well be wrong – often are), is that the underlying source of Dimensions citation data is Crossref. If so a DOI assignation is probably important in terms of the coverage parameters.

A: Thanks for the comment, Jason – not too far off! We are integrating publication data in a two step process: Crossref and other databases which are available to us are aggregated to a ‘backbone’ and some of these have spotty metadata and not all the references. In a second step we are working with publishers to index their content for better discoverability and improvement of the citation network and metadata – and we have done this for more than 60 million out of the 90 million.