Review of: "Finding citations for PubMed: A large-scale comparison between five open access data sources"

Review In this article, the authors described a study that compares five data sources that provide their data freely concerning the citations for the literature included in PubMed. It is indeed a fascinating study highlighting the coverage of existing freely available citation data sources for a specific macro-area. Moreover, it allows one to monitor the status of the free availability of these data and their effectiveness in bibliometrics studies compared with proprietary data sources, i.e. Scopus and Web of Science. While I do not have significant comments for the study, I think that essential aspects of the work should be revised. Therefore, I will discuss these aspects as follows. Before commenting on these aspects, though, a full disclosure: I am one of the Directors of OpenCitations and, as such, I am responsible for one of the data sources used by the authors in their experiment, i.e. COCI.


Review
In this article, the authors described a study that compares five data sources that provide their data freely concerning the citations for the literature included in PubMed. It is indeed a fascinating study highlighting the coverage of existing freely available citation data sources for a specific macro-area. Moreover, it allows one to monitor the status of the free availability of these data and their effectiveness in bibliometrics studies compared with proprietary data sources, i.e. Scopus and Web of Science.
While I do not have significant comments for the study, I think that essential aspects of the work should be revised. Therefore, I will discuss these aspects as follows. Before commenting on these aspects, though, a full disclosure: I am one of the Directors of OpenCitations and, as such, I am responsible for one of the data sources used by the authors in their experiment, i.e. COCI.

Open Access vs Open Data
Open Access is a term that is associated with traditional publications (i.e. articles). When we talk about data, as the authors do in this paper, the correct term to use would be Open Data. Thus, it would be good to rephrase it as "... five open data sources" in the title and the paper. However, for being more specific The term "open" applied to publications (i.e. Open Access) and data (i.e. Open Data) identifies a set of principles that go beyond the pure free availability to these objects -e.g. see https://oaspa.org/information-resources/frequently-asked-questions/#FAQ1. Indeed Open Data does not only mean that you can freely access such data, but that you can also "use, modify, and share [them] for any purpose (subject, at most, to requirements that preserve provenance and openness)" (see https://opendefinition.org/). This is today the intended definition of Open Data.
Thus, according to this definition, some of the "open access data sources" mentioned by the authors are not "open" at all, only freely accessible. I think that this distinction should be made explicitly in the paper. Citing the data There are several references (see the section "Collecting and matching citations") to the data that have been used in the study. Some of them have been taken from complete dumps. Some of these dumps are research objects per se and are identified with persistent identifiers (e.g. DOI The authors wrote that they extracted only DOI-to-DOI citations from MAG. However, MAG contains also non-DOI-identified articles. Some of them may, in principle, have a PMID and, thus, could be included in the PubMed dataset. Thus, why the authors did not consider the possibility of matching some metadata (title, first author, year of publication, etc.) of MAG articles with no DOIs against PubMed to consider also them in the analysis? Section 3 of a recent Visser et al.'s article published on QSS [1] also provides a strategy for doing it that could be adopted.
New release of COCI It has not been advertised yet on the website (it will happen on Monday, 2 August 2021). However, the new release of COCI was published on Figshare a few days ago [2] . It contains more than 1.09 billion citations (including Elsevier ones). I know that it would take time, but I would love to see updated figures in the article that include this new release (even in comparison with the version of COCI considered currently in the article).
Final remarks I do believe that this paper deserves to be published in Scientometrics, considering the importance of the topic addressed. However, I believe that all the aspects above should be addressed carefully before accepting it.