Using OpenAlex

Introduction

OpenAlex is

An open and comprehensive catalog of scholarly papers, authors, institutions, and more.

Inspired by the ancient Library of Alexandria, OpenAlex is an index of hundreds of millions of interconnected entities across the global research system. We’re 100% free and open source, and offer access via a web interface,

It offers an API and snapshots of the full database.

According to their FAQ OpenAlex is trying to disambiguate authors:

Do you disambiguate authors?
Yes. Using coauthors, references, and other features of the data, we can tell that the same Jane Smith wrote both “Frog behavior” and “Frogs: A retrospective,” but it’s a different Jane Smith who wrote “Oats before boats: The breakfast customs of 17th-Century Dutch bargemen.”

This makes OpenAlex an interesting resource for disambiguated author names.

Given a DOI from our set of COVID19 related preprints we will try to get

  • additional and deduplicated data about authorships
  • data about the open access status of publications
  • data about possible retractions of publications
Imports and Login
import pandas as pd
import requests
import time

from funcy import chunks
Loading preprints
df = pd.read_csv(INPUT_FILE)
dois = df.dropna(subset=["doi"])["doi"].values

Query OpenAlex

Querying data from OpenAlex
params = {"mailto":"meik.bittkowski@sciencemediacenter.de",
          "per-page":"50"}

authorship_data = []
open_access_data = []
retraction_data = []

for count, doi_chunk in enumerate(chunks(50, dois), start=1):
    if count % 10 == 0:
        print(f"processed {count*50} / {len(dois)} ({count/len(dois)*50*100:.2f} %)")
        
    try:
        doi_filter = '|'.join(map(lambda x: 'https://doi.org/'+x, doi_chunk))
        r = requests.get(f"https://api.openalex.org/works?filter=doi:{doi_filter}", params=params)
    except BaseException as err:
        print(f"Unexpected {err=}, {type(err)=}, {count=}, {doi_chunk=}")
    else:
        if r.ok:
            print(r.json()["meta"])
            works = r.json()["results"]
            for work in works:
                authorships_df = pd.json_normalize(work, record_path="authorships")
                authorships_df["doi"] = work["doi"]
                authorship_data.append(authorships_df)

                open_access_df = pd.json_normalize(work["open_access"])
                open_access_df["doi"] = work["doi"]
                open_access_data.append(open_access_df)

                is_retracted_df = pd.DataFrame([{"doi":work["doi"], "is_retracted":work["is_retracted"]}])
                retraction_data.append(is_retracted_df)

oa_authorships = pd.concat(authorship_data)
oa_open_access = pd.concat(open_access_data)
oa_is_retracted = pd.concat(retraction_data)
Saves results
oa_authorships.to_csv(OUTPUT_FILE_AUTHORS, index=None)
oa_open_access.to_csv(OUTPUT_FILE_OPEN_ACCESS, index=None)
oa_is_retracted.to_csv(OUTPUT_FILE_RETRACTIONS, index=None)