Using OpenAlex

Introduction

An open and comprehensive catalog of scholarly papers, authors, institutions, and more.

Inspired by the ancient Library of Alexandria, OpenAlex is an index of hundreds of millions of interconnected entities across the global research system. We’re 100% free and open source, and offer access via a web interface,

It offers an API and snapshots of the full database.

According to their FAQ OpenAlex is trying to disambiguate authors:

Do you disambiguate authors?
Yes. Using coauthors, references, and other features of the data, we can tell that the same Jane Smith wrote both “Frog behavior” and “Frogs: A retrospective,” but it’s a different Jane Smith who wrote “Oats before boats: The breakfast customs of 17th-Century Dutch bargemen.”

This makes OpenAlex an interesting resource for disambiguated author names.

Given a DOI from our set of COVID19 related preprints we will try to get

additional and deduplicated data about authorships
data about the open access status of publications
data about possible retractions of publications

Imports and Login

import pandas as pd
import requests
import time

from funcy import chunks

Loading preprints

df = pd.read_csv(INPUT_FILE)
dois = df.dropna(subset=["doi"])["doi"].values

Query OpenAlex

Querying data from OpenAlex

params = {"mailto":"meik.bittkowski@sciencemediacenter.de",
          "per-page":"50"}

authorship_data = []
open_access_data = []
retraction_data = []

for count, doi_chunk in enumerate(chunks(50, dois), start=1):
    if count % 10 == 0:
        print(f"processed {count*50} / {len(dois)} ({count/len(dois)*50*100:.2f} %)")
        
    try:
        doi_filter = '|'.join(map(lambda x: 'https://doi.org/'+x, doi_chunk))
        r = requests.get(f"https://api.openalex.org/works?filter=doi:{doi_filter}", params=params)
    except BaseException as err:
        print(f"Unexpected {err=}, {type(err)=}, {count=}, {doi_chunk=}")
    else:
        if r.ok:
            print(r.json()["meta"])
            works = r.json()["results"]
            for work in works:
                authorships_df = pd.json_normalize(work, record_path="authorships")
                authorships_df["doi"] = work["doi"]
                authorship_data.append(authorships_df)

                open_access_df = pd.json_normalize(work["open_access"])
                open_access_df["doi"] = work["doi"]
                open_access_data.append(open_access_df)

                is_retracted_df = pd.DataFrame([{"doi":work["doi"], "is_retracted":work["is_retracted"]}])
                retraction_data.append(is_retracted_df)

oa_authorships = pd.concat(authorship_data)
oa_open_access = pd.concat(open_access_data)
oa_is_retracted = pd.concat(retraction_data)

Saves results

oa_authorships.to_csv(OUTPUT_FILE_AUTHORS, index=None)
oa_open_access.to_csv(OUTPUT_FILE_OPEN_ACCESS, index=None)
oa_is_retracted.to_csv(OUTPUT_FILE_RETRACTIONS, index=None)