Processing full texts

Introduction

The preprint dataset queried from Dimensions contains references to the full texts of preprints if available.

To process fulltexts we first download all available PDF files in parallel. Then we us GROBID to parse those PDF files into a standardised XML representation, i.e. XML-TEI.

GROBID needs to be installed separately (see this guide for a Docker deployment). Configuration options for the GROBID-Client are stored in JSON file like the one at ../cfg/grobid_cfg.json.

Imports and Login

import pandas as pd
import requests
import os
from grobid_client.grobid_client import GrobidClient
from slugify import slugify

from multiprocessing import cpu_count
from multiprocessing.pool import ThreadPool

Download full texts

Load data

df = pd.read_csv(INPUT_FILE_PREPRINTS)

Show distribution of linkout field

df.linkout.isna().value_counts()

False    31773
True     11912
Name: linkout, dtype: int64

Download helpers

# cf. https://opensourceoptions.com/blog/use-python-to-download-multiple-files-or-urls-in-parallel/
def download_url(args):
    url, fn = args[0], args[1]
    try:
        r = requests.get(url)
        content_type = r.headers.get('content-type')

        if 'application/pdf' in content_type:
            with open(fn, 'wb') as f:
                f.write(r.content)
            return f"{url} is PDF"
        return url
    except Exception as e:
        print('Exception in download_url():', e)

def download_parallel(args):
    cpus = cpu_count()
    results = ThreadPool(cpus - 1).imap_unordered(download_url, args)
    for url in results:
        print('url:', url)

Download PDF documents

os.environ["REQUESTS_CA_BUNDLE"] = ""
df["filename"] = df.doi.dropna().apply(lambda x: f"{OUTPUT_PATH_FULLTEXTS_RAW}/{slugify(x)}.pdf")
download_inputs = list(df[["linkout", "filename"]].dropna().itertuples(index=False, name=None))
download_inputs = download_inputs[:TEST_N] if TEST else download_inputs

download_parallel(download_inputs)

Parse full texts

Parse PDF documents with GROBID

# adds additional information about certificates if necessary
os.environ["REQUESTS_CA_BUNDLE"] = os.environ.get("REQUESTS_CA_BUNDLE_EXTRA", "")
client = GrobidClient(config_path=PATH_TO_GROBID_CONFIG)
client.process("processFulltextDocument", OUTPUT_PATH_FULLTEXTS_RAW, output=OUTPUT_PATH_FULLTEXTS_PARSED, n=20)