Using reference details from Semantic Scholar

Intro

The Semantic Scholar API is a good source for enriching data about scientific publications with information aggregated or computed by Semantic Scholar.

This notebook shows how to use Semantic Scholar to add information about citation intents and related information to our dataset of COVID-19 related preprints.

Imports and Login
import pandas as pd
import requests
from ratelimit import limits, sleep_and_retry
from funcy import merge, compact, partial
Read input data
df = pd.read_csv(INPUT_FILE).dropna(subset=["doi"])
df_sample = df.sample(SAMPLE_SIZE, random_state=42).copy()
Get Reference Details from Semantic Scholar
if REQUEST_DATA:
    @sleep_and_retry
    @limits(calls=100, period=5*60)
    def call_s2_api(doi):
        try:
            res = requests.get(f"https://api.semanticscholar.org/graph/v1/paper/DOI:{doi}/references?fields=title,intents,contexts,isInfluential,fieldsOfStudy,referenceCount&limit=1000")
        except:
            print("some error occured", doi)
        else:
            if res.ok:
                return merge({"doi":doi}, res.json())
            
        return {}

    def normalize_result(result):
        doi = result["doi"]
        ds = result["data"]

        df_ds = pd.DataFrame(map(partial(merge, {"doi":doi}), ds))
        if not df_ds.empty:
            df_ds = pd.concat([df_ds.drop(columns="citedPaper"), (pd.json_normalize(df_ds.citedPaper))], axis=1)
        return df_ds

    dois = df_sample.doi.values if SAMPLE else df.doi.values
    results = list(compact(map(call_s2_api, dois)))

    df_results = pd.concat(map(normalize_result, results))\
                    .explode("intents")\
                    .explode("contexts")\
                    .explode("fieldsOfStudy")

    df_results.to_csv(OUTPUT_REFERENCE_DETAILS, index=None)
else:
    df_results = pd.read_csv(OUTPUT_REFERENCE_DETAILS)
Distribution of citation intents
df_results.intents.value_counts()
background     1271
methodology     474
result          160
Name: intents, dtype: int64
Sample of ‘background’ citations
df_results[df_results.intents == "background"].sample(5, random_state=42)
doi isInfluential contexts intents paperId title referenceCount fieldsOfStudy
2796 10.2139/ssrn.3727001 False One possible reason for this setback is that, ... background NaN Unemployment rate at four-decade high of 6.1% ... NaN NaN
2720 10.31235/osf.io/cpkj6 False The challenges of making ends meet lead to con... background e00b9531b09a5cded39959a0dbebc3e6a76e0cb5 Financial Arrangements and Relationship Qualit... 87.0 Medicine
1175 10.2139/ssrn.3775182 False Our combined custom hedge may offer a viable a... background caea0d6ea0b6948d866054c1d9ea785664c483ce Managing Disruption Risk: The Interplay Betwee... 71.0 Business
2400 10.2139/ssrn.3747468 True The paper adds to many more studies that analy... background 6195e04648d0b1c6cee3d9f852dbd70769a68ce0 The Economic Effects of Energy Price Shocks 81.0 Economics
1901 10.1101/2020.04.27.20081562 False Preliminary findings from Italian researchers ... background NaN Inquinamento dell’aria e pandemia da Covid-19:... NaN NaN
Sample of ‘methodology’ citations
df_results[df_results.intents == "methodology"].sample(5, random_state=42)
doi isInfluential contexts intents paperId title referenceCount fieldsOfStudy
2529 10.1101/2020.04.25.20079343 True For RNA viruses qPCR relies on RNA purificatio... methodology b1e1d7a7fbff3828aebbb0ec2eb94624520f1198 Real-Time Reverse Transcription PCR as a Tool ... 10.0 Biology
2549 10.1101/2020.04.25.20079343 False Once we have the estimated prevalence in the p... methodology 05dcb88ff4f9d06864a4624fca6852a0e818ee64 DNA pooling in mutation detection with referen... 19.0 Medicine
319 10.2139/ssrn.3851941 False For an analysis of the similarities between Ha... methodology 810bf91f3a7722fa16a9e383584b2010d14756f3 Individualism and Economic Order 0.0 Sociology
2803 10.2139/ssrn.3727001 True …individual level data from the SRS are not av... methodology c446b2986386bd133445c8fd8eb2215e14f0ec91 The contribution of age-specific mortality tow... 47.0 Medicine
749 10.1101/2020.05.24.113043 False Mass spectrometric measurements were carried o... methodology f4b85a6972b3f3fe42f58e0031ee46b7f224b656 Online Parallel Accumulation–Serial Fragmentat... 56.0 Medicine
Sample of ‘result’ citations
df_results[df_results.intents == "result"].sample(5, random_state=42)
doi isInfluential contexts intents paperId title referenceCount fieldsOfStudy
2201 10.2139/ssrn.3747468 True The role of openness in the context of macroec... result NaN Oil e ciency, demand, and prices: a tale of up... NaN NaN
2204 10.2139/ssrn.3747468 True For instance, Bodenstein and Guerrieri (2018) ... result NaN Oil e ciency, demand, and prices: a tale of up... NaN NaN
2374 10.2139/ssrn.3747468 False The empirical work complements the evidence pr... result fa860aa11f242fbaa4f941818cdea2d94684f98d Did Unexpectedly Strong Economic Growth Cause ... 42.0 Economics
1553 10.20944/preprints202007.0007.v1 False This is a need for the students that came in a... result 3b7a3db6299d384cf8ec7bd96b9f442c059cac46 Work–Life Balance? It Is Not about Balance, bu... 8.0 Medicine
2154 10.2139/ssrn.3747468 True This di ers from the interpretation of oil-spe... result 025ef8d95b3b6b19f1ad171c0ee5f48c12d0f46f Structural Interpretation of Vector Autoregres... 58.0 Economics