Intro
The Semantic Scholar API is a good source for enriching data about scientific publications with information aggregated or computed by Semantic Scholar.
This notebook shows how to use Semantic Scholar to add information about citation intents and related information to our dataset of COVID-19 related preprints.
Imports and Login
import pandas as pd
import requests
from ratelimit import limits, sleep_and_retry
from funcy import merge, compact, partial
Read input data
df = pd.read_csv(INPUT_FILE).dropna(subset= ["doi" ])
df_sample = df.sample(SAMPLE_SIZE, random_state= 42 ).copy()
Get Reference Details from Semantic Scholar
if REQUEST_DATA:
@sleep_and_retry
@limits (calls= 100 , period= 5 * 60 )
def call_s2_api(doi):
try :
res = requests.get(f"https://api.semanticscholar.org/graph/v1/paper/DOI: { doi} /references?fields=title,intents,contexts,isInfluential,fieldsOfStudy,referenceCount&limit=1000" )
except :
print ("some error occured" , doi)
else :
if res.ok:
return merge({"doi" :doi}, res.json())
return {}
def normalize_result(result):
doi = result["doi" ]
ds = result["data" ]
df_ds = pd.DataFrame(map (partial(merge, {"doi" :doi}), ds))
if not df_ds.empty:
df_ds = pd.concat([df_ds.drop(columns= "citedPaper" ), (pd.json_normalize(df_ds.citedPaper))], axis= 1 )
return df_ds
dois = df_sample.doi.values if SAMPLE else df.doi.values
results = list (compact(map (call_s2_api, dois)))
df_results = pd.concat(map (normalize_result, results))\
.explode("intents" )\
.explode("contexts" )\
.explode("fieldsOfStudy" )
df_results.to_csv(OUTPUT_REFERENCE_DETAILS, index= None )
else :
df_results = pd.read_csv(OUTPUT_REFERENCE_DETAILS)
Distribution of citation intents
df_results.intents.value_counts()
background 1271
methodology 474
result 160
Name: intents, dtype: int64
Sample of ‘background’ citations
df_results[df_results.intents == "background" ].sample(5 , random_state= 42 )
doi
isInfluential
contexts
intents
paperId
title
referenceCount
fieldsOfStudy
2796
10.2139/ssrn.3727001
False
One possible reason for this setback is that, ...
background
NaN
Unemployment rate at four-decade high of 6.1% ...
NaN
NaN
2720
10.31235/osf.io/cpkj6
False
The challenges of making ends meet lead to con...
background
e00b9531b09a5cded39959a0dbebc3e6a76e0cb5
Financial Arrangements and Relationship Qualit...
87.0
Medicine
1175
10.2139/ssrn.3775182
False
Our combined custom hedge may offer a viable a...
background
caea0d6ea0b6948d866054c1d9ea785664c483ce
Managing Disruption Risk: The Interplay Betwee...
71.0
Business
2400
10.2139/ssrn.3747468
True
The paper adds to many more studies that analy...
background
6195e04648d0b1c6cee3d9f852dbd70769a68ce0
The Economic Effects of Energy Price Shocks
81.0
Economics
1901
10.1101/2020.04.27.20081562
False
Preliminary findings from Italian researchers ...
background
NaN
Inquinamento dell’aria e pandemia da Covid-19:...
NaN
NaN
Sample of ‘methodology’ citations
df_results[df_results.intents == "methodology" ].sample(5 , random_state= 42 )
doi
isInfluential
contexts
intents
paperId
title
referenceCount
fieldsOfStudy
2529
10.1101/2020.04.25.20079343
True
For RNA viruses qPCR relies on RNA purificatio...
methodology
b1e1d7a7fbff3828aebbb0ec2eb94624520f1198
Real-Time Reverse Transcription PCR as a Tool ...
10.0
Biology
2549
10.1101/2020.04.25.20079343
False
Once we have the estimated prevalence in the p...
methodology
05dcb88ff4f9d06864a4624fca6852a0e818ee64
DNA pooling in mutation detection with referen...
19.0
Medicine
319
10.2139/ssrn.3851941
False
For an analysis of the similarities between Ha...
methodology
810bf91f3a7722fa16a9e383584b2010d14756f3
Individualism and Economic Order
0.0
Sociology
2803
10.2139/ssrn.3727001
True
…individual level data from the SRS are not av...
methodology
c446b2986386bd133445c8fd8eb2215e14f0ec91
The contribution of age-specific mortality tow...
47.0
Medicine
749
10.1101/2020.05.24.113043
False
Mass spectrometric measurements were carried o...
methodology
f4b85a6972b3f3fe42f58e0031ee46b7f224b656
Online Parallel Accumulation–Serial Fragmentat...
56.0
Medicine
Sample of ‘result’ citations
df_results[df_results.intents == "result" ].sample(5 , random_state= 42 )
doi
isInfluential
contexts
intents
paperId
title
referenceCount
fieldsOfStudy
2201
10.2139/ssrn.3747468
True
The role of openness in the context of macroec...
result
NaN
Oil e ciency, demand, and prices: a tale of up...
NaN
NaN
2204
10.2139/ssrn.3747468
True
For instance, Bodenstein and Guerrieri (2018) ...
result
NaN
Oil e ciency, demand, and prices: a tale of up...
NaN
NaN
2374
10.2139/ssrn.3747468
False
The empirical work complements the evidence pr...
result
fa860aa11f242fbaa4f941818cdea2d94684f98d
Did Unexpectedly Strong Economic Growth Cause ...
42.0
Economics
1553
10.20944/preprints202007.0007.v1
False
This is a need for the students that came in a...
result
3b7a3db6299d384cf8ec7bd96b9f442c059cac46
Work–Life Balance? It Is Not about Balance, bu...
8.0
Medicine
2154
10.2139/ssrn.3747468
True
This di ers from the interpretation of oil-spe...
result
025ef8d95b3b6b19f1ad171c0ee5f48c12d0f46f
Structural Interpretation of Vector Autoregres...
58.0
Economics