Analysis of fine grained intent classification

Intro

A small analysis of fined grained intent classification given our dataset of COVID-19 related preprints.

Imports and Login
import pandas as pd
Load input data
df_refs = pd.read_csv(INPUT_FILE_REFS)\
            .dropna(subset=["contexts"])\
            .drop_duplicates(subset=["paperId", "contexts"])\
            .reset_index(drop=True)


df_multicite = pd.read_csv(INPUT_FILE_MULTICITE)\
                 .rename(columns={"label":"multicite_label", "score":"multicite_score"})
df = pd.concat([df_refs, df_multicite], axis=1)
Distribution of isInfluential
df.isInfluential.value_counts()
False    134195
True      73821
Name: isInfluential, dtype: int64
Distribution of intention classes
df.multicite_label.value_counts()
df.multicite_label.value_counts().plot(kind="bar")
<AxesSubplot:>

Intention classes by paper
df.pivot_table(index=pd.Grouper(key="doi"), columns=["multicite_label"], values=["multicite_score"], aggfunc="count")
multicite_score
multicite_label background differences extends future_work motivation similarities uses
doi
10.1101/2020.08.02.20129767 38.0 3.0 NaN NaN NaN 6.0 1.0
10.1101/2020.08.05.20169060 89.0 NaN NaN NaN 2.0 6.0 4.0
10.1101/2020.08.06.20168294 12.0 1.0 NaN NaN 1.0 1.0 3.0
10.1101/2020.08.06.20169573 33.0 NaN NaN NaN NaN NaN NaN
10.1101/2020.08.06.20169581 87.0 NaN NaN NaN 3.0 NaN 12.0
... ... ... ... ... ... ... ...
10.5194/se-2020-155 116.0 NaN 3.0 NaN 1.0 3.0 7.0
10.5194/se-2020-194 14.0 NaN NaN NaN NaN 1.0 2.0
10.5194/se-2020-200 62.0 NaN NaN 2.0 NaN 12.0 9.0
10.5194/se-2020-203 81.0 2.0 NaN NaN 2.0 6.0 12.0
10.5194/tc-2020-330 16.0 1.0 NaN 1.0 NaN 3.0 6.0

5247 rows × 7 columns

Statistics of intention class occurences in papers
df.pivot_table(index=pd.Grouper(key="doi"), columns=["multicite_label"], values=["multicite_score"], aggfunc="count").describe()
multicite_score
multicite_label background differences extends future_work motivation similarities uses
count 5106.000000 1539.000000 415.000000 396.000000 1078.000000 2152.000000 3706.000000
mean 32.009597 2.957115 2.079518 1.631313 2.144712 3.623141 7.664868
std 40.483978 4.832646 2.119826 1.107136 1.796888 4.013191 10.844226
min 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000
25% 9.000000 1.000000 1.000000 1.000000 1.000000 1.000000 2.000000
50% 21.000000 2.000000 1.000000 1.000000 1.000000 2.000000 4.000000
75% 40.000000 3.000000 2.000000 2.000000 3.000000 4.000000 9.000000
max 885.000000 156.000000 31.000000 7.000000 15.000000 62.000000 218.000000
Distribution of intention class and isInfluential
df.groupby(by=["multicite_label", "isInfluential"])["multicite_label"].count().plot(kind="barh")
<AxesSubplot:ylabel='multicite_label,isInfluential'>