More fine grained intent classification

Intro

The 🤗 model allenai/multicite-multilabel-scibert is interesting for creating custom intent classifiers with more classes than the ones available in the S2 API. (Paper)

It is a model trained on scientific articles and capable of predicting multiple labels for each input text.

  • Background
  • Motivation
  • Future Work
  • Similar
  • Difference
  • Uses
  • Extention
  • Unsure
Imports and Login
import pandas as pd
from transformers import pipeline
Infer multicite intents
if INFER_INTENTS:
    df = pd.read_csv(INPUT_FILE)
    df_contexts = df.dropna(subset=["contexts"]).drop_duplicates(subset=["paperId", "contexts"]).copy()

    pipeline = pipeline("text-classification",model="allenai/multicite-multilabel-scibert", device=1)

    def data():
        for _, row in df_contexts.iterrows():
            yield row.contexts[:512]

    outputs = [out for out in pipeline(data(), batch_size=128)]

    pd.DataFrame(outputs).to_csv(OUTPUT_FILE, index=None)
df = pd.read_csv(OUTPUT_FILE)
df.label.value_counts()
background      163441
uses             28406
similarities      7797
differences       4551
motivation        2312
extends            863
future_work        646
Name: label, dtype: int64