Extracting citation statements

Introduction

Parsed full text information can be further processed for different information needs. For example, literature references can be extracted together with the contexts of the referencing.

Imports and Login
from bs4 import BeautifulSoup
from funcy import lfilter, walk_values, takewhile, first, rest
import json
from glob import glob
from pprint import pprint

Extract references

Helpers for extracting references
def _extract_references(head, tail, acc=[]):
    if not head:
        return acc
    if not head.name:
        passage = head
        references = list(takewhile(lambda el: el.name == "ref" and el["type"] == "bibr", tail))
        if references:
            acc.append({"passage":passage, "references":references})
    return _extract_references(first(tail), list(rest(tail)), acc)

def extract_references(contents):
    return _extract_references(first(contents), list(rest(contents)), [])
Extract references
for parsed_publication in glob(f"{PATH_TO_PARSED_PUBLICATIONS}/*tei.xml"):
    with open(parsed_publication, "rb") as f:
        soup = BeautifulSoup(f, features="xml")
    
    ps = soup.findAll("p")

    data = []
    for p in ps:
        data.append({
            "paragraph":p,
            "extracted_references": extract_references(p.contents)
        })    

    def convert_to_str(x):
        if type(x) == list:
            return list(map(convert_to_str, x))
        elif type(x) == dict:
            return walk_values(convert_to_str, x)
        else:
            return str(x)

    output_file = parsed_publication.split("/")[-1].split(".tei.xml")[0]
    with open(f"{PATH_TO_EXTRACTED_REFERENCES}/{output_file}.json", "w") as f:
        f.write(json.dumps(
            list(map(lambda d: walk_values(convert_to_str, d), data))
        ))

Examples of extracted references

Load and print example references
extracted_references = glob(f"{PATH_TO_EXTRACTED_REFERENCES}/*json")
with open(extracted_references[0], "rb") as f:
    ex_refs = json.load(f)
print("Preview of first 10 paragraphs")
pprint(ex_refs[:10])
Preview of first 10 paragraphs
[{'extracted_references': [],
  'paragraph': '<p>Copyright JMIR Publications Inc.</p>'},
 {'extracted_references': [],
  'paragraph': '<p>Background: With the approval of two COVID-19 vaccines in '
               'Canada, many people feel a sense of relief, as hope is on the '
               'horizon. However, only about 75% of people in Canada plan to '
               'receive one of the vaccines.</p>'},
 {'extracted_references': [],
  'paragraph': '<p>The purpose of this study is to determine the reasons why '
               'people in Canada feel hesitant toward receiving a COVID-19 '
               'vaccine.</p>'},
 {'extracted_references': [],
  'paragraph': '<p>We screened 3915 tweets from public Twitter profiles in '
               'Canada by using the search words "vaccine" and "COVID." The '
               'tweets that met the inclusion criteria (ie, those about '
               'COVID-19 vaccine hesitancy) were coded via content analysis. '
               'Codes were then organized into themes and interpreted by using '
               'the Theoretical Domains Framework.</p>'},
 {'extracted_references': [],
  'paragraph': '<p>Results: Overall, 605 tweets were identified as those about '
               'COVID-19 vaccine hesitancy. Vaccine hesitancy stemmed from the '
               'following themes: concerns over safety, suspicion about '
               'political or economic forces driving the COVID-19 pandemic or '
               'vaccine development, a lack of knowledge about the vaccine, '
               'antivaccine or confusing messages from authority figures, and '
               'a lack of legal liability from vaccine companies. This study '
               'also examined mistrust toward the medical industry not due to '
               'hesitancy, but due to the legacy of communities marginalized '
               'by health care institutions. These themes were categorized '
               'into the following five Theoretical Domains Framework '
               'constructs: knowledge, beliefs about consequences, '
               'environmental context and resources, social influence, and '
               'emotion.</p>'},
 {'extracted_references': [],
  'paragraph': '<p>With the World Health Organization stating that one of the '
               'worst threats to global health is vaccine hesitancy, it is '
               'important to have a comprehensive understanding of the reasons '
               'behind this reluctance. By using a behavioral science '
               'framework, this study adds to the emerging knowledge about '
               'vaccine hesitancy in relation to COVID-19 vaccines by '
               'analyzing public discourse in tweets in real time. Health care '
               'leaders and clinicians may use this knowledge to develop '
               'public health interventions that are responsive to the '
               'concerns of people who are hesitant to receive vaccines.</p>'},
 {'extracted_references': [{'passage': 'The approval of the Pfizer-BioNTech '
                                       'and Moderna vaccines sent waves of '
                                       'excitement and relief across the '
                                       'world. However, some people remain '
                                       'hesitant about receiving a vaccine for '
                                       'COVID-19 ',
                            'references': ['<ref target="#b0" '
                                           'type="bibr">[1,</ref>',
                                           '<ref target="#b1" '
                                           'type="bibr">2]</ref>']},
                           {'passage': '. The World Health Organization noted '
                                       'in 2019 that one of the greatest '
                                       'threats to global health was vaccine '
                                       'hesitancy ',
                            'references': ['<ref target="#b2" '
                                           'type="bibr">[3]</ref>']},
                           {'passage': '. Emerging international evidence on '
                                       'COVID-19 vaccine hesitancy suggests '
                                       'that there is a range of reasons for '
                                       'this reluctance, including doubts '
                                       'about the safety and efficacy of the '
                                       'vaccine, political or pharmaceutical '
                                       'mistrust, belief in natural immunity, '
                                       'and the belief that the virus is mild '
                                       'or not life-threatening ',
                            'references': ['<ref target="#b3" '
                                           'type="bibr">[4]</ref>',
                                           '<ref target="#b4" '
                                           'type="bibr">[5]</ref>',
                                           '<ref target="#b5" '
                                           'type="bibr">[6]</ref>']}],
  'paragraph': '<p>The approval of the Pfizer-BioNTech and Moderna vaccines '
               'sent waves of excitement and relief across the world. However, '
               'some people remain hesitant about receiving a vaccine for '
               'COVID-19 <ref target="#b0" type="bibr">[1,</ref><ref '
               'target="#b1" type="bibr">2]</ref>. The World Health '
               'Organization noted in 2019 that one of the greatest threats to '
               'global health was vaccine hesitancy <ref target="#b2" '
               'type="bibr">[3]</ref>. Emerging international evidence on '
               'COVID-19 vaccine hesitancy suggests that there is a range of '
               'reasons for this reluctance, including doubts about the safety '
               'and efficacy of the vaccine, political or pharmaceutical '
               'mistrust, belief in natural immunity, and the belief that the '
               'virus is mild or not life-threatening <ref target="#b3" '
               'type="bibr">[4]</ref><ref target="#b4" '
               'type="bibr">[5]</ref><ref target="#b5" '
               'type="bibr">[6]</ref>.</p>'},
 {'extracted_references': [{'passage': 'For herd immunity to any communicable '
                                       'disease to be effective, a '
                                       'considerable portion of the population '
                                       'needs to be vaccinated or have '
                                       'antibodies present from being recently '
                                       'infected. Achieving herd immunity is '
                                       'difficult when a large portion of the '
                                       'public is not vaccinated. For herd '
                                       'immunity to be effective for measles '
                                       'and polio, 95% and 80% of the '
                                       'population need to be vaccinated, '
                                       'respectively ',
                            'references': ['<ref target="#b6" '
                                           'type="bibr">[7]</ref>']},
                           {'passage': '. The exact percentage required for '
                                       'herd immunity to COVID-19 is difficult '
                                       'to estimate ',
                            'references': ['<ref target="#b6" '
                                           'type="bibr">[7]</ref>']}],
  'paragraph': '<p>For herd immunity to any communicable disease to be '
               'effective, a considerable portion of the population needs to '
               'be vaccinated or have antibodies present from being recently '
               'infected. Achieving herd immunity is difficult when a large '
               'portion of the public is not vaccinated. For herd immunity to '
               'be effective for measles and polio, 95% and 80% of the '
               'population need to be vaccinated, respectively <ref '
               'target="#b6" type="bibr">[7]</ref>. The exact percentage '
               'required for herd immunity to COVID-19 is difficult to '
               'estimate <ref target="#b6" type="bibr">[7]</ref>.</p>'},
 {'extracted_references': [{'passage': 'A Statistics Canada survey conducted '
                                       'in September 2020 (before a vaccine '
                                       'was approved) indicated that 75% of '
                                       'Canadians were either likely or '
                                       'somewhat likely to receive a '
                                       'vaccination ',
                            'references': ['<ref type="bibr">[8]</ref>']},
                           {'passage': '. An Angus Reid Institute ',
                            'references': ['<ref target="#b3" '
                                           'type="bibr">[4]</ref>']},
                           {'passage': ' study conducted between December 8 '
                                       'and 11, 2020 found that 48% of '
                                       'Canadians sampled wanted to be '
                                       'vaccinated immediately if a vaccine '
                                       'was available, and 31% wanted to be '
                                       'vaccinated but preferred to wait. '
                                       'Additionally, 7% of respondents '
                                       'indicated that they were unsure if '
                                       'they would receive a vaccination, and '
                                       '14% indicated that they would not get '
                                       'vaccinated ',
                            'references': ['<ref target="#b3" '
                                           'type="bibr">[4]</ref>']}],
  'paragraph': '<p>A Statistics Canada survey conducted in September 2020 '
               '(before a vaccine was approved) indicated that 75% of '
               'Canadians were either likely or somewhat likely to receive a '
               'vaccination <ref type="bibr">[8]</ref>. An Angus Reid '
               'Institute <ref target="#b3" type="bibr">[4]</ref> study '
               'conducted between December 8 and 11, 2020 found that 48% of '
               'Canadians sampled wanted to be vaccinated immediately if a '
               'vaccine was available, and 31% wanted to be vaccinated but '
               'preferred to wait. Additionally, 7% of respondents indicated '
               'that they were unsure if they would receive a vaccination, and '
               '14% indicated that they would not get vaccinated <ref '
               'target="#b3" type="bibr">[4]</ref>.</p>'},
 {'extracted_references': [{'passage': 'In the context of influenza '
                                       'vaccinations, there remains a broad, '
                                       "ethical imperative to respect others' "
                                       'agency over personal health decisions '
                                       '(eg, choosing to not get vaccinated). '
                                       'However, from a public health ethics '
                                       'perspective, the decision to not be '
                                       'vaccinated creates a conflict between '
                                       'population safety and personal '
                                       'liberty ',
                            'references': ['<ref target="#b7" '
                                           'type="bibr">[9]</ref>']},
                           {'passage': '. As of yet, COVID-19 vaccination has '
                                       'not been deemed mandatory by any '
                                       'nation, but conversations about '
                                       'whether such a public mandate should '
                                       'exist are emerging ',
                            'references': ['<ref target="#b8" '
                                           'type="bibr">[10]</ref>']}],
  'paragraph': '<p>In the context of influenza vaccinations, there remains a '
               "broad, ethical imperative to respect others' agency over "
               'personal health decisions (eg, choosing to not get '
               'vaccinated). However, from a public health ethics perspective, '
               'the decision to not be vaccinated creates a conflict between '
               'population safety and personal liberty <ref target="#b7" '
               'type="bibr">[9]</ref>. As of yet, COVID-19 vaccination has not '
               'been deemed mandatory by any nation, but conversations about '
               'whether such a public mandate should exist are emerging <ref '
               'target="#b8" type="bibr">[10]</ref>. Whether vaccines are '
               'mandated, it is worthwhile for public institutions to '
               'understand how to change behaviors concerning vaccine '
               'hesitancy to ensure that informed decision-making practices '
               'are being exercised.</p>'}]