Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

As a test run, I selected a small paragraph from the work-in-progress COVID-19 Interim User Guide. [3] This is a text about a data collection example for the disease's signs & and symptoms:

Data collection may include questions about groups of symptoms, such as

  • GI symptoms (nausea, vomiting, diarrhea)
  • Cough (non-productive, productive, or haemoptisis)

The next step was to extract named entities by running the signs & and symptoms text through the BC5CDR biomedical model, A named entity is text with a label of name of things. For BC5CDR, the entity types are DISEASE and CHEMICAL. This process is often referred to as named entity recognition (NER). These are the results:

EntityEntity Type
nauseaDISEASE
vomitingDISEASE
diarrheaDISEASE
CoughDISEASE
haemoptisisDISEASE

The first surprise was a programmatic method to link discovered entity with Unified Medical Language System (UMLS). UMLS is maintained by U.S. National Library of Medicine (NLM). This is appealing, when a term or concept is curated in the UMLS, a formal definition exists. Each concept in the UMLS has a Concept Unique Identifier (CUI). [4] This process is typically called named entity linking (NEL).

Let's take a look at the outcome, with green shaded rows indicating my preferred match:

EntityCUINameDefinitionScore
nauseaC0027497NauseaAn unpleasant sensation in the stomach usually accompanied by the urge to vomit. Common causes are early pregnancy, sea and motion sickness, emotional stress, intense pain, food poisoning, and various enteroviruses.1.0
nauseaC4085862Bothered by NauseaA question about whether an individual is or was bothered by nausea.1.0
nauseaC4255480Nausea:Presence or Threshold:Point in time:^Patient:OrdinalNone1.0
nauseaC4084796How Often NauseaA question about how often an individual has or had nausea.1.0
nauseaC1963179Nausea Adverse EventNone1.0
vomitingC0042963VomitingThe forcible expulsion of the contents of the STOMACH through the MOUTH.1.0
vomitingC4084767Bothered by VomitingA question about whether an individual is or was bothered by vomiting.0.9999999403953552
vomitingC4084768Usual Severity VomitingA question about the usual severity of an individual's vomiting.0.9999999403953552
vomitingC1963281Vomiting Adverse EventNone0.9999999403953552
vomitingC4084766How Much Distress VomitingA question about an individual's distress from their vomiting.0.9999999403953552
diarrheaC0011991DiarrheaAn increased liquidity or decreased consistency of FECES, such as running stool. Fecal consistency is related to the ratio of water-holding capacity of insoluble solids to total water, rather than the amount of water present. Diarrhea is not hyperdefecation or increased fecal weight.1.0
diarrheaC4084784How Much Distress DiarrheaA question about an individual's distress from their diarrhea.1.0
diarrheaC4084802Usual Severity DiarrheaA question about the usual severity of an individual's diarrhea.1.0
diarrheaC1963091Diarrhea Adverse EventNone1.0
diarrheaC3641756Have DiarrheaA question about whether an individual has or had diarrhea.1.0
CoughC0010200CoughingA sudden, audible expulsion of air from the lungs through a partially closed glottis, preceded by inhalation. It is a protective response that serves to clear the trachea, bronchi, and/or lungs of irritants and secretions, or to prevent aspiration of foreign materials into the lungs.1.0
CoughC1961131Cough Adverse EventNone1.0
CoughC3274924Have Been CoughingA question about whether an individual is or has been coughing.1.0
CoughC3815497Cough (guaifenesin)None1.0
CoughC4084725Usual Severity CoughA question about the usual severity of an individual's cough.1.0
haemoptisisNone returned

Notice how the table above does not include any UMLS concept for the named entity haemoptisis. With some online searches, Ii it came to me as another surprise that it is due to a typographical error. After correcting it to "hemoptysis," , a hit appears in the outcome, as follows:

EntityCUIName
Definition
DfinitionScore
hemoptysis
C0019079HemoptysisExpectoration or spitting of blood originating from any part of the RESPIRATORY TRACT, usually from hemorrhage in the lung parenchyma (PULMONARY ALVEOLI) and the BRONCHIAL ARTERIES.1.0
hemoptysisC0030424ParagonimiasisInfection with TREMATODA of the genus PARAGONIMUS.0.7546218633651733

Suffice to mention, these CUIs are available on the NCI Metathesaurus. This is the URL template: https://ncim.nci.nih.gov/ncimbrowser/ConceptReport.jsp?dictionary=NCI%20Metathesaurus&code={CUI}

Visualization

spacy has spaCy includes built in visualization constructors to display part-of-speech tags and syntactic dependencies. This The following  graphic is the rendition using the text described above:

...

We discussed named entity recognition, which can be displayed as such:

Example Code

Code Block
languagepy
linenumberstrue
import scispacy
import spacy
from scispacy.umls_linking import UmlsEntityLinker
from spacy import displacy

nlp = spacy.load("en_ner_bc5cdr_md")

linker = UmlsEntityLinker(resolve_abbreviations=True)
nlp.add_pipe(linker)

text = """
Data collection may include questions about groups of symptoms, such as
  GI symptoms (nausea, vomiting, diarrhea)
  Cough (non-productive, productive, or haemoptisis)
"""

doc = nlp(text)

entities = doc.ents
for entity in entities:
    print(entity.text, entity.start_char, entity.end_char, entity.label_)

    for umls_ent in entity._.umls_ents:
        # tuple with 2 values
        conceptId, score = umls_ent

        print(f"Name: {entity}")
        print(f"CUI: {conceptId}, Score {score}")
        print(linker.umls.cui_to_entity[umls_ent[0]])
        print()

colors = {
    'CHEMICAL': 'lightpink',
    'DISEASE': 'lightorange',
}

# show NER
displacy.serve(doc, style="ent", host="127.0.0.1", options={'colors': colors})
displacy.serve(doc, style="dep", host="127.0.0.1")
HTML
<script src="https://bitbucket.cdisc.org/snippets/9817e4728bb1465ea8c338685ea454eb.js"></script>

The Road Ahead

At this point, there seems to be a lot NP possibilities of NLP opportunities and applications in standards development. Linkage to UMLS will allow team members to ensure semantic meaning by referencing the curated definition. Quality will increase as I demonstrated how detecting the spelling error detection was an unintended experience. I can certainly see it has a utility in CDISC 360's biomedical concept authoring. Named entities can be used as keywords or tags in Example Collection. 

...

[3] CDISC Interim User Guide for COVID-19. https://wikiwww.cdisc.org/display/COVID19interim-user-guide-covid-19/

[4] Unique Identifiers in the Metathesaurus. https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html