Blog from April, 2020

Disclaimer

The views and opinions expressed in this blog entry are those of mine and do not reflect the official policy or position of CDISC.

I am learning about natural language processing (NLP) as part of my self-teaching journey in Data Science.

My set up is simple. Python and an NLP package. scispaCy is an open-source library for processing biomedical text. [1] It works with multiple pre-trained models, such as the The BioCreative V Chemical Disease Relation (BC5CDR) corpus for biomedical terms. [2] BioNLP13CG corpus is another example, which is a model for cancer genetics.

As a test run, I selected a small paragraph from the work-in-progress COVID-19 Interim User Guide. [3] This is a text about a data collection example for the disease's signs and symptoms:

Data collection may include questions about groups of symptoms, such as

  • GI symptoms (nausea, vomiting, diarrhea)
  • Cough (non-productive, productive, or haemoptisis)

The next step was to extract named entities by running the signs and symptoms text through the BC5CDR biomedical model, A named entity is text with a label of name of things. For BC5CDR, the entity types are DISEASE and CHEMICAL. This process is often referred to as named entity recognition (NER). These are the results:

EntityEntity Type
nauseaDISEASE
vomitingDISEASE
diarrheaDISEASE
CoughDISEASE
haemoptisisDISEASE

The first surprise was a programmatic method to link discovered entity with Unified Medical Language System (UMLS). UMLS is maintained by U.S. National Library of Medicine (NLM). This is appealing, when a term or concept is curated in the UMLS, a formal definition exists. Each concept in the UMLS has a Concept Unique Identifier (CUI). [4] This process is typically called named entity linking (NEL).

Let's take a look at the outcome, with green shaded rows indicating my preferred match:

EntityCUINameDefinitionScore
nauseaC0027497NauseaAn unpleasant sensation in the stomach usually accompanied by the urge to vomit. Common causes are early pregnancy, sea and motion sickness, emotional stress, intense pain, food poisoning, and various enteroviruses.1.0
nauseaC4085862Bothered by NauseaA question about whether an individual is or was bothered by nausea.1.0
nauseaC4255480Nausea:Presence or Threshold:Point in time:^Patient:OrdinalNone1.0
nauseaC4084796How Often NauseaA question about how often an individual has or had nausea.1.0
nauseaC1963179Nausea Adverse EventNone1.0
vomitingC0042963VomitingThe forcible expulsion of the contents of the STOMACH through the MOUTH.1.0
vomitingC4084767Bothered by VomitingA question about whether an individual is or was bothered by vomiting.0.9999999403953552
vomitingC4084768Usual Severity VomitingA question about the usual severity of an individual's vomiting.0.9999999403953552
vomitingC1963281Vomiting Adverse EventNone0.9999999403953552
vomitingC4084766How Much Distress VomitingA question about an individual's distress from their vomiting.0.9999999403953552
diarrheaC0011991DiarrheaAn increased liquidity or decreased consistency of FECES, such as running stool. Fecal consistency is related to the ratio of water-holding capacity of insoluble solids to total water, rather than the amount of water present. Diarrhea is not hyperdefecation or increased fecal weight.1.0
diarrheaC4084784How Much Distress DiarrheaA question about an individual's distress from their diarrhea.1.0
diarrheaC4084802Usual Severity DiarrheaA question about the usual severity of an individual's diarrhea.1.0
diarrheaC1963091Diarrhea Adverse EventNone1.0
diarrheaC3641756Have DiarrheaA question about whether an individual has or had diarrhea.1.0
CoughC0010200CoughingA sudden, audible expulsion of air from the lungs through a partially closed glottis, preceded by inhalation. It is a protective response that serves to clear the trachea, bronchi, and/or lungs of irritants and secretions, or to prevent aspiration of foreign materials into the lungs.1.0
CoughC1961131Cough Adverse EventNone1.0
CoughC3274924Have Been CoughingA question about whether an individual is or has been coughing.1.0
CoughC3815497Cough (guaifenesin)None1.0
CoughC4084725Usual Severity CoughA question about the usual severity of an individual's cough.1.0
haemoptisisNone returned

Notice how the table above does not include any UMLS concept for the named entity haemoptisis. With some online searches, it came to me as another surprise that it is due to a typographical error. After correcting it to "hemoptysis," a hit appears in the outcome, as follows:

EntityCUINameDfinitionScore
hemoptysis
C0019079HemoptysisExpectoration or spitting of blood originating from any part of the RESPIRATORY TRACT, usually from hemorrhage in the lung parenchyma (PULMONARY ALVEOLI) and the BRONCHIAL ARTERIES.1.0
hemoptysisC0030424ParagonimiasisInfection with TREMATODA of the genus PARAGONIMUS.0.7546218633651733

Suffice to mention, these CUIs are available on the NCI Metathesaurus. This is the URL template: https://ncim.nci.nih.gov/ncimbrowser/ConceptReport.jsp?dictionary=NCI%20Metathesaurus&code={CUI}

Visualization

spaCy includes built in visualization constructors to display part-of-speech tags and syntactic dependencies. The following  graphic is the rendition using the text described above:

We discussed named entity recognition, which can be displayed as such:

Example Code

import scispacy
import spacy
from scispacy.umls_linking import UmlsEntityLinker
from spacy import displacy

nlp = spacy.load("en_ner_bc5cdr_md")

linker = UmlsEntityLinker(resolve_abbreviations=True)
nlp.add_pipe(linker)

text = """
Data collection may include questions about groups of symptoms, such as
  GI symptoms (nausea, vomiting, diarrhea)
  Cough (non-productive, productive, or haemoptisis)
"""

doc = nlp(text)

entities = doc.ents
for entity in entities:
    print(entity.text, entity.start_char, entity.end_char, entity.label_)

    for umls_ent in entity._.umls_ents:
        # tuple with 2 values
        conceptId, score = umls_ent

        print(f"Name: {entity}")
        print(f"CUI: {conceptId}, Score {score}")
        print(linker.umls.cui_to_entity[umls_ent[0]])
        print()

colors = {
    'CHEMICAL': 'lightpink',
    'DISEASE': 'lightorange',
}

# show NER
displacy.serve(doc, style="ent", host="127.0.0.1", options={'colors': colors})
displacy.serve(doc, style="dep", host="127.0.0.1")

The Road Ahead

At this point, there seems to be a lot of NLP opportunities and applications in standards development. Linkage to UMLS will allow team members to ensure semantic meaning by referencing the curated definition. Quality will increase as I demonstrated how detecting the spelling error was an unintended experience. I can certainly see it has a utility in CDISC 360's biomedical concept authoring. Named entities can be used as keywords or tags in Example Collection. 

Last Note

I want to share a note on installation. scispacy and spacy require Cython, a C-extension for Python. I spent too many hours in troubleshooting before realizing I had installed a 32-bit port of Python onto a PC with Windows 10 64-bit. This caused many compiler errors because all the Microsoft Visual Studio runtime redistributables and compilers were 64-bit. Installing the 64-bit binaries for Python corrected all the installation issues.

References

[1] scispaCy: https://allenai.github.io/scispacy/

[2] BC5CDR corpus. https://www.ncbi.nlm.nih.gov/research/bionlp/Data/

[3] CDISC Interim User Guide for COVID-19. https://www.cdisc.org/interim-user-guide-covid-19/

[4] Unique Identifiers in the Metathesaurus. https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html

Disclaimer

The views and opinions expressed in this blog entry are those of mine and do not reflect the official policy or position of CDISC.

In this blog, I want to highlight one part of a project deliverable from the Controlled Terminology (CT) Relationships subteam - metadata about CDISC CT for the SDTM TS dataset.

Background

Before going into detail, here is a bit about how this Standards Development team was established. The team began at the CDISC Working Group Meeting in 2017 at Silver Spring, Maryland, U.S.A. NCI EVS representatives raised maintenance issues that stemmed from drastically different publication cadence between CDISC CT and Implementation Guide. Volunteers also shared implementation challenges about CDISC CT. After much discussion, the attendees agreed to this general problem statement for a new development subteam to tackle:

Relationships between published terminology codelists and variable metadata are not explicit enough or are incomplete in published Implementation Guides (IG) or Therapeutic Area User Guides (TAUG).

Project Deliverables

Fast forward to today, the team recently finished reviewing all the SDTM v1.4 & SDTMIG v3.2 domain variables. A project deliverable is being compiled with two main components:

  1. A model for expressing CT relationships for SDTM
  2. Metadata that details the relationship between variables and CDISC CT codelists & terms, or external dictionaries

Problem Discussions

Of all the SDTM datasets reviewed, I find Trial Summary (TS) the most intriguing due to its complex CT requirements.

The SDTM TS dataset, by definition, is "a trial design domain that contains one record for each trial summary characteristic." [1] A trial summary characteristic is represented by two parts: 1) TSPARM/TSPARMCD pair, or parameter/parameter code, respectively; and, 2) TSVAL, or value. Permissible values for TSVAL are dependent on TSPARM/TSPARMCD. In other words, CT requirement for TSVAL is dependent on TSPARM/TSPARM for any given dataset record.

Here is an excerpt from the SDTMIG v3.2's Appendix C1:

#TSPARMCDTSPARMTSVAL (Codelist Name or Format)
1ADDONAdded on to Existing TreatmentsNo Yes Response
2TDIGRPDiagnosis GroupSNOMED CT
3PCLAS

Pharmacological Class of Inv Therapy

NDF-RT
4TRT

Investigational Therapy or Treatment

UNII

Let's inspect and discuss each of them.

For #1, although seasoned CDISC users would likely recognize "No Yes Response" as one of the CDISC CT codelists, this notation inadvertently puts naive users at disadvantage. Even to trained users, it does not mean all the terms within that CT codelist are permissible. From a process automation's perspective, it contains no information to a machine about its purpose. Therefore, it isn't ideal for either human-, or machine-readability.

About #2 and #3, SNOMED CT and NDF-RT are external dictionaries. NDF-RT has been renamed to MED-RT. Not all users recognize these external dictionaries, especially when usages could be specific to certain geographical regions. Also, users face this implementation challenge: which component of these external dictionaries do they use to populate TSVAL? Therefore, information published in this SDTMIG appendix is not contemporary and is not explicit.

UNII is a coded identifier for all registered ingredients used in products regulated by US FDA. For example, 362O9ITL9D is the UNII for acetaminophen. In #4, it is misleading to populate TSVAL with UNII. It is more appropriate to populate this coded value in TSVALCD (parameter value code). The decode, so to speak, would instead go to TSVAL. In this instance, TSVAL shall correspond to the preferred substance name, a component in the Global Substance Registration System, which is maintained by U.S. FDA.

Solutions

What extra information is needed to make example #1 more readable to both human and machines? Since it is about CDISC CT, common attributes, such as codelist names (short & long) and c-codes will immediately be helpful. An attribute for subsetting codelist will be necessary to specify permissible values.

About the external dictionaries in examples #2 through #4, extra information to describe them will be elucidating, such as 1) owning organization, 2) dictionary's name, and, 3) dictionary's component.

An extra bit of metadata will be essential to cope with multiple regulatory requirements for SDTM data submissions.

All of the above together formulates the model (or, structure) for complete disambiguation of the relationships between CDISC CT and SDTM variables. The following tables illustrate this model in a tabular manner, along with the example parameters:

For use when CDISC CT is relevant:

#UsagesDomainVariableCondition 1C-Code for Value in Condition 1Condition 2C-Code for Value in Condition 2CDISC CT Codelist Short NameCDISC CT Codelist C-CodeCDISC CT Codelist Long NamePermissible Value from CDISC CTPermissible Value's C-CodeHealth Authority Provisions

Context of which this row of metadata applies; valid values are versioned foundational standards

A domain abbreviation found in foundational standard in "Usages"A variable name

May be used for normalized datasets such as SuppQual,  Findings domains, and TS

** Use this for TESTCD and PARMCD; or, QNAM

Conditional Value's c-code in the Condition column, if applicable

May be used for normalized datasets such as SuppQual,  Findings domains, and TS

** Use this for TEST and PARM to pair with TESTCD and PARMCD; otherwise, not needed

Conditional Value's c-code in the Condition column, if applicable

The CDISC CT Codelist that controls the values referenced in "Domain" and "Variable" columns

C-code that pairs with "CDISC CT Codelist Short Name"


Long name that pairs with "CDISC CT Codelist Short Name"

A semi-colon delimited value list subset from the codelist referenced in "CDISC CT Codelist Short Name"

C-codes for each value in "Permissible Value from CDISC CT", also semi-colon delimited

Specify to which health authority this set of metadata is applicable. Leave blank when not applicable. Example: "US FDA", "Japan PMDA"
1SDTMIG v3.2TSTSVALTSPARMCD EQ "ADDON"C49703TSPARM EQ "Added on to Existing Treatments"C49703NYC66742No Yes ResponseN; YC49488; C49487

For use when external dictionary is relevant:

#UsagesDomainVariableCondition 1C-Code for Value in Condition 1Condition 2C-Code for Value in Condition 2External Dictionary's OrganizationExternal Dictionary's NameExternal Dictionary's ComponentDescriptive InformationHealth Authority Provisions

Context of which this row of metadata applies; valid values are versioned foundational standards

A domain abbreviation found in foundational standard in "Usages"A variable name

May be used for normalized datasets such as SuppQual,  Findings domains, and TS

** Use this for TESTCD and PARMCD; or, QNAM

Conditional Value's c-code in the Condition column, if applicable

May be used for normalized datasets such as SuppQual,  Findings domains, and TS

** Use this for TEST and PARM to pair with TESTCD and PARMCD; otherwise, not needed

Conditional Value's c-code in the Condition column, if applicable

Used when "Variable" is controlled by an external dictionary. Example: "MSSO", "Regenstrief Institute"

Used when "Variable" is controlled by an external dictionary. Example: "MedDRA", "LOINC"

Used when "Variable" is controlled by an external dictionary. Example: "Preferred Term Code", "LOINC Code"

Additional information that is useful for implementers from a citable source

** Citable implementation information that can't be molded into detail metadata; or, regulatory agency's requirements

Specify to which health authority this set of metadata is applicable. Leave blank when not applicable. Example: "US FDA", "Japan PMDA"
2SDTMIG v3.2TSTSVALTSPARMCD EQ "TDIGRP"C49650TSPARM EQ "Diagnosis Group"C49650International Health Terminology Standards Organisation (IHTSDO)SNOMED CT

SNOMED CT Fully Specified Name

Appendix C of SDTMIG v3.2 specifies SNOMED CT. See FDA TCG section 6.6.1.1

Also see Notes in Appendix C of SDTMIG v3.2:

If the study population is healthy subjects (i.e., healthy subjects flag is Y), this parameter is not expected.

US FDA
2SDTMIG v3.2TSTSVALCDTSPARMCD EQ "TDIGRP"C49650TSPARM EQ "Diagnosis Group"C49650International Health Terminology Standards Organisation (IHTSDO)SNOMED CT

SNOMED CT Identifier (SCTID)


US FDA
3SDTMIG v3.2TSTSVALTSPARMCD EQ "PCLAS"C98768TSPARM EQ "Pharmacologic Class"C98768Department of Veterans Affairs/Veterans Health AdministrationMedication Reference Terminology (MED-RT)Established pharmacologic class (EPC)Note: Refer to citation in FDA TCG guidance. If the established pharmacologic class (EPC) is not available for an active moiety, then the sponsor should discuss the appropriate MOA, PE, and CS terms with the review division.US FDA; Japan PMDA
3SDTMIG v3.2TSTSVALCDTSPARMCD EQ "PCLAS"C98768TSPARM EQ "Pharmacologic Class"C98768Department of Veterans Affairs/Veterans Health AdministrationMedication Reference Terminology (MED-RT)Alphanumeric unique identifier (NUI)
US FDA; Japan PMDA
4SDTMIG v3.2TSTSVALTSPARMCD EQ "TRT"C41161TSPARM EQ "Investigational Therapy or Treatment"C41161U.S. Food and Drug Administration (US FDA)Global Substance Registration SystemPreferred substance name
US FDA; Japan PMDA
4SDTMIG v3.2TSTSVALCDTSPARMCD EQ "TRT"C41161TSPARM EQ "Investigational Therapy or Treatment"C41161U.S. Food and Drug Administration (US FDA)Global Substance Registration SystemUnique Ingredient Identifier (UNII)
US FDA; Japan PMDA

Project Status

The project deliverable is currently undergoing Internal Review per CDISC's standard development process. [2] All artifacts created by the team are available on the CDISC Wiki, along with a Read Me section. [3] The team expects Public Review to begin in 3rd quarter of 2020.

Expected Outcomes

The team operates with a tight alignment with CDISC's strategic goal to transform standards and clinical knowledge into a multidimensional representation to support automation. [4] Users can expect the metadata will be accessible via CDISC Library when it completes the development lifecycle. Future IG and TAUG may reference CT Relationships to keep concurrent with CDISC CT publication cadence. The team may incorporate additional kinds of CT relationships metadata, e.g., CT codetables. [5] Also an aspiration, the team, using the same methodology, will expand to cover CDASH, SEND, and ADaM.

Acknowledgements

I want to acknowledge these people for their contributions and domain expertise: Kristin Kelly (Pinnacle 21), Michael Lozano (Eli Lilly), Sharon Weller (Eli Lilly), Donna Sattler (BMS), Debbie O’Neill (Merck), Smitha Karra* (Gilead), Judith Goud (Nurocor), Swarupa Sudini (Pfizer), Anna Pron-Zwick (AstraZeneca), Craig Zwickl (Independent), Erin Muhlbradt* (NCI EVS), Fred Wood (TalentMine), Trish Gleason (BMS), Sharon Hartpence (BMS), Diane Wold (CDISC). Special thanks to Ann White for copyediting.

* denotes team co-lead, current and past

References

[1] CDISC SDTM CT P34. Extracted from CDISC Library Data Standards Browser: https://library.cdisc.org/browser/ct/2018-06-29?products=sdtmct-2018-06-29&codelists=C66734&codevalue=C53483

[2] CDISC Operating Procedure CDISC -COP -001 Standards Development. https://www.cdisc.org/system/files/about/cop/CDISC-COP-001-Standards_Development_2019.pdf

[3] Internal Review package. https://wiki.cdisc.org/display/CT/Internal+Review

[4] CDISC Strategic Plan 2019-2022. https://www.cdisc.org/sites/default/files/resource/CDISC_2019_2022_Strategic_Plan.pdf

[5] CT codetables. https://www.cdisc.org/standards/terminology, expand Codetable Mapping Files