Blog

Disclaimer

The views and opinions expressed in this blog entry are those of mine and do not reflect the official policy or position of CDISC.

I am learning about natural language processing (NLP) as part of my self-teaching journey in Data Science.

My set up is simple. Python and an NLP package. scispaCy is an open-source library for processing biomedical text. [1] It works with multiple pre-trained models, such as the The BioCreative V Chemical Disease Relation (BC5CDR) corpus for biomedical terms. [2] BioNLP13CG corpus is another example, which is a model for cancer genetics.

As a test run, I selected a small paragraph from the work-in-progress COVID-19 Interim User Guide. [3] This is a text about a data collection example for the disease's signs and symptoms:

Data collection may include questions about groups of symptoms, such as
GI symptoms (nausea, vomiting, diarrhea)
Cough (non-productive, productive, or haemoptisis)

The next step was to extract named entities by running the signs and symptoms text through the BC5CDR biomedical model, A named entity is text with a label of name of things. For BC5CDR, the entity types are DISEASE and CHEMICAL. This process is often referred to as named entity recognition (NER). These are the results:

Entity	Entity Type
nausea	DISEASE
vomiting	DISEASE
diarrhea	DISEASE
Cough	DISEASE
haemoptisis	DISEASE

The first surprise was a programmatic method to link discovered entity with Unified Medical Language System (UMLS). UMLS is maintained by U.S. National Library of Medicine (NLM). This is appealing, when a term or concept is curated in the UMLS, a formal definition exists. Each concept in the UMLS has a Concept Unique Identifier (CUI). [4] This process is typically called named entity linking (NEL).

Let's take a look at the outcome, with green shaded rows indicating my preferred match:

Entity	CUI	Name	Definition	Score
nausea	C0027497	Nausea	An unpleasant sensation in the stomach usually accompanied by the urge to vomit. Common causes are early pregnancy, sea and motion sickness, emotional stress, intense pain, food poisoning, and various enteroviruses.	1.0
nausea	C4085862	Bothered by Nausea	A question about whether an individual is or was bothered by nausea.	1.0
nausea	C4255480	Nausea:Presence or Threshold:Point in time:^Patient:Ordinal	None	1.0
nausea	C4084796	How Often Nausea	A question about how often an individual has or had nausea.	1.0
nausea	C1963179	Nausea Adverse Event	None	1.0
vomiting	C0042963	Vomiting	The forcible expulsion of the contents of the STOMACH through the MOUTH.	1.0
vomiting	C4084767	Bothered by Vomiting	A question about whether an individual is or was bothered by vomiting.	0.9999999403953552
vomiting	C4084768	Usual Severity Vomiting	A question about the usual severity of an individual's vomiting.	0.9999999403953552
vomiting	C1963281	Vomiting Adverse Event	None	0.9999999403953552
vomiting	C4084766	How Much Distress Vomiting	A question about an individual's distress from their vomiting.	0.9999999403953552
diarrhea	C0011991	Diarrhea	An increased liquidity or decreased consistency of FECES, such as running stool. Fecal consistency is related to the ratio of water-holding capacity of insoluble solids to total water, rather than the amount of water present. Diarrhea is not hyperdefecation or increased fecal weight.	1.0
diarrhea	C4084784	How Much Distress Diarrhea	A question about an individual's distress from their diarrhea.	1.0
diarrhea	C4084802	Usual Severity Diarrhea	A question about the usual severity of an individual's diarrhea.	1.0
diarrhea	C1963091	Diarrhea Adverse Event	None	1.0
diarrhea	C3641756	Have Diarrhea	A question about whether an individual has or had diarrhea.	1.0
Cough	C0010200	Coughing	A sudden, audible expulsion of air from the lungs through a partially closed glottis, preceded by inhalation. It is a protective response that serves to clear the trachea, bronchi, and/or lungs of irritants and secretions, or to prevent aspiration of foreign materials into the lungs.	1.0
Cough	C1961131	Cough Adverse Event	None	1.0
Cough	C3274924	Have Been Coughing	A question about whether an individual is or has been coughing.	1.0
Cough	C3815497	Cough (guaifenesin)	None	1.0
Cough	C4084725	Usual Severity Cough	A question about the usual severity of an individual's cough.	1.0
haemoptisis	None returned

Notice how the table above does not include any UMLS concept for the named entity haemoptisis. With some online searches, it came to me as another surprise that it is due to a typographical error. After correcting it to "hemoptysis," a hit appears in the outcome, as follows:

Entity	CUI	Name	Dfinition	Score
hemoptysis	C0019079	Hemoptysis	Expectoration or spitting of blood originating from any part of the RESPIRATORY TRACT, usually from hemorrhage in the lung parenchyma (PULMONARY ALVEOLI) and the BRONCHIAL ARTERIES.	1.0
hemoptysis	C0030424	Paragonimiasis	Infection with TREMATODA of the genus PARAGONIMUS.	0.7546218633651733

Suffice to mention, these CUIs are available on the NCI Metathesaurus. This is the URL template: https://ncim.nci.nih.gov/ncimbrowser/ConceptReport.jsp?dictionary=NCI%20Metathesaurus&code={CUI}

Visualization

spaCy includes built in visualization constructors to display part-of-speech tags and syntactic dependencies. The following graphic is the rendition using the text described above:

We discussed named entity recognition, which can be displayed as such:

Example Code

import scispacy
import spacy
from scispacy.umls_linking import UmlsEntityLinker
from spacy import displacy

nlp = spacy.load("en_ner_bc5cdr_md")

linker = UmlsEntityLinker(resolve_abbreviations=True)
nlp.add_pipe(linker)

text = """
Data collection may include questions about groups of symptoms, such as
  GI symptoms (nausea, vomiting, diarrhea)
  Cough (non-productive, productive, or haemoptisis)
"""

doc = nlp(text)

entities = doc.ents
for entity in entities:
    print(entity.text, entity.start_char, entity.end_char, entity.label_)

    for umls_ent in entity._.umls_ents:
        # tuple with 2 values
        conceptId, score = umls_ent

        print(f"Name: {entity}")
        print(f"CUI: {conceptId}, Score {score}")
        print(linker.umls.cui_to_entity[umls_ent[0]])
        print()

colors = {
    'CHEMICAL': 'lightpink',
    'DISEASE': 'lightorange',
}

# show NER
displacy.serve(doc, style="ent", host="127.0.0.1", options={'colors': colors})
displacy.serve(doc, style="dep", host="127.0.0.1")

The Road Ahead

At this point, there seems to be a lot of NLP opportunities and applications in standards development. Linkage to UMLS will allow team members to ensure semantic meaning by referencing the curated definition. Quality will increase as I demonstrated how detecting the spelling error was an unintended experience. I can certainly see it has a utility in CDISC 360's biomedical concept authoring. Named entities can be used as keywords or tags in Example Collection.

Last Note

I want to share a note on installation. scispacy and spacy require Cython, a C-extension for Python. I spent too many hours in troubleshooting before realizing I had installed a 32-bit port of Python onto a PC with Windows 10 64-bit. This caused many compiler errors because all the Microsoft Visual Studio runtime redistributables and compilers were 64-bit. Installing the 64-bit binaries for Python corrected all the installation issues.

References

[1] scispaCy: https://allenai.github.io/scispacy/

[2] BC5CDR corpus. https://www.ncbi.nlm.nih.gov/research/bionlp/Data/

[3] CDISC Interim User Guide for COVID-19. https://www.cdisc.org/interim-user-guide-covid-19/

[4] Unique Identifiers in the Metathesaurus. https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html

#	TSPARMCD	TSPARM	TSVAL (Codelist Name or Format)
1	ADDON	Added on to Existing Treatments	No Yes Response
2	TDIGRP	Diagnosis Group	SNOMED CT
3	PCLAS	Pharmacological Class of Inv Therapy	NDF-RT
4	TRT	Investigational Therapy or Treatment	UNII

#	Usages	Domain	Variable	Condition 1	C-Code for Value in Condition 1	Condition 2	C-Code for Value in Condition 2	External Dictionary's Organization	External Dictionary's Name	External Dictionary's Component	Descriptive Information	Health Authority Provisions
	Context of which this row of metadata applies; valid values are versioned foundational standards	A domain abbreviation found in foundational standard in "Usages"	A variable name	May be used for normalized datasets such as SuppQual, Findings domains, and TS ** Use this for TESTCD and PARMCD; or, QNAM	Conditional Value's c-code in the Condition column, if applicable	May be used for normalized datasets such as SuppQual, Findings domains, and TS ** Use this for TEST and PARM to pair with TESTCD and PARMCD; otherwise, not needed	Conditional Value's c-code in the Condition column, if applicable	Used when "Variable" is controlled by an external dictionary. Example: "MSSO", "Regenstrief Institute"	Used when "Variable" is controlled by an external dictionary. Example: "MedDRA", "LOINC"	Used when "Variable" is controlled by an external dictionary. Example: "Preferred Term Code", "LOINC Code"	Additional information that is useful for implementers from a citable source ** Citable implementation information that can't be molded into detail metadata; or, regulatory agency's requirements	Specify to which health authority this set of metadata is applicable. Leave blank when not applicable. Example: "US FDA", "Japan PMDA"
2	SDTMIG v3.2	TS	TSVAL	TSPARMCD EQ "TDIGRP"	C49650	TSPARM EQ "Diagnosis Group"	C49650	International Health Terminology Standards Organisation (IHTSDO)	SNOMED CT	SNOMED CT Fully Specified Name	Appendix C of SDTMIG v3.2 specifies SNOMED CT. See FDA TCG section 6.6.1.1 Also see Notes in Appendix C of SDTMIG v3.2: If the study population is healthy subjects (i.e., healthy subjects flag is Y), this parameter is not expected.	US FDA
2	SDTMIG v3.2	TS	TSVALCD	TSPARMCD EQ "TDIGRP"	C49650	TSPARM EQ "Diagnosis Group"	C49650	International Health Terminology Standards Organisation (IHTSDO)	SNOMED CT	SNOMED CT Identifier (SCTID)		US FDA
3	SDTMIG v3.2	TS	TSVAL	TSPARMCD EQ "PCLAS"	C98768	TSPARM EQ "Pharmacologic Class"	C98768	Department of Veterans Affairs/Veterans Health Administration	Medication Reference Terminology (MED-RT)	Established pharmacologic class (EPC)	Note: Refer to citation in FDA TCG guidance. If the established pharmacologic class (EPC) is not available for an active moiety, then the sponsor should discuss the appropriate MOA, PE, and CS terms with the review division.	US FDA; Japan PMDA
3	SDTMIG v3.2	TS	TSVALCD	TSPARMCD EQ "PCLAS"	C98768	TSPARM EQ "Pharmacologic Class"	C98768	Department of Veterans Affairs/Veterans Health Administration	Medication Reference Terminology (MED-RT)	Alphanumeric unique identifier (NUI)		US FDA; Japan PMDA
4	SDTMIG v3.2	TS	TSVAL	TSPARMCD EQ "TRT"	C41161	TSPARM EQ "Investigational Therapy or Treatment"	C41161	U.S. Food and Drug Administration (US FDA)	Global Substance Registration System	Preferred substance name		US FDA; Japan PMDA
4	SDTMIG v3.2	TS	TSVALCD	TSPARMCD EQ "TRT"	C41161	TSPARM EQ "Investigational Therapy or Treatment"	C41161	U.S. Food and Drug Administration (US FDA)	Global Substance Registration System	Unique Ingredient Identifier (UNII)		US FDA; Japan PMDA

Name	Birthdate
Taylor 🐕	3/12/2008
Carter 🐈	12/12/2018

Long Label	CDISC CT	UCUM
Table 1: CDISC CT vs. UCUM
Cells per Microliter	cells/uL	{Cells}/uL
Hour	HOURS	h
Joule	Joule	J
Millimeter of Mercury	mmHg	mm[Hg]
Millisecond	msec	ms
Pound	LB	[lb_av]
Tablet Dosing Unit	TABLET	{tbl}

conceptCode	preferredName	ucum
Table 2: Partial results from extracting metadata in Thesaurus OWL/RDF using SPARQL
C67242	Cells per Microliter	{Cells}/uL
C25529	Hour	h
C42548	Joule	J
C49670	Millimeter of Mercury	mm[Hg]
C41140	Millisecond	ms
C48531	Pound	[lb_av]
C48542	Tablet Dosing Unit	{tbl}

Blog

Visualization

Example Code

The Road Ahead

Last Note

References

Background

Project Deliverables

Problem Discussions

Solutions

Project Status

Expected Outcomes

Acknowledgements

References

Example of unstructured data

Example of semi-structured data

Example of structured data

Acknowledgements

Introduction

NCI EVS Resources

Thesaurus OWL/RDF

CDISC CT OWL/RDF

Final Query and Output

Conclusion

End Notes

Unicode Name (Code Point)	Replacement ASCII Character (Decimal)	Remarks
LEFT SINGLE QUOTATION MARK (U+2018)	' (39)	Commonly referred to left curly single-quote This character is part of Microsoft Word's default AutoFormat setting.
RIGHT SINGLE QUOTATION MARK (U+2019)	' (39)	Commonly referred to right curly single-quote This character is part of Microsoft Word's default AutoFormat setting.
LEFT DOUBLE QUOTATION MARK (U+201D)	" (34)	Commonly referred to left curly double-quote This character is part of Microsoft Word's default AutoFormat setting.
RIGHT DOUBLE QUOTATION MARK (U+201D)	" (34)	Commonly referred to right curly double-quote This character is part of Microsoft Word's default AutoFormat setting.
NON-BREAKING HYPHEN (U+2011)	- (45)
FIGURE DASH (U+2012)	- (45)	Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as â€’
EN DASH (U+2013)	- (45)	This character is part of Microsoft Word's default AutoCorrect setting. Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as â€“
EM DASH (U+2014)	- (45)	Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as â€”
HORIZONTAL ELLIPSIS (U+2026)	... (46, 3 times)	This character is part of Microsoft Word's default AutoCorrect setting. Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as â€¦
ZERO WIDTH SPACE (U+200B)	null (0)	Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as â€‹ This is a non-printable character. Microsoft Word uses it to represent optional breaks. It is visible after enabling the Show Formatting Symbols option: An example with text:
NO-BREAK SPACE (U+00A0)	whitespace (32)	Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as ∩┐╜ Although it is a printable character, Microsoft Word uses it to represent nonbreaking space, but disguises it as regular whitespace. It is visible after enabling the Show Formatting Symbols option: An example with text: