Disclaimer
The views and opinions expressed in this blog entry are those of mine and do not reflect the official policy or position of CDISC.
I am learning about natural language processing (NLP) as part of my self-teaching journey in Data Science.
My set up is simple. Python and an NLP package. scispaCy is an open-source library for processing biomedical text. [1] It works with multiple pre-trained models, such as the The BioCreative V Chemical Disease Relation (BC5CDR) corpus for biomedical terms. [2] BioNLP13CG corpus is another example, which is a model for cancer genetics.
As a test run, I selected a small paragraph from the work-in-progress COVID-19 Interim User Guide. [3] This is a text about a data collection example for the disease's signs and symptoms:
Data collection may include questions about groups of symptoms, such as
- GI symptoms (nausea, vomiting, diarrhea)
- Cough (non-productive, productive, or haemoptisis)
The next step was to extract named entities by running the signs and symptoms text through the BC5CDR biomedical model, A named entity is text with a label of name of things. For BC5CDR, the entity types are DISEASE and CHEMICAL. This process is often referred to as named entity recognition (NER). These are the results:
Entity | Entity Type |
---|---|
nausea | DISEASE |
vomiting | DISEASE |
diarrhea | DISEASE |
Cough | DISEASE |
haemoptisis | DISEASE |
The first surprise was a programmatic method to link discovered entity with Unified Medical Language System (UMLS). UMLS is maintained by U.S. National Library of Medicine (NLM). This is appealing, when a term or concept is curated in the UMLS, a formal definition exists. Each concept in the UMLS has a Concept Unique Identifier (CUI). [4] This process is typically called named entity linking (NEL).
Let's take a look at the outcome, with green shaded rows indicating my preferred match:
Entity | CUI | Name | Definition | Score |
---|---|---|---|---|
nausea | C0027497 | Nausea | An unpleasant sensation in the stomach usually accompanied by the urge to vomit. Common causes are early pregnancy, sea and motion sickness, emotional stress, intense pain, food poisoning, and various enteroviruses. | 1.0 |
nausea | C4085862 | Bothered by Nausea | A question about whether an individual is or was bothered by nausea. | 1.0 |
nausea | C4255480 | Nausea:Presence or Threshold:Point in time:^Patient:Ordinal | None | 1.0 |
nausea | C4084796 | How Often Nausea | A question about how often an individual has or had nausea. | 1.0 |
nausea | C1963179 | Nausea Adverse Event | None | 1.0 |
vomiting | C0042963 | Vomiting | The forcible expulsion of the contents of the STOMACH through the MOUTH. | 1.0 |
vomiting | C4084767 | Bothered by Vomiting | A question about whether an individual is or was bothered by vomiting. | 0.9999999403953552 |
vomiting | C4084768 | Usual Severity Vomiting | A question about the usual severity of an individual's vomiting. | 0.9999999403953552 |
vomiting | C1963281 | Vomiting Adverse Event | None | 0.9999999403953552 |
vomiting | C4084766 | How Much Distress Vomiting | A question about an individual's distress from their vomiting. | 0.9999999403953552 |
diarrhea | C0011991 | Diarrhea | An increased liquidity or decreased consistency of FECES, such as running stool. Fecal consistency is related to the ratio of water-holding capacity of insoluble solids to total water, rather than the amount of water present. Diarrhea is not hyperdefecation or increased fecal weight. | 1.0 |
diarrhea | C4084784 | How Much Distress Diarrhea | A question about an individual's distress from their diarrhea. | 1.0 |
diarrhea | C4084802 | Usual Severity Diarrhea | A question about the usual severity of an individual's diarrhea. | 1.0 |
diarrhea | C1963091 | Diarrhea Adverse Event | None | 1.0 |
diarrhea | C3641756 | Have Diarrhea | A question about whether an individual has or had diarrhea. | 1.0 |
Cough | C0010200 | Coughing | A sudden, audible expulsion of air from the lungs through a partially closed glottis, preceded by inhalation. It is a protective response that serves to clear the trachea, bronchi, and/or lungs of irritants and secretions, or to prevent aspiration of foreign materials into the lungs. | 1.0 |
Cough | C1961131 | Cough Adverse Event | None | 1.0 |
Cough | C3274924 | Have Been Coughing | A question about whether an individual is or has been coughing. | 1.0 |
Cough | C3815497 | Cough (guaifenesin) | None | 1.0 |
Cough | C4084725 | Usual Severity Cough | A question about the usual severity of an individual's cough. | 1.0 |
haemoptisis | None returned |
Notice how the table above does not include any UMLS concept for the named entity haemoptisis. With some online searches, it came to me as another surprise that it is due to a typographical error. After correcting it to "hemoptysis," a hit appears in the outcome, as follows:
Entity | CUI | Name | Dfinition | Score |
---|---|---|---|---|
hemoptysis | C0019079 | Hemoptysis | Expectoration or spitting of blood originating from any part of the RESPIRATORY TRACT, usually from hemorrhage in the lung parenchyma (PULMONARY ALVEOLI) and the BRONCHIAL ARTERIES. | 1.0 |
hemoptysis | C0030424 | Paragonimiasis | Infection with TREMATODA of the genus PARAGONIMUS. | 0.7546218633651733 |
Suffice to mention, these CUIs are available on the NCI Metathesaurus. This is the URL template: https://ncim.nci.nih.gov/ncimbrowser/ConceptReport.jsp?dictionary=NCI%20Metathesaurus&code={CUI}
spaCy includes built in visualization constructors to display part-of-speech tags and syntactic dependencies. The following graphic is the rendition using the text described above:
We discussed named entity recognition, which can be displayed as such:
import scispacy import spacy from scispacy.umls_linking import UmlsEntityLinker from spacy import displacy nlp = spacy.load("en_ner_bc5cdr_md") linker = UmlsEntityLinker(resolve_abbreviations=True) nlp.add_pipe(linker) text = """ Data collection may include questions about groups of symptoms, such as GI symptoms (nausea, vomiting, diarrhea) Cough (non-productive, productive, or haemoptisis) """ doc = nlp(text) entities = doc.ents for entity in entities: print(entity.text, entity.start_char, entity.end_char, entity.label_) for umls_ent in entity._.umls_ents: # tuple with 2 values conceptId, score = umls_ent print(f"Name: {entity}") print(f"CUI: {conceptId}, Score {score}") print(linker.umls.cui_to_entity[umls_ent[0]]) print() colors = { 'CHEMICAL': 'lightpink', 'DISEASE': 'lightorange', } # show NER displacy.serve(doc, style="ent", host="127.0.0.1", options={'colors': colors}) displacy.serve(doc, style="dep", host="127.0.0.1")
At this point, there seems to be a lot of NLP opportunities and applications in standards development. Linkage to UMLS will allow team members to ensure semantic meaning by referencing the curated definition. Quality will increase as I demonstrated how detecting the spelling error was an unintended experience. I can certainly see it has a utility in CDISC 360's biomedical concept authoring. Named entities can be used as keywords or tags in Example Collection.
I want to share a note on installation. scispacy and spacy require Cython, a C-extension for Python. I spent too many hours in troubleshooting before realizing I had installed a 32-bit port of Python onto a PC with Windows 10 64-bit. This caused many compiler errors because all the Microsoft Visual Studio runtime redistributables and compilers were 64-bit. Installing the 64-bit binaries for Python corrected all the installation issues.
[1] scispaCy: https://allenai.github.io/scispacy/
[2] BC5CDR corpus. https://www.ncbi.nlm.nih.gov/research/bionlp/Data/
[3] CDISC Interim User Guide for COVID-19. https://www.cdisc.org/interim-user-guide-covid-19/
[4] Unique Identifiers in the Metathesaurus. https://www.nlm.nih.gov/research/umls/new_users/online_learning/Meta_005.html
7 Comments
Saad Yousef
Thanks for sharing, Anthony. Did a bit of research to see if R has a tool like scispaCy but can't find anything yet. Learned quite a bit nonetheless from your post.
Saad
Saad Yousef
Just found something: .
Saad
Jozef Aerts
Anthony, this is not a baby step, it is a great leap forward! Especially linking to UMLS enables to go from simple text to machine-readable and machine-executable information. And it allows to bridge between CDISC-CT and other coding systems such as SNOMED, LOINC, ICD, ... as these are (just as CDISC-CT) represented in UMLS. I did something similar few years ago, but without NLP, just semi-automatically. It was a visual application allowing users to annotate protocols with CDISC-CT, SNOMED, LOINC, ... Using NLP, maybe in combination with AI, promises to finally come to clear, unambiguous, machine-readable protocols (and TAUGs) starting from "just text" that we have now. GREAT WORK!
Jozef Aerts
Anthony Chow Do you know of any worldwide standard for annotating text documents? I found https://www.w3.org/TR/annotation-model/ but I have no idea (yet) whether that would e.g. be suitable for annotation TAUGs and protocols with CDISC and other standards information.
Kit Howard
I'm arriving a bit late to the show, but thanks for a very informative blog. This puts some nuts and bolts to NLP, which is a term many folks like to wave about without really saying anything
Just as a quick point about hemoptysis/haemoptysis. You've run into an example of two countries divided by a common language! Actually, we can say it's 2 languages because the root cause is Latin. Haemoptysis is the spelling derived from Latin, and that is still used in the UK. Similarly, oesophagus, gynaecological (the a and e are actually joined), tumour, analyse and titre. US English has simplified many of these spellings.
Another example of - until you tell them differently, computers really are very ignorant
Anthony Chow AUTHOR
Right, except it is either hemoptysis or haemoptysis, but never haemoptisis.
Anthony Chow AUTHOR
Also, UMLS has haemoptysis as a synonym to hemoptysis.