Introduction

CDISC Controlled Terminology (CT) maintains a codelist of units of measurement (codelist code C71620, short name UNIT). It is used to represent values for unit variables in various domains, such as demographics (AGEU), concomitant medications (CMDOSU), lab (LBORRESU, LBSTRESU), vital signs (VSORRESU, VSSTRESU). Note: AGEU is a codelist subset of the UNIT superset.

Indiana University School of Medicine's Regenstrief Institute develops the Unified Code for Units of Measure (UCUM). It is a code system of units intended to be unambiguous to both human and machine. It has many applications in life sciences, such as EHR, and, EDI and HL7 electronic messaging. Logical Observation Identifiers Names and Codes (LOINC) is another code system that incorporates UCUM.

Even though many terms are identical between CDISC and UCUM, there are differences. For example, millimeter of mercury, a unit commonly used to measure blood pressure, is mmHg in CDISC, while mm[Hg] in UCUM. Here are a few examples showing differing values:

Long Label	CDISC CT	UCUM
Table 1: CDISC CT vs. UCUM
Cells per Microliter	cells/uL	{Cells}/uL
Hour	HOURS	h
Joule	Joule	J
Millimeter of Mercury	mmHg	mm[Hg]
Millisecond	msec	ms
Pound	LB	[lb_av]
Tablet Dosing Unit	TABLET	{tbl}

Therefore, a mapping between the two codelists is helpful for any two heterogeneous systems to be interoperable and be successful at exchanging data.

NCI EVS Resources

CDISC and NCI EVS have long been partners at curating and registering CDISC controlled terminologies in the NCI Thesaurus (NCIt), hence the NCI Metathesaurus (NCIm). As a matter of fact, a careful examination into NCIm reveals the relationship between CDISC and UCUM exists. A search of the term millimeter of mercury shows the evidence (source: http://1.usa.gov/1zoRA96):

At this point, we gather these about the NCI EVS:

All CDISC CT can be retrieved from the NCIt browser
NCIt contains biomedical knowledge from multiple sources, e.g., CDISC, UCUM, SNOMED, etc.
Relationships between sources are maintained, where applicable

Thesaurus OWL/RDF

Despite being off to a good start, manual lookup via the NCIt browser would be too tedious to be useful. Upon further research, NCI Center for Biomedical Informatics and Information Technology (CBIIT) publishes the NCIt in OWL/RDF format in a regular basis.

With an OWL/RDF file at our disposal, SPARQL is the tool to do some graph-based data analyses.

The following is a snippet from the Thesaurus OWL/RDF file, showing how it represents metadata for the term millimeter of mercury:

Millimeter of Mercury from Thesaurus.OWL

<!-- http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Millimeter_of_Mercury -->

<owl:Class rdf:about="#C49670">
    <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Millimeter of Mercury</rdfs:label>
    <rdfs:subClassOf rdf:resource="#C67332"/>
    <P97 rdf:parseType="Literal"><ncicp:ComplexDefinition><ncicp:def-definition>A non-SI unit of pressure equal to 133,332 Pa or 1.316E10-3 standard atmosphere. Use of this unit is generally deprecated by ISO and IUPAC.</ncicp:def-definition><ncicp:def-source>NCI</ncicp:def-source></ncicp:ComplexDefinition></P97>
    <P325 rdf:parseType="Literal"><ncicp:ComplexDefinition><ncicp:def-definition>A unit of pressure equal to 0.001316 atmosphere and equal to the pressure indicated by one millimeter rise of mercury in a barometer at the Earth's surface. (NCI)</ncicp:def-definition><ncicp:def-source>CDISC</ncicp:def-source></ncicp:ComplexDefinition></P325>
    <P90 rdf:parseType="Literal"><ncicp:ComplexTerm><ncicp:term-name>mm[Hg]</ncicp:term-name><ncicp:term-group>AB</ncicp:term-group><ncicp:term-source>UCUM</ncicp:term-source></ncicp:ComplexTerm></P90>
    <code rdf:datatype="http://www.w3.org/2001/XMLSchema#string">C49670</code>
    <P108 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Millimeter of Mercury</P108>
    <A8 rdf:resource="http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#C71620"/>
</owl:Class>

To de-reference the pseudo-properties P90 (Synonym with Source Data) and P108 (Preferred Name) above:

Definition of the Pesudo-Property P90

<!-- http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#FULL_SYN -->

<owl:DatatypeProperty rdf:about="#P90">
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#AnnotationProperty"/>
    <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">FULL_SYN</rdfs:label>
    <P97 rdf:parseType="Literal"><ncicp:ComplexDefinition><ncicp:def-definition>Fully qualified synonym, contains the string, term type, source, and an optional source code if appropriate. Each subfield is deliniated to facilitate interpretation by software.</ncicp:def-definition><ncicp:def-source>NCI</ncicp:def-source></ncicp:ComplexDefinition></P97>
    <P90 rdf:parseType="Literal"><ncicp:ComplexTerm><ncicp:term-name>FULL_SYN</ncicp:term-name><ncicp:term-group>PT</ncicp:term-group><ncicp:term-source>NCI</ncicp:term-source></ncicp:ComplexTerm></P90>
    <P90 rdf:parseType="Literal"><ncicp:ComplexTerm><ncicp:term-name>Synonym with Source Data</ncicp:term-name><ncicp:term-group>SY</ncicp:term-group><ncicp:term-source>NCI</ncicp:term-source></ncicp:ComplexTerm></P90>
    <P106 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Conceptual Entity</P106>
    <P108 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">FULL_SYN</P108>
    <code rdf:datatype="http://www.w3.org/2001/XMLSchema#string">P90</code>
    <P107 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Term &amp; Source Data</P107>
</owl:DatatypeProperty>

Definition of the Pesudo-Property P108

<!-- http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#Preferred_Name -->

<owl:AnnotationProperty rdf:about="#P108">
    <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#DatatypeProperty"/>
    <rdfs:label rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Preferred_Name</rdfs:label>
    <P97 rdf:parseType="Literal"><ncicp:ComplexDefinition><ncicp:def-definition>The word or phrase that NCI uses by preference to refer to the concept.</ncicp:def-definition><ncicp:def-source>NCI</ncicp:def-source></ncicp:ComplexDefinition></P97>
    <P90 rdf:parseType="Literal"><ncicp:ComplexTerm><ncicp:term-name>Preferred Name</ncicp:term-name><ncicp:term-group>SY</ncicp:term-group><ncicp:term-source>NCI</ncicp:term-source></ncicp:ComplexTerm></P90>
    <P90 rdf:parseType="Literal"><ncicp:ComplexTerm><ncicp:term-name>Preferred Term</ncicp:term-name><ncicp:term-group>SY</ncicp:term-group><ncicp:term-source>NCI</ncicp:term-source></ncicp:ComplexTerm></P90>
    <P90 rdf:parseType="Literal"><ncicp:ComplexTerm><ncicp:term-name>Preferred_Name</ncicp:term-name><ncicp:term-group>PT</ncicp:term-group><ncicp:term-source>NCI</ncicp:term-source></ncicp:ComplexTerm></P90>
    <P106 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Conceptual Entity</P106>
    <code rdf:datatype="http://www.w3.org/2001/XMLSchema#string">P108</code>
    <P107 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Preferred Name</P107>
    <P108 rdf:datatype="http://www.w3.org/2001/XMLSchema#string">Preferred_Name</P108>
</owl:AnnotationProperty>

An explanation for the above OWL snippet:

Line #3 is the beginning of the owl:Class C49670, which, by no accident, is the c-code for millimeter of mercury we are accustomed to seeing on the CDISC CT spreadsheet.
On line #11, it shows C49670 is associated to another class C71620, which is the UNIT codelist itself.
On line #9, it contains the c-code as a property.
On line #10, it contains the NCI preferred name of C49670.
On line #8, it contains the UCUM mapping. Note that the value AB in the XML tag ncicp:term-group signifies the entry is about unit symbols used by UCUM. Other values may appear, but do not relate to this demonstration.
Lines #6-7 contain definitions of the class.
Lines #15-41 contain definitions of the pseudo-properties P90 and P108 for easy reference. There are many other pseudo-objects in the Thesaurus OWL/RDF files.

With the above sample depicting the model, the following is a SPARQL query for obtaining a list objects having a UCUM mapping:

SPARQL for Extracting UCUM Mappings

PREFIX nci: <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#>
 
SELECT ?conceptCode ?preferredName (STRAFTER(STRBEFORE(STR(?synonym), "</ncicp:term-name>"), "<ncicp:term-name>") as ?ucum)
FROM <http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl>
WHERE {
    ?class a owl:Class ;
        nci:code ?conceptCode ;
        nci:P108 ?preferredName ;
        nci:P90 ?synonym .
        FILTER(STRAFTER(STRBEFORE(STR(?synonym), "</ncicp:term-source>"), "<ncicp:term-source>") = "UCUM")
        FILTER(STRAFTER(STRBEFORE(STR(?synonym), "</ncicp:term-group>"), "<ncicp:term-group>") = "AB")
}

A sample output matching the terms shown in Table 1 above:

conceptCode	preferredName	ucum
Table 2: Partial results from extracting metadata in Thesaurus OWL/RDF using SPARQL
C67242	Cells per Microliter	{Cells}/uL
C25529	Hour	h
C42548	Joule	J
C49670	Millimeter of Mercury	mm[Hg]
C41140	Millisecond	ms
C48531	Pound	[lb_av]
C48542	Tablet Dosing Unit	{tbl}

CDISC CT OWL/RDF

Incidentally, all CDISC CT packages are also available in OWL/RDF. The goal is to reduce the UCUM query above to only the entries found on the UNIT (C71620) codelist. Continuing with the flow, this is a snippet from the CDISC CT OWL/RDF for the same term, millimeter of mercury:

Millimeter of Mercury from sdtm-terminology.owl

<CodeList OID="CL.C71620.UNIT" Name="Unit" DataType="text" nciodm:ExtCodeID="C71620" nciodm:CodeListExtensible="Yes">
    <Description>
        <TranslatedText xml:lang="en">Terminology codelist used for units within CDISC.</TranslatedText>
    </Description>
    <EnumeratedItem CodedValue="mmHg" nciodm:ExtCodeID="C49670">
        <nciodm:CDISCSynonym>Millimeter of Mercury</nciodm:CDISCSynonym>
        <nciodm:CDISCDefinition>A unit of pressure equal to 0.001316 atmosphere and equal to the pressure indicated by one millimeter rise of mercury in a barometer at the Earth's surface. (NCI)</nciodm:CDISCDefinition>
        <nciodm:PreferredTerm>Millimeter of Mercury</nciodm:PreferredTerm>
    </EnumeratedItem>
    <nciodm:CDISCSubmissionValue>UNIT</nciodm:CDISCSubmissionValue>
    <nciodm:CDISCSynonym>Unit</nciodm:CDISCSynonym>
    <nciodm:PreferredTerm>CDISC SDTM Unit of Measure Terminology</nciodm:PreferredTerm>
</CodeList>

Unlike the esoteric nature in the Thesaurus OWL/RDF, the CDISC CT one is very straightforward and readable. With that, here is a SPARQL query to extract information such as submission values and their c-code from the UNIT codelist:

SPARQL for Extracting UCUM Mappings

PREFIX mms: <http://rdf.cdisc.org/mms#>
PREFIX cts: <http://rdf.cdisc.org/ct/schema#>
 
SELECT ?conceptCode ?cdiscSubmissionVal
FROM <http://rdf.cdisc.org/ct/schema>
FROM <http://rdf.cdisc.org/mms>
FROM <http://rdf.cdisc.org/sdtm-terminology>
WHERE {
    ?pv a mms:PermissibleValue ;
        cts:nciCode ?conceptCode ;
        cts:cdiscSubmissionValue ?cdiscSubmissionVal ;
        mms:inValueDomain ?clCcode .
        {
            ?clCcode cts:codelistName ?clName ;
                cts:nciCode "C71620" ;
        }
}

A sample output matching the terms shown in Tables 1 and 2 above:

conceptCode	cdiscSubmissionVal
Table 3: Partial results from extracting metadata in CDISC OWL/RDF using SPARQL
C67242	cells/uL
C25529	HOURS
C42548	Joule
C48531	LB
C49670	mmHg
C41140	msec
C48542	TABLET

Final Query and Output

The two result sets can be linked via the individual term's c-code. Therefore, the final query is a combination of the two SPARQL queries above, with a slight adjustment to make the nested queries work efficiently. It yields 143 mappings for 130 unique terms.

cdisc_ct_ucum.rq - Final SPARQL text that extract UCUM information from Thesaurus and subset it to the UNIT codelist in CDISC CT
cdisc_ct_ucum.txt - Result set in tab-delimited format

Conclusion

NCI EVS actively maintains a rich repository of terminology and biomedical ontology. Their OWL/RDF offering enables scalable IT solutions to search, link, and combine intricate biomedical concepts. This demonstration illustrates one semantic web technology application. SPARQL made analyzing over 2,160,000 triples (2,100,000 from Thesaurus, 60,000 from CDISC CT for SDTM) with ease. The more UCUM entries curated by NCI EVS, the more mappings will become available.

End Notes

SPARQL as specified by W3C: http://www.w3.org/2009/sparql/wiki/Main_Page
All SPARQL queries and OWL/RDF files were processed using TopQuadrant TopBraid Composer FE Version 4.4.0.
These file versions are used in this demonstration: NCI Thesaurus 14.10d; and, CDISC CT 2014-09-26
URL to download NCI Thesaurus OWL/RDF: http://cbiit.nci.nih.gov/evs-download/thesaurus-downloads
URL to download CDISC CT OWL/RDF for SDTM: http://evs.nci.nih.gov/ftp1/CDISC/SDTM/SDTM%20Terminology.OWL.zip

Unicode Name (Code Point)	Replacement ASCII Character (Decimal)	Remarks
LEFT SINGLE QUOTATION MARK (U+2018)	' (39)	Commonly referred to left curly single-quote This character is part of Microsoft Word's default AutoFormat setting.
RIGHT SINGLE QUOTATION MARK (U+2019)	' (39)	Commonly referred to right curly single-quote This character is part of Microsoft Word's default AutoFormat setting.
LEFT DOUBLE QUOTATION MARK (U+201D)	" (34)	Commonly referred to left curly double-quote This character is part of Microsoft Word's default AutoFormat setting.
RIGHT DOUBLE QUOTATION MARK (U+201D)	" (34)	Commonly referred to right curly double-quote This character is part of Microsoft Word's default AutoFormat setting.
NON-BREAKING HYPHEN (U+2011)	- (45)
FIGURE DASH (U+2012)	- (45)	Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as â€’
EN DASH (U+2013)	- (45)	This character is part of Microsoft Word's default AutoCorrect setting. Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as â€“
EM DASH (U+2014)	- (45)	Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as â€”
HORIZONTAL ELLIPSIS (U+2026)	... (46, 3 times)	This character is part of Microsoft Word's default AutoCorrect setting. Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as â€¦
ZERO WIDTH SPACE (U+200B)	null (0)	Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as â€‹ This is a non-printable character. Microsoft Word uses it to represent optional breaks. It is visible after enabling the Show Formatting Symbols option: An example with text:
NO-BREAK SPACE (U+00A0)	whitespace (32)	Viewing without proper encoding, e.g., Windows 1252 code page, this character will appear as ∩┐╜ Although it is a printable character, Microsoft Word uses it to represent nonbreaking space, but disguises it as regular whitespace. It is visible after enabling the Show Formatting Symbols option: An example with text:

Blog

Blog from January, 2015