Blog from January, 2018

Telephone (Chinese whispers or whisper down the lane, among other common name variations) is a children’s game in which the first child whispers a phrase to the next child, and so continues down the line. When it reaches the end, the last child reveals the phrase she heard to the entire group. The amusement comes witnessing how the original phrase becomes increasingly distorted between each pass, especially when it is played with an obscure phrase.

Variables are a common building block in all CDISC foundational standards. Their properties noticeably vary from one standard to another. For example, Variable Label, a descriptive text, is common across CDASH (data standards for collection), SEND/SDTM (for aggregation), and ADaM (for analysis). On the other hand, Role, which states how a variable functions in a given dataset, is a property unique to SDTM and Implementation Guides it supports. This distinction in variable properties is reasonable as each foundational standard addresses a specific purpose in the clinical data lifecycle. Nonetheless, variables must be accompanied by Definition. The purpose of Definition is, in a descriptive statement, to state the essential meaning of a variable in a precise and unambiguous manner.

Similarly, in a game of telephone, the more comprehensible a phrase to the players, the better chance it retains the original form at the end. However, an obscure phrase will very likely finish as gobbledygook.

It is not to say Definition is non-existent. Metadata tables published alongside the latest CDASH Model and Implementation Guide contain a column labeled Draft CDASH Definition for every variable. It is explained in section 3.5.1 of the CDASH Implementation Guide v2.0 (https://www.cdisc.org/standards/foundational/cdash/cdash-20#Bookmark18) that the CDASH team will harmonize its definitions with SDTM in the future.

Checking recent SDTM publications, the CDISC Notes column in the Implementation Guides (including Description in the Model) comes closest to Definition, though not a direct match. Unlike Variable Label, which has a character length limit due to regulatory data submission requirements, CDISC Notes has no such restrictions. Due to its free-form nature, CDISC Notes contain a variety of useful information: It may contain explanatory text, e.g., ‘Characterization of the duration of a biological process resulting in a particular finding.’ It may contain data examples, e.g., ‘Examples: "ng", "mg", or "mg/kg".’ It may contain data rules, e.g., ‘If MHTERM is modified to facilitate coding, then MHMODIFY will contain the modified text.’ It may contain usage rules, e.g., ‘When dosing of a treatment is recorded over multiple successive records, this variable is applicable only for the (chronologically) last record for the treatment.’ Extracting Definition from CDISC Notes will flush out other entangled properties and elucidate their purposes, such as rules, usages, and examples.

Data flow from one stage to another in a data lifecycle. Sources and targets formulate data lineage. Data rules may be added to describe manipulations such as data imputation, derivation, and transformation. It is a fundamental principle to not keep the same variable name whenever manipulation occurs between lifecycle stages that causes its meaning or any of its properties to change. Applying this principle to CDISC foundational standards, variable of the same name used between two lifecycle stages, e.g., from collection (CDASH) to aggregation (SEND/SDTM), ought to have the same essential meaning. The harmonization effort anticipated by the CDASH team embodies this principle.

Usability of data increases when they are dependable. Good definitions are key to achieving dependable data. People reasonably expect a high degree of permanence once definitions are established. Therefore, changes to definitions will need to be taken judiciously. This requires active governance, with an effective team of experts charged to establish and safeguard good definitions.

Other suggestions for good definitions are:

  • Specify valid values, but in a separate property to complement definitions. Valid values could mean codelists, value lists, external dictionaries. It is because data with different sets of valid values, more often than not, represent different (close, far, or unrelated) concepts.

  • Be type aware. A general concept may have specific usage with added constraints such as type. For example, U.S. ZIP code, belonging to the postal code concept, is a group of five or nine numbers that are used in conjunction with a postal address to assist the sorting of mail. The definitions may be incomplete without making clear that a U.S. ZIP code is numerical.

  • Use common lexicons. Referenceable knowledge eases comprehension. For example, epoch and randomization are words with specific connotation in the clinical context. It will be wise to leverage terms and definitions published in CDISC Glossary (https://www.cdisc.org/standards/semantics/glossary) to ensure their consistent and correct usage within foundational and therapeutic area standards.

  • Be conservative with the Model, be liberal with the Implementation Guides. Since all variable concepts in an Implementation Guide are created within the confines of the Model it references, variable definitions in the Model should be easily applied to domain-specific variables in an Implementation Guide, provided governance are robust for variable definitions in the Model.

It is important to note, good definitions shall not be embedded with example values in the text to avoid these situations:

  • Perpetual modifications. There have been more than isolated cases when people request example values be added, updated, or removed. These requests, even when it is done without altering a variable’s meaning, are modification nonetheless. As harmless as it may seem, the act resets the state of permanence. Low degree or lack of permanence indicates poor definitions.

  • Conflicting information. Valid values could be dependent on context. For example, different therapeutic areas may have slightly different requirements for a variable’s valid values. At CDISC, process has been established to document value subsets, conditional codelists, etc. Therefore, hard wiring example values to Definition will lead to confusions and sometimes conflicts with the actual valid values.

  • Chained definitions. It would be assuming that people understand the meaning of an example value added to Definition. When that isn’t the case, they will be required to refer another variable’s Definition or other sources, causing a chained effect. This chained effect is indicative of imprecise explanation in the first place.

Certain areas will likely be impacted, to a degree yet to be known, until some mandates about definitions are put in place. Formulating definitions are often more of an art than science, the practice of which could impact the speed of development. Additionally, domains, which are comprised of variables, are versioned in SDTM Implementation Guides. Variable definitions could impact the policy of when and how domains are up-versioned. Furthermore, concepts can have wide and narrow scopes (e.g., U.S. ZIP code is a subclass of the postal code concept; injection site is closely related to anatomical location, etc.), a CDISC ontology may emerge over time after achieving certain volume of well-defined variables. Last, but not least, some machine-readable mechanism will be a welcome replacement of today’s two-dimensional view for documenting all the intricacies about variable properties.

Enacting to allow definitions to be front and center during all standards development activities will require cross-team debate and consensus. It is undoubtedly going to disrupt current processes and norms. However, enormous benefits await. Not only will variables be disambiguated, variables and data will also no longer be interpreted dissimilarly due to poor or lack of definitions. With strong emphasis nowadays on linked data (i.e., published data with strong semantic backbone) and how cures could be unlocked by cross-domain data analyses, dependable data sources are a basic requirement to even begin tapping into its power.

In closing, let's imagine the contrasting results playing a game of telephone with these two versions of definitions for the variable Planned Arm Code (ARMCD). Unless all players have superior photographic memory, Game 1 will undoubtedly wind up far from the original. The succinct version for Game 2, in contrast, will likely survive the game with little to no distortions.

Game 1  

Game 2  

CDISC Notes as published in SDTMIG v3.2Definition as proposed by the Variable Definitions team
ARMCD is limited to 20 characters and does not have special character restrictions. The maximum length of ARMCD is longer than for other “short” variables to accommodate the kind of values that are likely to be needed for crossover trials. For example, if ARMCD values for a seven-period crossover were constructed using two-character abbreviations for each treatment and separating hyphens, the length of ARMCD values would be 20.A short sequence of characters that represents the planned arm to which the subject was assigned.

I have established, in this post, rationale for extracting and maintaining formal definitions for CDISC variables. At the same time, opportunities are here to also untangle all intricate variable properties from CDISC Notes or Description to bring normative information such as rules, usages, and other standards conformance details to light. Further, I have proposed variable definitions won’t be just another piece of information, but will serve an essential role in both governance and semantics. Lastly, I have recognized risks and challenges with this disruptive proposal, though those concerns are outweighed by many benefits. I hope to see your comments and debates on this topic.

Acknowledgements

I want to express my appreciation to my CDISC SHARE colleagues for their point of view on this topic which inspired me to author this blog entry: Dr. Sam Hume, Darcy Wold, Julie Chason, Dr. Lauren Becnel, and Frederik Malfait. Special thanks to Erin Muhlbradt, Ph.D. NCI Enterprise Vocabulary Services, for sharing her experience and insights in controlled terminology curation, as well as for proofreading this content.