A Proposal: Rules as Machine-Computable Metadata

The Standards Review Council (SRC) recently reviewed the SDTM conformance rules ("Rules") produced by the SDTMV. After having painstakingly combed through the SDTM v1.4 and SDTMIG v3.2, the team identified 400+ rule candidates. At the time of this blog post, the SRC is working with the sub-team to address some reviewer comments before making the package available for Public Review. As you can preview here, the construct is not very different from those published by the FDA SDTM Validation Rules and OpenCDISC Community: Rules have identifier, context, rule description in some pre-specified lexicons, condition, and citation of the rule's source.

As a Metadata Curator, I need to ask myself what the Rules mean to SHARE, as metadata. The text and description are, by definition, not metadata. Extra steps are needed to tease out the metadata. I thought to first illustrate a typical rule construct, or a model, shown here:

Furthermore, I formulated these objectives to help me devise solutions (my philosophy to innovate: first understand the what's before bother with the how's):

A rule may not only limit to so-called validation, but wide varieties of constraints such as referential integrity, data ranges, data derivations, inter-dependencies, etc.
A rule may be direct, which are used to express a piece of data in some data elements, a data element in some classes, or some classes within a model.
Conversely, a rule may be descriptive, whose description or condition is not directly related to the data, data elements, classes, or models.
Rule metadata may be platform independent and vendor neutral.
Rule metadata may be machine-computable, and at the same time may be complemented by natural language. This and Objective #4 together mean the rule metadata may not be machine-executable.
Rule metadata may allow third party players such as vendors, pharmaceutical companies, regulatory agencies, etc., to consume a common set of constraints and enable them to do what they want with it.
Rule metadata may disambiguate any undesirable ambiguity often found in natural language used to describe rules.

Additionally, I self-imposed some scoping limitations, i.e., a list of "won't do's" to keep implementation simple so this can be completed within a reasonable amount of time:

Shall not invent a pseudocode mechanism. By the same token, shall not invent new grammars and expressions.
Shall not invent a software.
Shall not need to cover all Rules, recognizing not everything will fit perfectly.

Having done some research along with inputs from volunteers and peers, two choices were available. They are both open standards and fit my objectives:

HL7 GELLO, most current version Release 2
Object Management Group (OMG) Object Constraint Language (OCL), most current version 2.4

At first, I found HL7 GELLO fascinating, supporting a huge range of medical and healthcare data. After all, it is designed to be a clinical decision support system. That said, having required to understand HL7 RIM and specialized toolset, it will be very difficult to find a sustainable workforce to develop and maintain using the GELLO framework.

A little bit more research revealed GELLO is in fact created based on OMG OCL. Here are a few characteristics that resonate with me and my Objectives:

It is aligned with both Unified Modeling Language (UML) and Meta Object Facility (MOF). It is often described as partner standard to UML. It is fair to say you can't have UML without OCL.
It is object-oriented, meaning constraints are placed directly onto objects and classes in a model.
It has inheritance, therefore model expansions by instantiating new objects from classes will carry over existing constraints.
Context matters. An OCL must declare context where the constraints apply. This is an essential element to achieve disambiguation. For example, context can be an object, a class, a set, a tuple (such as value list), an association, etc.
It is assertion based, most noticeable by the invariant constraint type. Invariants are conditions that must be true. This coincide with the SDTM Validation Sub-Team's approach, which emphasizes on the "positives."
It supports both pre- and post-conditions. They will be useful to implement conditions information in Rules.
It is supported by a number of commercial off-the-shelf (e.g., Enterprise Architect, MagicDraw) and open-source software (e.g., Eclipse).

This diagram nicely depicts the information architecture we use and how the CDISC product family stack up in terms of overall model framework.

Those said and illustrated, OMG OCL represents a no-brainer choice to me. UML, hence OCL, is the next logic step to further with (and, complete) the architectural blueprint.

I have only recently begun studying the OCL specifications to solidify my thinking. I hope the little work I attempted helps demonstrate this proposal. Below is a subset of the SDTM Findings class drawn using Enterprise Architect:

I added a couple of OCL to --TESTCD:

--TESTCD, being a topic variable, cannot be null.
--TESTCD has length less than equal to 8.
--TESTCD has a data pattern requirement, contains only uppercase letters, numbers, and underscores, with the first character must be a uppercase letter.

Their OCL expressions are as follows:

Imagine we will be able to run test data through the whole series of OCL as an exercise to validate the correctness of the constraints. This will enable us to run example data to test their validity prior to including them in Implementation Guides or User Guides. As a matter of fact, they are not a far-fetched ideas. This Youtube video posted by a third party modeling tool, called MagicDraw, adequately demonstrates the power of test automation using OCL functionality. At 6:00, the video shows how easy it is to validate an OCL using some XML data: prepare an XML file guaranteed to trigger a constraint violation, run it against the rules in a compiled Java code and the auto-generated schema file. Pretty nifty.

The vision of this proposal:

CDISC will treat and maintain the SDTM as a true data model.
Model constraints such as rules, derivations, and domain inter-dependencies will be normative and be part of each point release.
SDTMIG will be drastically less voluminous. Much of the text aimed to explain implementation will be replaced by rules and constraint metadata alike.
Additional SHARE automation can be had through meta-metadata and its constraints.

In conclusion, SHARE influences a certain discipline and conduct toward the standards development process. Engineering SDTM with an UML model and refitting validation rules using OCL are not only logical, but essential to lead the industry with technical innovation. Furthermore, this will address a lot of model and implementation ambiguities currently exist. Lastly, I'd like to make a call for volunteers to further implement this proposal. Perhaps, a proof of concept project to create a testbed to apply model constraints and rules metadata toward submission data validation and other uses.

Blog

6 Comments

Julie Evans

Jozef Aerts

Stetson Line

Anthony Chow AUTHOR

Dave Iberson-Hurst

Stephen Gelling