The  Standards Review Council (SRC) recently reviewed the SDTM conformance rules ("Rules") produced by the SDTMV. After having painstakingly combed through the SDTM v1.4 and SDTMIG v3.2, the team identified 400+ rule candidates. At the time of this blog post, the SRC is working with the sub-team to address some reviewer comments before making the package available for Public Review. As you can preview here, the construct is not very different from those published by the FDA SDTM Validation Rules and OpenCDISC Community: Rules have identifier, context, rule description in some pre-specified lexicons, condition, and citation of the rule's source.

As a Metadata Curator, I need to ask myself what the Rules mean to SHARE, as metadata. The text and description are, by definition, not metadata. Extra steps are needed to tease out the metadata. I thought to first illustrate a typical rule construct, or a model, shown here:

 

 

Furthermore, I formulated these objectives to help me devise solutions (my philosophy to innovate: first understand the what's before bother with the how's):

  1. A rule may not only limit to so-called validation, but wide varieties of constraints such as referential integrity, data ranges, data derivations, inter-dependencies, etc.
  2. A rule may be direct, which are used to express a piece of data in some data elements, a data element in some classes, or some classes within a model.
  3. Conversely, a rule may be descriptive, whose description or condition is not directly related to the data, data elements, classes, or models.
  4. Rule metadata may be platform independent and vendor neutral.
  5. Rule metadata may be machine-computable, and at the same time may be complemented by natural language. This and Objective #4 together mean the rule metadata may not be machine-executable.
  6. Rule metadata may allow third party players such as vendors, pharmaceutical companies, regulatory agencies, etc., to consume a common set of constraints and enable them to do what they want with it.
  7. Rule metadata may disambiguate any undesirable ambiguity often found in natural language used to describe rules.

Additionally, I self-imposed some scoping limitations, i.e., a list of "won't do's" to keep implementation simple so this can be completed within a reasonable amount of time:

  1. Shall not invent a pseudocode mechanism. By the same token, shall not invent new grammars and expressions.
  2. Shall not invent a software.
  3. Shall not need to cover all Rules, recognizing not everything will fit perfectly.

Having done some research along with inputs from volunteers and peers, two choices were available. They are both open standards and fit my objectives:

  1. HL7 GELLO, most current version Release 2
  2. Object Management Group (OMG) Object Constraint Language (OCL), most current version 2.4

At first, I found HL7 GELLO fascinating, supporting a huge range of medical and healthcare data. After all, it is designed to be a clinical decision support system. That said, having required to understand HL7 RIM and specialized toolset, it will be very difficult to find a sustainable workforce to develop and maintain using the GELLO framework.

A little bit more research revealed GELLO is in fact created based on OMG OCL. Here are a few characteristics that resonate with me and my Objectives:

  1. It is aligned with both Unified Modeling Language (UML) and Meta Object Facility (MOF). It is often described as partner standard to UML. It is fair to say you can't have UML without OCL.
  2. It is object-oriented, meaning constraints are placed directly onto objects and classes in a model.
  3. It has inheritance, therefore model expansions by instantiating new objects from classes will carry over existing constraints.
  4. Context matters. An OCL must declare context where the constraints apply. This is an essential element to achieve disambiguation. For example, context can be an object, a class, a set, a tuple (such as value list), an association, etc.
  5. It is assertion based, most noticeable by the invariant constraint type. Invariants are conditions that must be true. This coincide with the SDTM Validation Sub-Team's approach, which emphasizes on the "positives."
  6. It supports both pre- and post-conditions. They will be useful to implement conditions information in Rules.
  7. It is supported by a number of commercial off-the-shelf (e.g., Enterprise Architect, MagicDraw) and open-source software (e.g., Eclipse).

This diagram nicely depicts the information architecture we use and how the CDISC product family stack up in terms of overall model framework.

Those said and illustrated, OMG OCL represents a no-brainer choice to me. UML, hence OCL, is the next logic step to further with (and, complete) the architectural blueprint.

I have only recently begun studying the OCL specifications to solidify my thinking. I hope the little work I attempted helps demonstrate this proposal. Below is a subset of the SDTM Findings class drawn using Enterprise Architect:

I added a couple of OCL to --TESTCD:

  1. --TESTCD, being a topic variable, cannot be null.
  2. --TESTCD has length less than equal to 8.
  3. --TESTCD has a data pattern requirement, contains only uppercase letters, numbers, and underscores, with the first character must be a uppercase letter.

Their OCL expressions are as follows:

Imagine we will be able to run test data through the whole series of OCL as an exercise to validate the correctness of the constraints. This will enable us to run example data to test their validity prior to including them in Implementation Guides or User Guides. As a matter of fact, they are not a far-fetched ideas. This Youtube video posted by a third party modeling tool, called MagicDraw, adequately demonstrates the power of test automation using OCL functionality. At 6:00, the video shows how easy it is to validate an OCL using some XML data: prepare an XML file guaranteed to trigger a constraint violation, run it against the rules in a compiled Java code and the auto-generated schema file. Pretty nifty.

The vision of this proposal:

  1. CDISC will treat and maintain the SDTM as a true data model.
  2. Model constraints such as rules, derivations, and domain inter-dependencies will be normative and be part of each point release.
  3. SDTMIG will be drastically less voluminous. Much of the text aimed to explain implementation will be replaced by rules and constraint metadata alike.
  4. Additional SHARE automation can be had through meta-metadata and its constraints.

In conclusion, SHARE influences a certain discipline and conduct toward the standards development process. Engineering SDTM with an UML model and refitting validation rules using OCL are not only logical, but essential to lead the industry with technical innovation. Furthermore, this will address a lot of model and implementation ambiguities currently exist. Lastly, I'd like to make a call for volunteers to further implement this proposal. Perhaps, a proof of concept project to create a testbed to apply model constraints and rules metadata toward submission data validation and other uses.

  • No labels

6 Comments

  1. Really nice approach to rules.

  2. This is great!
    As you probably know, some of us have started writing a number of these rules (FDA, SDTM, ADaM) in XQuery, which is NOT vendor/technology-neutral, as it is assumed that Dataset-XML is used for the datasets. So, if we can describe such rules using OCL, and than have them (automatically) "translated" into things that actually executes the rules on XPT, XML, JSON implementations of the standards, this will be great. "Proof of the pudding" will of course be to see whether some of the complex (e.g. ADaM) rules can be described.
    This would also be a next step into machine-readable IGs.

  3. Impressive analysis.  I completely agree with the vision and strategy outlined here.  Let's talk about how the Rules sub-team can support these efforts. 

  4. Anthony Chow AUTHOR

    UML software these days allows modelers to "annotate" an OCL with natural language. To me, this must be done to eliminate the "geek effect," otherwise there will no chance for adoption in the industry. Although there are a number of research papers examining the mechanism to translate natural language to OCL, it is a much more complicated matter to extract semantic meanings from natural language (ontology, natural language rules, etc.). For now, the first step is to get the basic framework in place so we can build some simple OCL expressions.

    That's it for a Sunday morning food for thought.

  5. Anthony

    I read this about a week ago, been meaning to comment since then, got round to it at last!

    First of all, very nice post. Yes we need this, it was always the intent to move the text-based rules into some form of metadata. As you say, it will reduce the size of the documents. Since I first read the post you have also added your "geek effect' comment. I had a similar thought/concern when I first read it. OCL is not, shall we say, user friendly and the CDISC community might find it hard to gain understanding.

    Some specific thoughts

    • On the model, does a rule always contain, refer to, a variable. I think we can have rules at a domain and standard levels. Just as a thought, should SDTM data always contain a DM domain?
    • The list directly below the model picture, number 2. Raised a thought in my head of the division between a rule and what 'rules' may be inherent in any model (in the M1 layer). Just thinking of the RDF meta-model-schema. In there there is the start of the constraints regarding SDTM Domain based on SDTM Class. You could model that, you could have it in a rule. We need to ponder where the division is

    Again, nice post

    Dave

     

  6. I would like to add my thoughts on this topic.  I have already implemented a number of “workflows” that checks the consistency of metadata in my organisation using the open source tool KNIME (www.knime.org). 

    From my experience creating simple rules using OCL expressions will be of some benefit in the short term, but I afraid that as you delve deeper it will become apparent that greater precision is needed and the OCL solution may show its limitations. 

    My belief is that we will not be able to avoid the “geek effect” as rules become more complex.  It would be great to see the documentation relating to the rules to be implemented (maybe I have missed this).   This should be open to public review before starting any work on a future solution.