Blog from May, 2014

Before answering the topic, here is a little background to set the stage: During R1 development late last year 2013, the SHARE dev team loaded the SDTM v1.2 / SDTMIG v3.1.2 into SHARE. It serves the baseline SDTM content. In March, 2014, we held a kickoff meeting with the SDS volunteers to begin the journey of adding new content, i.e., SDTM v1.3 / SDTMIG v3.1.3 and SDTM v1.4 / SDTMIG v3.2.

 

Everything in SHARE is interconnected with relationships (see SHARE Metamodel). It was obvious we needed to divide the work into two pieces as SDTM v1.4 is a child of SDTM v1.3, which itself is a child of SDTM v1.2. In other words, SDTM v1.4 can’t be a parallel task, at least not until SDTM v1.3’s content is stable.

 

As part of a tool onboarding exercise, we thought it would be of great benefit to enter some metadata interactively on-line (as opposed to data import). In April, the team was divided into three teams of two to develop content for the three new oncology domains RS, TR, and TU introduced in SDTMIG v3.1.3. As illustrated on the SHARE Stack diagram, the first hurdle is to correctly up-version the managed objects (or, “asset” as it is called in the tool): Since these new domains belong to the Findings class, the Findings class asset needs a new version to contain both new and existing domains. Further, the TU domain also implements new class variables --LAT, --DIR, and --PORTOT. Therefore, the General Findings class asset also needs a new version to contain both new and existing class variables; until SHARE, these relationships can only be deduced from section 6.1 of the SDTM v1.3 publication. SHARE forces us to be explicit and express them in a machine-readable way.

 

That was the easy part -- we had only been dealing with new items thus far. For existing domains, the team needed to know what exactly changed between the two versions of the standard. Such manifest needs to be granular to be useful for SHARE. For example, the TSVAL variable in Domain TS has a change in CDISC Notes and Core; or, the only change in Domain CM is CMDOSFRM’s role. To accomplish that, we needed a reliable machine-readable metadata input in order to produce a reliable metadata comparison output[a]. At the time of this blog entry, the team is half way through reviewing the SDTM v1.3 / SDTMIG v3.1.3 metadata spreadsheet posted on the CDISC website. With the keen eyes for detail, the team already identified several discussion-worthy discrepancies between the PDF publication and the input metadata spreadsheet. We will discuss how to resolve them in the upcoming team meetings[b]. It is my gut feeling decisions will be contingent on how prevalent each category is, i.e., one size may not fit all.

[a] Using the existing metadata spreadsheet, there are 1,134 changes from SDTM v1.2 / SDTMIG v3.1.2 to SDTM v1.3 / SDTMIG v3.1.3. The count reflects each attribute change, includes variable name, label, order, data type, controlled term, role, CDISC notes, and core, excluding the three new oncology domains.

[b] The intention is, after issues are resolved, the metadata curator will generate SHARE-friendly import files using the reviewed metadata. With the magnitude of changes, manual data entry will not be efficient.

Another challenge we face is CDISC Controlled Terminology. Specifically, it is the evolution of codelist development that creates an unanticipated complexity. Within the sixteen months in between the final publication of SDTM v1.3 (2012-07) and SDTM v1.4 (2013-11), five CDISC Controlled Terminology releases were published and new codelists were introduced. For example, C78735 (EVAL: Evaluator) for --EVAL, C99079 (EPOCH: Epoch) for EPOCH, C66728 (STENRF: Relation to Reference Period) for --STRF, --ENRF, --STRTPT, --ENRTPT[c]. We will have to decide how to handle this in SHARE.

[c] Credits go to the PhUSE Semantic Technology team. By analyzing their RDF materials using SPARQL, I realized the team retrospectively applied these CDISC Controlled Terminology codelists to the SDTM v1.3 / SDTMIG v3.1.3 triples. It is good to see how other people may interpret the standard.

There are other hurdles too. The team will decide how to represent value domain and conceptual domain (see ISO 11179 Part 4) for the new MedDRA variables in the Events class, as well as ISO 21090’s nullFlavor associated to the TSVALNF variable in Domain TS. Note both instances are external standards whose values are managed by a different entity. Besides, intellectual property and copyright may further complicate the matter.

 

Suffice to say, adding existing content into SHARE is easy said than done. Challenges that would not materialize on paper now surface because the metadata repository requires us to be precise and verbose. That said, it is important to do the right thing so the community will soon benefit from truly interoperable data.

 

Lastly, I leave you this (in reference to SHARE). Your feedbacks are always welcome.

 

 

I recently admitted to my colleagues I am a tinkerer by nature -- I indulge intricacies and desire the deep understanding of things I work on. On this fine Saturday, I had the urge wanting to know the Chinese word for metadata. Through web searches, I realized there isn't one common name for it. Amongst those appear on the search results, the more popular ones are 詮釋資料, which means annotative or explanatory information; 後設資料, mostly used in Taiwan, literally means information behind the setting; the third translation is 元數據. 元 (yuán) means beginning, first, and origin in Chinese, while 數據 (shuò ) means data.

 

While metadata can add annotations and background meanings to data, the notions of beginning and first truly accentuate the purpose of metadata; therefore, last translation (元數據) is the one I prefer the most. My experience backs it up: as a Metadata Curator for the CDISC SHARE project, I came to realization how much more difficult it is than I thought to define metadata for existing standards. This retrospective exercise often entails tedious manual interventions such as transcribing information from PDF publications into machine-readable formats[a]; discerning proper relationships[b]; and, handling omitted information or mistakes in the original publications, whose details may deserve another blog post. All these manual work easily eclipses the effort spent to create the mechanical counterparts, which themselves are no easy feat.

[a] We first had to turn PDF or its Microsoft Word sources into a spreadsheet format. Then, other challenges emerged such as character encoding (e.g., non-printable characters and those smart curly quotes Microsoft Office auto-corrects as default setting), combining multiple sources (e.g., CDASH v1.1 and CDASH User Guide v1.0), and minute details such as handling the NullFlavor details described in SDTM's Trial Summary (TS) domain -- is it intended to be registered as a CDISC Controlled Terminology in NCI EVS? Further, NullFlavor's governing authority can be HL7 or ISO 21090.

[b] A good example is identifying codelist supersets and subsets in the CDISC Controlled Terminology. For instance, Age Unit (C66781) is a subset of Unit (C71620) codelist.

It is not only ideal, but essential to tackle metadata up-front and during the standard development cycle. Accomplishing this will increase clarity and remove ambiguity for downstream consumers, hence ease the job for standards developers to quickly advance each CDISC standard. All these benefits are reasons why the CDISC SHARE is such a game changer.