Blog from February, 2020

Disclaimer

The views and opinions expressed in this blog entry are those of mine and do not reflect the official policy or position of CDISC.

Mainstream definition for unstructured data is data that do not contain descriptive information about the data themselves. An email body is unstructured. A newspaper op-ed is unstructured; definition further goes to describe structured data to have a rigid schema requirement. Semi-structured data are the middle category, where meaning of data is, for example, self-explanatory using markup tags.

Below are examples to illustrate these three kinds of data.

Example of unstructured data

Taylor was born on March 12th in 2008. Carter was born 10 years and 9 months after Taylor.

Example of semi-structured data

NameBirthdate
Taylor 🐕3/12/2008
Carter 🐈12/12/2018

Example of structured data

{
    "$id": "http://example.com/example.json",
    "type": "object",
    "title": "The Root Schema",
    "description": "My root schema.",
    "required": [
        "animals"
    ],
    "properties": {
        "animals": {
            "$id": "#/properties/animals",
            "type": "array",
            "title": "The Animals Schema",
            "description": "Pet animals data.",
            "default": [],
            "items": {
                "$id": "#/properties/animals/items",
                "type": "object",
                "title": "The Items Schema",
                "description": "One record per pet animal",
                "default": {},
                "required": [
                    "name",
                    "species",
                    "birthdate",
                    "dateFormat",
                ],
                "properties": {
                    "name": {
                        "$id": "#/properties/animals/items/properties/name",
                        "type": "string",
                        "title": "The Name Schema",
                        "description": "Pet animal's given name"
                    },
                    "species": {
                        "$id": "#/properties/animals/items/properties/species",
                        "type": "string",
                        "enum": [
                            "canine",
                            "feline"
                        ]
                        "title": "The Species Schema",
                        "description": "Pet animal's species"
                    },
                    "birthdate": {
                        "$id": "#/properties/animals/items/properties/birthdate",
                        "type": "string",
                        "title": "The Birthdate Schema",
                        "description": "Pet animal's date of birth"
                    },
                    "dateFormat": {
                        "$id": "#/properties/animals/items/properties/dateFormat",
                        "type": "string",
                        "title": "The Dateformat Schema",
                        "description": "Date format used in birth date",
                        "default": "dd/mm/yyyy"
                    }
                }
            }
        }
    }
}


Data fitting the schema, in JSON format.
{
    "animals": [
        {
            "name": "Taylor",
            "species": "canine",
            "birthdate": "3/12/2008",
            "dateFormat": "mm/dd/yyyy"
        },
        {
            "name": "Carter",
            "species": "feline",
            "birthdate": "12/12/2018"
        }
    ]
}

Only with the structured data example would you realize Taylor is about a dog and a cat for Carter. The schema serves an important function: gives context and rules to the data. This schema also contains information about defaults, optionality, and valid values. Because of the schema, machine would be able to process the pet animal data much more efficiently. Predictability is key.

I would like to continue with setting aside the classification of normative versus informative content in a CDISC publication. Two reasons. Firstly, no publications have them clearly delineated. More importantly, the degree of structuredness does not make content any more or less normative.

CDISC publishes foundational standards in PDF. When loading these standards into the metadata repository, with the exception of Controlled Terminology, a good amount of effort is spent to tease out basic metadata from the documents. By basic metadata, I mean those matter most toward process automation such as variable, data set, and class information common in domain specifications.

These basic metadata do not always share identical structure across CDISC products. CDASHIG variable's attributes differ from SDTMIG variable's; different versions of the same CDISC product also do not always fit in one structure. SDTM v1.2 & v1.3 contain flags for variable allowance in human clinical (SDTMIG) or animal toxicology (SEND) data sets. In subsequent SDTM versions, these flags were removed and subsumed by description text; same attributes do not always have the same value domains. Each product has its own set of values for the Core attribute. "HR", "O", and "R/C" are for CDASHIG. SDTMIG has "Req", "Exp", and "Perm". ADaMIG has "Req", "Cond", and "Perm". In other words, in this unstructured PDF container, certain semi-structured data can be asserted.

Therapeutic Area User Guides (TAUG) are largely unstructured. Even though TAUG publications may seem to follow a common document pattern, they often contain knowledge graphs, images, and data examples, albeit with varying degree of specificity. This is similar to digital photo files. Even though they are in a file format (JPEG, PNG, or GIF), the content is unstructured. A digital photo file needs to be processed to become an image for human cognition. Image recognition & analytic tools are needed to allow machine to learn and understand its meaning.

Typical composition of a CDISC publication that contains unstructured, semi-structured, and structured contents.

Controlled Terminology, on the other hand, has publication in PDF format rendered directly from raw data. The raw data is highly structured. NCI EVS maintains all codelist and terms in Protégé, an ontology tool. EVS extends Protégé with OWL, a web semantic layer.[1] Data serialization happens at each quarterly publication to produce formats available for download, such as CSV, Excel spreadsheet, and PDF. This structured data approach enables many scenarios of reusability. Define-XML, codelists & terms in implementation guides, and CDISC 360's biomedical concepts, just to name a few. Controlled Terminology reaps the benefit of repeatability by virtue of its highly structured nature. CDISC publishes Controlled Terminology four times per year, with a maximum of six packages each time. With this level of frequency and quantity, a repeatable process is a must.

Technology plays a significant role when working with various structuredness of data. An overall data architecture is a strategy that must be very well articulated. This impacts storage, analytics, accessibility, and discoverability. Computing evolves at rapid pace. Especially with cloud computing, it has created many choices in the marketplace for different kinds of data: RDBMS, graph database, document store, multi-model database. Long gone is the era where all data would go in to one single enterprise data store. Database-as-a-service has become a hot commodity, offering unprecedented flexibility, scalability, and modularity. Implementers today can choose to put their data in a right container for the right purpose.

In conclusion, today's CDISC products mostly fall into the semi-structured and unstructured categories. Data strategy with stakeholders about accessibility and usability should heavily influence the degree of structuredness in the kinds of data being managed, especially with novel contents. Decision makers need to factor in the strengths and weaknesses of each technology offering. Trade-offs may be inevitable, e.g., sacrifice end-to-end interoperability for a simplified data store? Prefer repeatability over schema flexibility? Immediate access or deferred availability? None of these questions have one answer for all the varying degrees of data structuredness.

[1] NCI Thesaurus Downloads: https://evs.nci.nih.gov/evs-download/thesaurus-downloads