Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Mainstream definition for unstructured data is data that do not contain descriptive information about the data themselves. An email body is unstructured. A newspaper op-ed is unstructured; definition further goes to describe structured describe structured data to have a rigid schema requirement. Semi-structured data are the middle category, where meaning of data is, for example, self-explanatory using markup tags.

Below are examples to illustrate these three kinds of data.

Section
Column
width33%

Example of unstructured data

Taylor was born on March 12th in 2008. Carter was born 10 years and 9 months after Taylor.

Column
width33%

Example of semi-structured data

NameBirthdate
Taylor 🐕3/12/2008
Carter 🐈12/12/2018
Column
width33%

Example of structured data


Expand
titleData schema
Code Block
{
    "$id": "http://example.com/example.json",
    "type": "object",
    "title": "The Root Schema",
    "description": "My root schema.",
    "required": [
        "animals"
    ],
    "properties": {
        "animals": {
            "$id": "#/properties/animals",
            "type": "array",
            "title": "The Animals Schema",
            "description": "Pet animals data.",
            "default": [],
            "items": {
                "$id": "#/properties/animals/items",
                "type": "object",
                "title": "The Items Schema",
                "description": "One record per pet animal",
                "default": {},
                "required": [
                    "name",
                    "species",
                    "birthdate",
                    "dateFormat",
                ],
                "properties": {
                    "name": {
                        "$id": "#/properties/animals/items/properties/name",
                        "type": "string",
                        "title": "The Name Schema",
                        "description": "Pet animal's given name"
                    },
                    "species": {
                        "$id": "#/properties/animals/items/properties/species",
                        "type": "string",
                        "enum": [
                            "canine",
                            "feline"
                        ]
                        "title": "The Species Schema",
                        "description": "Pet animal's species"
                    },
                    "birthdate": {
                        "$id": "#/properties/animals/items/properties/birthdate",
                        "type": "string",
                        "title": "The Birthdate Schema",
                        "description": "Pet animal's date of birth"
                    },
                    "dateFormat": {
                        "$id": "#/properties/animals/items/properties/dateFormat",
                        "type": "string",
                        "title": "The Dateformat Schema",
                        "description": "Date format used in birth date",
                        "default": "dd/mm/yyyy"
                    }
                }
            }
        }
    }
}


Code Block
languagetext
titleData fitting the schema, in JSON format.
{
    "animals": [
        {
            "name": "Taylor",
            "species": "canine",
            "birthdate": "3/12/2008",
            "dateFormat": "mm/dd/yyyy"
        },
        {
            "name": "Carter",
            "species": "feline",
            "birthdate": "12/12/2018"
        }
    ]
}

...

Controlled Terminology, on the other hand, has publication in PDF format rendered directly from raw data. The raw data is highly structured. NCI EVS maintains all codelist and terms in Protégéin Protégé, an  an ontology tool. EVS extends Protégé extends Protégé with OWL, a web semantic layer.[1] Data serialization happens at each quarterly each quarterly publication to produce formats available for download, such as CSV, Excel spreadsheet, and PDF. This structured data approach enables many scenarios of reusability. Define-XML, codelists & terms in implementation guides, and CDISC 360's biomedical concepts, just to name a few. Controlled Terminology reaps the benefit of repeatability by virtue of its highly structured nature. CDISC publishes Controlled Terminology four times per year, with a maximum of six packages each time. With this level of frequency and quantity, a repeatable process is a must.

Technology plays a significant role when working with various structuredness of data. An overall data architecture is a strategy that must be very well articulated. This impacts storage, analytics, accessibility, and discoverability. Computing evolves at rapid pace. Especially with cloud computing. , it has created many choices in the marketplace for different kinds of data: RDBMS, graph database, document store, multi-model database. Long gone is the era where all data would go in to one single enterprise data store. Database-as-a-service has become a hot commodity, offering unprecedented flexibility, scalability, and modularityand modularity. Implementers today can choose to put their data in a right container for the right purpose.

...