I recently admitted to my colleagues I am a tinkerer by nature -- I indulge intricacies and desire the deep understanding of things I work on. On this fine Saturday, I had the urge wanting to know the Chinese word for metadata. Through web searches, I realized there isn't one common name for it. Amongst those appear on the search results, the more popular ones are 詮釋資料, which means annotative or explanatory information; 後設資料, mostly used in Taiwan, literally means information behind the setting; the third translation is 元數據. 元 (yuán) means beginning, first, and origin in Chinese, while 數據 (shuò ) means data.

 

While metadata can add annotations and background meanings to data, the notions of beginning and first truly accentuate the purpose of metadata; therefore, last translation (元數據) is the one I prefer the most. My experience backs it up: as a Metadata Curator for the CDISC SHARE project, I came to realization how much more difficult it is than I thought to define metadata for existing standards. This retrospective exercise often entails tedious manual interventions such as transcribing information from PDF publications into machine-readable formats[a]; discerning proper relationships[b]; and, handling omitted information or mistakes in the original publications, whose details may deserve another blog post. All these manual work easily eclipses the effort spent to create the mechanical counterparts, which themselves are no easy feat.

[a] We first had to turn PDF or its Microsoft Word sources into a spreadsheet format. Then, other challenges emerged such as character encoding (e.g., non-printable characters and those smart curly quotes Microsoft Office auto-corrects as default setting), combining multiple sources (e.g., CDASH v1.1 and CDASH User Guide v1.0), and minute details such as handling the NullFlavor details described in SDTM's Trial Summary (TS) domain -- is it intended to be registered as a CDISC Controlled Terminology in NCI EVS? Further, NullFlavor's governing authority can be HL7 or ISO 21090.

[b] A good example is identifying codelist supersets and subsets in the CDISC Controlled Terminology. For instance, Age Unit (C66781) is a subset of Unit (C71620) codelist.

It is not only ideal, but essential to tackle metadata up-front and during the standard development cycle. Accomplishing this will increase clarity and remove ambiguity for downstream consumers, hence ease the job for standards developers to quickly advance each CDISC standard. All these benefits are reasons why the CDISC SHARE is such a game changer.

 

  • No labels