How To Determine Where Data Belong

Implementation of standards in this guide starts with the selection of standards with which to collect, represent, and/or exchange data. After standards are selected, it is then possible to determine how the data are collected, represented, or exchanged using the standard.

Determining the Standard

At the highest level, sets of standards in this guide are aligned with both tobacco study use cases and by data related processes they support. In the TIG:

Standards for Collection will guide development and use of case report forms (CRFs) by implementing the CDISC CDASH Model.
Standards for Tabulation will guide organization of data collected, assigned, or derived for a study by implementing SDTM.
Standards for Analysis will specify the principles to follow in the creation of analysis datasets and associated metadata by implementing the ADaM.
Standards for Data Exchange will support sharing of structured data between parties and across different information systems by implementing specified standard specifications and resources.

Use cases, activities, and associated sets of standards in scope for this guide are shown in the table below.

Use case	Standards for Collection	Standards for Tabulation	Standards for Analysis	Standards for Data Exchange
Product Description		SDTM	ADaM	Define-XML
Nonclinical		SDTM		Define-XML
Product Impact on Individual Health	CDASH Model	SDTM	ADaM	ODM-XML, Define-XML
Product Impact on Population Health			ADaM	Define-XML

Determining Where Data Belong

Standards for collection, tabulation, and analysis are

All models implemented as part of this guide collect and represent data by common topics with:

CDASH and SDTM grouping logically related data points in domains; and
ADaM dataset design customizable for analysis requirements.

The terms “domain” and “dataset” are commonly used in CDISC’s nomenclature and found frequently in the Study Data Tabulation Model (SDTM). For example, the SDTM v1.8 includes 134 instances of "domain" and says "A collection of observations on a particular topic is considered a domain." The Model includes 78 instances of dataset and certain structures in the model are called "datasets" rather than "domains." Is there a difference between a domain and a dataset?

The CDISC Glossary defines these terms as follows:

Domain: A collection of logically related observations with a common, specific topic that are normally collected for all subjects in a clinical investigation. NOTE: The logic of the relationship may pertain to the scientific subject matter of the data or to its role in the trial. Example domains include laboratory test results (LB), adverse events (AE), concomitant medications (CM). [After SDTM Implementation Guide version 3.2, CDISC.org] See also general observation class.
Dataset: A collection of structured data in a single file. [CDISC, ODM, and SDS] Compare to analysis dataset, tabulation dataset.

In plainer terms, a domain is a grouping of observations that are related while a dataset is the data structure associated with that grouping of observations. Both domains and datasets use the same nomenclature, which is why they are often confused.

The distinction between domain and dataset is most clearly seen in cases where a general observation class domain is split into multiple datasets in a submission. Common examples are splitting the Laboratory Test Results (LB) domain due to size, splitting the Questionnaires (QS) domain by questionnaire, and splitting the Findings About Events or Interventions (FA) domain by parent domain.

However, since in most cases there is a one-to-one relationships between a conceptual domain and a dataset based on that conceptual domain, the words are used interchangeably in the standards and, therefore, by most users. The structures called “relationship datasets” were given that name because they are mechanisms for connecting information represented in different datasets rather than observations about study subjects. Note that none of the relationship datasets includes the variable DOMAIN. However, in a submission, these datasets need dataset names, and character strings used in those names are included in the CDISC Codelist called "SDTM Domain Abbreviations."

In conclusion, there is a clear distinction between the meaning of "domain" and "dataset" but given that the naming conventions are the same across both terms, in many cases they can be considered interchangeable.

Domains

SDTM

Observations about study subjects are normally collected for all subjects in a series of domains. A domain is defined as a collection of logically related observations with a common topic. The logic of the relationship may pertain to the scientific subject matter of the data or to its role in the trial. Each domain is represented by a single dataset.

Each domain dataset is distinguished by a unique, 2-character code that should be used consistently throughout the submission. This code, which is stored in the SDTM variable named DOMAIN, is used in 4 ways: as the dataset name, as the value of the DOMAIN variable in that dataset, as a prefix for most variable names in that dataset, and as a value in the RDOMAIN variable in relationship tables (see Section 8, Representing Relationships and Data).

SEND

Aside from a limited number of special-purpose domains, all subject-level SDTM datasets are based on 1 of the 3 general observation classes. When faced with a set of data that were collected and that "go together" in some sense, the first step is to identify SDTM observations within the data and the general observation class of each observation. Once these observations are identified at a high level, 2 other tasks remain:

Determining whether the relationships between these observations need to be represented using GRPID within a dataset, as described in Section 8.1, (SENDIG v3.1.1) Relating Groups of Records Within a Domain Using the --GRPID Variable, or using RELREC between datasets, as described in Section 8.3, (SENDIG v3.1.1) Supplemental Qualifiers - SUPP-- Datasets
Placing all the data items in 1 of the identified general observation class records, or in a SUPP-- dataset, as described in Section 8.5, (SENDIG v3.1.1) Relating Findings To Multiple Subjects - Subject Pooling

In practice, considering the representation of relationships and placing individual data items may lead to reconsidering the identification of observations, so the whole process may require several iterations.

ADD MORE TEXT HERE

Page tree

How To Determine Where Data Belong