Page History

Requirements for data submission are defined and managed by the regulatory authorities to whom data are submitted. This section describes general requirements for datasets which that may be part of a submission. However However, additional conventions may be defined by regulatory bodies or negotiated with regulatory reviewers. In such cases, additional requirements must be followed.

Tabulation Datasets

Observations generated over the course of a study about tobacco products and study subjects generated to support a submission are represented in a series of datasets aligned with logical groupings of data per domains. when data have been collectedinto domains. Domains described in this guide are generally aligned with implementation of a single dataset for each domainfile in which to represent data in scope for a domain.All datasets are structured as flat files with rows representing observations and columns representing variables.In some cases, a dataset implemented in alignment with for a domain may be split into physically separate datasets dataset files to support submission when needed and as allowable by the regulatory authority.

Generally, a domain is represented by a single dataset.

A domain dataset may be split into physically separate datasets to support submission when needed and as allowable by the regulatory authority. The following conventions must be adhered to when splitting domains into separate datasets:

A domain based on a general observation class may be split according to values in --CAT. When a domain is split on --CAT, --CAT must not be null.
The Findings About (FA) domain may alternatively be split based on the domain of the value in --OBJ.

To ensure split datasets can be appended back into one domain dataset:

All datasets are structured as flat files with rows representing observations and columns representing variables. Data represented in tabulation datasets will include:

Data as originally collected or received.
Data from the protocol.
Assigned data.
Derived data.

Dataset names will reflect the following conventions:

Names will be a unique 2 to 4 letter character code.
This code, which is stored in the SDTM variable named DOMAIN, is used in 4 ways: as the dataset name, as the value of the DOMAIN variable in that dataset, as a prefix for most variable names in that dataset, and as a value in the RDOMAIN variable in relationship tables (see Section 8, Representing Relationships and Data).

The following guidance will be adhered to for tabulation datasets:

Metadataspec

Num

Guidance For

Implementation

1

Dataset content

Data represented in datasets will include the following per regulatory requirements, scientific needs, and standards in this guide:

Data as originally collected or received (using controlled terminology where applicable) to support the submission
Data from external references relevant to the submission (e.g., study protocol)
Data assigned per conventions in the TIG
Data derived per regulatory and TIG conventions

2

Dataset naming

Domain datasets based on the SDTM general observations classes will be named using the 2-character code for the domain or using the applicable 4-character code when a dataset is split (e.g., LB, LBHM).
Supplemental Qualifier
Jira
showSummary false
server Issue Tracker (JIRA)
serverId 85506ce4-3cb3-3d91-85ee-f633aaaf4a45
key TOBA-792
datasets will be named using "SUPP" concatenated withthe 2-character domain code for the parent domain (e.g., SUPPDM, SUPPFA) or the 4-character code for the parent dataset when a dataset is split (e.g., SUPPFACM).
All other datasets will be named using the code for the domain or dataset (e.g., DM, RELREC).

3

Variable order

Dataset variables will be ordered per guidance in the SDTM.
Variable order in TIG domain specifications aligns with variable order in the SDTM.

4

Variable names

Variables will be named per guidance in the SDTM. The SDTM guidance uses fragment names in the CDISC Non-Standard Variables Registry.
Variable names in TIG domain specifications align with naming conventions in the SDTM.
Variable names will be 8 characters or less and uppercase.

5

Variable labels

Descriptive labels per this guide, up to 40 characters, will be provided as data variable labels for all variables, including Supplemental Qualifier variables.

6

Variable length

When variable length is referenced in the TIG, this refers to the length in bytes of ASCII character strings.

The maximum length of character variables is 200 characters, and the full 200 characters should not be used unless necessary.
Applicants will consider the nature of the data and apply reasonable, appropriate lengths to variables. For example:
- --TESTCD and IDVAR values will never be longer than 8 characters, so the lengths of those variables can be set to 8.
- The length for variables that use controlled terminology can be set to the length of the longest term.

7

Variable value text case

Values from controlled terminology or response values for QRS instruments specified by the instrument documentation will be in the case specified by those sources.
Otherwise, text data will be represented in upper case (e.g., NEGATIVE).

8

Missing variable values

Missing values for individual data items will be represented by nulls.

9

Splitting datasets

A domain dataset may be split into physically separate datasets to support submission when needed and as allowable by the regulatory authority. The following conventions must be adhered to when splitting domains into separate datasets:

A domain based on a

...

General Observation Class may be split according to values in variable --CAT. When a domain is split on --CAT, --CAT must not be null.
The Findings About Events or Interventions (FA) domain may

...

be split

...

according to the domain

...

in which the interventions or events in --OBJ are represented (or would be represented).

To ensure split datasets can be appended back into

...

1 domain dataset:

The value of DOMAIN must be consistent across the separate datasets as it would have been if they had not been split (e.g., LB, FA).
All variables that require a domain prefix (e.g., --TESTCD, --LOC) must use the value of DOMAIN as the prefix value (e.g., LB, FA).
--SEQ must be unique within USUBJID for all records across all the split datasets. If there are 1000 records for a USUBJID across the separate datasets, all 1000 records need unique values for --SEQ.
When relationship datasets (e.g., SUPPxx, FAxx, CO, RELREC) relate back to split parent domains, the value of IDVAR

...

will be from a variable with unique values for each observation.
Permissible variables included in one split dataset need not be included in all split datasets.
For domains with 2-letter domain codes, split dataset names can be up to 4 characters in length. For example, if splitting by --CAT, dataset names would be the domain name plus up to 2 additional characters to indicate the value of --CAT (e.g.,

...

LBHM for LB if the value of --CAT is HEMATOLOGY). If splitting Findings About by parent domain, then the dataset name would be the domain code, "FA", plus the

...

two-character domain code for parent domain code (e.g., "FACM"). The 4-character dataset-name limitation allows the use of a Supplemental Qualifier dataset associated with the split dataset.
Supplemental Qualifier datasets for split domains

...

will also be split. The nomenclature

...

will include the additional 1 to 2 characters used to identify the split dataset (e.g.,

...

SUPPLBHM, SUPPFACM). The value of RDOMAIN in the SUPP-- datasets would be the 2-character domain code (e.g.,

...

LB, FA).
In RELREC, if a dataset-level relationship is defined for a split Findings About domain, then RDOMAIN

...

will contain the 4-character dataset name, rather than the domain name "FA"

...

relrec.xpt

(e.g., the value of RDOMAIN will be FACM).

Analysis Datasets

Observations about tobacco products and study subjects generated to support analysis in a submission are represented in a series of datasets based on the CLASS values described in the TIG. Datasets described in this guide are generally created to support a certain type of analysis, but sometimes analysis datasets are created to support the creation of a subsequent dataset that will be used for analysis. All datasets are structured as flat files with rows representing observations and columns representing variables.

The following guidance will be adhered to for analysis datasets:

Metadataspec

Num	Guidance For	Implementation
1	Dataset content	Data represented in datasets will include the following per regulatory requirements, scientific needs, and standards in this guide: Data as originally collected or received (using controlled terminology where applicable) to support the submission Data from external references relevant to the submission (e.g., reference data) Data assigned per conventions in the TIG Data derived per regulatory and TIG conventions
2	Dataset naming	Analysis dataset naming has no predefined values. The only pre-defined name for analysis datasets is ADSL which is suggested for studies where a one-record-per-subject dataset is created to capture subject-level demographics, product usage, and/or trial experience information. All other ADaM datasets (besides for ADSL) should be named AD + applicant-defined name (ADXXXXXX). The exception to this general naming convention is the addition of the RF prefix for reference data that has been introduced in the TIG. There is no rule that noncompliant datasets must start with AX or that they cannot start with AD. ADaM datasets should be named logically, if possible, and consistent naming conventions should be used across studies within a submission.
3	Variable order	There is no variable ordering defined for the ADaM standards, although having variables ordered together within a variable group helps review and dataset understanding. Variable order in the ADaM dataset must match the order in the define.xml file.
4	Variable names	Variables will be named per ADaM guidance, which uses fragment names in the CDISC NSV Registry. Variable names in TIG ADaM specifications align with naming conventions in ADaM. Variable names will be 8 characters or less and uppercase.
5	Variable labels	Descriptive labels per this guide, up to 40 characters, will be provided as data variable labels for all variables. All variables must use labels defined in Section 2.9.5, Predefined Standard Variables for ADSL, Predefined Standard Variables for BDS, and Section 2.9.7, Predefined Standard Variables for OCCDS (besides for the 2 exceptions described therein).
6	Variable length	When variable length is referenced in the TIG, this refers to the length in bytes of ASCII character strings. The maximum length of character variables is 200 characters, and the full 200 characters should not be used unless necessary. Applicants will consider the nature of the data and apply reasonable, appropriate lengths to variables. For example: PARAMCD values will never be longer than 8 characters, so the length of that variable can be set to 8. The length for variables that use controlled terminology can be set to the length of the longest term.
7	Variable value text case	Variable value text case generally depends on the variable usage and how it is presented on outputs (but there is no requirement that this usage must be followed).
8	Missing variable values	Missing values for individual data items will be represented by nulls if necessary for analysis. Otherwise, it is up to the dataset creator whether to include missing values in an analysis dataset.
9	Splitting datasets	An analysis dataset may be split into physically separate datasets to support submission when needed. ADaM currently has no conventions as to the proper way to split analysis datasets, although like types of data should have similar dataset naming.

...

Row

...

STUDYID

...

RDOMAIN

...

USUBJID

...

IDVAR

...

IDVARVAL

...

RELTYPE

...

RELID

...

a domain is a grouping of observations that are related while a dataset

The terms domain and dataset

The terms “domain” and “dataset” are commonly used in CDISC’s nomenclature and found frequently in the Study Data Tabulation Model (SDTM). For example, the SDTM v1.8 includes 134 instances of "domain" and says "A collection of observations on a particular topic is considered a domain." The Model includes 78 instances of dataset and certain structures in the model are called "datasets" rather than "domains." Is there a difference between a domain and a dataset?

The CDISC Glossary defines these terms as follows:

Domain: A collection of logically related observations with a common, specific topic that are normally collected for all subjects in a clinical investigation. NOTE: The logic of the relationship may pertain to the scientific subject matter of the data or to its role in the trial. Example domains include laboratory test results (LB), adverse events (AE), concomitant medications (CM). [After SDTM Implementation Guide version 3.2, CDISC.org] See also general observation class.
Dataset: A collection of structured data in a single file. [CDISC, ODM, and SDS] Compare to analysis dataset, tabulation dataset.

In plainer terms, a domain is a grouping of observations that are related while a dataset is the data structure associated with that grouping of observations. Both domains and datasets use the same nomenclature, which is why they are often confused.

The distinction between domain and dataset is most clearly seen in cases where a general observation class domain is split into multiple datasets in a submission. Common examples are splitting the Laboratory Test Results (LB) domain due to size, splitting the Questionnaires (QS) domain by questionnaire, and splitting the Findings About Events or Interventions (FA) domain by parent domain.

However, since in most cases there is a one-to-one relationships between a conceptual domain and a dataset based on that conceptual domain, the words are used interchangeably in the standards and, therefore, by most users. The structures called “relationship datasets” were given that name because they are mechanisms for connecting information represented in different datasets rather than observations about study subjects. Note that none of the relationship datasets includes the variable DOMAIN. However, in a submission, these datasets need dataset names, and character strings used in those names are included in the CDISC Codelist called "SDTM Domain Abbreviations."

In conclusion, there is a clear distinction between the meaning of "domain" and "dataset" but given that the naming conventions are the same across both terms, in many cases they can be considered interchangeable.

Tabulation Datasets

langauge from here https://www.cdisc.org/kb/articles/domain-vs-dataset-whats-difference

The terms “domain” and “dataset” are commonly used in CDISC’s nomenclature and found frequently in the Study Data Tabulation Model (SDTM). For example, the SDTM v1.8 includes 134 instances of "domain" and says "A collection of observations on a particular topic is considered a domain." The Model includes 78 instances of dataset and certain structures in the model are called "datasets" rather than "domains." Is there a difference between a domain and a dataset?

The CDISC Glossary defines these terms as follows:

Domain: A collection of logically related observations with a common, specific topic that are normally collected for all subjects in a clinical investigation. NOTE: The logic of the relationship may pertain to the scientific subject matter of the data or to its role in the trial. Example domains include laboratory test results (LB), adverse events (AE), concomitant medications (CM). [After SDTM Implementation Guide version 3.2, CDISC.org] See also general observation class.
Dataset: A collection of structured data in a single file. [CDISC, ODM, and SDS] Compare to analysis dataset, tabulation dataset.

In plainer terms, a domain is a grouping of observations that are related while a dataset is the data structure associated with that grouping of observations. Both domains and datasets use the same nomenclature, which is why they are often confused.

The distinction between domain and dataset is most clearly seen in cases where a general observation class domain is split into multiple datasets in a submission. Common examples are splitting the Laboratory Test Results (LB) domain due to size, splitting the Questionnaires (QS) domain by questionnaire, and splitting the Findings About Events or Interventions (FA) domain by parent domain.

However, since in most cases there is a one-to-one relationships between a conceptual domain and a dataset based on that conceptual domain, the words are used interchangeably in the standards and, therefore, by most users. The structures called “relationship datasets” were given that name because they are mechanisms for connecting information represented in different datasets rather than observations about study subjects. Note that none of the relationship datasets includes the variable DOMAIN. However, in a submission, these datasets need dataset names, and character strings used in those names are included in the CDISC Codelist called "SDTM Domain Abbreviations."

In conclusion, there is a clear distinction between the meaning of "domain" and "dataset" but given that the naming conventions are the same across both terms, in many cases they can be considered interchangeable.

...

Pagenav

Page tree

Versions Compared

Old Version 39

New Version Current

Key

Tabulation Datasets

Analysis Datasets