Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Requirements for data submission are defined and managed by the regulatory authorities to whom data are submitted. This section describes general requirements for datasets which that may be part of a submission. However However, additional conventions may be defined by regulatory bodies or negotiated with regulatory reviewers. In such cases, additional requirements must be followed.

Tabulation Datasets

Observations about tobacco products and study subjects generated to support a submission are represented in a series of datasets aligned with logical groupings of data per into domains. Domains Domains described in this guide are generally aligned with implementation of a single dataset file in which to represent data in scope for a domain.All datasets are structured as flat files with rows representing observations and columns representing variables.In some cases, a dataset implemented for a domain may be split into physically separate dataset files to support submission when needed and as allowable by the regulatory authority. 

The following guidance will be adhered to for tabulation datasets: 

Metadataspec
NumGuidance ForImplementation
1Dataset Contentcontent

Data represented in datasets will include the following per regulatory requirements, scientific needs, and standards in this guide:

  • Data as originally collected or received (using controlled terminology where applicable) to support the submission.
  • Data from external references relevant to the submission (such as a e.g., study protocol).
  • Data assigned per conventions in the TIG.
  • Data derived per regulatory and TIG conventions.
2Dataset Namingnaming
  • Domain datasets based on the SDTM general observations classes will be named using the
two
  • 2-character code for the domain or using the applicable
four
  • 4-character code when a dataset is split (e.g., LB, LBHM).
  • Supplemental Qualifier
    Jira
    showSummaryfalse
    serverIssue Tracker (JIRA)
    serverId85506ce4-3cb3-3d91-85ee-f633aaaf4a45
    keyTOBA-792
    datasets will be named using
convention
  • "SUPP" concatenated withthe
two
  • 2-character domain code
or four
  • for the parent domain (e.g., SUPPDM, SUPPFA) or the 4-character code for the parent dataset when a dataset is split (e.g.
, SUPPDM
  • ,
SUPPFA,
  • SUPPFACM).
  • All other datasets will be named using the code for the domain or dataset
and
  • (e.g., DM, RELREC). 
3Variable Orderorder
  • Dataset variables will be ordered per guidance in the SDTM.
  • Variable order in TIG Domain Specifications domain specifications aligns with variable order in the SDTM.  
4Variable Namesnames
  • Variables will be named per guidance in the SDTM. The SDTM guidance uses fragment names in
Appendix x,
  • the CDISC
Variable-naming Fragments
  • Non-Standard Variables Registry.
  • Variable names in TIG
Domain Specifications aligns
  • domain specifications align with naming conventions in the SDTM.  
  • Variable names will be 8 characters or less and
upper case
  • uppercase
5Variable Labelslabels

Descriptive labels per this guide, up to 40 characters, will be provided as data variable labels for all variables, including Supplemental Qualifier variables.

6

Variable

Length

length

When variable length is referenced in the TIG, this refers to the length in bytes of ASCII character strings.

  • The maximum length of character
variable is
  • variables is 200 characters, and the full 200 characters
which
  • should not be used unless necessary.
  • Applicants will consider the nature of the data and apply reasonable, appropriate lengths to variables. For example:
  • The length of flags will always be 1.
      • --TESTCD and IDVAR values will never be
    more
      • longer than 8 characters, so the
    length
      • lengths of those variables can
    always
      • be set to 8.
      • The length for variables that use controlled terminology can be set to the length of the longest term.
    7Variable Value Text Casevalue text case
    • Values from controlled terminology or response values for QRS instruments specified by the instrument documentation will be in the case specified by those sources.
    • Otherwise, text data will be represented in upper case (e.g., NEGATIVE).
    8Missing Variable Valuesvariable values

    Missing values for individual data items will be represented by nulls. 

    9

    Splitting

    Datasets

    datasets

    A domain dataset may be split into physically separate datasets to support submission when needed and as allowable by the regulatory authority. The following conventions must be adhered to when splitting domains into separate datasets:

    • A domain based on a General Observation Class may be split according to values in variable --CAT. When a domain is split on --CAT, --CAT must not be null.
    • The Findings About Events or Interventions (FA) domain may be split
    based on the of the value in variable
    • according to the domain in which the interventions or events in --OBJ are represented (or would be represented).

    To ensure split datasets can be appended back into

    one

    1 domain dataset:

    • The value of DOMAIN must be consistent across the separate datasets as it would have been if they had not been split (e.g., LB, FA).
    • All variables that require a domain prefix (e.g., --TESTCD, --LOC) must use the value of DOMAIN as the prefix value (e.g., LB, FA).
    • --SEQ must be unique within USUBJID for all records across all the split datasets. If there are 1000 records for a USUBJID across the separate datasets, all 1000 records need unique values for --SEQ.
    • When relationship datasets (e.g., SUPPxx, FAxx, CO, RELREC) relate back to split parent domains, the value of IDVAR will be from a variable with unique values for each observation
    . When possible, the variable represented in IDVAR will have values from collected data (e.g., SPID, RECID), but variables with derived values may also be used (e.g
    • .
    , --SEQ).
    • Permissible variables included in one split dataset need not be included in all split datasets.
    • For domains with 2-letter domain codes, split dataset names can be up to 4 characters in length. For example, if splitting by --CAT, dataset names would be the domain name plus up to 2 additional characters to indicate the value of --CAT (e.g., LBHM for LB if the value of --CAT is HEMATOLOGY). If splitting Findings About by parent domain, then the dataset name would be the domain code, "FA", plus the two-character domain code for parent domain code (e.g., "FACM"). The
    four
    • 4-character dataset-name limitation allows the use of a Supplemental Qualifier dataset associated with the split dataset.
    • Supplemental Qualifier datasets for split domains will also be split. The nomenclature will include the additional
    one
    • 1 to
    two
    • 2 characters used to identify the split dataset (e.g., SUPPLBHM, SUPPFACM). The value of RDOMAIN in the SUPP-- datasets would be the
    two
    • 2-character domain code (e.g., LB, FA).
    • In RELREC, if a dataset-level relationship is defined for a split Findings About domain, then RDOMAIN will contain the
    four
    • 4-character dataset name, rather than the domain name "FA" (e.g., the value of RDOMAIN will be FACM)
    .

    Analysis Datasets

    ...

    • .

    Analysis Datasets

    Observations about tobacco products and study subjects generated to support analysis in a submission are represented in a series of datasets aligned with logical groupings of data per domains. Domains based on the CLASS values described in the TIG. Datasets described in this guide are generally aligned with implementation of a single dataset file in which to represent data in scope for a domain.created to support a certain type of analysis, but sometimes analysis datasets are created to support the creation of a subsequent dataset that will be used for analysis. All datasets are structured as flat files with rows representing observations and columns representing variables.In some cases, a dataset implemented for a domain may be split into physically separate dataset files to support submission when needed and as allowable by the regulatory authority

    The following guidance will be adhered to for tabulation analysis datasets: 

    Metadataspec
    NumGuidance ForImplementation
    1Dataset Contentcontent

    Data represented in datasets will include the following per regulatory requirements, scientific needs, and standards in this guide:

    • Data as originally collected or received (using controlled terminology where applicable) to support the submission
    .
    • Data from external references relevant to the submission (
    such as a study protocol).
    • e.g., reference data)
    • Data assigned per conventions in the TIG
    .
    • Data derived per regulatory and TIG conventions
    .
    2Dataset Namingnaming
    • Analysis dataset naming has no predefined values. The only pre-defined name for analysis  datasets is ADSL which is suggested for studies where a one-record-per-subject dataset is created to capture subject-level demographics, product usage, and/or trial experience information.
    • All other ADaM datasets (besides for ADSL) should be named AD + applicant-defined name (ADXXXXXX). The exception to this general naming convention is the addition of the RF prefix for reference data that has been introduced in the TIG.
    • There is no rule that noncompliant datasets must start with AX or that they cannot start with AD.
    • ADaM datasets should be named logically, if possible, and consistent naming conventions should be used across studies within a submission
  • Domain datasets based on the SDTM general observations classes will be named using the two-character code for the domain or using the applicable four-character code when a dataset is split.
  • Supplemental Qualifier datasets will be named using convention SUPP concatenated withthe two-character domain code or four-character code when a dataset is split (e.g., SUPPDM, SUPPFA, SUPPFACM).All other datasets will be named using the code for the domain or dataset and (e.g., DM, RELREC)
    3Variable Orderorder
    • There is no variable ordering defined for the ADaM standards, although having variables ordered together within a variable group helps review and dataset understanding
    Dataset variables will be ordered per guidance in the SDTM
    • .
    • Variable order in
    TIG Domain Specifications aligns with variable
    • the ADaM dataset must match the order in the
    SDTM
    • define.xml file.  
    4Variable Namesnames
    • Variables will be named per guidance in the SDTM. The SDTM guidance ADaM guidance, which uses fragment names in Appendix x, CDISC Variable-naming Fragmentsthe CDISC NSV Registry.
    • Variable names in TIG Domain Specifications aligns ADaM specifications align with naming conventions in the SDTMADaM.  
    • Variable names will be 8 characters or less and upper caseuppercase
    5Variable Labelslabels
    • Descriptive labels per this guide, up to 40 characters, will be provided as data variable labels for all variables.
    • All variables
    , including Supplemental Qualifier variables
    6Variable Lengthlength

    When variable length is referenced in the TIG, this refers to the length in bytes of ASCII character strings.

    • The maximum length of character variable variables is 200 characters which , and the full 200 characters should not be used unless necessary.
    • Applicants will consider the nature of the data and apply reasonable, appropriate lengths to variables. For example:
      • The length of flags will always be 1.
      • --TESTCD and IDVAR PARAMCD values will never be more longer than 8 characters, so the length of that variable can always be set to 8.
      • The length for variables that use controlled terminology can be set to the length of the longest term.
    7Variable Value Text Case
  • Values from controlled terminology or response values for QRS instruments specified by the instrument documentation will be in the case specified by those sources.
  • value text case

    Variable value text case generally depends on the variable usage and how it is presented on outputs (but there is no requirement that this usage must be followed

    Otherwise, text data will be represented in upper case (e.g., NEGATIVE

    ).

    8Missing Variable Valuesvariable values

    Missing values for individual data items will be represented by nulls if necessary for analysis. Otherwise, it is up to the dataset creator whether to include missing values in an analysis dataset

    9Splitting Datasetsdatasets

    A domain An analysis dataset may be split into physically separate datasets to support submission when needed and as allowable by the regulatory authority. The following conventions must be adhered to when splitting domains into separate datasets:

    • A domain based on a General Observation Class may be split according to values in variable --CAT. When a domain is split on --CAT, --CAT must not be null.
    • The Findings About Events or Interventions (FA) domain may be split based on the of the value in variable --OBJ.

    To ensure split datasets can be appended back into one domain dataset:

  • The value of DOMAIN must be consistent across the separate datasets as it would have been if they had not been split (e.g., LB, FA).
  • All variables that require a domain prefix (e.g., --TESTCD, --LOC) must use the value of DOMAIN as the prefix value (e.g., LB, FA).
  • --SEQ must be unique within USUBJID for all records across all the split datasets. If there are 1000 records for a USUBJID across the separate datasets, all 1000 records need unique values for --SEQ.
  • When relationship datasets (e.g., SUPPxx, FAxx, CO, RELREC) relate back to split parent domains, the value of IDVAR will be from a variable with unique values for each observation. 
    1. When possible, the variable represented in IDVAR will have values from collected data (e.g., SPID, RECID), but variables with derived values may also be used (e.g., --SEQ).

  • Permissible variables included in one split dataset need not be included in all split datasets.
  • For domains with 2-letter domain codes, split dataset names can be up to 4 characters in length. For example, if splitting by --CAT, dataset names would be the domain name plus up to 2 additional characters (e.g., LBHM for LB if the value of --CAT is HEMATOLOGY). If splitting Findings About by parent domain, then the dataset name would be the domain code, "FA", plus the two-character domain code for parent domain code (e.g., "FACM"). The four-character dataset-name limitation allows the use of a Supplemental Qualifier dataset associated with the split dataset.
  • Supplemental Qualifier datasets for split domains will also be split. The nomenclature will include the additional one to two characters used to identify the split dataset (e.g., SUPPLBHM, SUPPFACM). The value of RDOMAIN in the SUPP-- datasets would be the two-character domain code (e.g., LB, FA).
  • In RELREC, if a dataset-level relationship is defined for a split Findings About domain, then RDOMAIN will contain the four-character dataset name, rather than the domain name "FA" (e.g., the value of RDOMAIN will be FACM)

    . ADaM currently has no conventions as to the proper way to split analysis datasets, although like types of data should have similar dataset naming.

    Pagenav