Creation of Derived Columns Versus Creation of Derived Rows

This section provides specific rules to use in building a BDS dataset. These rules are essential, because they ensure the BDS dataset is analysis-focused, with all analysis-enabling variables and supportive variables included in a predictable structure, while preventing a "horizontalization" of the dataset.

The rows (i.e., records) in the ADaM BDS represent subject data for analysis parameters and timepoints (as applicable). There may be multiple rows within a given combination of subject, parameter, and timepoint, depending on the number of observations collected or derived, baseline definition, and so on.

The ADaM BDS structure contains a central set of columns (i.e., variables) that represent the data being analyzed. These variables include the value being analyzed (e.g., AVAL) and the description of the value being analyzed (e.g., PARAM). Other columns in the dataset provide more information about the value being analyzed (e.g., the subject identification) or describe and trace its derivation (e.g., DTYPE) or support its analysis (e.g., product variables, covariates). Standard columns exist for a variety of purposes, such as SDTM record identifiers for traceability, population, and other record selection flags; analysis values; and some standard functions of analysis values. Permissible columns are not limited to those whose variable names are specified in Section 2.9.5, Predefined Standard Variables for ADSL; Section 2.9.6, Predefined Standard Variables for BDS; and Section 2.9.7, Predefined Standard Variables for OCCDS, and may include study-specific analysis model covariates, subgrouping variables, variables supportive of traceability, and other variables needed for analysis or useful for review.

The BDS is flexible in that derived data can be added to the collected data as additional rows and columns that support the analyses and provide traceability. However, there are some constraints on how to incorporate derived data in the BDS dataset. This section addresses when derived data that are functions of analysis values should be added as additional columns, and when they should be added as additional rows.

The precise sequence of steps involved in creating a BDS dataset varies according to operational and study-specific needs. For the purposes of this discussion, it is useful to consider 2 fundamental steps.

Create an initial dataset from the source datasets. The first step is to create a set of rows and columns more or less directly derived from or loaded from input datasets (primarily SDTM datasets and other ADaM datasets) into their appropriate places. This step will include creation and population of columns containing analysis parameter (PARAM), analysis timepoint (e.g., AVISIT), and analysis values (e.g., AVAL, AVALC). It would also include adding columns containing identifiers (e.g., STUDYID, USUBJID, SUBJID, SITEID) and other SDTM variables for traceability (e.g., VISIT, --SEQ).
Add additional derived data as needed for the analysis. The second step consists of adding derived rows and columns based on the initial set of ADaM dataset records and columns. The rules below govern this step. These rules are further described and illustrated in the remaining subsections of this section.

Num	Rules	Implementation
1	Rule 1: A parameter-invariant function of AVAL and BASE on the same row that does not involve a transform of BASE should be added as a new column.	The 3 conditions of rule 1 for when a function of AVAL and BASE should be added as a column (i.e., a function column) are: the function is of AVAL and, optionally, BASE, on the same row; the function is parameter-invariant; and the function does not involve a transform of BASE. The remainder of the discussion of this rule is devoted to explaining these conditions. PARAM uniquely describes the contents of AVAL or AVALC. Often, AVAL itself is not the value that is needed for analysis. For example, in a change from baseline analysis, it is the change from baseline CHG that is analyzed. The change from baseline column CHG should be created according to rule 1 because it satisfies the 3 conditions: CHG is derived from AVAL and BASE on the same row. The same calculation applies on all rows in the dataset on which CHG is populated (the function CHG=AVAL-BASE does not vary according to PARAM). This second condition is known as the "property of parameter-invariance"; unless listed in Section 2.9.5, Predefined Standard Variables for ADSL; Section 2.9.6, Predefined Standard Variables for BDS; or Section 2.9.7, Predefined Standard Variables for OCCDS, a function of AVAL (and optionally BASE) may not be derived as a column if it is parameter-variant (i.e., is calculated differently for different parameters). In the function CHG=AVAL-BASE, BASE is not transformed. The intent is to use the standard columns as much as possible, to keep the structure as standard as possible, and avoid undue "horizontalization," while still permitting efficient use of function columns.
2	Rule 2: A transformation of AVAL that does not meet the conditions of rule 1 should be added as a new parameter, and AVAL should contain the transformed value.	If the intention is to redefine AVAL, BASE, CHG, and so on in terms of a transform of AVAL, then a new parameter must be added in which PARAM describes the transform. The creation of a new parameter results, by definition, in the creation of a new set of rows. For example, as described in the discussion of rule 1, in a change from baseline analysis of the logarithm of weight, AVAL should contain the log of weight, BASE should contain the baseline value of the log of weight, and CHG should contain the difference between the 2. PARAM should contain a description of the transformed data contained in AVAL (e.g., "Log10 (Weight (kg))"). In this way, the ADaM standard accommodates an analysis of transformed data in the standard columns without creating a multiplicity of new special-purpose columns. A related application of rule 2 is the case where it is necessary to support analysis and reporting in 2 different systems of units. In SDTM Findings domains (e.g., LB, QS, EG), the --STRESN column is the only numeric result column, and is also the only standardized numeric result column. The --ORRES column contains a character representation of the collected result, in the collected units specified in the --ORRESU column. The --ORRES column is not standardized. So for example, if data are typically collected in conventional units, SDTM cannot accommodate standardized data in both conventional units and the International System of Units (SI). In SDTM, for any given --TEST, a producer can standardize in 1 system of units but not 2. If one wishes to be able to analyze standardized results in both conventional units and in SI units, a transform in an ADaM dataset is needed. In each such case, a new parameter must be created in order to accommodate standardized data in the other system of units. The description in the PARAM column must contain the units, as well as any other information (e.g., location, specimen type) that is needed to ensure that PARAM uniquely describes what is in AVAL, and differentiates between parameters as needed. PARAM cannot be the same for different units. When a record is derived from a single record in the dataset, retain on the derived record any variable values from the original record that do not change and that make sense in the context of the new record (e.g., --SEQ, VISIT, VISITNUM, --TPT, covariates).
3	Rule 3: A function of 1 or more rows within the same parameter for the purpose of creating an analysis timepoint should be added as a new row for the same parameter.	For analysis purposes, there is often a need to impute missing data, or to create a derived conceptual timepoint. Such derivations should result in the creation of new derived records within the same parameter. As a general rule, when a record is derived from a single record in the dataset, retain on the derived record any variable values from the original record that do not change and that make sense in the context of the new record (e.g., --SEQ, VISIT, VISITNUM, --TPT, covariates). When a record is derived from multiple records, retain on the derived record all variable values that are constant across the original records, that do not change, and which make sense in the context of the new record. Note that there are situations when retention of values from an original record or records would make no sense on the derived record; in such cases, do not retain those values. For example, suppose that the analysis endpoint value is defined as the average of the last two available post-baseline values. In this case, a new row should be added, with a corresponding description in AVISIT, and the DTYPE (derivation type) column should contain a description on that row such as "AVERAGE" to indicate both that the row was derived, and also the derivation method. The metadata associated with AVISIT=Endpoint should adequately describe which records are used in the definition of the average. Note that even though the set of records for the log transformation of weight are derived, DTYPE is not populated for every row. DTYPE should be used to indicate rows that are derived within a given value of PARAM and is not to be used as an indication of whether the record exists in SDTM. An extension of rule 3 is necessary in the case where there is record-level population flagging. For example, assume the SAP states that if the subject is off drug for 7 days prior to a visit, the measurement collected at that visit is not included in the per-protocol analysis. Then, for some subjects, the last 2 available values may be different for intent-to-treat and for per-protocol analyses, so that the calculated endpoint averages would be different. For such subjects, 2 distinct derived endpoint rows would be needed, the appropriate row for each analysis indicated by the record-level population flags ITTRFL and PPROTRFL.
4	Rule 4: A function of multiple rows within a parameter should be added as a new parameter.	Rule 4 is a special case of rule 2. The functions covered by this rule violate the second condition of rule 1 (they are not same-row functions of AVAL), and may also violate the first and third conditions.
5	Rule 5: A function of more than 1 parameter should be added as a new parameter.	There is often a need to derive for analysis a parameter that was not collected. Such parameters may be quite complex functions of data from multiple SDTM domains and domain classes. Rule 5 addresses the case where a parameter is derived from other parameters already present in the dataset. For example, a questionnaire total domain score is calculated as a function of more than 1 observed question. The total domain score should be added as a new parameter, with its corresponding set of derived rows. For this derived parameter, the value of PARAM could be "Total Domain Score", and the value of the total domain score would be stored in the standard AVAL column, the baseline value would be stored in the standard BASE column, change from baseline would be stored in CHG, as usual.
6	Rule 6: When there is more than 1 definition of baseline, each additional definition of baseline requires the creation of its own set of rows.	In case there is more than 1 definition of baseline in an ADaM dataset, new rows must be created for each additional alternative definition of baseline. There will therefore be multiple sets of rows, where each set of rows corresponds to a particular definition of baseline. Whenever there is more than 1 definition of baseline, the BASETYPE column is required. BASETYPE identifies the definition of baseline that corresponds to the value of BASE in each row. There is only 1 BASE column, and only 1 column for each qualifying function of AVAL and BASE.

« Implementation Issues, Standard Solutions, and Examples

Inclusion of All Observed and Derived Records for a Parameter Versus the Subset of Records Used for Analysis »

Page tree

Creation of Derived Columns Versus Creation of Derived Rows