Creating Data Collection Instruments

Best practices in this section are operational recommendations to support data collection, suggested CRF development workflow, and methods for creating data-collection instruments.

Num	Best Practice Recommendation	Rationale
1	When a binary response is expected, "Yes/No" responses are preferred over "Check all that apply," because a missing response could lead to a misinterpretation of critical data. For example, if adverse events (AEs) are determined to be "serious" based only upon checking the applicable criterion (e.g., Hospitalization, Congenital Anomaly), failure to check a criterion could potentially delay identification of a serious adverse event (SAE). If an assessment has composite responses (e.g., presence or absence of 2 or more symptoms), "Yes/No" questions for each component response (e.g., symptom) are preferred to "Check all that apply" questions. Exceptions to this recommendation include: Assessments where the majority of options would be answered "No," such as in the collection of electrocardiogram (ECG) abnormality data where approximately 45 abnormalities may be listed but only a few will apply. When a validated instrument contains checkboxes. In this case, they should remain checkboxes in the CRF or eCRF. When controlled terminology governs the values being collected. For example, if collecting RACE using the "Check all that apply" option, the RACE values defined by controlled terminology should be collected as individual check boxes, and not as a "Yes/No" response. In cases where the sponsor chooses to use "Check all that apply", additional quality checks should be considered (e.g., source data verification, SDV) to ensure the data collected in the CRF are correct and complete.	"Yes/No" questions provide a definite answer. The absence of a response is ambiguous as it can mean "No," "None," or that the response is missing. In situations where there is no other dependent or related field by which to gauge the completeness of the field in question, a "Yes/No" response ensures that the data are complete. For example, when an AE End Date is blank, a "Yes" response to the question "Is the AE ongoing?" ensures that the data are complete. When the end date is provided, it is not necessary to answer the question "No".
2	The database should contain an indication that a planned exam/assessment was not performed. The mechanism for this may be different from system to system or from paper to EDC. For example, the data collection instrument/CRF could contain a field that allows the site to record an indication that a Vital Sign assessment was not performed (e.g., VSPERF="N" or TEMP_VSSTAT="NOT DONE") A "Yes/No – assessment completed" question is preferred over a "Check if not done" box, unless the "Check if not done" field can be compared to a completed data field using a validation check to confirm that one or the other has data. In situations where there is no other dependent or related field to gauge the completeness of the field in question, a "Yes/No" response format should be used to eliminate ambiguity. When another related field is present, the "Yes/No" response is optional. For example, when a value for temperature is missing, a simple "Not Done" box may be checked. It is not necessary to respond "Done" when a temperature value is present.	This will provide a definitive indicator that a data field has missing data and has not been overlooked. This will prevent unnecessary data queries to clarify whether an assessment has been performed. The use of the "Yes/No" format helps to eliminate ambiguity about whether an assessment has been completed.
3	Data-cleaning prompts should be used to confirm that blank CRFs are intentionally blank. Usually this will be a "Yes/No" question (e.g., AEYN) but it may be a "Check if blank" box, if a validation check can be used to confirm that either the "Check if blank" box is checked, or that there are data recorded in the CRF.	This will provide a definitive indicator that a CRF is blank on purpose and has not been overlooked. This will prevent unnecessary data queries.
4	The same data (i.e., the same information at the same time) should not be collected more than once.	Collecting the same data more than once: Creates the opportunity for discrepancies between the entered values. For example, subject’s birthdate or age is collected on the Demographics page; it is not necessary to collect age on the Lab CRF at every visit. Requires extra reconciliation. May affect frequency counts and analysis results.
5	A "Check if ongoing" question is recommended to confirm ongoing against an end date. This is a special-use case of "Yes/No," where the data entry field may be presented as a single possible response of "Yes" in conjunction with an End Date variable. If the box is checked, the operational variable may contain "Yes". If the box is not checked and the End Date is populated, the value of the variable may be set to "No". For some EDC systems, it may be better to display the possible responses to the "Check if ongoing" question as radio buttons. Conditional logic can then be used to solicit the collection of the end date only if the answer to the "Ongoing" question is "N" (No).	For the use case of "Check if ongoing," for the data to be considered "clean," 1 of the 2 responses must be present and the other response must be blank. So, the presence of the end date provides confirmation that the event is not ongoing.
6	CRFs should use a consistent order of responses (e.g., "Yes/No") from question to question, for questions with response boxes or other standardized lists of values. Exceptions to this would be cases where a validated instrument (e.g., a standardized assessment questionnaire) is used.	A consistent order of response boxes promotes ease of use of the CRF to help reduce data entry errors and to avoid introducing bias or leading the investigator to a desired response.
7	CRF questions and completion instructions should be unambiguous, and should not "lead" the site to answer the question in a particular way.	Data should be collected in a way that does not introduce bias or errors into the study data. Questions should be clear and unambiguous. This includes making sure that the options for answering the question are complete, such as providing options for "Other" and "None" when applicable.
8	CRF questions should be as self-explanatory as possible, thereby reducing the need for separate instructions. If required, short instructions may be placed on the CRF page, especially if the Prompt is not specific enough. More detailed instructions may be presented in a CRF completion guideline. All instructions should be concise. Instructions should be standardized as much as possible.	Putting short instructions and prompts on the CRF increases the probability that they will be read and followed, and can reduce the number of queries and the overall data cleaning costs. Having standard instructions supports all sites using the same conventions for completing the fields. Providing short instructions and prompts on the CRF and moving long instructions to a separate instruction booklet, facing page, or checklist will decrease the number of CRF pages, with the following benefits: Decreased CDM costs (e.g., decreased data entry costs) Allows CRF to be formatted so that the reader can easily identify the fields to be completed Format of the page is less cluttered, making it easier for site personnel and monitors to identify fields with missing responses
9	Collection of dates should use an unambiguous format, such as DD-MON-YYYY, where each part of the date is a unique format: "DD" is the day as a 2-digit numeric value; "MON" is the month as a 3-character letter abbreviation in English, or similar character abbreviation or representation in the local language; and "YYYY" is the year as a 4-digit numeric value. For EDC, the user may be able to select a date from a calendar, and this would also meet the recommendation for an unambiguous date. If the recommended approach is not adaptable to the local language, a similarly unambiguous format should be used. The method for capturing date values should allow the collection of partial dates, and should use a consistent method or convention for collecting the known date parts to facilitate standard mapping to SDTM. See the CDASH Model for standard date variable names.	Using this data collection format (i.e., DD-MON-YYYY) will provide unambiguous dates. For example, the date "06/08/02" is ambiguous because it can be interpreted as June 8, 2002, or August 6, 2002. If subject-completed CRF pages are translated into a local language, the CDASH recommended date format for collection may make translation of the documents easier. Dates are collected in this format, but reformatted and submitted in ISO 8601 format. See the SDTMIG and Section 3.6, Timing Variables: Collection, Conversion, and Imputation of Dates, for more information about the ISO 8601 format.
10	To eliminate ambiguity, times should be collected with the use of a 24-hour clock, using the hh:mm:ss format for recording times. Use only as many of the hh:mm:ss elements as are needed for a particular field. Sites should be cautioned not to "zero-fill" time components if these are not known (for example 21:00:00 means "exactly 9 pm", but if the site did not know how many seconds after 9 PM, they should not record the seconds). Subject-completed times may be recorded using a 12-hour clock and an "am" or "pm" designation. The time should then be transformed to a 24-hour clock in the database.	SDTM-based datasets use ISO 8601 date/time formats. Collecting times using a 24-hour clock eliminates both ambiguity and the need to convert values from 12-hour to 24-hour clock time.
11	Manually calculated fields should not typically be recorded within the CRF when the raw data on which the calculation is based are recorded in the CRF. An exception is when a treatment and/or study conduct decision should be made based on those calculations. In such cases it may be useful for the calculated field to be recorded within the CRF. It may also be useful to provide the site a step-by-step worksheet to calculate this data.	Data items that can be calculated from other data captured within the CRF are more accurately reported if they are calculated programmatically using validated algorithms. The noted exception may be in cases where it is important to show how an investigator determined a protocol-defined endpoint from collected raw data.
12	Questions with free-text responses should be limited to cases of specific safety or therapeutic need in reporting or analysis, such as adverse events, concomitant medications, or medical history—generally in cases where the data will be subsequently coded. Questions should be specific and clear rather than open-ended. Instead of free-text comment fields, a thorough review of the CRF by the protocol development team should be performed to maximize the use of predefined lists of responses. See Section 7.2, CO - Comments, for additional recommendations.	The collection and processing of free text requires significant resources for data entry: It requires clinical data management (CDM) resources to review the text for safety information and for inconsistencies with other recorded data and is of limited use when analyzing and reporting clinical data. Another risk is that sites may enter data into free-text fields that should be recorded elsewhere.
13	Subject-specific data should be collected and recorded by the site and should not be pre-populated in the CRF/eCRF.	The CRF is a tool to collect subject-level data. However, pre-population of some identifying (e.g., investigator name, site identification, protocol number) or timing (e.g., Visit Name) information prevents errors and reduces data entry time. Fields on the CRF or in the database that are known to be the same for all subjects may be pre-populated (e.g., measurements for which there is only 1 possible unit, such as Respiratory Rate or Blood Pressure). The units can be displayed on the CRF and populated in the database
14	The anatomical location of a measurement, position of subject, or method of measurement should be collected only if the protocol specifies the allowable options, or if the parameter is relevant to the consistency or meaning of the resulting data.	When a parameter such as location, position, or method is specified in a protocol and is part of the analysis, the CRF may include the common options for these parameters to ensure the site can report what actually happened and protocol deviations can be identified. If the parameter is pre-populated on the CRF and other options are not available, then the site should be directed to not record data that was not collected per protocol specifications. Taking measurements in multiple anatomical locations may affect the value of the measurement and/or the ability to analyze the data in a meaningful way (e.g., when data obtained from different locations may bias or skew the analysis). In this case, collecting the location may be necessary to ensure consistent readings. For example, temperature obtained from the ear, mouth, or skin may yield different results. If there is no such rationale for collecting location, position, method, or any other value, it would be considered unnecessary data. See Section 4.3, Organizational Best Practices to Support Data Collection, Num 1.
15	Sites should record verbatim terms for non-solicited adverse events, concomitant medications, or medical history-reported terms. Sites should not be asked to select a preferred term from a coding dictionary as a mechanism for recording data.	When the site records information about spontaneously reported adverse events or medical history, recording responses verbatim ensures that no information is omitted. Site personnel are not expected to be coding experts and may not be familiar with the coding dictionaries used in clinical research. Having sites record adverse events from a standardized list is the same as having them code these events. Having multiple sites "coding" data, on the other hand, will likely result in inconsistencies in coding across sites. See Section 6, Other Recommendations, for more information about collecting data for coding purposes.
16	An SDTMIG variable name should only be used as a data collection/operational variable name if the collected value will directly populate the SDTMIG variable with no transformation (other than changing case). Otherwise, create a "collected" version of the variable and write a standard mapping to the SDTMIG variable.	This practice provides clearer traceability from data collection to submission, and facilitates a more automated process of transforming collected data to the standardized data tabulations for submission.

Page tree

Creating Data Collection Instruments