Creating Data Collection Instruments

The following best practice recommendations are general principles for the development of data collection instruments to ensure data collected are complete and of high quality.

Num	Best Practice Recommendation	Rationale
1	When a binary response is expected, "Yes/No" responses are preferred over "Check all that apply," because a missing response could lead to a misinterpretation of critical data. If an assessment has composite responses (e.g., presence or absence of 2 or more subject characteristics), "Yes/No" questions for each component response (e.g., characteristic) are preferred to "Check all that apply" questions. Exceptions to this recommendation include: Assessments where the majority of options would be answered "No". When a validated instrument contains checkboxes. In this case, they should remain checkboxes in the CRF or eCRF. When controlled terminology is required for the values being collected. In cases where the applicant chooses to use "Check all that apply", additional quality checks should be considered (e.g., source data verification) to ensure the data collected in the CRF are correct and complete.	"Yes/No" questions provide a definite answer. The absence of a response is ambiguous as it can mean "No," "None," or that the response is missing. In situations where there is no other dependent or related field by which to gauge the completeness of the field in question, a "Yes/No" response ensures that the data are complete.
2	The database should contain an indication that a planned exam/assessment was not performed. The mechanism for this may be different from system to system or from paper to EDC. A "Yes/No – assessment completed" question is preferred over a "Check if not done" box, unless the "Check if not done" field can be compared to a completed data field using a validation check to confirm that one or the other has data. In situations where there is no other dependent or related field to gauge the completeness of the field in question, a "Yes/No" response format should be used to eliminate ambiguity. When another related field is present, the "Yes/No" response is optional. For example, when a value for temperature is missing, a simple "Not Done" box may be checked. It is not necessary to respond "Done" when a temperature value is present.	This will provide a definitive indicator that a data field has missing data and has not been overlooked. This will prevent unnecessary data queries to clarify whether an assessment has been performed. The use of the "Yes/No" format helps to eliminate ambiguity about whether an assessment has been completed.
3	Data-cleaning prompts should be used to confirm that blank CRFs are intentionally blank. Usually this will be a "Yes/No" question but it may be a "Check if blank" box, if a validation check can be used to confirm that either the "Check if blank" box is checked, or that there are data recorded in the CRF.	This will provide a definitive indicator that a CRF is blank on purpose and has not been overlooked. This will prevent unnecessary data queries.
4	The same data (i.e., the same information at the same time) should not be collected more than once.	Collecting the same data more than once: Creates the opportunity for discrepancies between the entered values. For example, subject’s birthdate or age is collected on the Demographics page; it is not necessary to collect age on the Lab CRF at every visit. Requires extra reconciliation. May affect frequency counts and analysis results.
5	A "Check if ongoing" question is recommended to confirm ongoing against an end date. This is a special-use case of "Yes/No," where the data entry field may be presented as a single possible response of "Yes" in conjunction with an End Date variable. If the box is checked, the operational variable may contain "Yes". If the box is not checked and the End Date is populated, the value of the variable may be set to "No". For some EDC systems, it may be better to display the possible responses to the "Check if ongoing" question as radio buttons. Conditional logic can then be used to solicit the collection of the end date only if the answer to the "Ongoing" question is "N" (No).	For the use case of "Check if ongoing," for the data to be considered "clean," 1 of the 2 responses must be present and the other response must be blank. So, the presence of the end date provides confirmation that the event is not ongoing.
6	CRFs should use a consistent order of responses (e.g., "Yes/No") from question to question, for questions with response boxes or other standardized lists of values. Exceptions to this would be validated instruments (e.g., a standardized assessment questionnaires) where the order of responses is dictated by the instrument.	A consistent order of response boxes promotes ease of use of the CRF to help reduce data entry errors and to avoid introducing bias or leading the investigator to a desired response.
7	CRF questions and completion instructions should be unambiguous and should not "lead" the answer to the question in a particular way.	Data should be collected in a way that does not introduce bias or errors into the study data. Questions should be clear and unambiguous. This includes making sure that the options for answering the question are complete, such as providing options for "Other" and "None" when applicable.
8	CRF questions should be as self-explanatory as possible, thereby reducing the need for separate instructions. If required, short instructions may be placed on the CRF page, especially if the Prompt is not specific enough. More detailed instructions may be presented in a CRF completion guideline. All instructions should be concise. Instructions should be standardized as much as possible.	Putting short instructions and prompts on the CRF increases the probability that they will be read and followed, and can reduce the number of queries and the overall data cleaning costs. Having standard instructions supports all data collection using the same conventions for completing the fields. Providing short instructions and prompts on the CRF and moving long instructions to a separate instruction booklet, facing page, or checklist will decrease the number of CRF pages, with the following benefits: Decreased CDM costs (e.g., decreased data entry costs) Allows CRF to be formatted so that the reader can easily identify the fields to be completed Format of the page is less cluttered, making it easier for individuals involved in data collection and monitoring to identify fields with missing responses
9	Collection of dates should use an unambiguous format, such as DD-MON-YYYY, where each part of the date is a unique format: "DD" is the day as a 2-digit numeric value; "MON" is the month as a 3-character letter abbreviation in English, or similar character abbreviation or representation in the local language; and "YYYY" is the year as a 4-digit numeric value. For EDC, the user may be able to select a date from a calendar, and this would also meet the recommendation for an unambiguous date. If the recommended approach is not adaptable to the local language, a similarly unambiguous format should be used. The method for capturing date values should allow the collection of partial dates and should use a consistent method or convention for collecting the known date parts.	Using this data collection format (i.e., DD-MON-YYYY) will provide unambiguous dates. For example, the date "06/08/02" is ambiguous because it can be interpreted as June 8, 2002, or August 6, 2002. If subject-completed CRF pages are translated into a local language, the CDASH recommended date format for collection may make translation of the documents easier.
10	To eliminate ambiguity, times should be collected with the use of a 24-hour clock, using the hh:mm:ss format for recording times. Use only as many of the hh:mm:ss elements as are needed for a particular field. Individuals involved in data collection should be cautioned not to "zero-fill" time components if these are not known (for example 21:00:00 means "exactly 9 pm", but if how many seconds after 9 PM is unknown, seconds should not be recorded). Subject-completed times may be recorded using a 12-hour clock and an "am" or "pm" designation. The time should then be transformed to a 24-hour clock in the database.	SDTM-based datasets use ISO 8601 date/time formats. Collecting times using a 24-hour clock eliminates both ambiguity and the need to convert values from 12-hour to 24-hour clock time.
11	Manually calculated fields should not typically be recorded within the CRF when the raw data on which the calculation is based are recorded in the CRF. An exception is when a study conduct decision should be made based on those calculations. In such cases it may be useful for the calculated field to be recorded within the CRF. It may also be useful to provide the individuals involved in data collection a step-by-step worksheet to calculate this data.	Data items that can be calculated from other data captured within the CRF are more accurately reported if they are calculated programmatically using validated algorithms. The noted exception may be in cases where it is important to show how an investigator determined a protocol-defined endpoint from collected raw data.
12	Questions with free-text responses should be limited. Questions should be specific and clear rather than open-ended. Instead of free-text comment fields, a thorough review of the CRF by the protocol development team should be performed to maximize the use of predefined lists of responses.	The collection and processing of free text requires significant resources for data entry: It requires data management (DM) resources to review the text for safety information and for inconsistencies with other recorded data and is of limited use when analyzing and reporting data. Another risk is data entry into free-text fields that should be recorded elsewhere.
13	Subject-specific data should be collected and recorded and should not be pre-populated in the CRF/eCRF.	The CRF is a tool to collect subject-level data. However, pre-population of some identifying (e.g., investigator name, protocol number) or timing information prevents errors and reduces data entry time. Fields on the CRF or in the database that are known to be the same for all subjects may be pre-populated (e.g., measurements for which there is only 1 possible unit, such as Pulse Rate ="beats/min"). The units can be displayed on the CRF and populated in the database.
14	The anatomical location of a measurement, position of subject, or method of measurement should be collected only if the protocol specifies the allowable options, or if the parameter is relevant to the consistency or meaning of the resulting data.	When a parameter such as location, position, or method is specified in a protocol and is part of the analysis, the CRF may include the common options for these parameters to ensure what actually happened can be reported and protocol deviations can be identified. If the parameter is pre-populated on the CRF and other options are not available, then data should not be recorded that was not collected per protocol specifications. Taking measurements in multiple anatomical locations may affect the value of the measurement and/or the ability to analyze the data in a meaningful way (e.g., when data obtained from different locations may bias or skew the analysis). In this case, collecting the location may be necessary to ensure consistent readings. For example, temperature obtained from the ear, mouth, or skin may yield different results. If there is no such rationale for collecting location, position, method, or any other value, it would be considered unnecessary data.
15	Verbatim terms for non-solicited adverse events, concomitant medications, or medical history-reported terms should be recorded. A preferred term should not be selected from a coding dictionary as a mechanism for recording data.	When information about spontaneously reported adverse events or medical history reported, recording responses verbatim ensures that no information is omitted. Individuals involved in data collection are not expected to be coding experts and may not be familiar with the coding dictionaries used. Recording adverse events from a standardized list is the same as having them code these events.
16	A tabulation variable name should only be used as a data collection/operational variable name if the collected value will directly populate the tabulation variable with no transformation (other than changing case). Otherwise, create a "collected" version of the variable and write a standard mapping to the tabulation variable.	This practice provides clearer traceability from data collection to submission and facilitates a more automated process of transforming collected data to the standardized data tabulations for submission.

« Organizational Practices for Data Collection

CRF Design »

Page tree

Creating Data Collection Instruments