The aim of any assessment of diet [1,2], physical activity [4-7], or anthropometry [8,9] is to accurately estimate the true value. This estimate consists of the true value plus error, even for the most accurate tool or overall method. Validity is the extent to which the estimated value matches the true value, or, the extent to which a method measures what it is supposed to measure.
Since we cannot know the true value with absolute certainty, it follows that interpretation of validity cannot be simplified to the question: is this method valid or not? Instead, validity differs according to a variable of interest, study design, population, and context. Validity can vary when:
Poor validity is typically the result of systematic error, which causes the estimated value to be distorted in a particular direction away from the true value. One example would be a measurement of height of study participants with their shoes still on their feet. Their shoes would cause the consistent effect of producing values systematically greater than their true height and thus decrease the truthfulness of the resulting data.
Validity is closely linked to reliability, however whilst reliability relates to the consistency of results, validity relates to the accuracy. It is, therefore, possible for a highly reliable method to have limited validity.
In the height example above, repeated measurements of height of the same individual would be the same each time and therefore reliable. However, the measurement would not be valid due to the underlying poor agreement with the true height caused by the shoes. Reliability and validity are described visually by the target example in Figure C.2.1 below.
Figure C.2.1 Relationships between reliability and validity at an individual level.
It is possible for a method to be unreliable at the individual level, but provide valid estimates at the group level using the mean, as shown in Figure C.2.2 below. Such a method would not be valid at the individual level.
Figure C.2.2 Relationships between reliability and validity at the group level. Neither reliable nor valid at the individual level, but valid at the group level because the mean of all values matches the target.
Validity is a broad concept that has been defined in different ways and for different purposes. Some of the more commonly used forms of validity are described below [4].
Face validity
The degree to which a method appears to provide desired information about a variable designed to measure. This is typically a more qualitative judgement which, given the multi-dimensional nature of diet, physical activity, and anthropometry, can be an important step in determining whether a method is fit for purpose.
Content validity (also known as logical validity)
The extent to which the method is considered to assess specific aspects of a phenomenon to assess. This is important when measuring health behaviours since they can be broken down into various dimensions and domains. Similar to face validity, this is a more qualitative judgement made by considering the target variable to be measured alongside the dimensions captured by the method.
Construct validity
The extent to which a method measures the theoretical construct it is designed to measure. It is demonstrated when the method yields data as might be expected, given its intended purpose.
Criterion-related validity
The extent to which estimated values relate to those derived from a comparison or ‘criterion’ method, preferably one of very high validity and thought to provide the closest approximation of the true value, commonly referred to as a ‘gold standard’ method. For example:
Convergent validity
Like criterion validity, this is the extent to which predicted values match those derived from a comparison method, but one not generally accepted to be the gold standard.
A validity study can assess the extent to which a method produces estimated values which are consistent with ‘true’ values. Typically, the method being examined and another method - ideally a gold standard - are used to assess the same phenomenon, followed by evaluation of the data from each.
Validity for absolute and relative measures
The relationship between the two measures can be expressed in absolute or relative terms:
One type of measurement may not be valid to capture absolute levels of exposure, but valid to capture relative differences between individuals in a study population. For example, a dietary assessment of the frequency of consuming selected foods (food frequency questionnaires) is often used without assessment of portion sizes. Thus, absolute levels of nutrient intakes cannot be valid.
Despite the absolute measures of nutrient intakes not being valid, ranking individuals by levels of nutrient intakes can be valid and thus be adopted in a study of a lifestyle-disease association. Depending on the research question, validity for absolute measures is not always necessary.
Absolute validity can be separated according to whether the interpretations are to be made about groups or individuals.
What if no gold standard is available?
There are often circumstances in which no gold standard method is available for use as the criterion [10]. This may be because:
In such instances, the validity of a method can only be estimated by comparing its data with that of another with known systematic errors and biases [11]. This type of comparison is known to indicate convergent validity.
When no gold standard method is available, it is desirable that the comparison method relies on a different type of measurement to obtain data in order to avoid introducing correlated errors. For example, comparing a 24-hour dietary recall to an estimated food diary carries the risk of similar under-reporting from both methods and produces a correlation between errors from the two methods.
Even if the validity of a tool has been assessed through comparison with a gold standard, it should not be assumed that it is appropriate for use in every research scenario. In practice this is rarely the case; validity is tied to the overall method, plus the intended purpose, population, and context where it is applied.
The following should be considered when assessing the validity of a method to measure any aspect of diet, physical activity or anthropometry:
Internal and external validity
The sample used in validation or other types of study should be reviewed to ascertain if the results are likely to be generalisable to other populations or contexts. This is known as external validity. Sample characteristics such as age, sex, ethnic origin, and socio-economic status may all limit generalisability. For example, an adult physical activity questionnaire that is valid for adult use may not be suitable for use in a youth population.
In contrast, internal validity is the extent to which the study or estimate is free from bias or systematic error – i.e. the appropriateness and rigour of the study design, data collection protocols, and/or analysis.
Face and content validity
Another important consideration should be whether the criterion used to evaluate a method would be suitable for use in answering your research question. For example, validity reported when compared to doubly labelled water (gold standard estimate of overall energy expenditure), would not be sufficient evidence to support the use of a questionnaire to estimate subcategories of activity such as active commuting. A method with acceptable validity for one dimension of behaviour may not be relevant or generalisable to another dimension.
Suitability for study design and research question
It is very important to recognise that the degree of validity of a method may be more or less acceptable for studies designed for different purposes. Table C.2.1 illustrates different validity for different outcomes assuming use of a ‘gold standard’ method, such as:
Table C.2.1 illustrates that even if a perfect method is used, validity of such methods varies by their application.
Table C.2.1 Theoretical validity of a ‘gold standard’ measurement by exposure type.
N times of the assessment (N participants) | Once (n = 5) |
1000 times* (n = 5) |
Once (n = 50,000)† |
1000 times* (n = 50,000)† |
---|---|---|---|---|
Internal validity | ||||
Exposure on a specific day of each person | Valid‡ | Valid‡ | Valid‡ | Valid‡ |
Habitual exposure* of each person | ? | Valid‡ | ? | Valid‡ |
External validity | ||||
Average habitual exposure* of the population | ? | ? | Valid‡ | Valid‡ |
Variation of habitual exposure* of the population | ? | ? | ? | Valid‡ |
% of the population meeting a certain public guideline or clinical cut-off | ? | ? | ? | Valid‡ |
* Assumed to be sufficient to represent a habitual condition over a long period in a person.
† Assumed to be sufficient to represent the source population.
‡ Assumed to have no change in participant’s characteristics in response to each measurement
and to have no errors in measurement, processing, and analysis.
For example, gold standard measures of 24-hour calorimetry in 50,000 people can capture the energy expenditure of a specific day for each individual. Also, even if we know that energy expenditure varies by time, the average of 50,000 measures can be valid to estimate an average of habitual energy expenditure of the parent population.
However, those 50,000 measures do not provide a valid measure of the variability of habitual energy expenditure between different individuals. This limitation is because an estimate of variability mixes both between-person and within-person variability together (reliability), precluding a study on between-person variability. If there is no or little within-person variability in a measurement (e.g. knee height), measuring many individuals just once allows inference of between-individual variability.