Reliability is the degree to which a method provides estimates that are stable or consistent, as opposed to erratic or variable. It describes the extent to which a method is able to yield reproducible data under the various conditions or contexts for which it has been designed.
The following terms have been used often interchangeably with reliability in the literature [1,2]:
The rule of thumb would be to describe what each means sufficiently in detail. Reliability is decreased by measurement error, most commonly random error, which causes estimated values to vary around the true value in an unpredictable way. It can arise from chance differences in the method, researcher or participant. Poor reliability weakens observed associations between exposure and outcome variables which can conceal true relationships between behaviour and disease [3,4]. It also results in misclassification when categorising data. Reliability is a concern for any method and determines the upper limit for validity.
Reliability is closely linked to validity, however while validity relates to the accuracy of a method, reliability relates to the consistency. It is therefore possible for a method with poor validity to be very reliable. For example, replicate measures using the same faulty tape might provide very similar, reliable, measurements of length, but would not be valid due to the underlying poor agreement with the true length.
The relationship between reliability and validity is described visually by the target example in Figure C.3.1. The above example is represented by the ‘reliable but not valid’ target.
Figure C.3.1 The relationship between reliability and validity at an individual level.
It is possible for a method to be unreliable at an individual level, but provide valid estimates at the group level using the mean, as shown by Figure C.3.2 below. Such a method would not be valid at an individual level.
Figure C.3.2 Relationships between reliability and validity at group level. Not reliable and not valid at individual level, but valid at group level using mean of all values.
Test-retest reliability (also known as stability)
The extent to which a method produces consistent data in similar conditions across multiple time points. For example, fat mass assessed with an electronic device may vary due to random errors resulting from differences in calibration. To assess how reliable the test is, replicate assessments in the exact same setting can be undertaken.
Demonstrating this type of reliability may be problematic in the assessment of diet and physical activity due to the relatively high intra-individual variation of these variables. Some methods are likely to have particularly high day-to-day variation, e.g. 24-hour dietary or physical activity recall. Consideration should be given to the time between assessments when assessing test-retest reliability.
Internal consistency reliability
Inter-rater reliability (also known as inter-rating agreement or concordance)
The similarity between data from different observers of the same phenomenon, for example, the concordance of:
The degree of inter-rater reliability is particularly important when using methods that require a certain level of interpretation or observation by investigators. High inter-rater reliability indicates a good level of standardisation, whereas low inter-rater reliability indicates a high level of measurement error due to variability in subjective assessment.
Assessing reliability when measuring highly variable data
The choice of an appropriate method should always involve consideration of its ability to provide reliable data, and this can be tested in a reliability study. Reliability can be assessed in variables that remain stable over long periods, such as psychological traits, social attitudes, or height of an adult.
Diet and physical activity, on the other hand, are inherently variable on a day-to-day basis. There are consequently two main sources of variability which affect the reliability of a measurement:
It can be difficult or even impossible to distinguish between these components. An assessment of reliability relates the degree of measurement error to the underlying true variability in the variable:
Reliability for absolute or relative measures
Absolute changes in an individual’s behaviour or characteristics over time reduce reliability. However, even if absolute values of individuals’ characteristics change over time and are not reproducible, ranking or relative difference between individuals can be reproducible and reliable.
For example, growth curves are used to monitor the growth of a child or children:
Alternatively, physical activity may be assessed repeatedly throughout a six month period:
Figure C.3.3 Hypothetical example of the absolute and relative reliability of physical activity measure between winter and summer.
Reliability in a multi-centre study
Reliability is a term loosely used in population-health sciences, due to lack of clarity surrounding repeatability of an instrument vs. variability of target characteristics over time. Ideally, a study should embark with a clear design in which different types of reliability of a method are accounted for, and with a (statistical) plan to assess them.
The level of method reliability that is required or desirable can vary by study. As such, it is essential to consider wider factors when evaluating reliability, such as:
High reliability does not necessarily mean high validity
Reliability affected by multiple non-biological effects
Figure C.3.4 Four scenarios demonstrating non-biological effects on the reliability of two repeated measures.
In the four different laboratories described in Figure C.3.4:
Effects detrimental to reliability can be reduced by the proper application of quality control measures. For example, unwanted effects of blood storage can be corrected by measuring the effect and calibrating observed values accordingly.
The reliability of a diet, physical activity, or anthropometric method depends on the:
In order to increase the reliability of an assessment, the sources and types of error must be identified. A random error may be reduced by: