Measurement Toolkit - Data processing

Contents

Introduction
Data cleaning
Outliers
Transformation
Missing data
References

Introduction

Data on diet, physical activity, and anthropometry are not always suitable for use in their raw state. Data processing refers to the series of steps performed to derive variables from the raw measurement.

Each of the diet, physical activity, and anthropometry domains involves unique data processing steps, for example, processing food consumption to nutrient intakes, heart rate to energy expenditure, and x-ray absorption to fat mass, as discussed on relevant pages. This section describes the following more general topics requiring data processing:

Data cleaning
Outliers
Transformation
Missing data

All are often implemented or addressed after standard data processing is undertaken. All considerations are common across assessments of diet, physical activity, and anthropometry. All should be considered as options depending on the research aim, the study design, availability of data, and biological or statistical plausibility.

Data cleaning

Involves identification of invalid or incorrect records within the dataset, for example:

Reporting errors, e.g. minutes in hours column in questionnaire data
Incomplete data, e.g. providing the weight and type of food but not frequency with which it is consumed
Illegible data
Biologically implausible data, e.g. extremely high acceleration or energy consumption
Data that are ambiguous or difficult to interpret, e.g. malfunction of an objective measure, use of 12-hour clock, or overlapping diary entries
Data entry errors, e.g. mistakes in manual input of questionnaire responses to an electronic database

In some cases, it may be possible to correct data or contact the participant to clarify details.

Outliers

An outlier is an extreme observation that appears to deviate markedly from other observations in a defined sample, and can have a large effect on statistical analysis according to the quantity of outliers. Some statistics, such as the mean and least squares regression, are particularly vulnerable to the effects of outliers [1].

Identifying outliers depends upon the underlying distribution of the data, and should therefore begin with inspecting this distribution. Outliers can be identified visually using plots such as a scatterplot, histogram, or box plot. Statistical thresholds can also be used, for example, data points three or more standard deviations from the mean can be flagged.

An outlier may indicate an error in the data but may also be a legitimate extreme case sampled from the population by chance. Outliers must therefore be carefully investigated to understand their cause(s) and to decide what should be done about them. Decisions must be biologically or statistically plausible. The aim is to produce results that are not affected by a few outliers. Methods that can be used include truncation, winsorization, or transformation of the data.

Truncation

Truncation is to remove values above or below an absolute (e.g. kcal/day) or relative threshold (mean ± three standard deviations).

The drawback of this method is that potentially valuable data are lost; as such the sample may not be fully representative of the population of interest. The choice to truncate data and the threshold to use should be considered carefully.

Winsorization

Winsorization involves recoding extreme values to the nearest ‘reasonable’ values (either minimum or maximum). For example, when using the International Physical Activity Questionnaire (IPAQ) [2] in a population where a sedentary lifestyle is concerning, values for walking exceeding three hours per day may appear to be outliers. Those outlying values can be winsorized to three hours, permitting a maximum of 21 hours of walking per week.

Winsorization can reduce the effect of outliers without removing individual cases from the dataset. Winsorization maintains the relative ordering of scores, with the highest or lowest scores remaining, thus minimising harm to statistical inference.

The choice of the highest or lowest ‘reasonable’ value is important and should be justified. For example, cut points may reflect their clinical meaning, pilot data, biological plausibility, or statistical plausibility. It can also be based on percentiles.

Transformation

Diet, physical activity, and anthropometric data often have skewed distributions, mostly being naturally truncated at zero and having no upper limit. Transformation of data may be required for statistical analysis and modelling, and may also reduce the effect of outliers.

Transformation has the advantages of keeping all values in the dataset and keeping the relative ranking of scores. Common types of transformation include:

Log
Square root
Reciprocal
Reverse score
Box-Cox power transformation

Box-Cox transformation is a type of power transformation to achieve a nearly perfect normality of a variable. A variable x is transformed by fitting it to a function of (x^k-1)/k, where k is selected so that the transformed variable has skewness equal to zero. This transformation is useful to obtain a normally distributed variable, but this loses its interpretability. Practically, standardisation after Box-Cox transformation is conceivable to allow 1 unit to represent 1 standard deviation.

Transformation always alters the meaning of one unit of the target variable. Notably, the interpretation of a variable does not necessarily become difficult. For example, if a variable is transformed with a common logarithm (Log₁₀), a one unit difference in the variable indicates difference by 10 times (i.e. Log₁₀10=1, Log₁₀100=2, Log₁₀1000=3 etc.).

Categorisation

Depending on the aim of the study, categorisation of a study population can be readily applied. This reduces any effects of the shape of distribution and of outliers. Categorisation can be based on:

Cut-points with regards to clinical practice or public health: e.g. body mass index (BMI) of 25 and 30 kg/m² to identify overweight and obese adults; physical activity ≥150 or < 150 min/week to identify adults meeting the physical activity recommendation in the United Kingdom
Cut-points determined by data: e.g. four groups based on quartiles (25^th, 50^th, and 75^th percentile values)
Cut-points with regards to both clinical or public-health implication and also based on data: e.g. 5 categories of alcohol consumption, non-alcohol drinkers (0 grams/day), former alcohol drinkers (0 grams/day), and three groups among drinkers based on tertiles (33^rd and 67^th percentiles)

Categorisation facilitates statistical analysis and interpretation. However, if a continuous variable is categorised, this always loses differences between individuals in the same category. For instance, if BMI is treated as a categorical variable as above, individuals with BMI=30 kg/m² and with BMI=40 kg/m² are treated as the same.

Missing data

Missing data represent a potential source of bias in population health research. Research that is carefully designed and conducted can minimise the potential for missing data. It may be useful to consider whether the method to be used is likely to result in missing data in the population of interest. For example,

Self-administration of diet or physical activity questionnaires may result in missing data because of some lack of literacy, cognitive performance, or interest in responding to questions about diet or physical activity.
Anthropometric measurements can result in missing information for morbidly obese individuals whose fat mass cannot be assessed for a technical reason.

Consideration of the subgroups more likely to return missing data, and collection of additional information describing these groups, can in turn be used to address missing data that do occur.

Types of missing data

Unfortunately, missing data often occurs. The mechanism by which this happens must be accounted for when the missing information is treated. In statistics and epidemiology literature [3], missing data are categorised as:

Missing completely at random – missingness is not related to the values of the dataset (either missing or observed)
Missing at random – missingness is related to observed values (e.g. baseline characteristic) but not the missing unobserved values
Missing not at random – missingness is related to the missing unobserved values

Dealing with missing data

The type of missing data may affect the chosen method for dealing with missing data [3,4]. Broadly, the principal options are:

Ignore the missing data and analyse only observed values (complete-case analysis)
Imputation by replacing unobserved values as if they were observed (imputation analysis)

Imputation can be performed on a case-by-case basis. For example:

Missing information on plasma vitamin C levels of some adults was due to no detectable signal of a vitamin C assay. Exclusion of these individuals just because of the missing information would result in loss in the lowest end of the distribution of vitamin C, which would be invalid. An imputed value would be its detection limit (or half of its detection limit, for instance).
In an exercise trial measuring maximal oxygen uptake at three time points (baseline, 4 weeks, and 8 weeks), some participants could not show up at the endpoint but did consent to use of collected data by 4 weeks. The endpoint results at 8 weeks may be imputed with the results at 4 weeks (‘last observation carried forward’). Using this imputation, the analysis can be said ‘intention to treat’ analysis.
Imputation of means or medians of a continuous variable is generally discouraged, because this shrinks variability of the variable.
For categorical variables, a missing category is often used. For missing information on BMI, for example, four categories may be the option: < 25 kg/m², 25-29.9 kg/m², ≥30 kg/m², and ‘unknown’. But this is generally not recommended. It often loses continuous information.
Missing information can be imputed with regression analysis. For example, if waist circumference were missing among 5% of a study population, but BMI were available among all, waist circumference could be imputed by a linear regression model predicting waist circumference by BMI and other available variables (e.g. age, sex).
Imputation of missing data with uncertainty (e.g. regression-based imputation) is recommended to be done repeatedly. Consequently, repeated imputation produces multiple datasets with different imputed values. This is called ‘multiple imputation’. An overall inference is supposed to account for the variability owing to the multiple datasets.

References

Osborne JW, Overbay A. The power of outliers (and why researchers should always check for them). Pract Assess Res Eval. 2004;9(6):1-12
Craig CL, Marshall AL, Sjöström M, Bauman AE, Booth ML, Ainsworth BE, Pratt M, Ekelund U, Yngve A, Sallis JF, et al. International physical activity questionnaire: 12-country reliability and validity. Med Sci Sp Exercise. 2003;35:1381-95
Sterne JA, White IR, Carlin JB, Spratt M, Royston P, Kenward MG, Wood AM, Carpenter JR, Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls. BMJ. 2009;338:b2393
White IR, Horton NJ, Carpenter J, Pocock SJ, Strategy for intention to treat analysis in randomised trials with missing outcome data. BMJ. 2011;342:d40