Introduction
In multivariate quantitative research there is always a margin for potential error that does not relate to the design of the study, its conduct, or its implementation of various strategies aimed at preventing mistakes. These errors slip through the system and have the potential of obfuscating the results and leading to wrong conclusions (Cramer, 2016). While it is impossible to eliminate every error that could potentially affect the research, there are several data-cleaning strategies that could correct these errors or minimize their negative impact on the study. The purpose of this paper is to describe several potential data-cleaning methods and strategies, underline the most common error types, review error deletion and correction rates, as well as identify differences in research outcomes using a multivariate research article as an example.
Article Summary: Multivariate Statistical Analysis of Cigarette Design Feature Influence on ISO TNCO Yields
The study I chose to analyze in this paper is a multivariate statistical analysis of cigarette designs and their effects on reducing the presence of various negative components within the smoke that enters the person’s lungs. The reason why researchers chose a multivariate analysis design is associated with the number of variables they had to include in this research. These variables include not only the quality and type of tobacco used in over 50 domestic US brands, but also the rod length, filter length, circumference, overlap, draw resistance, pressure drop, and filter tip ventilation (Agnew-Heard et al., 2016). The researchers analyzed their data in several steps. The data had undergone a transformation into a set of uncorrelated principal components before the logistical regression analysis was performed. Then, these components were analyzed using a K-means cluster algorithm, which allowed classifying different cigarettes into different groups using the results and finding potential data similarities. The last stage of the multivariate analysis was the partial least squares method, which determined the relationship between the original ISO TNCO and the nine design parameters established previously.
The research concluded that only three components out of nine accounted for the 65% of the variability in the TNCO values for every presented cluster. All of the three components featured a strong correlation between them, with a coefficient between 0.5 and 0.9 (Agnew-Heard et al., 2016). Despite these three components having a larger influence than most, neither of the 9 tested components showed a particular dominance over others, as the results in every cluster varied differently from the others.
Data Cleaning as a Process
The difference between data cleaning and error-prevention strategies lie in the fact that the former eliminates data-related problems once they have occurred, while the latter tries to prevent errors from happening in the first place. A standard data-cleaning strategy for multivariate data involves a three-stage process of screening, diagnosing, and editing or eliminating any suspected data abnormalities. This process can be initiated at any stage of the study. While it is possible to correct errors while the research is being performed, on the spot, it is recommended to actively and systematically screen for any errors in a planned way, in order to ensure that all data is adequately scanned and analyzed. Data cleaning process can be done manually or with the assistance of specialized software. Every step of the process has its own techniques and suggestions that could be implemented to analyze the data provided by the article.
Screening Phase
During the data screening process in a multivariate research, one has to detect and distinguish five basic types of anomalous data: lacking data, excessive data, outliers and inconsistencies, strange patterns and distributions, unexpected analysis results, and other types of interferences and potential errors. Screening methods can be statistical and non-statistical. Nonconformities in outliers are often detected when compared to prior expectations based on experience, evidence in the literature, previous studies, or common sense (Johnson & Wichern, 2014).
Detecting erroneous inliers, or data that falls within the expected range is harder. Erroneous inliers often escape the screening process. In order to detect erroneous inliers in a multivariate research, they need to be viewed in relation to other variables using regression analyses and consistency checks. Remeasurement is a recommended strategy to deal with erroneous inliers, but it is not always feasible (Johnson & Wichern, 2014). Another strategy involves examination of a multitude of inliers in order to approximately estimate the number of potential errors in the research.
Some of the screening methods that could be implemented in relation to the chosen article are as followed (Aggarwal, 2013):
- Checking data for double entry
- Graphical presentation and exploration of the datasets using histograms, diagrams, scatterplots, etc.
- Frequency distributions
- Cross-tabulations
- Statistical methods for outlier detection
Diagnostic Phase
After the suspected erroneous data points have been identified in the screening phase, it is necessary to determine their nature. There are several potential outcomes of this process – the data points could be erroneous, normal (if the expectations were incorrect), true extreme, or undecided (without any explanation or reason for the extremity to occur). Some of these data points can be singled out on the grounds of being logically or biologically impossible (Chu, Ilyas, & Papotti, 2013). In many cases, several diagnostic procedures may be required in order to determine the true nature of every troublesome data pattern (Osborne, 2013).
Depending on the number and nature of errors in a multivariate research, the researchers might be required to reconsider their expectations for the results. In addition, quality control procedures may need to be reviewed and adjusted. This is why the diagnostic phase is considered to be a very labor-intensive and expensive procedure (Osborne, 2013). The costs of data diagnostics can be lowered if the data-cleaning process is implemented throughout the entire research, rather than after it has been concluded. Data diagnostic software may be implemented to speed up the process.
One diagnostic strategy that can be used to analyze the chosen multivariate research article involves returning to the previous stages of the data flow in order to reassess their consistency (Osborne, 2013). After any unjustified changes at any stage are detected, the next step involves looking for information to confirm whether the dataset is erroneous, or extreme. For example, the data regarding the length of cigarette filters could be erroneous due to a measurement error or a sampling error. In order to be effective, this strategy requires insight on several statistical and biological levels.
Treatment Phase
After the diagnostic stage is completed, all of the possible suspected errors in a multivariate research are identified either as actual errors, missing values, or true values. There are only three options to deal with each – correcting, deleting, or keeping the data unchanged (Downey & Fellows, 2012). The latter is implemented towards true values, as removing true extremes from the research would inevitably obfuscate results. Values that are physically or logically impossible are never left unchanged. They should be either removed or corrected, if possible. In the case with two data points measured within a short period of time that only have small variations between each other, it is recommended to take an average between the two, in order to enhance data accuracy (Downey & Fellows, 2012). Depending on the severity and number of factual errors, it may be required to amend the research protocol or even start anew.
In regards to the multivariate research article chosen for this paper, it is possible to correct the measurements of every individual component in order to ensure that any suspicious data is either accurate or erroneous. Extremities should be kept in the research if there have been no errors in the test samples or the measurements, as they represent a percentage of statistical probability that would otherwise obfuscate the research. However, in order to ensure that the results of the research are accurate, an additional screening must be made for any extraneous influences that could have been unaccounted for.
Conclusions
Data cleaning is considered an important part of a multivariate research, as it provides the opportunity to eliminate any errors that could have been made during the data collection, processing, and analysis stages of the study that the research framework and other error-prevention measurements did not account for. The data cleaning process includes screening, diagnostic, and treatment stages. During the screening process, all of the available information used in a multivariate research is analyzed, with suspected data singled out for diagnostic process. During the diagnostic stage, it is determined whether the provided data is actually erroneous or simply extreme. During the treatment stage, erroneous data is either corrected or deleted. This three-step process, when applied multiple times, can ensure accuracy and validity of the research.
References
Aggarwal, C. C. (2013). Managing and mining sensor data. New York, NY: Springer.
Agnew-Heard, K. A., Lancaster, V. A., Bravo, R., Watson, C., Walters, M. J., & Holman, M. R. (2016). Multivariate statistical analysis of cigarette design feature influence on ISO TNCO yields. Chemical Research in Toxicology, 29(6), 1051-1063.
Chu, X., Ilyas, I. F., & Papotti, P. (2013). Holistic data cleaning: putting violations into context. Data Engineering (ICDE), 1(1), 5-16.
Cramer, H. (2016). Mathematical methods of statistics. Jersey City, NJ: Princeton University Press.
Downey, R. G., & Fellows, M. R. (2012). Parameterized complexity. New York, NY: Springer.
Johnson, R. A., & Wichern, D. W. (2014). Applied multivariate statistical analysis. Duxbury, MA: Duxbury Publishing.
Osborne, J. W. (2013). Best practices in data cleaning. New York, NY: Sage.