Introduction
Multivariate analysis is considered to be one of the most frequently used methods of statistical evaluation of data available in contemporary research. If 50 years ago many features of multivariate analysis were unavailable due to certain difficulties in accessing, processing, analyzing, and synthesizing information, with the development of information and communication technology, it has become possible to implement multivariate analysis with greater frequency than ever before. The technology enables the researchers to collect, store, and transport large databases from observation and experimentation. Multivariate techniques are invaluable for their ability to help discover the interrelatedness between different variables within massive sets of data. Because of this, it is paramount to understand the necessary methodology behind the multivariate research, its numerous methods, and implementation, as well as strengths and weaknesses within the context of a particular situation. The purpose of this paper is to explore these concepts by comparing and contrasting two types of research that use a multivariate analysis approach to studying data.
What is Multivariate Analysis?
Multivariate analysis, as the name suggests, is based on observation and analysis of many statistical outcomes at the same time. Its difference from a multiple regression lies in the fact that multiple regressions, while studying more than one independent variable in its equations, always have only one dependent variable, whereas a multivariate analysis includes several. Multivariate analysis is used to perform studies across several dimensions of research while taking account of numerous effects and responses at the same time. It is excellent for studying increasingly large and complex amounts of data (Mengual-Macenlle, Marcos, Golpe, & Gonzales-Riva, 2015). As it stands, multivariate analysis is the most accurate representation of reality through statistics, as it allows for modeling and simultaneous analysis of several datasets, including different variables approach for each sample studied. There are numerous types of multivariate analysis, but in general, they can be classified into three separate groups, based on the types of variables included in the research. These types are (Mengual-Macenlle et al., 2015):
- Dependent
- Interdependent
- Structural.
In quantitative research, there are four types of multivariate analysis. These methods are regression analysis, survival analysis, analysis of variance, and canonical correlation. These types of analyses are implemented when the dependent variable is quantitative. Should the dependent variable have qualitative features, the researchers are advised to use one of the three statistical models, such as discriminant analysis, conjoint analysis, or logistic regressions.
Weaknesses and Limitations of Multivariate Analysis
Every kind of analysis has its weaknesses. In the case of multivariate analysis, the main weakness of the method lies in its complexity. Manual processing of data is virtually impossible due to the number of complex calculations required to perform the analysis. Multivariate research cannot be completed without high-level statistical software, which is expensive. The results of multivariate analysis are complex and prone to misinterpretation due to being based on assumptions that are difficult to assess (Mengual-Macenlle et al., 2015). Also, multivariate analyses tend to have high standard error margins, which demand large sample sizes for the results to be meaningful. Lastly, running statistical programs required a specialist to make sense of the data output. Because of these limitations, multivariate analysis is generally restricted to large-scale researches with large budgets and data samples (Mengual-Macenlle et al., 2015).
Pulmonary Arterial Enlargement and Acute Exacerbations of COPD
Multivariate analysis is often used in medical research. The promotion of electronic data files for patients offered new possibilities for the implementation of multivariate analysis, as it significantly simplified the data collection processes and enabled access to very large sample databases. This research is dedicated to studying pulmonary arterial enlargement and acute exacerbations of COPD. It was published in the New England Journal of Medicine. The research tests the hypothesis, which states that a computed tomographic metric of pulmonary vascular disease is associated with previous cases of severe COPD. To determine the level of association between various patient characteristics and the occurrence of COPD, the researchers used the univariate logistic regression method.
The second step of this multivariate analysis research dealt with variables that showed a univariate association with severe COPD-related symptoms with a P < 0.10 (Wells et al., 2012). These variables were included in a stepwise backward univariate logistic model. Other variables that were included in this model were also individually associated with COPD exacerbations, as showed in the ECLIPSE study.
The authors found that there are significant associations between severe COPD exacerbations and the following variables (Wells et al., 2012):
- Younger age
- Black race
- Congestive heart failure
- Sleep deprivation
- Asthma
- Chronic Bronchitis
- Hazardous labor
Multivariate analysis was useful in handling many covariates and cofounders, as it enabled the use of interaction testing and effect modification. Also, multiple logistic regression analyses showed an association between severe COPD-related exacerbations and several other factors, such as younger age, high SGRQ scores, and lower FEV1 values (Wells et al., 2012).
Multiple logistic regression analysis used in this research is a robust method of data analysis, which is resistant to many common limitations and vulnerabilities that other methods of analyzing quantitative and qualitative data have. However, in the context of this research, the limitation of this type of analysis lies in its limited number of outcome variables. In this scenario with degrees of severity in COPD-related exacerbations, multiple logistic-regression analysis is limited because it is best suited for dealing with categorical and not continuous outcomes (Hampel, Ronchetti, Rousseeuw, & Stahel, 2011).
Multivariate Statistical Analysis of Cigarette Design Feature Influence on ISO TNCO Yields
This study was published in Chemical Research in Toxicology in 2016. It is considered with differences in the physical design of cigarettes and how they affect the variance of tar, nicotine, and carbon monoxide in inhaled mainstream smoke, and then comparing it to the ISO smoke standards (Agnew-Heard et al., 2016). The choice of multivariate statistical analysis for this research was dictated by numerous variables that needed to be processed to achieve any meaningful conclusions. These variables include not only the quality and type of tobacco used in over 50 domestic US brands, but also the rod length, filter length, circumference, overlap, draw resistance, pressure drop, and filter tip ventilation (Agnew-Heard et al., 2016). These components are present in 44 out of 50 samples and constitute the majority of cigarettes produced worldwide (Agnew-Heard et al., 2016).
The data was analyzed using computerized programming software. Multivariate analysis was performed in several steps. First, all data received from the tests had to be transformed into a set of uncorrelated principal components to perform a multiple logistic regression analysis, as it considered a primary requirement for many multivariate analyses. Afterward, the principal components were analyzed using a K-means cluster algorithm to form groups of cigarettes depending on their similarities of data. Lastly, the relationship between the original ISO TNCO and the nine design parameters was established using the partial least squares analysis. To summarize the systematic multivariate analysis implemented in this research, physical parameters, univariate correlations, and principal components were used to form the K-Means clusters, which then were used to perform the partial least squares method within and between sample groups (Agnew-Heard et al., 2016).
After the evaluation of the 50 cigarette brands, the research identified three components out of nine that accounted for 65% of the variability in the TNCO values for each of the clusters (Agnew-Heard et al., 2016). The correlation between the predicted and observed yields was between 0.5 and 0.9 (Agnew-Heard et al., 2016). Multivariate analysis results indicate that while all nine components have a degree of influence on the end TNCO yields, none of them is considered a dominant enough factor present in all 50 samples. The variation between yields is associated not with the position of the components within the cigarette, but with the qualities of each component determined by its construction.
The research uses a solid method of step-by-step data analysis, first converging it into non-correlated data sets, and then using cluster analysis to determine the connection between variables within and between clusters. This gradual analysis is very robust against most deviations and sample-related errors, but not all of them (Cramer, 2016). One of the biggest weaknesses of multivariate data analyses lies in the assumptions that researchers have to deal with before their implementations. One of these assumptions is homoscedasticity, which assumes that the errors and qualities of variables across all samples are equal (Hampel et al., 2011). In the case of this study, this may not be applicable because there are differences between different components in cigarettes. This refers to the construction of filters, the quality of tobacco, structural compositions, features, and ingredients. According to the researchers, “the PLSR models in this study cannot account for cigarettes with different filter technologies and paper types, tobacco blend compositions, ingredients, or variations in static burn rate, which impact the total puff count, even though the resulting ISO generated TNCO yields can be similar in select cases” (Agnew-Heard et al., 2016, p. 1060)
Multivariate Analysis Methods Comparison
When comparing the two types of research, it can be noticed that the second research used a more complicated data analysis model when compared to the first. The research involving congestive heart failure used multiple logistic regression models as its primary tool in data analysis. The second research that analyzed the effectiveness of different cigarette components on mitigating the amount of tar, nicotine, and other harmful substances used a step-by-step multivariate analysis, where logistic-regression models were used only as an intermediary tool to transform the data into a form more acceptable for further analysis using the clusters method and later the partial least squares regression method, which is considered more thorough (Cramer, 2016). Both types of research have to deal with several weaknesses that are connected to their respective methodologies and research designs. The latter research has fewer weaknesses due to increased funding, which enabled it to use a more thorough and software-demanding analysis.
Conclusions
Multivariate techniques and methodologies are very efficient at analyzing large databases and processing various variables to model a more complete and authentic statistical representation of reality. Multivariate analysis methods are heavily reliant on statistical software due to the complexity of calculations and data processes, which make them costly for small-scale and personal research. Nevertheless, they provide incredible amounts of reliable data, is it was possible to see in the example studies provided in this paper. Nevertheless, multivariate analysis design has several weaknesses associated with sample sizes and variable homogeneity, which has the potential to obfuscate the results due to high margins for standard errors.
References
Agnew-Heard, K. A., Lancaster, V. A., Bravo, R., Watson, C., Walters, M. J., & Holman, M. R. (2016). Multivariate statistical analysis of cigarette design feature influence on ISO TNCO yields. Chemical Research in Toxicology, 29(6), 1051-1063.
Cramer, H. (2016). Mathematical methods of statistics. New Jersey, NJ: Princeton University Press.
Hampel, F. R., Ronchetti, E. M., Rousseeuw, R. J., & Stahel, W. A. (2011). Robust statistics: The approach based on influence functions. New Jersey, NJ: John Wiley & Sons.
Mengual-Macenlle, N., Marcos, P. J., Golpe, R., & Gonzales-Riva, D. (2015). Multivariate analysis in thoracic research. Journal of Thoracic Research, 7(3), 2-6.
Wells, J. M., Washko, G. R., Han, M. K., Abbas, N., Nath, H., Mamary, A. J., Dransfield, M. T. (2012). Pulmonary arterial enlargement and acute exacerbations of COPD. New England Journal of Medicine, 367(10), 913-921.