Atmospheric Pollution Constituents Coursework

Exclusively available on Available only on IvyPanda® • No AI

Table of Contents

Summary
Introduction
Correlation and Scatterplot Analysis
Multiple Regression Model
Chosen Multiple Regression Model
Assumptions of Regression Model
Autocorrelation
Pollutant Variables (X) to Contribute Atmospheric Pollution (Purity Index Y)

Summary

A department dealing with the effects of atmospheric pollutants in the vicinity of an industrial complex has established a data table of measurements of a purity index Y on a scale of 0 (extremely bad ) to 1000 ( absolutely pure) and the dependence of this on component pollutant variables X₁, X₂, …, X₆. The aim of the department is to establish which of the component variables is contributing most to local atmospheric pollution.

This report analyzed and discussed the association of the purity index Y with component pollutant variables and developed a model to forecast the purity index. The analysis suggested that the component pollutant variables X₁, X₂, X₄, X₅, and X₆ are significantly related to purity index Y (p <.05). However, only two-component pollutant variables X₁ and X₅ are most likely to contribute significantly to atmospheric pollution (purity index Y). The equation for the best regression (chosen) model was given by Y = 0.185 + 1.111X₁ + 7.598X₅

Further, for the chosen model, all the underlying assumptions of the regression analysis (multicollinearity, non-normality, nonconstant variance, and autocorrelation) are valid.

Introduction

A department dealing with the effects of atmospheric pollutants in the vicinity of an industrial complex has established a data table of measurements of a purity index Y. The purity index Y is measured on a scale of 0 to 1000, with 0 being extremely bad and 1000 being absolutely pure and the dependence of this on component pollutant variables X₁, X₂, …, X₆. The aim of the department is to establish which of the component variables is contributing most to local atmospheric pollution.

This report will analyze and discuss the association of the purity index Y with component pollutant variables X₁, X₂, …, X₆. Further, this report will develop a model for forecasting the purity index Y based on component pollutant variables X₁, X₂, …, X₆. For this, sample data for a period of 50 days is obtained. The test is a ‘blind’ one in the sense that none of the pollutants has been identified by name, because of its association with the source and the possibility at this stage of unwanted litigation.

Correlation and Scatterplot Analysis

Figure 1 to 6 shows the scatterplots of purity index Y against component pollutant variables X₁, X₂… X₆.

There appears a strong linear relationship between Y and X₁, Y and X₂, and Y and X₅. In addition, there appears a moderately strong linear relationship between Y and X₆. Furthermore, there appears weak or no linear relationship between Y and X₃ and Y and X₄. Table 2 shows the correlation matrix (using MegaStat, an Excel Add-in) for purity index Y and component pollutant variables X₁, X₂… X₆.

Table 1: Correlation Matrix

	X1	X2	X3	X4	X5	X6	Y
X1	1.000
X2	.738	1.000
X3	-.293	-.283	1.000
X4	.201	.287	-.130	1.000
X5	.605	.803	-.094	.307	1.000
X6	.491	.675	-.163	.109	.521	1.000
Y	.881	.778	-.261	.290	.805	.533	1.000

	50	sample size

	±.279	critical value.05 (two-tail)
	±.361	critical value.01 (two-tail)

As shown in table 1, the correlation of Y is significant for X₁, X₂, X₄, X₅, and X₆. Therefore, excluding component pollutant variable X₃ from first multiple regression analysis based on correlation and scatterplot analysis.

Multiple Regression Model

Model with Five Independent Variables (Excluding X₃)

Table 2

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.9477
R Square	0.8982
Adjusted R Square	0.8866
Standard Error	44.0675
Observations	50

ANOVA
	df	SS	MS	F	Significance F
Regression	5	753910.3541	150782.0708	77.6449	0.0000
Residual	44	85445.5108	1941.9434
Total	49	839355.8649

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%
Intercept	-80.3818	67.5179	-1.1905	0.2402	-216.4552	55.6917
X1	1.1879	0.1278	9.2925	0.0000	0.9302	1.4455
X2	-1.4448	1.0805	-1.3372	0.1880	-3.6223	0.7327
X4	6.2999	7.2074	0.8741	0.3868	-8.2257	20.8255
X5	8.4910	1.4413	5.8911	0.0000	5.5862	11.3959
X6	2.4322	3.2784	0.7419	0.4621	-4.1750	9.0393

Table 2 shows the regression model with five component pollutant variables. Although, the regression model is significant (F = 77.64, p <.001), the p-value for coefficient of component pollutant variables X₂, X₄, and X₆ are greater than 0.05. The p-value for coefficient of X₆ (0.462) is higher as compared to coefficient of other component pollutant variables X₂ (0.188) and X₄ (0.3868), thus, excluding component pollutant variable X₆ from further multiple regression analysis.

Model with Four Independent Variables (Excluding X3 and X6)

Table 3

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.9471
R Square	0.8969
Adjusted R Square	0.8878
Standard Error	43.8468
Observations	50

ANOVA
	df	SS	MS	F	Significance F
Regression	4	752841.5355	188210.3839	97.8967	0.0000
Residual	45	86514.3294	1922.5407
Total	49	839355.8649

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%
Intercept	-46.0844	48.9615	-0.9412	0.3516	-144.6979	52.5291
X1	1.1863	0.1272	9.3280	0.0000	0.9301	1.4424
X2	-1.0792	0.9567	-1.1280	0.2653	-3.0061	0.8477
X4	5.6856	7.1238	0.7981	0.4290	-8.6625	20.0338
X5	8.4570	1.4334	5.9000	0.0000	5.5700	11.3440

Table 3 shows the regression model with four component pollutant variables. Although, the regression model is significant (F = 97.90, p <.001), the p-value for coefficient of component pollutant variables X₂ and X₄ are greater than 0.05. The p-value for coefficient of X₄ (0.443) is higher as compared to coefficient of component pollutant variable X₂ (0.265), thus, excluding component pollutant variable X₄ from further multiple regression analysis.

Model with Four Independent Variables (Excluding X3, X4 and X6)

Table 4

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.9463
R Square	0.8955
Adjusted R Square	0.8887
Standard Error	43.6734
Observations	50

ANOVA
	df	SS	MS	F	Significance F
Regression	3	751616.9110	250538.9703	131.3532	0.0000
Residual	46	87738.9539	1907.3686
Total	49	839355.8649

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%
Intercept	-12.7192	25.3857	-0.5010	0.6187	-63.8179	38.3795
X1	1.1842	0.1266	9.3505	0.0000	0.9293	1.4391
X2	-1.0248	0.9505	-1.0781	0.2866	-2.9381	0.8885
X5	8.6109	1.4147	6.0865	0.0000	5.7631	11.4586

Table 4 shows the regression model with three component pollutant variables. Although, the regression model is significant (F = 131.35, p <.001), the p-value for coefficient of component pollutant variable X₂ (0.287) is greater than 0.05, thus, excluding component pollutant variable X₂ from further multiple regression analysis.

Model with Two Independent Variables X1 and X5

Table 5

SUMMARY OUTPUT

Regression Statistics
Multiple R	0.9449
R Square	0.8928
Adjusted R Square	0.8883
Standard Error	43.7488
Observations	50

ANOVA
	df	SS	MS	F	Significance F
Regression	2	749399.7991	374699.8996	195.7722	0.0000
Residual	47	89956.0658	1913.9588
Total	49	839355.8649

	Coefficients	Standard Error	t Stat	P-value	Lower 95%	Upper 95%
Intercept	0.1849	22.4256	0.0082	0.9935	-44.9296	45.2995
X1	1.1114	0.1074	10.3531	0.0000	0.8955	1.3274
X5	7.5978	1.0594	7.1717	0.0000	5.4665	9.7290

Table 5 shows the regression model with two component pollutant variables X₁ and X₅. The regression model is significant (F = 195.77, p <.001). The p-value for coefficient of component variables X₁ and X₅ is significant that indicates that both component pollutant variables X₁ and X₅ significantly predicts purity index Y in regression model.

Table 6 shows the stepwise regression (using MegaStat, an Excel Add-in) taking n number of variables (best for n). As shown in table 6, the best multiple regression model is given by component pollutant variables X₁ and X₅, as p-value for model is highest.

Table 6: Multiple regression model with different number of independent variables

	p-values for the coefficients
Nvar	X₁	X₂	X3	X₄	X₅	X₆	S_e	Adj R²	R²	p-value
1	.0000						62.649	.771	.776	3.46E-17
2	.0000				.0000		43.749	.888	.893	1.61E-23
3	.0000	.2866			.0000		43.673	.889	.895	1.45E-22
4	.0000	.1933	.2597		.0000		43.530	.889	.898	9.57E-22
5	.0000	.1428	.2482		.0000	.4818	43.772	.888	.900	7.98E-21
6	.0000	.1282	.2802	.4404	.0000	.4349	43.970	.887	.901	5.57E-20

Adjusted R² is a parameter for deciding number of independent variables in multiple regression model. Figure 7 show the Adjusted R² versus Number of Independent Variables. As shown in figure 7, there is not much increase in Adjusted R² after two independent variables X₁, and X₅. The Adjusted R² value is approximately same (0.888) for more than 2 independent variables in multiple regression model. Therefore, the best regression model is given by only taking two independent variables X₁, and X_5.

Figure 7: Adjusted R2 versus Number of Independent Variables

Chosen Multiple Regression Model

The equation for the best regression (chosen) is given by Y = 0.185 + 1.111X₁ + 7.598X₅

Regression slope coefficient of 1.111 of X₁ indicates that for each point increase in X₁, purity index Y increase by about 1.111 on average given fixed component pollutant variable X₅.

The regression slope coefficient of 7.598 of X₂ indicates that for each point increase in X₂, purity index Y increase by about 7.598 on average given fixed component pollutant variable X₁.

Component pollutant variables X₁ and X₅ explain about 89.3% variation in purity index Y. The other 10.7% variation in purity index Y remains unexplained may be due to other factors.

T-tests on Individual Coefficients

The null and alternate hypotheses are:

The selected level of significance is 0.05 and the selected test is t-test for Zero Slope.

The decision rule will reject H₀ if p-value ≤ 0.5. Otherwise, do not reject H₀.

Component pollutant variable X₁ significantly predicts purity index Y, t(47) = 10.35, p <.001.

Component pollutant variable X₅ significantly predicts purity index Y, t(47) = 7.17, p <.001.

F – test on All coefficients

The null and alternate hypotheses are:

The selected level of significance is 0.05 and the selected test is F-test.

The decision rule will reject H₀ if p-value ≤ 0.5. Otherwise, do not reject H₀.

The regression model is significant, R² =.893, F(2, 47) = 195.77, p <.001.

Assumptions of Regression Model

Multicollinearity

Klein’s Rule suggests that we should worry about the stability of the regression coefficient estimates only when a pairwise predictor correlation exceeds the multiple correlation coefficient R (i.e., the square root of R²). The value of the correlation coefficient between X₁ and X₅ is 0.605. The value of Multiple R for the final regression model with X₁ and X₅ is 0.945 and far exceeds 0.605, which suggests that the confidence intervals and t-tests may not be affected.

Another approach for checking multicollinearity is the Variance Inflation Factor (VIF). Figure 2 shows the interpretation of the Variance Inflation Factor (VIF). As a Rule of Thumb, we should not worry about multicollinearity, if VIF for the explanatory variable is less than 10.

Figure 8: Variance Inflation Factor (VIF) and Interpretation

Table 7: Variance Inflation Factor (VIF) using MegaStat

Regression Analysis

	R²	0.893
	Adjusted R²	0.888	n	50
	R	0.945	k	2
	Std. Error	43.749	Dep. Var.	Y

ANOVA table
Source	SS	df	MS	F	p-value
Regression	749,399.7991	2	374,699.8996	195.77	1.61E-23
Residual	89,956.0658	47	1,913.9588
Total	839,355.8649	49


Regression output					confidence interval
variables	coefficients	std. error	t (df=47)	p-value	95% lower	95% upper	std. coeff.	VIF
Intercept	0.1849	22.4256	0.008	.9935	-44.9296	45.2995	0.000
X1	1.1114	0.1074	10.353	1.03E-13	0.8955	1.3274	0.621	1.576
X5	7.5978	1.0594	7.172	4.49E-09	5.4665	9.7290	0.430	1.576

As shown in table 7, the VIF’s for both X1 and X5 is 1.576; thus, there is no need for concern.

Non-Normal Errors

Figure 9 shows the normal probability plot of residuals. As shown in figure 9, the residual plot is approximately linear, thus, the residuals seem to be consistent with the hypothesis of normality.

Nonconstant Variance (Heteroscedasticity)

Figure 10 and 11 show the plots of residuals by X₁ and residuals by X₅.

As shown in figure 10 and 11 the data points are scattered, and there is no pattern in the residuals as we move from left to right, thus, the residuals seem to be consistent with the hypothesis of homoscedasticity (constant variance).

Autocorrelation

Autocorrelation exists when the residuals are correlated with each other. With time-series data, one needs to be aware of the possibility of autocorrelation, a pattern of nonindependent errors that violate the regression assumption that each error is independent of its predecessor. The most common test for autocorrelation is the Durbin-Watson test. The DW statistic lies between 0 and 4. For no autocorrelation, the DW statistic will be near 2. In this case, DW = 2.33, which is near 2, thus errors are non-autocorrelated. However, for cross-sectional data, the DW statistic is usually ignored.

Figure 12 shows the residual by observation number. As shown in figure 12, the sign of a residual cannot be predicted from the sign of the preceding one this means that there is no autocorrelation.

Thus, for the chosen model, all the underlying assumptions of the regression analysis are valid.

Pollutant Variables (X) to Contribute Atmospheric Pollution (Purity Index Y)

As shown in table 1: correlation matrix, the component pollutant variables X₁, X₂, X₄, X₅, and X₆ are significantly related to purity index Y (p <.05). Thus, they all are individually contributing significantly to atmospheric pollution. However, looking at the multiple regression model analysis, the only two-component pollutant variables X₁ and X₅ are most likely to contribute significantly to atmospheric pollution (purity index Y).