Atmospheric Pollution Constituents Coursework

Exclusively available on Available only on IvyPanda® Written by Human No AI

Summary

A department dealing with the effects of atmospheric pollutants in the vicinity of an industrial complex has established a data table of measurements of a purity index Y on a scale of 0 (extremely bad ) to 1000 ( absolutely pure) and the dependence of this on component pollutant variables X1, X2, …, X6. The aim of the department is to establish which of the component variables is contributing most to local atmospheric pollution.

This report analyzed and discussed the association of the purity index Y with component pollutant variables and developed a model to forecast the purity index. The analysis suggested that the component pollutant variables X1, X2, X4, X5, and X6 are significantly related to purity index Y (p <.05). However, only two-component pollutant variables X1 and X5 are most likely to contribute significantly to atmospheric pollution (purity index Y). The equation for the best regression (chosen) model was given by Y = 0.185 + 1.111X1 + 7.598X5

Further, for the chosen model, all the underlying assumptions of the regression analysis (multicollinearity, non-normality, nonconstant variance, and autocorrelation) are valid.

Introduction

A department dealing with the effects of atmospheric pollutants in the vicinity of an industrial complex has established a data table of measurements of a purity index Y. The purity index Y is measured on a scale of 0 to 1000, with 0 being extremely bad and 1000 being absolutely pure and the dependence of this on component pollutant variables X1, X2, …, X6. The aim of the department is to establish which of the component variables is contributing most to local atmospheric pollution.

This report will analyze and discuss the association of the purity index Y with component pollutant variables X1, X2, …, X6. Further, this report will develop a model for forecasting the purity index Y based on component pollutant variables X1, X2, …, X6. For this, sample data for a period of 50 days is obtained. The test is a ‘blind’ one in the sense that none of the pollutants has been identified by name, because of its association with the source and the possibility at this stage of unwanted litigation.

Correlation and Scatterplot Analysis

Figure 1 to 6 shows the scatterplots of purity index Y against component pollutant variables X1, X2… X6.

Y versus X1
Figure 1: Y versus X1
Y versus X2
Figure 2: Y versus X2
Y versus X3
Figure 3: Y versus X3
Y versus X4
Figure 4: Y versus X4
Y versus X5
Figure 5: Y versus X5
Y versus X6
Figure 6: Y versus X6

There appears a strong linear relationship between Y and X1, Y and X2, and Y and X5. In addition, there appears a moderately strong linear relationship between Y and X6. Furthermore, there appears weak or no linear relationship between Y and X3 and Y and X4. Table 2 shows the correlation matrix (using MegaStat, an Excel Add-in) for purity index Y and component pollutant variables X1, X2… X6.

Table 1: Correlation Matrix

X1X2X3X4X5X6Y
X11.000
X2.7381.000
X3-.293-.2831.000
X4.201.287-.1301.000
X5.605.803-.094.3071.000
X6.491.675-.163.109.5211.000
Y.881.778-.261.290.805.5331.000
50sample size
±.279critical value.05 (two-tail)
±.361critical value.01 (two-tail)

As shown in table 1, the correlation of Y is significant for X1, X2, X4, X5, and X6. Therefore, excluding component pollutant variable X3 from first multiple regression analysis based on correlation and scatterplot analysis.

Multiple Regression Model

Model with Five Independent Variables (Excluding X3)

Table 2

SUMMARY OUTPUT
Regression Statistics
Multiple R0.9477
R Square0.8982
Adjusted R Square0.8866
Standard Error44.0675
Observations50
ANOVA
dfSSMSFSignificance F
Regression5753910.3541150782.070877.64490.0000
Residual4485445.51081941.9434
Total49839355.8649
CoefficientsStandard Errort StatP-valueLower 95%Upper 95%
Intercept-80.381867.5179-1.19050.2402-216.455255.6917
X11.18790.12789.29250.00000.93021.4455
X2-1.44481.0805-1.33720.1880-3.62230.7327
X46.29997.20740.87410.3868-8.225720.8255
X58.49101.44135.89110.00005.586211.3959
X62.43223.27840.74190.4621-4.17509.0393

Table 2 shows the regression model with five component pollutant variables. Although, the regression model is significant (F = 77.64, p <.001), the p-value for coefficient of component pollutant variables X2, X4, and X6 are greater than 0.05. The p-value for coefficient of X6 (0.462) is higher as compared to coefficient of other component pollutant variables X2 (0.188) and X4 (0.3868), thus, excluding component pollutant variable X6 from further multiple regression analysis.

Model with Four Independent Variables (Excluding X3 and X6)

Table 3

SUMMARY OUTPUT
Regression Statistics
Multiple R0.9471
R Square0.8969
Adjusted R Square0.8878
Standard Error43.8468
Observations50
ANOVA
dfSSMSFSignificance F
Regression4752841.5355188210.383997.89670.0000
Residual4586514.32941922.5407
Total49839355.8649
CoefficientsStandard Errort StatP-valueLower 95%Upper 95%
Intercept-46.084448.9615-0.94120.3516-144.697952.5291
X11.18630.12729.32800.00000.93011.4424
X2-1.07920.9567-1.12800.2653-3.00610.8477
X45.68567.12380.79810.4290-8.662520.0338
X58.45701.43345.90000.00005.570011.3440

Table 3 shows the regression model with four component pollutant variables. Although, the regression model is significant (F = 97.90, p <.001), the p-value for coefficient of component pollutant variables X2 and X4 are greater than 0.05. The p-value for coefficient of X4 (0.443) is higher as compared to coefficient of component pollutant variable X2 (0.265), thus, excluding component pollutant variable X4 from further multiple regression analysis.

Model with Four Independent Variables (Excluding X3, X4 and X6)

Table 4

SUMMARY OUTPUT
Regression Statistics
Multiple R0.9463
R Square0.8955
Adjusted R Square0.8887
Standard Error43.6734
Observations50
ANOVA
dfSSMSFSignificance F
Regression3751616.9110250538.9703131.35320.0000
Residual4687738.95391907.3686
Total49839355.8649
CoefficientsStandard Errort StatP-valueLower 95%Upper 95%
Intercept-12.719225.3857-0.50100.6187-63.817938.3795
X11.18420.12669.35050.00000.92931.4391
X2-1.02480.9505-1.07810.2866-2.93810.8885
X58.61091.41476.08650.00005.763111.4586

Table 4 shows the regression model with three component pollutant variables. Although, the regression model is significant (F = 131.35, p <.001), the p-value for coefficient of component pollutant variable X2 (0.287) is greater than 0.05, thus, excluding component pollutant variable X2 from further multiple regression analysis.

Model with Two Independent Variables X1 and X5

Table 5

SUMMARY OUTPUT
Regression Statistics
Multiple R0.9449
R Square0.8928
Adjusted R Square0.8883
Standard Error43.7488
Observations50
ANOVA
dfSSMSFSignificance F
Regression2749399.7991374699.8996195.77220.0000
Residual4789956.06581913.9588
Total49839355.8649
CoefficientsStandard Errort StatP-valueLower 95%Upper 95%
Intercept0.184922.42560.00820.9935-44.929645.2995
X11.11140.107410.35310.00000.89551.3274
X57.59781.05947.17170.00005.46659.7290

Table 5 shows the regression model with two component pollutant variables X1 and X5. The regression model is significant (F = 195.77, p <.001). The p-value for coefficient of component variables X1 and X5 is significant that indicates that both component pollutant variables X1 and X5 significantly predicts purity index Y in regression model.

Table 6 shows the stepwise regression (using MegaStat, an Excel Add-in) taking n number of variables (best for n). As shown in table 6, the best multiple regression model is given by component pollutant variables X1 and X5, as p-value for model is highest.

Table 6: Multiple regression model with different number of independent variables

p-values for the coefficients
NvarX1X2X3X4X5X6SeAdj R²p-value
1.000062.649.771.7763.46E-17
2.0000.000043.749.888.8931.61E-23
3.0000.2866.000043.673.889.8951.45E-22
4.0000.1933.2597.000043.530.889.8989.57E-22
5.0000.1428.2482.0000.481843.772.888.9007.98E-21
6.0000.1282.2802.4404.0000.434943.970.887.9015.57E-20

Adjusted R2 is a parameter for deciding number of independent variables in multiple regression model. Figure 7 show the Adjusted R2 versus Number of Independent Variables. As shown in figure 7, there is not much increase in Adjusted R2 after two independent variables X1, and X5. The Adjusted R2 value is approximately same (0.888) for more than 2 independent variables in multiple regression model. Therefore, the best regression model is given by only taking two independent variables X1, and X5.

Adjusted R2 versus Number of Independent Variables
Figure 7: Adjusted R2 versus Number of Independent Variables

Chosen Multiple Regression Model

The equation for the best regression (chosen) is given by Y = 0.185 + 1.111X1 + 7.598X5

Regression slope coefficient of 1.111 of X1 indicates that for each point increase in X1, purity index Y increase by about 1.111 on average given fixed component pollutant variable X5.

The regression slope coefficient of 7.598 of X2 indicates that for each point increase in X2, purity index Y increase by about 7.598 on average given fixed component pollutant variable X1.

Component pollutant variables X1 and X5 explain about 89.3% variation in purity index Y. The other 10.7% variation in purity index Y remains unexplained may be due to other factors.

T-tests on Individual Coefficients

The null and alternate hypotheses are:

Formula

Formula

The selected level of significance is 0.05 and the selected test is t-test for Zero Slope.

The decision rule will reject H0 if p-value ≤ 0.5. Otherwise, do not reject H0.

Component pollutant variable X1 significantly predicts purity index Y, t(47) = 10.35, p <.001.

Component pollutant variable X5 significantly predicts purity index Y, t(47) = 7.17, p <.001.

F – test on All coefficients

The null and alternate hypotheses are:

Formula

Formula

The selected level of significance is 0.05 and the selected test is F-test.

The decision rule will reject H0 if p-value ≤ 0.5. Otherwise, do not reject H0.

The regression model is significant, R2 =.893, F(2, 47) = 195.77, p <.001.

Assumptions of Regression Model

Multicollinearity

Klein’s Rule suggests that we should worry about the stability of the regression coefficient estimates only when a pairwise predictor correlation exceeds the multiple correlation coefficient R (i.e., the square root of R2). The value of the correlation coefficient between X1 and X5 is 0.605. The value of Multiple R for the final regression model with X1 and X5 is 0.945 and far exceeds 0.605, which suggests that the confidence intervals and t-tests may not be affected.

Another approach for checking multicollinearity is the Variance Inflation Factor (VIF). Figure 2 shows the interpretation of the Variance Inflation Factor (VIF). As a Rule of Thumb, we should not worry about multicollinearity, if VIF for the explanatory variable is less than 10.

Variance Inflation Factor (VIF) and Interpretation
Figure 8: Variance Inflation Factor (VIF) and Interpretation

Table 7: Variance Inflation Factor (VIF) using MegaStat

Regression Analysis
0.893
Adjusted R²0.888n50
R0.945k2
Std. Error43.749Dep. Var.Y
ANOVA table
SourceSSdfMSFp-value
Regression749,399.79912374,699.8996195.771.61E-23
Residual89,956.0658471,913.9588
Total839,355.864949
Regression outputconfidence interval
variablescoefficientsstd. errort (df=47)p-value95% lower95% upperstd. coeff.VIF
Intercept0.184922.42560.008.9935-44.929645.29950.000
X11.11140.107410.3531.03E-130.89551.32740.6211.576
X57.59781.05947.1724.49E-095.46659.72900.4301.576

As shown in table 7, the VIF’s for both X1 and X5 is 1.576; thus, there is no need for concern.

Non-Normal Errors

Figure 9 shows the normal probability plot of residuals. As shown in figure 9, the residual plot is approximately linear, thus, the residuals seem to be consistent with the hypothesis of normality.

Normal Probability Plot of Residuals
Figure 9: Normal Probability Plot of Residuals

Nonconstant Variance (Heteroscedasticity)

Figure 10 and 11 show the plots of residuals by X1 and residuals by X5.

Residuals by X1
Figure 10: Residuals by X1
 Residuals by X5
Figure 11: Residuals by X5

As shown in figure 10 and 11 the data points are scattered, and there is no pattern in the residuals as we move from left to right, thus, the residuals seem to be consistent with the hypothesis of homoscedasticity (constant variance).

Autocorrelation

Autocorrelation exists when the residuals are correlated with each other. With time-series data, one needs to be aware of the possibility of autocorrelation, a pattern of nonindependent errors that violate the regression assumption that each error is independent of its predecessor. The most common test for autocorrelation is the Durbin-Watson test. The DW statistic lies between 0 and 4. For no autocorrelation, the DW statistic will be near 2. In this case, DW = 2.33, which is near 2, thus errors are non-autocorrelated. However, for cross-sectional data, the DW statistic is usually ignored.

Figure 12 shows the residual by observation number. As shown in figure 12, the sign of a residual cannot be predicted from the sign of the preceding one this means that there is no autocorrelation.

Residuals by Observations
Figure 12: Residuals by Observations

Thus, for the chosen model, all the underlying assumptions of the regression analysis are valid.

Pollutant Variables (X) to Contribute Atmospheric Pollution (Purity Index Y)

As shown in table 1: correlation matrix, the component pollutant variables X1, X2, X4, X5, and X6 are significantly related to purity index Y (p <.05). Thus, they all are individually contributing significantly to atmospheric pollution. However, looking at the multiple regression model analysis, the only two-component pollutant variables X1 and X5 are most likely to contribute significantly to atmospheric pollution (purity index Y).

More related papers Related Essay Examples
Cite This paper
You're welcome to use this sample in your assignment. Be sure to cite it correctly

Reference

IvyPanda. (2022, March 14). Atmospheric Pollution Constituents. https://ivypanda.com/essays/atmospheric-pollution-constituents/

Work Cited

"Atmospheric Pollution Constituents." IvyPanda, 14 Mar. 2022, ivypanda.com/essays/atmospheric-pollution-constituents/.

References

IvyPanda. (2022) 'Atmospheric Pollution Constituents'. 14 March.

References

IvyPanda. 2022. "Atmospheric Pollution Constituents." March 14, 2022. https://ivypanda.com/essays/atmospheric-pollution-constituents/.

1. IvyPanda. "Atmospheric Pollution Constituents." March 14, 2022. https://ivypanda.com/essays/atmospheric-pollution-constituents/.


Bibliography


IvyPanda. "Atmospheric Pollution Constituents." March 14, 2022. https://ivypanda.com/essays/atmospheric-pollution-constituents/.

If, for any reason, you believe that this content should not be published on our website, you can request its removal.
Updated:
This academic paper example has been carefully picked, checked and refined by our editorial team.
No AI was involved: only quilified experts contributed.
You are free to use it for the following purposes:
  • To find inspiration for your paper and overcome writer’s block
  • As a source of information (ensure proper referencing)
  • As a template for you assignment
1 / 1