The analysis of the assessment was performed using the quantitative approach. The primary reason for the choice is the fact that scores available as a result of the assessment are easily quantifiable and uniform, which allows for a relatively consistent result. The tool chosen for the analysis is the paired t-test, which can be successfully utilized to determine the change based on two datasets retrieved at different points in time from the same population (Hinton, McMurray, & Brownlow, 2014).
Importantly, the paired t-test also provides the possibility to ascertain the reliability of the results by calculating the p-value and a corresponding t-value, which is especially valuable when several small samples are available (five classes in the case at hand) (Hahs-Vaughn & Lomax, 2012). The following paper provides the results of the analysis of the scores and outlines the methods of ascertaining validity and reliability.
For the sake of uniformity, it was assumed that the criterion score for succeeding in the assessment was 70 percent. The analysis of the data disaggregated by classes yielded the following results. In class 1, the average of 52% ± 7% of the students were successful in the test. Of the 23 students, one (4.3%) met the 70% criterion. There was a statistically significant difference between the scores and a criterion of success (t(22) = -4.94, p < 0.001). The post-test results averaged at 65.3% ± 7.7%. 11 of the students in class met the success criterion, which accounts for 47.8% of the class. The difference was not statistically significant (t(22) = -1.2, p > 0.05). The comparative result displays a 13% improvement, which was statistically significant (t(22), p < 0.001).
For the second class, the average was 42.4% ± 10.4%. One of the students (5.6%) exceeded the success criterion. The result was statistically significant (t(17) = -5.18, p < 0.001). In the post-test, the class average was 55.3% ± 13.4%. 9 students (50%) were able to meet the criterion of success. The result was statistically significant (t(17) = -2.15, p < 0.05). The overall difference between the tests identified a 12.8% improvement of the results, which was statistically significant (t(17) , p < 0.025).
In the third class, the average of the pretest was 48.3% ± 6.8%, with two out of 29 students meeting the success criterion (6.9% of the total). The difference was statistically significant (t(28) = -6,26, p < 0.001). The post-test results demonstrated an average of 58.1% ± 8.7% of success, with 12 of 29 students being able to meet the criterion (41.4%). The difference was statistically significant (t(28) = -2.7, p < 0.025). The change between the pre-test and post-test scores is characterized as a 9.8% improvement, which is a statistically significant result (t(28), p < 0.05.
In the fourth class, 51.4% ± 9.4% class average was observed. Three out of twenty students were able to meet the success criterion. The difference was statistically significant (t(19) = -3.88, p < 0.005). The post-test indicated an average of 61.5% ± 9% of the class average, with 9 out of 20 students being able to succeed (45%), with no statistically significant difference between the score and the criterion of success (t(19) = -1.85, p > 0.05). The resulting difference indicated a 10.1% improvement in the class, which was a statistically significant result (t(19), p < 0.025).
In the fifth class, the calculated average was 51.1% ± 8.8%. Of the 21 students, two met the success criterion (9.5%). The difference was statistically significant (t(20) = -4.23, p < 0.001). In the post-test, the calculated average was 60.6% ± 7.1%, with eight out of 21 students succeeding in obtaining 70% of the score (38.1%). The difference was statistically significant (t(20) = -2.61, p < 0.025). The difference of 9.5% between the pre-test and post-test did not demonstrate sufficient statistical significance (t(20), p > 0.05).
Finally, the overall result across all five classes was as follows. The average on a pretest was 49.2% ± 3.7%. 9 out of 111 students were able to meet the criterion, which amounts to 8.1% of the total number. The difference was statistically significant (t(110) = -11.02, p < 0.001). In the post-test, the average was 60.2% ± 4.1% and 49 students exceeded the criterion (44.1%), with a statistical significance of t(110) = -4.72, p < 0.001. An overall 11% improvement could be observed across the total population, which high statistical significance of t(110), p < 0.001.
The results of the analysis allow us to conclude that the academic achievement demonstrated across five classes is expected to be observed in the similar setting. The results of the paired t-test across the five classes suggest the p-value of below 0.001, which constitutes a high degree of reliability (Elliott & Woodward, 2016). The majority of per-class results is consistent with the conclusion. The internal validity of the results can be partially confirmed by the fact that both the pre-test and post-test scores were used in the analysis, suggesting the causality.
However, it should be noted that additional validity could be attained by involving a control group of students (Elliott & Woodward, 2016). The external validity is derived primarily from the fact that the scores were obtained from a real-world setting (Turner, 2014). However, it can be further increased by accounting for the specificities of the environment prior to the application of the intervention in a similar setting.
References
Elliott, A. C., & Woodward, W. A. (2016). IBM SPSS by example: A practical guide to statistical data analysis (2nd ed.). Thousand Oaks, CA: Sage Publications.
Hahs-Vaughn, D. L., & Lomax, R. G. (2012). An introduction to statistical concepts (3rd ed.). New York, NY: Routledge.
Hinton, P. R., McMurray, I., & Brownlow, C. (2014). SPSS explained (2nd ed.). New York, NY: Routledge.
Turner, J. L. (2014). Using statistics in small-scale language education research: Focus on non-parametric data. New York, NY: Routledge.