In regression models, a measure of reliability can be the coefficient of determination R2. According to Zhang et al., “the coefficient of determination is the measurement of how well the regression model fits the data” (Zhang et al., 2018, p. 2). However, this parameter can also be explained differently: the value of the coefficient of determination can be considered the fraction of variance of the dependent variable that can be captured by the regression model. For example, if the coefficient of determination for a particular data set is 0.9, this means that up to 90% of the variance in this set is covered by the constructed model, which indicates a fairly high accuracy. From this, it is clear that R2 values range from 0.00 to 1.00, where 1.00 is the strongest degree of fit. This interpretation of the coefficient of determination can ultimately justify the need to use it in regression analysis. When a linear regression is constructed for a set of data, the researcher wants to know how accurate and extensive the model is; R2 is precisely used as a tool to assess this accuracy. If the researcher determines that the coefficient of determination is extremely low, that may be the first signal and invalidity of the regression model for the set. Accordingly, in essence, I would evaluate R2 as a tool to assess the fit or coverage of the constructed model for the data set; the higher it is, the better overall, but it does not always work that way.
The relationship between the coefficient of determination and sample size is ambiguous. A researcher may use a small sample (N = 10), for which the linear regression will give a near absolute value (R2 = 1.00), but when the sample is increased to N = 20, the coefficient of determination will decrease. In such cases, it is a matter of luck with which small samples are created since, for real data, the coefficient of determination is almost never equal to one. There is another way to look at this relationship: according to the CLT, as the sample size increases, the distribution becomes increasingly like a normal distribution, that is, the outliers and variances in such a distribution tend to decrease (Ganti, 2022). As a consequence, if the sample is correctly formed, increasing the sample size can also lead to an increase in R2 (Jonson, 2021). It is worth emphasizing that the coefficient of determination does not always indicate the significance of the constructed model. For example, a regression may have an extremely high R2 value, that is, it may cover most of the variance, but the significance of the model itself, defined through p-value, may be greater than the critical value. In this case, despite a high R2 value, the regression model is not significant. At the same time, the low value of R2 can correspond to the statistical significance of the model, which means that there is no universal connection between the level of the determination coefficient and the significance of the model.
References
Ganti, A. (2022). Central Limit Theorem (CLT): Definition and key characteristics. Investopedia. Web.
Jonson, E. (2021). Does R2 change with sample size? SI. Web.
Zhang, Y., Yang, X., Shardt, Y. A., Cui, J., & Tong, C. (2018). A KPI-based probabilistic soft sensor development approach that maximizes the coefficient of determination. Sensors, 18(9),1-17. Web.