## Multicollinearity in Linear Regression Models

Multicollinearity in Linear Regression Models

The objective of multiple regression analysis is to approximate the relationship of individual parameters of a dependency, but not of interdependency. It is assumed that the dependent variable $y$ and regressors $X$‘s are linearly related to each other (Graybill, 1980; Johnston, 1963 and Malinvaud, 1968). Therefore, inferences depicted from any regression model are

(i) identify relative influence of regressors
(ii) prediction and/or estimation and
(iii) selection of appropriate set of regressors for the model.

From all these inferences, one of the purpose of regression model is to ascertain what extent the dependent variable can be predicted by the regressors in the model. However, to draw some suitable inferences, the regressors should be orthogonal, i.e., there should be no linear dependencies among regressors. However, in most of the application of regression analysis, regressors are not orthogonal, which leads to misleading and erroneous inferences, especially, in case when regressors are perfectly or nearly perfectly collinear to each other. The condition of non-orthogonality is also referred to as the problem of multicollinearity or collinear data for example, see Gunst and Mason, 1977;  Mason et al., 1975 and Ragnar, 1934). Multicollinearity is also synonymous with ill-conditioning of $X'X$ matrix.

The presence of interdependence or the lack of independence is signified by high order inter-correlation ($R=X'X$) within a set of regressors ({Dorsett et al, 1983; Farrar and Glauber1967; Gunst and Mason, 1977; Mason et al., 1975). The perfect multicollinearity situation is a pathological extreme and it can easily be detected and resolved by dropping one of the regressor(s) causing multicollinearity (Belsley et al., 1980). In case of perfect multicollinearity, the regression coefficients remain indeterminate and their standard errors are infinite. Similarly, perfectly collinear regressors destroy the uniqueness of the least square estimators (Belsley et al., 1980 and Belsley, 1991). Many explanatory variables (regressors/ predictors) are highly collinear, making very difficult to infer the separate influence of collinear regressors on the response variable ($y$), that is, estimation of regression coefficients becomes difficult because coefficient(s) measures the effect of the corresponding regressor while holding all other regressors as constant.  Problem of not perfect multicollinearity is extremely hard to detect (Chatterjee and Hadi, 2006) as it is not specification or modeling error, actually it is a condition of deficit data (Hadi and Chatterjee, 1988). On the other hand, existence of multicollinearity has no impact on the overall regression model and associated statistics such as $R^2$, $F$-ratio and $p$-value. Multicollinearity does not also lessen the predictive or reliability of the regression model as whole, it only affects the individual regressors (Koutsoyiannis, 1977). Note that, multicollinearity refers only to linear relationship among the regressors, it does not rule out nonlinear relationship among them.

To draw suitable inferences from the model, existence of (multi)collinearity should always be tested when examining a data set as an initial step in multiple regression analysis. On the other hand, high collinearity is rare, but some degree of collinearity is always exists.

A distinction between collinearity and multicollinearity should be made. Strictly speaking, multicollinearity is usually refers to the existence of more than one exact linear relationship among regressors, while collinearity refers to the existence of a single linear relationship. However, multicollinearity refers to both of the cases nowadays.

There are many methods for the detection/ testing of multi(collinearity) among regressors. However, these methods can destroy the usefulness of the model, since relevant regressor(s) may be removed by these methods. Note that, if there are two predictors then it is sufficient to detect problem of collinearity using pairwise correlation. However, to check the severity of the collinearity problem, VIF/TOL, eigenvalues or other diagnostic measures can be used.

For further detail see

• Belsley, D., Kuh, E., and Welsch, R. (1980). Diagnostics: Identifying Influential Data and Sources of Collinearity. John Willey & Sons, New York. chap. 3.
• Belsley, D. A. (1991). A Guide to Using the Collinearity Diagnostics. Computer Science in Economics and Management, 4(1), 3350.
• Chatterjee, S. and Hadi, A. S. (2006). Regression Analysis by Example. Wiley and Sons, 4th edition.
• Dorsett, D., Gunst, R. F., and Gartland, E. C. J. (1983). Multicollinear Effects of Weighted Least Squares Regression. Statistics & Probability Letters, 1(4), 207211.
• Graybill, F. (1980). An Introduction to Linear Statistical Models. McGraw Hill.
• Gunst, R. and Mason, R. (1977). Advantages of examining multicollinearities in regression analysis. Biometrics, 33, 249260.
• Hadi, A. and Chatterjee, S. (1988). Sensitivity Analysis in Linear Regression. John Willey & Sons.
• Imdadullah, M., Aslam, M. and Altaf, S. (2916) mctest: An R Package for Detection of Collinearity Among Regressors
• Imdadullah, M., Aslam, M. (2016). mctest: An R Package for Detection of Collinearity Among Regressors
• Johnston, J. (1963). Econometric Methods. McGraw Hill, New York.
• Koutsoyiannis, A. (1977). Theory of Econometrics. Macmillan Education Limited.
• Malinvaud, E. (1968). Statistical Methods of Econometrics. Amsterdam, North Holland. pp. 187192.
• Mason, R., Gunst, R., and Webster, J. (1975). Regression Analysis and Problems of Multicollinearity. Communications in Statistics, 4(3), 277292.
• Ragnar, F. (1934). Statistical Consequence Analysis by means of complete regression systems. Universitetets Ãkonomiske Instituut. Publ. No. 5.