Determining the presence of multicollinearity. Definition of multicollinearity Theoretical implications of multicollinearity in general terms

When constructing a multiple regression equation, the problem of multicollinearity of factors may arise. Multicollinearity is a linear relationship between two or more explanatory variables, which can manifest itself in a functional (explicit) or stochastic (latent) form.
Identification of the relationship between selected characteristics and quantitative assessment of the closeness of the connection are carried out using correlation analysis methods. To solve these problems, , is first estimated, then, on its basis, partial and multiple correlation and determination coefficients are determined, and their significance is checked. The ultimate goal of correlation analysis is the selection of factor characteristics x 1, x 2,…, x m for further construction of the regression equation.

If the factor variables are connected by a strict functional dependence, then we speak of full multicollinearity. In this case, among the columns of the matrix of factor variables X there are linearly dependent columns, and, by the property of matrix determinants, det(X T X) = 0, i.e. the matrix (X T X) is singular, which means there is no inverse matrix. The matrix (X T X) -1 is used in constructing OLS estimates. Thus, complete multicollinearity does not allow us to unambiguously estimate the parameters of the original regression model.

What difficulties does multicollinearity of factors included in the model lead to, and how can they be resolved?

Multicollinearity can lead to undesirable consequences:

  1. parameter estimates become unreliable. They find large standard errors. As the volume of observations changes, the estimates change (not only in magnitude, but also in sign), which makes the model unsuitable for analysis and forecasting.
  2. it becomes difficult to interpret multiple regression parameters as characteristics of the action of factors in a “pure” form, because the factors are correlated; linear regression parameters lose economic meaning;
  3. It becomes impossible to determine the isolated influence of factors on a performance indicator.

The type of multicollinearity in which factor variables are related by some stochastic dependence is called partial. If there is a high degree of correlation between the factor variables, then the matrix (X T X) is close to degenerate, i.e. det(X T X) ≈ 0.
The matrix (X T X) -1 will be ill-conditioned, which leads to instability of OLS estimates. Partial multicollinearity leads to the following consequences:

  • an increase in the variances of parameter estimates expands the interval estimates and worsens their accuracy;
  • decrease t-statistics of coefficients leads to incorrect conclusions about the significance of factors;
  • instability of OLS estimates and their variances.

There are no precise quantitative criteria for detecting partial multicollinearity. The presence of multicollinearity can be indicated by the proximity of the determinant of the matrix (X T X) to zero. The values ​​of pairwise correlation coefficients are also examined. If the determinant of the interfactor correlation matrix is ​​close to one, then there is no multicollinearity.

There are various approaches to overcome strong interfactor correlation. The simplest of them is the exclusion from the model of the factor (or factors) most responsible for multicollinearity, provided that the quality of the model will suffer insignificantly (namely, the theoretical coefficient of determination -R 2 y(x1...xm) will decrease insignificantly) .

What measure cannot be used to eliminate multicollinearity?
a) increasing the sample size;
b) excluding variables that are highly correlated with others;
c) change in model specification;
d) transformation of the random component.

Paired (linear) and partial correlation coefficients

Closeness of connection, for example, between variables x and y for a sample of values ​​(x i, y i), i=1,n, (1)
where x and y are the average values, S x and S y are the standard deviations of the corresponding samples.

The pairwise correlation coefficient varies from –1 to +1. The closer it is in absolute value to unity, the closer the statistical relationship between x and y is to a linear functional one. A positive value of the coefficient indicates that the relationship between the characteristics is direct (as x increases, the value of y increases), a negative value indicates that the relationship is inverse (as x increases, the value of y decreases).
We can give the following qualitative interpretation of the possible values ​​of the correlation coefficient: if |r|<0.3 – связь практически отсутствует; 0.3≤ |r| < 0.7 - связь средняя; 0.7≤ |r| < 0.9 – связь сильная; 0.9≤ |r| < 0.99 – связь весьма сильная.
To assess the multicollinearity of factors, use a matrix of paired correlation coefficients of the dependent (resultative) characteristic y with factor characteristics x 1, x 2,…, x m, which allows you to assess the degree of influence of each factor indicator x j on the dependent variable y, as well as the closeness of the relationships between the factors . The correlation matrix in the general case has the form
.
The matrix is ​​symmetrical; there are ones on its diagonal. If the matrix has an interfactor correlation coefficient r xjxi >0.7, then there is multicollinearity in this multiple regression model.
Since the source data from which the relationship of characteristics is established is a sample from a certain general population, the correlation coefficients calculated from these data will be selective, i.e. they only estimate the relationship. A significance test is needed, which answers the question: are the obtained calculation results random or not?
Significance of pairwise correlation coefficients check by t- Student's t test. A hypothesis is put forward that the general correlation coefficient is equal to zero: H 0: ρ = 0. Then the parameters are set: significance level α and the number of degrees of freedom v = n-2. Using these parameters, tcr is found from the table of critical points of the Student distribution, and from the available data is calculated observed criterion value:
, (2)
where r is the paired correlation coefficient calculated from the data selected for the study. The paired correlation coefficient is considered significant (the hypothesis that the coefficient is equal to zero is rejected) with a confidence probability γ = 1- α, if t Obs modulo is greater than t crit.
If variables are correlated with each other, then the value of the correlation coefficient is partially affected by the influence of other variables.

Partial correlation coefficient characterizes the closeness of the linear relationship between the result and the corresponding factor when eliminating the influence of other factors. The partial correlation coefficient evaluates the closeness of the relationship between two variables with a fixed value of other factors. If it is calculated, for example, r yx 1| x2 (partial correlation coefficient between y and x 1 with a fixed influence of x 2), this means that a quantitative measure of the linear relationship between y and x 1 is determined, which will occur if the influence of x 2 on these characteristics is eliminated. If the influence of only one factor is excluded, we get partial first order correlation coefficient.
Comparison of the values ​​of paired and partial correlation coefficients shows the direction of influence of the fixed factor. If the partial correlation coefficient r yx 1| x2 will be less than the corresponding pair coefficient r yx 1, which means that the relationship between the characteristics y and x 1 is to some extent determined by the influence of the fixed variable x 2 on them. Conversely, a larger value of the partial coefficient compared to the pair coefficient indicates that the fixed variable x 2 weakens the relationship between y and x 1 with its influence.
The partial correlation coefficient between two variables (y and x 2) when excluding the influence of one factor (x 1) can be calculated using the following formula:
. (3)
For other variables, formulas are constructed in a similar way. At fixed x 2
;
at fixed x 3
.
The significance of partial correlation coefficients is checked similarly to the case of pair correlation coefficients. The only difference is the number of degrees of freedom, which should be taken equal to v = n – l -2, where l is the number of fixed factors.

Stepwise regression

The selection of factors x 1 , x 2 , …, x m included in a multiple regression model is one of the most important stages of econometric modeling. The method of sequential (step-by-step) inclusion (or exclusion) of factors in the model allows you to select from a possible set of variables exactly those that will enhance the quality of the model.
When implementing the method, the first step is to calculate the correlation matrix. Based on pairwise correlation coefficients, the presence of collinear factors is revealed. Factors x i and x j are considered collinear if r xjxi >0.7. Only one of the interrelated factors is included in the model. If there are no collinear factors among the factors, then any factors that have a significant impact on y.

At the second step, a regression equation is constructed with one variable that has the maximum absolute value of the pairwise correlation coefficient with the resulting attribute.

At the third step, a new variable is introduced into the model, which has the largest absolute value of the partial correlation coefficient with the dependent variable with a fixed influence of the previously introduced variable.
When an additional factor is introduced into the model, the coefficient of determination should increase and the residual variance should decrease. If this does not happen, i.e. the coefficient of multiple determination increases slightly, then the introduction of a new factor is considered inappropriate.

Example No. 1. For 20 enterprises in the region, the dependence of output per employee y (thousand rubles) on the share of highly qualified workers in the total number of workers x1 (% of the value of assets at the end of the year) and on the commissioning of new fixed assets x2 (%) is studied. .

Y X1 X2
6 10 3,5
6 12 3,6
7 15 3,9
7 17 4,1
7 18 4,2
8 19 4,5
8 19 5,3
9 20 5,3
9 20 5,6
10 21 6
10 21 6,3
11 22 6,4
11 23 7
12 25 7,5
12 28 7,9
13 30 8,2
13 31 8,4
14 31 8,6
14 35 9,5
15 36 10

Required:

  1. Construct a correlation field between output per worker and the share of highly qualified workers. Put forward a hypothesis about the closeness and type of relationship between indicators X1 and Y.
  2. Assess the closeness of the linear relationship between output per worker and the proportion of highly qualified workers with a reliability of 0.9.
  3. Calculate the coefficients of the linear regression equation for the dependence of output per worker on the share of highly qualified workers.
  4. Check the statistical significance of the parameters of the regression equation with a reliability of 0.9 and construct confidence intervals for them.
  5. Calculate the coefficient of determination. Using Fisher's F test, evaluate the statistical significance of the regression equation with a reliability of 0.9.
  6. Give a point and interval forecast with a reliability of 0.9 output per employee for an enterprise where 24% of workers are highly qualified.
  7. Calculate the coefficients of the linear multiple regression equation and explain the economic meaning of its parameters.
  8. Analyze the statistical significance of multiple equation coefficients with a reliability of 0.9 and construct confidence intervals for them.
  9. Find the pair and partial correlation coefficients. Analyze them.
  10. Find the adjusted coefficient of multiple determination. Compare it with the unadjusted (overall) coefficient of determination.
  11. Using Fisher's F test, evaluate the adequacy of the regression equation with a reliability of 0.9.
  12. Give a point and interval forecast with a reliability of 0.9 output per employee for an enterprise in which 24% of workers are highly qualified, and the commissioning of new fixed assets is 5%.
  13. Check the constructed equation for the presence of multicollinearity using: Student's test; χ2 test. Compare the results.

Solution We do it using a calculator. The following is the progress of the solution to clause 13.
Matrix of pair correlation coefficients R:

- yx 1x 2
y 1 0.97 0.991
x 1 0.97 1 0.977
x 2 0.991 0.977 1

In the presence of multicollinearity, the determinant of the correlation matrix is ​​close to zero. For our example: det = 0.00081158, which indicates the presence of strong multicollinearity.
To select the most significant factors x i, the following conditions are taken into account:
- the connection between the resultant characteristic and the factor one must be higher than the interfactor connection;
- the relationship between factors should be no more than 0.7. If the matrix has an interfactor correlation coefficient r xjxi > 0.7, then there is multicollinearity in this multiple regression model.;
- with a high interfactor connection of a characteristic, factors with a lower correlation coefficient between them are selected.
In our case, r x 1 x 2 have |r|>0.7, which indicates multicollinearity of the factors and the need to exclude one of them from further analysis.
Analysis of the first row of this matrix allows for the selection of factor characteristics that can be included in the multiple correlation model. Factor characteristics for which |r yxi | 0.3 – there is practically no connection; 0.3 ≤ |r| ≤ 0.7 - average connection; 0.7 ≤ |r| ≤ 0.9 – strong connection; |r| > 0.9 – the connection is very strong.
Let's check the significance of the obtained pairwise correlation coefficients using Student's t-test. Coefficients for which the values ​​of the t-statistics modulo are greater than the found critical value are considered significant.
Let us calculate the observed values ​​of t-statistics for r yx 1 using the formula:

where m = 1 is the number of factors in the regression equation.

Using the Student's table we find Ttable
t crit (n-m-1;α/2) = (18;0.025) = 2.101
Since t obs > t crit, we reject the hypothesis that the correlation coefficient is equal to 0. In other words, the correlation coefficient is statistically significant
Let's calculate the observed values ​​of t-statistics for r yx 2 using the formula:

Since t obs > t crit, we reject the hypothesis that the correlation coefficient is equal to 0. In other words, the correlation coefficient is statistically significant
Thus, the relationship between (y and x x 1), (y and x x 2) is significant.
The factor x2 (r = 0.99) has the greatest influence on the effective attribute, which means that when constructing the model, it will be the first to enter the regression equation.
Testing and eliminating multicollinearity.
The most complete algorithm for studying multicollinearity is the Farrar-Glober algorithm. It tests three types of multicollinearity:
1. All factors (χ 2 - chi-square).
2. Each factor with the others (Fisher’s criterion).
3. Each pair of factors (Student's t-test).
Let's check the variables for multicollinearity using the Farrar-Glouber method using the first type of statistical criteria (chi-square test).
The formula for calculating the value of the Farrar-Glouber statistic is:
χ 2 = -ln(det[R])
where m = 2 is the number of factors, n = 20 is the number of observations, det[R] is the determinant of the matrix of paired correlation coefficients R.
We compare it with the table value at v = m/2(m-1) = 1 degrees of freedom and significance level α. If χ 2 > χ table 2, then there is multicollinearity in the vector of factors.
χ table 2 (1;0.05) = 3.84146
Let's check the variables for multicollinearity using the second type of statistical criteria (Fisher's test).

Let's check the variables for multicollinearity using the third type of statistical criteria (Student's test). To do this, we will find partial correlation coefficients.
Partial correlation coefficients.
The partial correlation coefficient differs from the simple linear pair correlation coefficient in that it measures the pairwise correlation of the corresponding characteristics (y and x i), provided that the influence of other factors (x j) on them is eliminated.
Based on the partial coefficients, we can conclude that the inclusion of variables in the regression model is justified. If the value of the coefficient is small or insignificant, this means that the relationship between this factor and the outcome variable is either very weak or completely absent, so the factor can be excluded from the model.


Communication density is low.
Let us determine the significance of the correlation coefficient r yx 1 / x 2. As we can see, the connection between y and x 2, provided that x 1 is included in the model, has decreased. From this we can conclude that entering x 2 into the regression equation remains inappropriate.
We can conclude that when constructing a regression equation, factors x 1, x 2 should be selected.

Example No. 2. For 30 observations, the matrix of paired correlation coefficients turned out to be as follows:

yx 1x 2x 3
y1,0
x 10,30 1,0
x 20,60 0,10 1,0
x 30,40 0,15 0,80 1,0
Assess multicollinearity of factors. Construct a regression equation on a standard scale and draw conclusions.
  • 4. Statistical estimation of PLR parameters using the least squares method. Properties of least squares estimates
  • Properties of least squares estimates:
  • 5. Checking the quality of multiple linear regression: significance of parameters, confidence intervals, model adequacy. Forecasting.
  • 6. Multiple linear regression (MLR). Classic assumptions. OLS estimation of model parameters.
  • 7. Properties of OLS estimates of multiple linear regression. Gauss-Markov theorem.
  • 8. Checking the quality of multiple linear regression: significance of parameters, confidence intervals, model adequacy. Forecasting.
  • 5. Coefficient Determinations
  • Forecasting using a multiple linear regression model
  • 9. Specification of an econometric model: methods and diagnostics for selecting exogenous variables. Ramsey and Amemya tests.
  • Ramsey criterion:
  • 10. Econometric model specification: choosing the form of dependence of the nonlinear model
  • Specification principles
  • 11. The problem of multicollinearity. Consequences of the presence and diagnosis of multicollinearity.
  • Methods for diagnosing multicollinearity:
  • 12. Methods for eliminating multicollinearity. Principal component method. Ridge regression.
  • 13. Problems of heteroskedasticity of the model. Criteria for its diagnosis.
  • 1. Park criterion.
  • 2. Goldfeld-Quandt criterion.
  • 3. Breusch-Pagan criterion.
  • 4. White criterion.
  • 14. Generalized least squares (oms). Properties of mlr estimates for omnk. Weighted least squares method in the problem of estimating model parameters. Properties of estimates using weighted least squares.
  • Question 15. The problem of autocorrelation of model residuals. Implications of autocorrelation when using the model.
  • Reasons for Autocorrelation of Residuals
  • Consequences of autocorrelation:
  • 16. Durbin-Watson autocorrelation diagnostic criterion
  • 17.Methods for eliminating autocorrelation. Cochrane-Orcutt and Hildreth-Lou scoring procedures
  • 18. Models with distributed lags: lag structure according to Koik: Special cases (model with incomplete adjustment and adaptive expectations)
  • 19 Models with distributed lags: linear-arithmetic structure of lags and polynomial structure of lags according to Almon
  • 20. h-Durbin test and multiple Lagrange test for checking autocorrelation in lag models
  • 21. The concept of time series (time). VR model, main tasks of VR analysis. Time smoothing methods (moving average, exponential smoothing, sequential differences)
  • 22 Stationarity of the time series (time). Characteristics of correlation of temp levels.
  • 23 Stationary time series models: autoregression, moving average, arsc
  • 24. Non-stationary model of ariss. Estimation of model parameters.
  • 28. Time series forecasting. Indicators of forecast accuracy.
  • 30. Chow test for diagnosing the inclusion of dummy variables in an econometric model.
  • 32. Systems of simultaneous econometric equations (SOE). Structural and reduced form of the system (graphical and matrix representation).
  • 33. Problems of identification of systems of simultaneous equations (SOE). Identifiability of equations soy (ordinal and rank criteria)
  • 34. Methods for estimating systems of simultaneous equations: indirect least squares method, two-step least squares method. Applicability and properties of assessments
  • 35. Current state of econometrics. Examples of large econometric models
  • 11. The problem of multicollinearity. Consequences of the presence and diagnosis of multicollinearity.

    If available linear relationship of exogenous variables , for example, then OLS estimates will not exist, because there is no inverse of a matrix that will be singular. This situation in econometrics is called the problem multicollinearity.

    Reasons for multicollinearity:

    incorrect model specification

    careless collection of statistical data (use of repeated observations).

    Distinguish explicit And implicit multicollinearity.

    Explicit - known exact linear relationship between model variables.

    For example, if the model of the investment process includes nominal and real interest rates, i.e.

    where the relationship between real and nominal rates and the inflation rate is known

    then there is obvious multicollinearity.

    Implicit occurs when there is stochastic (uncertain, random) linear dependence between exogenous variables.

    implicit prevails, its presence is characterized by6 signs :

    1. OLS estimates of model parameters lose their undisplaced properties .

    2. Variance of OLS estimates increases:

    Due to the fact that, the correlation coefficient, then, which entails

    3. There is a decrease t- statistics that are indicators of the significance of parameters:

    4. The coefficient of determination is no longer a measure of the adequacy of the model, since low values t-statisticians lead to distrust of the selected dependence model.

    5. Parameter estimates for non-collinear exogenous variables become very sensitive to changes in data.

    6. Parameter estimates for non-collinear exogenous variables become insignificant.

    Methods for diagnosing multicollinearity:

    Step 1. In the (initial) multiple linear regression model, we will go through all the submodels in which any exogenous variable becomes endogenous, i.e.

    Step 2. We calculate the coefficients of determination of all the resulting models, on the basis of which we calculate the so-called inflation factors:

    If , then they conclude that multicollinearity exists.

    a) they do not change any structure in the model, but, using computer least squares, analyze the presence of the problem of multicollinearity using visual methods.

    b) improve the model specification by eliminating collinear exogenous variables from the original model.

    c) increase the volume of statistical data.

    d) combine collinear variables and include a common exogenous variable in the model.

    12. Methods for eliminating multicollinearity. Principal component method. Ridge regression.

    If the main task of the model is to predict future values ​​of the dependent variable, then with a sufficiently large coefficient of determination R2 (≥ 0.9), the presence of multicollinearity often does not affect the predictive qualities of the model.

    If the purpose of the study is to determine the degree of influence of each of the explanatory variables on the dependent variable, then the presence of multicollinearity will distort the true relationships between the variables. In this situation, multicollinearity appears to be a serious problem.

    Note that there is no single method for eliminating multicollinearity that is suitable in any case. This is because the causes and consequences of multicollinearity are ambiguous and largely depend on the results of the sample.

    METHODS:

    Excluding Variable(s) from the Model

    For example, when studying the demand for a certain good, the price of this good and the prices of substitutes for this good, which often correlate with each other, can be used as explanatory variables. By excluding the prices of substitutes from the model, we are likely to introduce a specification error. As a result, it is possible to obtain biased estimates and draw unfounded conclusions. In applied econometric models, it is desirable not to exclude explanatory variables until collinearity becomes a serious problem.

    Getting more data or a new sample

    Sometimes it is enough to increase the sample size. For example, if you are using annual data, you can move to quarterly data. Increasing the amount of data reduces the variance of regression coefficients and thereby increases their statistical significance. However, obtaining a new sample or expanding an old one is not always possible or is associated with serious costs. In addition, this approach can strengthen autocorrelation. These problems limit the ability to use this method.

    Changing Model Specification

    In some cases, the problem of multicollinearity can be solved by changing the specification of the model: either by changing the form of the model, or by adding explanatory variables that are not taken into account in the original model, but significantly affect the dependent variable.

    Using advance information about some parameters

    Sometimes, when building a multiple regression model, you can use some preliminary information, in particular, the known values ​​of some regression coefficients. It is likely that the values ​​of the coefficients obtained for some preliminary (usually simpler) models, or for a similar model based on a previously obtained sample, can be used for the one being developed in this moment models.

    To illustrate, we give the following example. Regression is built. Let's assume that variables X1 and X2 are correlated. For the previously constructed paired regression model Y = γ0 + γ1X1+υ, a statistically significant coefficient γ1 was determined (for definiteness, let γ1 = 0.8), connecting Y with X1. If there is reason to think that the relationship between Y and X1 will remain unchanged, then we can set γ1 = β1 = 0.8. Then:

    Y = β0 + 0.8X1 + β2X2 + ε. ⇒ Y – 0.8X1 = β0 + β2X2 + ε.

    The equation is actually a pairwise regression equation for which the problem of multicollinearity does not exist.

    The limitations of using this method are due to:

      Obtaining preliminary information is often difficult,

      the probability that the allocated regression coefficient will be the same for various models, not high.

    Converting Variables

    In some cases, the problem of multicollinearity can be minimized or even eliminated by transforming variables.

    For example, let the empirical regression equation be Y = b0 + b1X1 + b2X2

    where X1 and X2 are correlated variables. In this situation, you can try to determine regression dependencies of relative values. It is likely that in similar models, the problem of multicollinearity will not be present.

    Principal component method is one of the main methods for eliminating variables from a multiple regression model.

    This method is used to eliminate or reduce multicollinearity of factor variables in a regression model. The essence of the method : reducing the number of factor variables to the most significantly influencing factors . This is achieved by linearly transforming all factor variables xi (i=0,...,n) into new variables called principal components, i.e. a transition is made from the matrix of factor variables X to the matrix of principal components F. In this case, the requirement is put forward that the selection of the first principal component corresponds to the maximum of the total variance of all factor variables xi (i=0,...,n), the second component corresponds to the maximum of the remaining variance, after the influence of the first principal component is eliminated, etc.

    If none of the factor variables included in the multiple regression model can be excluded, then one of the main biased methods for estimating regression model coefficients is used - ridge regression or ridge. When using the ridge regression method a small number is added to all diagonal elements of the matrix (XTX) τ: 10-6 ‹ τ ‹ 0.1. Estimation of unknown parameters of a multiple regression model is carried out using the formula:

    where ln is the identity matrix.

    Basic provisions

    If the regressors in the model are connected by a strict functional dependence, then complete (perfect) multicollinearity. This type multicollinearity can arise, for example, in a linear regression problem solved by the least squares method, if the determinant of the matrix is ​​equal to zero. Complete multicollinearity does not allow us to unambiguously estimate the parameters of the original model and separate the contributions of regressors to the output variable based on the results of observations.

    In problems with real data, the case of complete multicollinearity is extremely rare. Instead, in the application domain we often have to deal with partial multicollinearity, which is characterized by pairwise correlation coefficients between regressors. In the case of partial multicollinearity, the matrix will have full rank, but its determinant will be close to zero. In this case, it is formally possible to obtain estimates of the model parameters and their accuracy indicators, but all of them will be unstable.

    Among the consequences of partial multicollinearity are the following:

    • increase in variances of parameter estimates
    • decrease in t-statistic values ​​for parameters, which leads to an incorrect conclusion about their statistical significance
    • obtaining unstable estimates of model parameters and their variances
    • the possibility of obtaining an incorrect sign from the theoretical point of view of the parameter estimate

    There are no precise quantitative criteria for detecting partial multicollinearity. The following are most often used as signs of its presence:

    Methods for eliminating multicollinearity

    There are two main approaches to solving this problem.

    No matter how the selection of factors is carried out, reducing their number leads to an improvement in the conditionality of the matrix, and, consequently, to an increase in the quality of estimates of the model parameters.

    In addition to the listed methods, there is another, simpler one that gives fairly good results - this is pre-centering method. The essence of the method is that before finding the parameters mathematical model The source data is centered: the average of the series is subtracted from each value in the data series: . This procedure allows us to separate the hyperplanes of the LSM conditions so that the angles between them are perpendicular. As a result, the model estimates become stable (Construction of multifactor models under conditions of multicollinearity).

    Federal Agency for Education and Science of the Russian Federation

    Kostroma State Technological University.

    Department of Higher Mathematics

    in econometrics on the topic:

    Multicollinearity

    Performed

    1st year student

    correspondence faculty

    sleep "Accounting"

    analysis and audit."

    I checked

    Katerzhina S.F.

    Kostroma 2008


    Multicollinearity

    Multicollinearity refers to the high mutual correlation of explanatory variables. Multicollinearity can manifest itself in functional (explicit) and stochastic (hidden) forms.

    In the functional form of multicollinearity according to at least one of the pairwise relationships between the explanatory variables is a linear functional relationship. In this case, the matrix X`X is special, since it contains linearly dependent column vectors, and its determinant is equal to zero, i.e. the premise of regression analysis is violated, this leads to the impossibility of solving the corresponding system of normal equations and obtaining estimates of the parameters of the regression model.

    However, in economic research, multicollinearity more often manifests itself in a stochastic form, when there is a close correlation between at least two explanatory variables. The matrix X`X in this case is non-singular, but its determinant is very small.

    At the same time, the vector of estimates b and its covariance matrix ∑ b are proportional inverse matrix(X`X) -1 , which means that their elements are inversely proportional to the value of the determinant |X`X|. As a result, significant standard deviations (standard errors) of the regression coefficients b 0 , b 1 , ..., b p are obtained and assessing their significance using the t-test does not make sense, although in general the regression model may turn out to be significant using the F-test.

    Estimates become very sensitive to small changes in observations and sample size. Regression equations in this case, as a rule, have no real meaning, since some of its coefficients may have incorrect signs from the point of view of economic theory and unreasonably large values.

    There are no precise quantitative criteria for determining the presence or absence of multicollinearity. However, there are some heuristic approaches to identify it.

    One such approach is to analyze the correlation matrix between the explanatory variables X 1 , X 2 , ..., X p and identify pairs of variables that have high variable correlations (usually greater than 0.8). If such variables exist, they are said to have multicollinearity. It is also useful to find multiple coefficients of determination between one of the explanatory variables and some group of them. The presence of a high multiple coefficient of determination (usually greater than 0.6) indicates multicollinearity.

    Another approach is to examine the matrix X`X. If the determinant of the matrix X`X or its minimum eigenvalue λ min is close to zero (for example, of the same order with accumulating calculation errors), then this indicates the presence of multicollinearity. The same may be indicated by a significant deviation of the maximum eigenvalue λ max of the matrix X`X from its minimum eigenvalue λ min .

    A number of methods are used to eliminate or reduce multicollinearity. The simplest of them (but not always possible) is that of two explanatory variables that have a high correlation coefficient (more than 0.8), one variable is excluded from consideration. At the same time, which variable to leave and which to remove from the analysis is decided primarily on the basis of economic considerations. If, from an economic point of view, none of the variables can be given preference, then the one of the two variables that has a higher correlation coefficient with the dependent variable is retained.

    Another method for eliminating or reducing multicollinearity is to move from unbiased estimates determined by the least squares method to biased estimates, which, however, have less dispersion relative to the estimated parameter, i.e. the smaller mathematical expectation of the squared deviation of the estimate b j from the parameter β j or M (b j - β j) 2.

    Estimates determined by a vector have, in accordance with the Gauss-Markov theorem, the minimum variances in the class of all linear unbiased estimators, but in the presence of multicollinearity, these variances may be too large, and turning to the corresponding biased estimators can improve the accuracy of estimating regression parameters. The figure shows the case where the biased estimate β j ^, the sampling distribution of which is given by the density φ (β j ^).

    Indeed, let the maximum permissible confidence interval for the estimated parameter β j be (β j -Δ, β j +Δ). Then the confidence probability, or the reliability of the estimate, determined by the area under the distribution curve on the interval (β j -Δ, β j +Δ), as is easy to see from the figure, will in this case be greater for the estimate β j compared to b j (in the figure these areas are shaded). Accordingly, the average squared deviation of the estimate from the estimated parameter will be less for a biased estimate, i.e.:

    M (β j ^ - β j) 2< M (b j - β j) 2

    When using “ridge regression” (or “ridge regression”), instead of unbiased estimates, we consider biased estimates specified by the vector

    β τ ^ =(X`X+τ E p +1) -1 X`Y,

    Where τ – some positive number called a "ridge" or "ridge"

    E p +1 – unit matrix (p+1) of the –th order.

    Addition τ to the diagonal elements of the matrix X`X makes the estimates of the model parameters shifted, but at the same time the determinant of the matrix of the system of normal equations increases - instead of (X`X) from will be equal to

    |X`X+τ E p +1 |

    Thus, it becomes possible to exclude multicollinearity in the case when the determinant |X`X| close to zero.

    To eliminate multicollinearity, a transition from the original explanatory variables X 1 , X 2 ,…, X n , interconnected by a fairly close correlation, to new variables representing linear combinations of the original ones can be used. In this case, the new variables must be weakly correlated or completely uncorrelated. As such variables, we take, for example, the so-called principal components of the vector of initial explanatory variables, studied in component analysis, and consider regression on the principal components, in which the latter act as generalized explanatory variables, subject to further meaningful (economic) interpretation.

    The orthogonality of the principal components prevents the multicollinearity effect. In addition, the method used allows us to limit ourselves to a small number of principal components with a relatively large number of initial explanatory variables.

    Multicollinearity - is a concept that is used to describe the problem where a loose linear relationship between explanatory variables results in unreliable regression estimates. Of course, such a dependence does not necessarily lead to unsatisfactory assessments. If all other conditions are favorable, that is, if the number of observations and sample variances of the explanatory variables are large, and the variance of the random term is small, then in the end you can get quite good estimates.

    So, multicollinearity must be caused by a combination of a weak relationship and one (or more) unfavorable condition, and that is the question

    the degree of manifestation of the phenomenon, and not its type. The estimation of any regression will suffer from it to some extent unless all the independent variables turn out to be completely uncorrelated. Consideration of this problem begins only when it seriously affects the results of the regression estimation.

    This problem is common in time series regressions, that is, when the data consists of a number of observations over a period of time. If two or more independent variables have a strong time trend, they will be highly correlated, and this can lead to multicollinearity.


    What can be done in this case?

    The various techniques that can be used to mitigate multicollinearity fall into two categories: the first category involves attempts to improve the degree to which the four conditions for the reliability of regression estimates are met; the second category includes the use external information. If we first use possible directly obtained data, then it would obviously be useful to increase the number of observations.

    If you are using time series data, this can be done by shortening the duration of each time period. For example, when estimating the demand function equations in Exercises 5.3 and 5.6, you can switch from using annual data to quarterly data.

    After this, instead of 25 observations, there will be 100. This is so obvious and so easy to do that most researchers using time series almost automatically use quarterly data, if available, instead of annual data, even if multicollinearity is not an issue, just for the sake of argument. minimum theoretical variances of regression coefficients. There are, however, potential problems with this approach. Autocorrelation can be introduced or enhanced, but it can be neutralized. In addition, bias due to measurement errors can be introduced (or amplified) if quarterly data are measured with less precision than the corresponding annual data. This problem is not easy to solve, but it may not be significant.

    Multicollinearity is the correlation of two or more explanatory variables in a regression equation. It can be functional (explicit) and stochastic (hidden). With functional multicollinearity, the XTX matrix is ​​degenerate and (XTX)-1 does not exist, therefore it is impossible to determine. More often, multicollinearity manifests itself in a stochastic form, while OLS estimates formally exist, but have a number of disadvantages:

    • 1) a small change in the initial data leads to a significant change in the regression estimates;
    • 2) the estimates have large standard errors and low significance, while the model as a whole is significant (high R2 value);
    • 3) interval estimates of coefficients expand, worsening their accuracy;
    • 4) it is possible to obtain the wrong sign for the regression coefficient.

    Detection

    There are several signs by which the presence of multicollinearity can be determined.

    First, analysis of the correlation matrix of pairwise correlation coefficients:

    • - if there are pairs of variables that have high correlation coefficients (> 0.75 - 0.8), they speak of multicollinearity between them;
    • - if the factors are uncorrelated, then det Q = 1, if there is complete correlation, then det Q = 0.

    You can check H0: det Q = 1; using statistical test

    where n is the number of observations, m = p+1.

    If, then H0 is rejected and multicollinearity is proven.

    Secondly, multiple coefficients of determination of one of the explanatory variables and some group of others are determined. The presence of a high R2 (> 0.6) indicates multicollinearity.

    Thirdly, the proximity to zero of the minimum eigenvalue of the XTX matrix (i.e., the solution to the equation) indicates that det(XTX) is also close to zero and, therefore, multicollinearity.

    Fourthly, high partial correlation coefficients.

    where are the algebraic additions of the elements of the matrix of sample correlation coefficients. Partial correlation coefficients of higher orders can be determined through partial correlation coefficients of lower orders using the recurrent formula:

    Fifthly, some people talk about the presence of multicollinearity external signs constructed model, which are its consequences. These should include the following:

    • · some of the estimates have incorrect signs from the point of view of economic theory or unreasonably large absolute values;
    • · a small change in the initial statistical data (adding or removing some observations) leads to a significant change in the estimates of the model coefficients, even changing their signs;
    • · most or even all estimates of regression coefficients turn out to be statistically insignificant according to the t-test, while the model as a whole is significant according to the F-test.

    There are a number of other methods for determining multicollinearity.

    If the main task of the model is to predict future values ​​of the dependent variable, then with a sufficiently large coefficient of determination R2 (> 0.9), the presence of multicollinearity usually does not affect the predictive qualities of the model. This statement will be justified if the same relationships between the correlated variables remain in the future.

    If the purpose of the study is to determine the degree of influence of each of the explanatory variables on the dependent variable, then the presence of multicollinearity, leading to an increase standard errors, most likely, will distort the true relationships between variables. In this situation, multicollinearity is a serious problem.