Total outcome

Table 8.3a. Regression statistics
Regression statistics
Multiple R. 0,998364
R-square 0,99673
Normal R-Square 0,996321
Standard error 0,42405
Observations 10

First, consider the upper part of the calculations, presented in Table 8.3a, is regression statistics.

The value of R-square, also called a measure of certainty, characterizes the quality of the obtained regression direct. This quality is expressed by the degree of conformity between the source data and the regression model (calculated data). The measure of certainty is always within the interval.

In most cases, the value of the R-square is between these values, called extreme, i.e. Between zero and unit.

If the value of the R-square is close to one, this means that the constructed model explains almost all the variability of the corresponding variables. Conversely, the value of the R-square, close to zero, means the poor quality of the constructed model.

In our example, the measure of certainty is 0.99673, which indicates a very good fit regression direct to the initial data.

Multiple R. - The multiple correlation coefficient R - expresses the degree of dependence of independent variables (x) and the dependent variable (y).

Multiple R is equal to square root from the determination coefficient, this value takes values \u200b\u200bin the range from zero to one.

In the simple linear regression analysis, the multiple R is equal to the Pearson correlation coefficient. Indeed, the multiple R in our case is equal to the Pearson correlation coefficient from the previous example (0.998364).

Table 8.3b. Recession coefficients
Factors Standard error t-statistics
Y-crossing 2,694545455 0,33176878 8,121757129
Variable x 1. 2,305454545 0,04668634 49,38177965
* A truncated calculation option is given.

Now consider the middle part of the calculations, presented in Table 8.3b. Regression coefficient B (2.305454545) and the offset along the ordinate axis, i.e. Constant A (2,694545455).

Based on the calculations, we can write the regression equation in this way:

Y \u003d x * 2,305454545 + 2,694545455

The direction of communication between variables is determined on the basis of signs (negative or positive) recession coefficients (coefficient b).

If the sign is recession coefficient - Positive, the relationship of the dependent variable with independent will be positive. In our case, the regression coefficient is positive, therefore, the connection is also positive.

If the sign is recession coefficient - Negative, the relationship of the dependent variable with independent is negative (reverse).

Table 8.3B. The results of the output of residues are presented. In order for these results to appear in the report, it is necessary to activate the checkbox "Remains" when starting the "Regression" tool.

Conclusion residue

Table 8.3B. Residues
Observation Predicted y. Residues Standard residues
1 9,610909091 -0,610909091 -1,528044662
2 7,305454545 -0,305454545 -0,764022331
3 11,91636364 0,083636364 0,209196591
4 14,22181818 0,778181818 1,946437843
5 16,52727273 0,472727273 1,182415512
6 18,83272727 0,167272727 0,418393181
7 21,13818182 -0,138181818 -0,34562915
8 23,44363636 -0,043636364 -0,109146047
9 25,74909091 -0,149090909 -0,372915662
10 28,05454545 -0,254545455 -0,636685276

With this part of the report, we can see deviations of each point from the constructed regression line. The greatest absolute value

Regression analysis - Method of modeling the measured data and studies of their properties. Data consist of steam values dependent variable (response variable) and independent variable (explaining the variable). The regression model is the function of an independent variable and parameters with the added random variable. The model parameters are configured in such a way that the model best brings the data. The criterion for the quality of the approximation (target function) is usually the standard error: the sum of the squares of the difference pattern of the model values \u200b\u200band the dependent variable for all values \u200b\u200bof the independent variable as an argument. Regression analysis - section of mathematical statistics and machine learning. It is assumed that the dependent variable is the sum of the values \u200b\u200bof a certain model and random variable. Regarding the nature of the distribution of this magnitude, assumptions are made, called the hypothesis of data generation. To confirm or refutate this hypothesis, statistical tests are performed, called residue analysis. It assumes that an independent variable does not contain errors. Regression analysis is used to predict, analyzing time series, testing hypotheses and identify hidden interconnections in data.

Definition of regression analysis

The sample may not be a function, but an attitude. For example, data for constructing regression may be such :. In such a sample, one value of the variable corresponds to several variable values.

Linear regression

Linear regression assumes that the function depends on the parameters linearly. In this case, linear dependence on the free variable is optional,

In the case when the linear regression feature is

here are the components of the vector.

The values \u200b\u200bof the parameters in the case of linear regression are found using the least squares method. The use of this method is substantiated by the assumption of Gaussian distribution of a random variable.

Differences between the actual values \u200b\u200bof the dependent variable and restored are called regression residues (Residuals). Synonyms are also used in the literature: sustice and errors. One of the important estimates of the quality criterion obtained dependence is the sum of the squares of the residues:

Here - Sum of Squared Errors.

Dispersion of residues is calculated by the formula

Here - Mean Square Error, the standard error.

The graphs present samples indicated by blue dots, and regression dependencies indicated by solid lines. On the abscissa axis, a free variable is postponed, and along the ordinate axis - dependent. All three dependencies are linear relative to the parameters.

Nonlinear regression

Nonlinear regression models - models of the form

which cannot be represented as a scalar product

where - the parameters of the regression model, is a free variable from the space, a dependent variable, - a random value and is a function from a certain set set.

The values \u200b\u200bof the parameters in the case of nonlinear regression are found using one of the methods of gradient descent, for example, the Levenberg-Marquardt algorithm.

About terms

The term "regression" was introduced by Francis Galton at the end of the 19th century. Galton found that the children of parents with high or low growth usually do not inherit an outstanding growth and called this phenomenon "regression to mediocreness." At first, this term was used exclusively in the biological sense. After the works of Charles Pearson, this term began to use and in statistics.

In statistical literature distinguish regression with the participation of one free variable and with several free variables - one-dimensional and multidimensional regression. It is assumed that we use several free variables, that is, a free variable - vector. In particular cases, when the free variable is a scalar, it will be designated. Distinguish linear and nonlinear regression. If the regression model is not a linear combination of functions from parameters, they are talking about nonlinear regression. At the same time, the model can be an arbitrary superposition of functions from some set. Nonlinear models are, exponential, trigonometric and others (for example, radial basis functions or penspond Rosenblatt), assigning the relationship between the parameters and the dependent variable nonlinear.

Distinguish parametric and non-parametric regression. Strict boundary between these two types of regressions is difficult to spend. Now there is no generally accepted criterion for distinguishing one type of models from the other. For example, it is believed that linear models are parametric, and models that include averaging of the dependent variable in the space of a free variable-notparametric. An example of a parametric regression model: a linear predictor, a multilayer perceptron. Examples of a mixed regression model: Radial base functions. The non-parametric model is a sliding averaging in the window of some width. In general, non-parametric regression differs from parametric in that the dependent variable depends not from one value of a free variable, but from a certain presented neighborhood of this value.

There is a distinction between the terms: "Approximation of functions", "approximation", "Interpolation", and "regression". It is as follows.

Approximation of functions. The function of a discrete or continuous argument is given. It is required to find a function from some parametric family, for example, among algebraic polynomials of a given degree. Function parameters must deliver minimum of some functionality, for example,

Term approximation - Synonym for the term "approximation of functions". It is more often used when it comes to a given function as the function of the discrete argument. It also requires to find such a function that takes place closest to all points of the specified function. In this case, the concept is introduced sustice - distances between dots of continuous function and corresponding points of the function of the discrete argument.

Interpolation functions - a special case of the task of approximation, when required to be in certain points, called interpolation nodes The values \u200b\u200bof the function and the approaching function coincide. In a more general case, restrictions on the values \u200b\u200bof some derivatives are imposed. That is, the function of the discrete argument is given. It is required to find such a function that passes through all points. In this case, the metric is usually not used, however, the concept of "smoothness" is often introduced.

The purpose of the regression analysis is to measure the connection between the dependent variable and one (paired regression analysis) or several (multiple) independent variables. Independent variables are also called factor, explaining, defining, regressors and predictors.

The dependent variable is sometimes referred to as the defined explained, "response". The extremely widespread regression analysis in empirical studies is not only associated with the fact that this is a convenient testing tool hypotheses. Regression, especially multiple, is an effective method of modeling and forecasting.

Explanation of the principles of working with regression analysis will begin with a simpler pair method.

Paired regression analysis

The first actions using regression analysis will be almost identical to us in the framework of calculating the correlation coefficient. Three main conditions for the effectiveness of correlation analysis using the Pearson method - the normal distribution of variables, interval measurement of variables, linear bond between variables are relevant for multiple regression. Accordingly, at the first stage, scattering diagrams are built, a statistically descriptive analysis of variables is carried out and the regression line is calculated. As in the framework of the correlation analysis, the regression lines are built by the smallest square method.

To more clearly illustrate the differences between the two methods of data analysis, we turn to the already considered example with the variables "Support of the ATP" and "Share of the Rural Population". The source data is identical. The difference in scattering diagrams will be that in the regression analysis, the dependent variable is correctly disappointing - in our case "Support for ATP" along the Y axis, whereas in the correlation analysis it does not matter. After cleaning emissions, the scattering diagram is:

The fundamental idea of \u200b\u200bregression analysis is that having a general trend for variables - in the form of a regression line, - you can predict the value of the dependent variable, having an independent value.

Imagine a conventional mathematical linear function. Any direct in the Euclidean space can be described by the formula:

where a is a constant that sets the offset along the axis of the ordinate; B is a coefficient that determines the angle of the lines.

Knowing the angular coefficient and constant, you can calculate (predict) value for any x.

This simplest function has formed the basis of the regression analysis model with the reservation that the value of we will predict not accurately, but within a certain confidence interval, i.e. about.

The constant is the intersection point of the regression line and the ordinate axis (F-crossing, in statistical packages, as a rule, denoted by "interceptor"). In our example with a voting for the ATP, its rounded value will be 10.55. The angular coefficient Kommersant will be approximately -0.1 (as in the correlation analysis, the sign shows the type of communication - direct or reverse). Thus, the resulting model will have the form of the joint venture C \u003d -0.1 x villages. us. + 10.55.

So, for the case of the "Republic of Adygea" with the shares of the rural population of 47% of the predicted value will be 5.63:

ATP \u003d -0.10 x 47 + 10.55 \u003d 5.63.

The difference between the initial and predicted values \u200b\u200bis called the residue (with this term - principled for statistics - we have already encountered when analyzing the conjugacy tables). So, for the case of the "Republic of Adygea" the residue will be equal to 3.92 - 5.63 \u003d -1.71. The larger the modular value of the residue, the less successfully the value is predicted.

Calculate predicted values \u200b\u200band residues for all cases:
Happening Sel. us. THX

(initial)

THX

(predicted)

Residues
Republic of Adygea 47 3,92 5,63 -1,71 -
Altai Republic 76 5,4 2,59 2,81
Republic of Bashkortostan 36 6,04 6,78 -0,74
The Republic of Buryatia 41 8,36 6,25 2,11
The Republic of Dagestan 59 1,22 4,37 -3,15
The Republic of Ingushetia 59 0,38 4,37 3,99
Etc.

Analysis of the ratio of initial and predicted values \u200b\u200bis used to assess the quality of the model obtained, its prognostic ability. One of the main indicators of regression statistics is the multiple correlation coefficient R - the correlation coefficient between the initial and predicted values \u200b\u200bof the dependent variable. In pair regression analysis, it is equal to the usual peonon correlation coefficient between the dependent and independent variable, in our case - 0.63. To substantively interpret the multiple R, it must be converted to the determination coefficient. This is done in the same way as in the correlation analysis - the construction of the square. The determination coefficient R -Kvadrat (R 2) shows the proportion of variation of the dependent variable, explained by independent (independent) variables.

In our case, R 2 \u003d 0.39 (0.63 2); This means that the variable "share of the rural population" explains about 40% of the variation of the "Support for ATP" variation. The greater the value of the determination coefficient, the higher the quality of the model.

Another model quality indicator is a standard estimate error (Standard Error of Estimate). This is an indicator of how much the point is "scattered" around the regression line. The variation measure for interval variables is the standard deviation. Accordingly, the standard evaluation error is the standard deviation of the residue distribution. The higher its value, the stronger the spread and the worse the model. In our case, the standard error is 2.18. It is for this magnitude that our model will "be wrong on average" when predicting the value of the "Support for ATP" variable.

Regression statistics also includes dispersion analysis. With it, we find out: 1) which proportion of variation (dispersion) of the dependent variable is explained by an independent variable; 2) which proportion of the dispersion of the dependent variable falls on the balance (inexplicable part); 3) What is the attitude of these two values \u200b\u200b(/ "- attitude). Dispersion statistics are especially important for sample studies - it shows how likely the availability of communication between independent and dependent variables in the general population. However, for continuous research (as in our example), learning The results of the dispersion analysis are not inspected. In this case, they are checked if the identified statistical pattern is caused by a coincidence that it is characteristic of that complex of conditions in which the surveyed set is set, i.e. not the truth of the result obtained for some more extensive general Aggregate, and the degree of its patterns, freedom from accidental impact.

In our case, the dispersion analysis statistics is as follows:

SS. df. MS. F. value
Regnet. 258,77 1,00 258,77 54,29 0.000000001
Left. 395,59 83,00 L, 11.
Total 654,36

F-ratio 54.29 significantly at the level of 0.0000000001. Accordingly, we can confidently reject the zero hypothesis (that the connection we discovered is random character).

A similar function is performed by the criterion T, but already in relation to regression coefficients (angular and F-intersection). With the help of a criterion / we check the hypothesis that in the general set regression coefficients are zero. In our case, we can again confidently discard the zero hypothesis.

Multiple regression analysis

The multiple regression model is almost identical to the paired regression model; The only difference is that several independent variables are sequentially included in the linear function:

Y \u003d b1x1 + b2x2 + ... + bpxp + a.

If independent variables are more than two, we do not have the opportunity to get a visual idea of \u200b\u200btheir connection, in this regard multiple regression less "visual" than steam room. If there are two independent variables, the data is useful to display on a three-dimensional scattering diagram. In professional statistical software packages (for example, Statistic), there is an option to rotate a three-dimensional diagram, which allows you to visually imagine the data structure.

When working with multiple regression, in contrast to the steam room, it is necessary to determine the analysis algorithm. The standard algorithm includes all existing predictors in the final regression model. A step-by-step algorithm implies a sequential inclusion (exception) of independent variables, based on their explanatory "weight". Step-by-step method is good when there are many independent variables; He "cleans" a model from frankly weak predictors, making it more compact and laconic.

An additional condition for the correctness of multiple regression (along with intervality, normality and linearity) is the absence of multicollinarity - the presence of strong correlation bonds between independent variables.

The interpretation of multiple regression statistics includes all the glements considered by us for the case of pair regression. In addition, there are other important components in the statistics of multiple regression analysis.

We will illustrate work with multiple regression on the example of testing hypotheses that explain differences in electoral activity in the regions of Russia. In the course of concrete empirical studies, assumptions were made that the level of voter turnover affects:

The national factor (the variable "Russian population"; is survivored as the share of the Russian population in the subjects of the Russian Federation). It is assumed that the increase in the share of the Russian population leads to a decrease in voter activity;

The urbanization factor (the variable "urban population"; is surveyed as the share of urban population in the subjects of the Russian Federation, we have already worked with this factor within the framework of the correlation analysis). It is assumed that an increase in the share of urban population also leads to a decrease in voter activity.

The dependent variable - the "intensity of electoral activity" ("asset") is survived through the averaged data of the appearance of the regions in the federal elections from 1995 to 2003. The source table of data for two independent and one dependent variable will have the following form:

Happening Variables
Assets. Mountains us. Rus. us.
Republic of Adygea 64,92 53 68
Altai Republic 68,60 24 60
The Republic of Buryatia 60,75 59 70
The Republic of Dagestan 79,92 41 9
The Republic of Ingushetia 75,05 41 23
Republic of Kalmykia 68,52 39 37
Karachay-Circassian 66,68 44 42
Republic of Karelia 61,70 73 73
Komi Republic 59,60 74 57
Mari El Republic 65,19 62 47

Etc. (after cleaning emissions 83 cases out of 88)

Statistics describing the quality of the model:

1. Multiple R \u003d 0.62; L-square \u003d 0.38. Consequently, the national factor and urbanization factor together explain about 38% of variation of the variable "electoral activity".

2. The average error is 3.38. It is so "the average is wrong" the constructed model when predicting the level of appearance.

3. / L-ratio of explained and inexplicable variation is 25.2 at the level of 0.000000003. The zero hypothesis about the chance of identified connections is rejected.

4. Criterion / for the constant and regression coefficients of variables "Urban population" and "Russian population" meaning at the level of 0.0000001; 0.00005 and 0.007, respectively. The zero hypothesis about the randomness of the coefficients is rejected.

Additional useful statistics in the analysis of the ratio of initial and predicted values \u200b\u200bof the dependent variable are the distance of Mahalabis and the distance of Cook. The first - measure of the uniqueness of the case (shows how much the combination of values \u200b\u200bof all independent variables for a given case deviates from the mean value on all independent variables simultaneously). Second - measure of the influence of the case. Different observations in different ways affect the slope of the regression line, and with the help of the Cook distance, they can be compared to this indicator. This is useful when cleaning emissions (the emission can be represented as an overly influential case).

In our example, Dagestan refers to unique and influential cases.

Happening Source

values

Predica

values

Residues Distance

Mahalanobis

Distance
Adygea 64,92 66,33 -1,40 0,69 0,00
Altai Republic 68,60 69.91 -1,31 6,80 0,01
The Republic of Buryatia 60,75 65,56 -4,81 0,23 0,01
The Republic of Dagestan 79,92 71,01 8,91 10,57 0,44
The Republic of Ingushetia 75,05 70,21 4,84 6,73 0,08
Republic of Kalmykia 68,52 69,59 -1,07 4,20 0,00

The actual regression model has the following parameters: U-intersection (constant) \u003d 75.99; B (mountains. Us.) \u003d -0.1; Kommersant (Rus. Us.) \u003d -0.06. Final formula:

Aacive, \u003d -0.1 x Mountains. R + - 0.06 x Rus. R + 75.99.

Can we compare the "explanatory force" of the predictors, based on the value of the coefficient 61. In this case, yes, since both independent variables have the same percentage format. However, most often multiple regression deals with variables measured in different scales (for example, the level of income in rubles and age in years). Therefore, in general, to compare the predictive possibilities of variables by the regression ratio incorrectly. In the statistics of multiple regression for this purpose, there is a special beta coefficient (B) calculated separately for each independent variable. It is a private (calculated after taking into account the influence of all other predictors) the correlation coefficient of factor and response and shows the independent contribution of the factor in the prediction of the response values. In pair regression analysis, beta coefficients for obvious reasons is equal to the pair correlation coefficient between the dependent and independent variable.

In our example beta (mountains. Us.) \u003d -0.43, beta (rus. We.) \u003d -0.28. Thus, both factors adversely affect the level of electoral activity, while the significance of the urbanization factor is significantly higher than the significance of the national factor. The cumulative influence of both factors determines about 38% of the variation of the variable "electoral activity" (see L-Square value).

Regression analysis is a statistical research method that show the dependence of a parameter from one or several independent variables. The application was difficult to use it in a compuscript era, especially if it were about large amounts of data. Today, learning how to build regression in Excel, you can solve complex statistical tasks in literally in a couple of minutes. Below are concrete examples from the field of economics.

Types of regression

This very concept was introduced into mathematics in 1886. Regression happens:

  • linear;
  • parabolic;
  • power;
  • exponential;
  • hyperbolic;
  • indicative;
  • logarithmic.

Example 1.

Consider the task of determining the dependence of the number of those who quenched members of the team from the average salary in 6 industrial enterprises.

A task. In the six enterprises analyzed the average monthly wage and the number of employees who quit at their own request. In tabular form we have:

The number of faded

The salary

30000 rubles

35,000 rubles

40000 rubles

45,000 rubles

50,000 rubles

55,000 rubles

60000 rubles

For the problem of determining the dependence of the quantity of workers overwhelmed from the average salary in 6 enterprises, the regression model has the form of an equation y \u003d a 0 + a 1 x 1 + ... + a k x k, where x i is the influencing variables, and the regression coefficients, A K is the number of factors.

For this task, Y is an indicator of those who quarreled employees, and the influencing factor - the salary that X is denoted by X.

Using the capabilities of the "Excel" table processor

Regression analysis in Excel should be preceded by the application to the existing table data of the built-in functions. However, for these purposes it is better to use a very useful superstructure "analysis package". To activate it, you need:

  • from the File tab, go to the "Parameters" section;
  • in the window that opens, select the "superstructure" string;
  • click on the "Go button" below, to the right of the Row "Management";
  • put a tick next to the name "Analysis Package" and confirm your actions by clicking OK.

If everything is done correctly, on the right side of the "Data" tab, located above the Workstation "Excel", the desired button will appear.

in Excel

Now, when you have all the necessary virtual tools for the implementation of econometric calculations, we can proceed to solve our task. For this:

  • click on the "Data Analysis" button;
  • in the window that opens, click on the "Regression" button;
  • in the tab that appears, we enter the range of values \u200b\u200bfor Y (the number of abolished employees) and for x (their salaries);
  • confirm your actions by pressing the "OK" button.

As a result, the program will automatically fill out a new sheet of table processor with regression analysis data. Note! Excel has the ability to independently ask the place you prefer for this purpose. For example, it may be the same sheet where the values \u200b\u200bare y and x, or even a new book specifically designed to store such data.

Analysis of regression results for R-square

In Excel, the data obtained during the processing of the data under consideration seems to be:

First of all, you should pay attention to the value of the R-square. It is the determination coefficient. In this example, R-square \u003d 0.755 (75.5%), i.e. the calculated parameters of the model explain the relationship between the parameters under consideration by 75.5%. The higher the value of the determination coefficient, the selected model is considered more applicable for a particular task. It is believed that it correctly describes the actual situation with the value of the R-square above 0.8. If R-square<0,5, то такой анализа регрессии в Excel нельзя считать резонным.

Analysis of coefficients

The number 64,1428 shows what will be y if all variables XI in the model we are reset. In other words, it can be argued that the value of the analyzed parameter also affect other factors not described in the specific model.

The following coefficient -0.16285, located in the B18 cell, shows the weight of the effect of the variable x on Y. This means that the average monthly salary of employees within the model under consideration affects the number of -0,16285, i.e., the degree of its influence is at all small. The sign "-" indicates that the coefficient has a negative value. This is obvious, as everyone knows that the more salary in the enterprise, the less people express a desire to terminate the employment contract or dismissed.

Multiple regression

Under such a term is understood as the equation of communication with several independent variables of the form:

y \u003d f (x 1 + x 2 + ... x m) + ε, where y is a resulting feature (dependent variable), and x 1, x 2, ... x M is signs of factors (independent variables).

Evaluation of parameters

For multiple regression (MR), it is carried out using the method of smallest squares (MNC). For linear equations of the form y \u003d a + b 1 x 1 + ... + b m x M + ε we build a system of normal equations (see below)

To understand the principle of the method, consider a two-factor case. Then we have the situation described by the formula

From here we get:

where σ is the dispersion of the corresponding feature reflected in the index.

MNK is applicable to an MR equation in a standardized scale. In this case, we get the equation:

in which T y, t x 1, ... t xm is standardized variables for which the average values \u200b\u200bare 0; β i is standardized regression coefficients, and the standard deviation is 1.

Please note that all β i in this case are specified as normalized and centralized, therefore, their comparison is considered correct and admissible. In addition, it is customary to carry out differentials of factors, discarding those of which the smallest values \u200b\u200bof βi.

Task using linear regression equation

Suppose there is a table of dynamics of the price of a specific product N over the past 8 months. It is necessary to decide on the feasibility of acquiring his party at a price of 1850 rubles / t.

number of month

name of the month

product price N.

1750 rubles per ton

1755 rubles per ton

1767 rubles per ton

1760 rubles per ton

1770 rubles per ton

1790 rubles per ton

1810 rubles per ton

1840 rubles per ton

To solve this task in the Excel Table Processor, it is required to use the "Data Analysis" tool presented above. Next, choose the "Regression" section and set the parameters. It must be remembered that the range of values \u200b\u200bfor the dependent variable must be introduced in the "Input Input Interval Y" (in this case, the price of goods in specific months), and in the "Input Interval X" - for an independent (number of the month). Confirm the actions by pressing OK. On a new sheet (if it was so indicated) we obtain data for regression.

We build the linear equation of the form y \u003d ax + b, where the ratio of the number of the month and the coefficients and lines "Y-intersection" from the sheet with the results of the regression analysis protrude as parameters A and B. Thus, the regression linear equation (UR) for task 3 is written in the form:

Price to product N \u003d 11.714 * Month month + 1727.54.

or in algebraic notation

y \u003d 11,714 x + 1727,54

Analysis of the results

To decide whether the resulting linear regression equations are adequately, the multiple correlation coefficients (KMK) and determination, as well as the Fisher's criterion and the Student criterion are used. In the Table "Excel" with the results of regression, they act as multiple R, R-square, F-statistics and T-statistics, respectively.

KMK R makes it possible to evaluate the closeness of the probabilistic connection between independent and dependent variables. Its high value indicates a sufficiently strong connection between the variables "number of the month" and "the price of a product N in rubles per 1 ton." However, the nature of this connection remains unknown.

The square of the determination coefficient R 2 (RI) is a numeric characteristic of the share of the total scattering and shows the scatter of which part of the experimental data, i.e. The values \u200b\u200bof the dependent variable corresponds to the linear regression equation. In the problem under consideration, this value is 84.8%, i.e., statistical data with a high degree of accuracy are described by the OR obtained.

F-statistics, also called Fisher's criterion, is used to assess the importance of linear dependence, refuting or confirming the hypothesis of its existence.

(Student's criterion) helps assess the significance of the coefficient at an unknown or free member of linear dependence. If the value of the T-criterion is\u003e t, the hypothesis of insignificance of a free member of the linear equation is rejected.

In the problem under consideration for a free member, using the "Excel" tools, it was obtained that t \u003d 169,20903, and p \u003d 2.89e-12, i.e. we have a zero probability that the correct hypothesis of insignificance of a free member will be rejected. For the coefficient at an unknown T \u003d 5,79405, and p \u003d 0.001158. In other words, the likelihood that the correct hypothesis of the insignificance of the coefficient is rejected at an unknown, is 0.12%.

Thus, it can be argued that the resulting equation of linear regression is adequately.

Task on the feasibility of buying a package of shares

Multiple regression in Excel is performed using the entire "data analysis" tool. Consider a specific applied task.

Management Company "NNN" should decide on the feasibility of buying a 20% stake in MMM JSC. The cost of the package (SP) is 70 million US dollars. Specialists "NNN" collected data on similar transactions. It was decided to assess the cost of a stake in such parameters expressed in millions of American dollars as:

  • accounts payable (VK);
  • volume of annual turnover (VO);
  • receivables (VD);
  • the cost of fixed assets (SOF).

In addition, the settlement of the wage enterprise (V3 P) in thousands of US dollars is used.

Solution tools for a table processor Excel

First of all, you need to make a table of source data. It has the following form:

  • call the "Data Analysis" window;
  • select the section "Regression";
  • in the "Input Interval Y" window, a range of values \u200b\u200bof dependent variables from column G are introduced;
  • click on the icon with a red arrow to the right of the window "Input interval X" and allocate the range of all values \u200b\u200bfrom columns B, C, D, F.

The item "New Work List" and click "OK".

Receive analysis for this task.

Study of the results and conclusions

"Collect" from the rounded data presented above on a sheet of a table processor Excel, the regression equation:

SP \u003d 0.103 * Sof + 0.541 * VO - 0.031 * VK + 0.405 * Vd + 0.691 * VZP - 265,844.

In a more familiar mathematical form, it can be written as:

y \u003d 0.103 * x1 + 0,541 * x2 - 0.031 * x3 + 0,405 * x4 + 0,691 * x5 - 265,844

Data for MMM JSC are presented in Table:

Substituting them into the regression equation, they receive a figure of 64.72 million US dollars. This means that the shares of MMM JSC should not be purchased, since their cost of 70 million US dollars is sufficiently overestimated.

As we see, the use of the "Excel" table processor and the regression equations made it possible to adopt a reasonable decision regarding the feasibility of a completely specific transaction.

Now you know what regression is. Excel examples discussed above will help you in solving practical tasks from the field of econometrics.

Regression analysis is one of the most sought-after methods of statistical research. With it, it is possible to establish the degree of influence of independent values \u200b\u200bon the dependent variable. The Microsoft Excel functionality has tools intended for a similar type of analysis. Let's analyze that they represent themselves and how to use them.

But, in order to use a function that allows you to carry out a regression analysis, first of all, you need to activate the analysis package. Only then the tools needed for this procedure will appear on the exile tape.


Now, when we proceed to the tab "Data"on the ribbon in the tool block "Analysis" We will see a new button - "Data analysis".

Types of regression analysis

There are several types of regressions:

  • parabolic;
  • power;
  • logarithmic;
  • exponential;
  • indicative;
  • hyperbolic;
  • linear regression.

We will talk more about the implementation of the last type of regression analysis in Excele more.

Linear regression in the Excel program

Below, as an example, a table is presented in which the average daily air temperature on the street, and the number of shop buyers for the appropriate working day is indicated. Let us find out with the help of regression analysis, exactly how the weather conditions in the form of air temperature may affect the attendance of the commercial institution.

The general equation of regression of the linear species is as follows: y \u003d a0 + a1x1 + ... + AKK. In this formula Y. means a variable, the influence of the factors on which we are trying to explore. In our case, this is the number of buyers. Value x. - These are various factors affecting the variable. Parameters a. Are regression coefficients. That is, it is they who determine the importance of a particular factor. Index k. Indicates the total number of these factors.


Analysis of the results of the analysis

The results of the regression analysis are displayed in the form of a table in the place indicated in the settings.

One of the main indicators is R-square. It indicates the quality of the model. In our case, this coefficient is 0.705 or about 70.5%. This is an acceptable level of quality. Dependence less than 0.5 is bad.

Another important indicator is located in the cell on the crossing line "Y-intersection" and column "Factors". It indicates what value will be in y, and in our case, this is the number of buyers, with all other factors equal to zero. This table is 58.04 in this table.

The value at the intersection of the graph "The variable x1" and "Factors" Shows the level of Y depending on X. In our case, it is the level of dependence of the number of store clients from temperature. The coefficient of 1.31 is considered a rather high indicator of influence.

As you can see, using the Microsoft Excel program it is quite easy to make a table of regression analysis. But, to work with the data obtained at the exit, and understand their essence, only a prepared person will be able.


Close