Before start my main topic, I would like to introduce you about Regression Analysis and Time Series Data in shortly.
Regression analysis is a statistical techniques in machine learning, which is most popular and frequently used techniques. This techniques is useful for investigating and modelling the relationship between dependent feature/variable (y) and one or more independent features/variables (x)
In simple word, time series data is data such that its points are recorded at time sequence. In other word, data is collected at different point in time.
Example : Annual Expenditures of particular person.
Hope, you may have understood what is regression analysis and time series data. Let’s come to the point.
Many applications of regression analysis involve both independent/predictor and dependent/response variables that are time series, that mean, the variables are recorded at time sequence. The assumption of uncorrelated or independent errors that is typically made for regression data that is not time-dependent is usually not appropriate for time series data. The error in time series data represent autocorrelated structure. Autocorrelation, also known as serial correlation, tell that the error are correlated with different time period itself.
There are many sources of auto-correlation in time series regression data. In many cases, the cause of autocorrelation is the failure of the analyst to include one or more important predictor variable in the model.
Ex : Suppose that we wish to regress the sales of a product in a particular region of the country against the annual advertising expenditures for that product.
In above example, the growth in the population in that region over the period of time used in the study will also influence the product sale. Failure to include the population size may cause the errors in the model to be positively autocorrelated, because if the per-capita demand for the product is either constant or increasing with time, population size is positively correlated with product sales.
The presence of autocorrelation in the errors has several effect on the ordinary least-squares regression procedure.
- Regression coefficient are still unbiased, but they are no longer minimum- variance estimates.
- When the errors are positively autocorrelated, the residual mean square may seriously underestimate the error variance.
- Confidence intervals, prediction intervals, and tests of hypotheses base on t and F distributions are, strictly speaking, no longer exact procedures.
We can deal with autocorrelation using three approaches. If autocorrelation present due to failure of to include one or more predictors and if analyst can be identified and include those predictor in the model, then observed autocorrelation should disappear.
As another option to dealing with the problem of autocorrelation, the weighted least squares or generalised least squared method could be used if there were sufficient knowledge of the autocorrelation structures. If these approaches cannot be used then the analyst must turn to a model that specifically include the autocorrelation structure. These models usually require special parameter estimation techniques. How can we identify autocorrelaion present or not in your data? This most very common question arise for every analyst. So that is what, I am going to discuss about how we can detect autocorrelation using statistical techniques with example.
1. Natural Language Generation:
The Commercial State of the Art in 2020
2. This Entire Article Was Written by Open AI’s GPT2
3. Learning To Classify Images Without Labels
4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst
For the detection of autocorrelation, residual plots can be useful. Draw the plot of residuals versus time for meaningful and useful visualisation.
There are two possibility while detecting autocorrelation.
Positive autocorrelation : Positive autocorrelation is indicated by a cyclical residual plot over time. The correlation is positive between observation which were recorded in time sequence.
Negative autocorrelation : Negative autocorrelation is indicated by alternating pattern where the residual cross time axis more frequently than if they were distributed randomly. The correlation is negative between observation which were recorded in time sequence.
See the below figure, which is visualisation of autocorrelation. Figure is showing the relation between residuals(Y-axis) and time(X-axis).
Various statistical tests can be used to detect the presence of autocorrelation. The test developed by Durbin and Watson (1950, 1951, 1971) is a very widely used procedure. This test for first order autocorrelation — i.e. assume that the errors in the regression model are generated by a first-order autoregressive process observed at equally spaced time period.
For uncorrelated errors lag one sample autocorrelation coefficient equal to 0 (at least approximately) so the value of Durbin-Watson statistic should be approximately 2. Statistical testing is necessary to determine just how far away from 2 the statistic must fall in order for us to conclude that the assumption of uncorrelated errors is violated. The decision procedure is as follows.
Example: A company wants to use a regression model to related annual regional advertising expenses to annual regional concentrate sale for a soft drink company. Table 1 presents 20 years of these data. we will initially assume that a linear relationship is appropriate and fit simple linear regression by ordinary least squares.
Fitting a simple linear regression model by using python:
The plot of residuals versus time, shown in below figure.
Residuals plot has a pattern indicative of potential autocorrelation; there is a definite upward trend in the plot.
Null hypothesis : There is no autocorrelation present in errors of model Alternative hypothesis : There is positive autocorrelation present in errors of model
Endnote : A significant value of the Durbin-Watson statistic or a suspicious residual plot indicates a potential problem with autocorrelated model errors. This could be the result of an actual time dependence in the error or and ‘artificial’ time dependence caused by the omission of one or more important predictor variable. If the apparent autocorrelation result from missing predictors and if these missing predictors can be identified and incorporated into the model, the autocorrelation problem may be eliminated. If autocorrelation cannot removed by one or more new predictor, it is necessary to take explicit account of the autocorrelative structure in the model and use an appropriate parameter estimation method. A very good and widely used approach is the procedure devised by Cochrane and Orcutt(1949).
Montgomery, D. C., Peck, E. A. and Vining, G. G. (2001). Introduction to Linear Regression Analysis. 3rd Edition, New York, New York: John Wiley & Sons.