Why do we take logs of variable in Regression analysis?

We should remember that a regression equation has two parts

i) The Dependent variable (Predictand)

ii) The Independent variables (Predictors) ; which can be one or more and can be of different types (Categorical or Continuous).

The nature of the regression that we should run depends on the type of Dependent variable that we are dealing with in our model. For example, if the dependent variable is Continuous then we might run OLS (though this does require some other conditions to be satisfied for better results) to get the estimates of the parameters, or if our Predictand is a Categorical variable (Binomial Categorical, 0 or 1) then we might want to run a Logistic regression.

It has to be noted that Linear Regression has certain conditions that need to be satisfied for it to provide good/desirable results, one of them being normal residuals, which in many instances are not. If the error between the observed and the expected values are not normally distributed, that could be because the response variable is skewed. In such cases we can take a log transformation of the variable to normalize it. The question is, whether we should do it. According to some statisticians, there are other regression methods that can handle these problems with efficiency without going through such transformations, the justification being, it is advisable to “use a method that fits the data than to make the data fit the method”. So if the residuals are non-normal we can take help of Robust Regression, Quantile Regression or in some cases MARS. It has to be noted here that OLS regression does not require the variables to be normal, but only the errors which are estimated by the residuals. However, if there are outliers in the dependent or the independent variables in the model taking logarithmic transformations can reduce the effect of those observations.

So if transforming variables for the sake of normalizing them is not a great move, what else could be the reason why variables are still transformed in practice?

One good reason if because it can make substantive sense, one is when the raw values of the variables are not exactly linearly related. For example, a unit change in X can cause a constant percentage change in Y. So a unit change in X might have a small effect on Y to begin with but subsequent increments in X might have greater and greater impacts on Y and thus yielding a non-linear relationship between the raw values of the variables. Taking logarithmic transformation of the response variable helps us in estimating the relationship. A similar transformation of X can be made if a percentage change in X causes a constant unit change in Y, such a transformation is generally taken when the impact of the independent variable on the dependent variable decreases as the value of independent variable increases. Finally we can even take logs of both the response and the independent variable if a percentage change in X causes a constant percentage change in Y which is called a Double-log or a log-log model. The estimated parameter here is interpreted as the elasticity.

In some cases the relationship between variables can be given by

Y= K^a. L^b , where a and b are the parameters you want to estimate. Taking logs on both sides and adding a constant c can help us estimate the relationship using a Linear Regression.

Or, in some other cases a transformation is used to stabilize the variance (Reduce heteroskedasticity).

At the end of the day, all we do is choose a line/functional form that best fits the data and while doing so the primary consideration must be the evaluation of the nature of the relationship between the response and the independent variable. Whatever we do, there has to be a perfectly good reason for doing it.

Credit: Data Science Central By: Sibashis Chakraborty