We are all aware of the fact that Data is going to rule humanity’s future. The knowledge of data and building intelligence using data is very important.
But learning Data Science can be intimidating at times. Firstly due to adverse complexity and learning curve. Secondly due to the wide range of resources.
Research suggests and I believe is that learning through questions can be a better way to challenge your understanding which also clears your concept. It is also very helpful for technical interviews.
I have collected questions from this very helpful resource. I would rate the difficulty level of the questions as pointed in the source article. Let’s begin:
1. What is regression? Which models can you use to solve a regression problem? [Easy]
Regression analysis is a statistical method to model the relationship between a dependent (target) and independent (predictor) variables with one or more independent variables.
For example, you buy a piece of land with some amount like $50000. It gets sold for $80000 after 3 years. Similarly, you buy more land for some prices. They get sold for some amount. Now we want to predict if we buy land for some amount what would be its cost after 3 years given the previous data. This is the work of regression.
Regression is a supervised learning technique that helps in finding the correlation between variables and enables us to predict the continuous output variable based on one or more predictor variables. It is mainly used for prediction, forecasting, time-series modeling, etc.
There are different types of Regression:
a. Linear Regression
b. Ridge Regression
c. Lasso Regression
d. Polynomial Regression etc
The description of the above types is another blog altogether. Let’s reserve it for another day.
2. What is linear regression? When do we use it? [Easy]
Linear regression is a statistical regression method that is used for predictive analysis.
It is used for solving the regression problem in machine learning. Linear regression shows the relationship between the independent variable (X-axis) and the dependent variable (Y-axis), hence called the linear regression.
The above graph shows the relationship between employee salary and experience. Based on the above graph we can predict the employee salary if he has certain years of experience.
Linear regression is generally used in time-series forecasting, stock market analysis, investment profits analysis, weather forecasting, etc. Generally all the regression analyses in supervised learning.
3. What’s the normal distribution? Why do we care about it? [Easy]
The normal distribution or Gaussian distribution is a probability function that describes how the values of a variable are distributed. It is an asymmetric distribution where most of the observations cluster around the central peak and the probabilities for the values further away from the mean taper off equally in both directions. Extreme values in both tails of the distribution are similarly unlikely.
The normal distribution is the most important probability distribution in statistics because it fits many natural phenomena. For example, heights, blood pressure, measurement error, and IQ scores follow the normal distribution.
It also makes data science easy.
If we speak more statistically, the normal distribution is important because of the central limit theorem. It says that if we have a problem with several independent variables, the aggregate of those variables will tend towards a normal distribution.
4. How do we check if a variable follows the normal distribution? [Medium]
The first and foremost method is using a simple histogram. The histogram is a great way to quickly visualize the distribution of a single variable.
The left-hand side graph is normally distributed whereas the right-hand side graph isn’t normally distributed.
4.2. Box Plot
The Box Plot plots 5 number summary of a variable: minimum, first quartile, median, third quartile, and maximum. The boxplot is a great way to visualize distributions of multiple variables at the same time.
4.3. QQ Plot
QQ plot stands for quantile vs quantile plot. It generally plots theoretical plots against the actual quantiles of our variable.
The QQ plot allows us to see the deviation of a normal distribution much better than in a Histogram or Box Plot.
As seen in the above image, the following are the properties of the QQ plot.
- The normal QQ plot follows a straight line (left image).
- The uniform distribution has too many observations in both extremities i.e., very high and very low values (middle image).
- The exponential distribution has too many observations on the lower values, but little in the higher values. (right image).
There are many other tests. You can certainly give a look to this blog for more techniques.
5. What if we want to build a model for predicting prices? Are prices distributed normally? Do we need to do any pre-processing for prices? [Medium]
First and foremost we would try to analyze the data. We try to visualize whether the data is normally distributed or not, does the data has outliers, whether the data is univariate or multivariate. The problem is a regression problem as the data of prices would be given and we have to predict the prices. So we can use algorithms like Linear Regression to solve this problem statement.
To understand whether the data is normally distributed or not we first normalize the data using sklearn’s standardscaler() or minmaxscaler() or we can simply code from scratch. Secondly, we can plot a histogram, if the data is of one single variable. We can also use the QQ plot if the data is of multiple variables.
As we have seen above we can do normal or standard scaling of the data and remove outliers in some cases to carry out our further processing.