Deep dive-in : Linear Regression using PySpark MLlib
PREREQUISITE : Amateur level knowledge of PySpark
spark.ml is a package introduced in Spark 1.2, which aims to provide a uniform set of high-level APIs that help users create and tune practical machine learning pipelines.
Do not get worried about the imports now. I would explain each packages and its use.
Our data is stored in a Dataframe data . Need to check schema of the data before proceeding as
VectorAssembler accepts the following input column types: all numeric types, Boolean type and vector type, if not one of these types we could have explicitly type-casted
data = data.select(data.columnName.astype(‘int’))
from pyspark import SQLContext, SparkConf, SparkContext
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
conf = SparkConf().setMaster('local').setAppName('ML_learning')
sc = SparkContext(conf=conf)
sqlcontext = SQLContext(sc)
data = sqlcontext.read.csv(path='salary_d.csv', header = True, inferSchema = True)
There are two terms which you often come across in ML. “Feature” and “Label”.
In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon being observed. Choosing informative, discriminating and independent features is a crucial step for effective algorithms in pattern recognition, classification and regression. Features are usually numeric, but structural features such as strings and graphs are used in syntactic pattern recognition. The concept of “feature” is related to that of explanatory variable used in statistical techniques such as linear regression.
In simple terms features are all the independent variables that you think will help you to predict the values of dependent variable. In many a cases there are one or more independent variable or say “Feature”.
VectorAssembler combines all feature into single vector. Assume x1, x2, x3 are three columns having values 1, 2 ,3 which you want to combine into a single feature vector called
features and use it to predict dependent variable. If we set
VectorAssembler input columns to
x3 and output column to
features, after transformation we get the following DataFrame:
| — — — — — — |
| [1.0, 2.0, 3.0]|
Label is dependent variable whose value our model predicts.
data2 = data.select(data.Expr_yrs,data.Salary.alias('label'))
train, test = data2.randomSplit([0.7,0.3])
assembler = VectorAssembler().setInputCols(['Expr_yrs',])
train01 = assembler.transform(train)
''' we only need features and label column '''
train02 = train01.select("features","label")
We split our data into 2 parts train and test, train data is used to train our model based on certain algorithm and perform prediction on test data. Split size is complete choice of programmer and is debatable.
lr = LinearRegression()
model = lr.fit(train02)
test01 = assembler.transform(test)
test02 = test01.select('features', 'label')
test03 = model.transform(test02)
ml package of PySpark and would use as our algorithm for predictive modeling to create
fit() method we passed train data to train our model. Based on this data supplied our model will predict the values for test data which we supplied later and got the above result.
evaluator = RegressionEvaluator()
To verify how good our model is predicting the values of label and whether selecting linear regression as our algorithm for our model was good choice or not ( means whether independent variable explains dependent variable) we use a evaluator.
ml package evaluates regression model. The metric used is R². We can use other metrics like
rmse, mse (root-mean square error and mean square error ) etc.
R² = 0.68. The R² value depends upon train and test data which is divided randomly, so the result might vary.
R² value ~ 1 is a good fit. Hence we can conclude that our assumption that “Years of Experience” explains “Salary”is correct.
Hope the concepts are bit clear now. Feel free to write your suggestions. Thanks !