Credit: IBM

Offloading your Informix data in Spark, Part

5

Add a little intelligence

### Content series:

## This content is part # of # in the series: Offloading your Informix data in Spark, Part 5

https://www.ibm.com/developerworks/library/?series_title_by=**auto**

Stay tuned for additional content in this series.

## This content is part of the series:Offloading your Informix data in Spark, Part 5

Stay tuned for additional content in this series.

## Where are we? Where

are we going?

In the first four parts of this series, you learned how to:

You might be wondering what else there is to do. It’s true that this tutorial

series has covered many aspects of Apache Spark, but there is so much more

to discover with this analytics platform. It’s time to explore one of the

key features of Spark: its support for machine learning.

Through the previous parts of this series, you’ve used sales

data for a store as an example. The idea is simply to show you what you

can do with machine learning: Forecast future orders based on

previous orders.

What you’ll need:

- The Spark ML (for machine learning) library, which is in the project on

GitHub. - The stores_demo data set included with every Informix® database.

Note: Don’t worry if you don’t have Informix knowledge. You

do not need it to read and understand this tutorial.

Nevertheless, feel free to consider IBM Informix as the RDBMS for your

next project. - The code you used for parts 1-4
- For this part, the labs are in the.

`net.jgp.labs.informix2spark.l5x0`

package on GitHub.

## Mathematics

I love math. Coming from the French centralized education system, you

better love math if you want to access the best engineering schools. This

is also true in many other places, but France seems to be an extreme.

Unfortunately for me, the mathematics behind machine learning (ML) is a

lot of statistics and probabilities. As much as I enjoyed statistics, I am

not the greatest fan of probabilities. Perhaps this is because of their

random aspects in some scenarios.

Therefore, I always try to look at minimizing the math impact on ML. I find

this makes ML understandable by most of us.

## Linear regression

Linear regression is the concept you will implement. Imagine the following

graph: your x-axis (abscissa) is the week number, your y-axis (ordinate)

is the total amount of orders for this week. It should look like the image

below.

The idea behind linear regression is to draw a straight line, which is the

least distant from all the points on the chart.

In this context, the regression line is:

You can now imagine that we will continue this line to see where it goes.

However, first you need to know how we got the data.

## Getting the data with

Spark

Use the examples in the previous parts of this series and adapt

them to get the orders then group the sales amount by week. The output

should look like this:

+----------+----------------+ |order_week|sum(total_price)| +----------+----------------+ | 21| 4387.00| | 22| 2144.00| | 23| 940.00| | 24| 450.00| | 25| 1366.80| | 26| 2544.00| | 28| 3652.97| | 30| 2670.00| +----------+----------------+

Note: In this part of the series, I’m not explaining every part of

the code. By this time, you should feel comfortable reading the code

without breaking it into small chunks. That said, if you have issues,

please ask questions in the comments.

Your code should look like:

package net.jgp.labs.informix2spark.l500; import static org.apache.spark.sql.functions.lit; import static org.apache.spark.sql.functions.weekofyear; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.jdbc.JdbcDialect; import org.apache.spark.sql.jdbc.JdbcDialects; import net.jgp.labs.informix2spark.utils.Config; import net.jgp.labs.informix2spark.utils.ConfigManager; import net.jgp.labs.informix2spark.utils.InformixJdbcDialect; import net.jgp.labs.informix2spark.utils.K; public class OrdersPerWeekApp { public static void main(String[] args) { OrdersPerWeekApp app = new OrdersPerWeekApp(); app.start(); } private void start() { SparkSession spark; spark = SparkSession .builder() .appName("Sales per week") .master("local") .getOrCreate(); // List of all tables we want to work with List<String> tables = new ArrayList<>(); tables.add("orders"); tables.add("items"); // Specific Informix dialect JdbcDialect dialect = new InformixJdbcDialect(); JdbcDialects.registerDialect(dialect); // Let's connect to the database Config config = ConfigManager.getConfig(K.INFORMIX); // Let's build our datalake Map<String, <Dataset<Row> datalake = new HashMap<>(); for (String table : tables) { System.out.print("Loading table [" + table + "] ... "); Dataset<Row> df = spark.read() .format("jdbc") .option("url", config.getJdbcUrl()) .option("dbtable", table) .option("user", config.getUser()) .option("password", config.getPassword()) .option("driver", config.getDriver()) .load(); datalake.put(table, df); System.out.println("done"); } System.out.println("We have loaded " + datalake.size() + " table(s) in our data lake"); // Let's look at the content Dataset<Row> ordersDf = datalake.get("orders"); Dataset<Row> itemsDf = datalake.get("items"); // Builds the datasets in 2 steps, first with the week number... Dataset<Row> allDf = ordersDf .join( itemsDf, ordersDf.col("order_num").equalTo(itemsDf.col("order_num")), "full_outer") .drop(ordersDf.col("customer_num")) .drop(itemsDf.col("order_num")) .withColumn("order_week", lit(weekofyear(ordersDf.col("order_date")))); // ... then with allDf = allDf .groupBy(allDf.col("order_week")) .sum("total_price") .orderBy(allDf.col("order_week")); allDf.show(50); } }

The “meat” of this app starts when you create the `allDF`

dataframe. First, create a column named `order_week`

based on

`order_date`

. Use the `weekofyear()`

static method

to determine the week number from a date and the `lit()`

static

method to create a column from scratch in your dataframe. Both methods are

statically imported at the beginning of the code.

## Data quality

It is always important to look at the data. You might not catch every

anomaly by doing so, especially with big data. But by

looking at it, you can see that weeks 27 and 29 are missing. From

this observation, you have (at least) two decisions to make:

- Ignore the missing data. Perhaps the central system has not been

updated yet, it is not the first time it happened, or maybe it’s the

intern who crashed the system the other day. - Assume there weren’t any orders; it would mean that you have to insert

two rows with an amount of 0.

I recommend you go with the first solution: don’t blame it on the interns,

but keep track of your decision.

## The

two-second introduction to machine learning

ML algorithms can be complex. However, the principle is

really easy. You *build* (or *train*) a

*model*, then you apply this model to a data set

to *predict* an outcome. In this scenario, you will only

execute step 2, but you can easily imagine different scenarios where the

model does not change and can be reused in step 3, 4, etc.

Of course, as a data professional, you can imagine the full spectrum of

lifecycle activities deriving from this model: validating, refining,

testing, etc. However, these activities are slightly

outside the scope of this primer.

## Building the model

You have the data, and you have the theory. Now you can practice.

Your first step is to prepare the data for the ML trainer to

digest. It varies depending on the type of algorithm, but the linear

regression expects *features* and *labels*.

In essence, the label is what you are studying, and the features define it.

So, if you look at the orders of week 28 where you made $3,652.97, the

label is `3652.97`

, and one of its features is `28`

.

You can add more features, such as:

- Temperature
- Precipitation level
- Total of orders during the same week of previous years
- Number of days before or after a holiday, etc.

I remember a friend of mine who sold swimming pools. He had roughly a six

months’ lead time. He sold more pools when it was sunny, so

applying the amount of sunshine to his model made sense.

A common mistake is to confuse the label and features, especially in a case

like this one when you only have one feature.

To use a linear regression, Spark expects a vector of features, even if

your vector contains only one element. Basically, Spark expects the

following dataframe:

+----------+----------------+-------+--------+ |order_week|sum(total_price)| label|features| +----------+----------------+-------+--------+ | 21| 4387.00|4387.00| [21.0]| | 22| 2144.00|2144.00| [22.0]| | 23| 940.00| 940.00| [23.0]| | 24| 450.00| 450.00| [24.0]| | 25| 1366.80|1366.80| [25.0]| | 26| 2544.00|2544.00| [26.0]| | 28| 3652.97|3652.97| [28.0]| | 30| 2670.00|2670.00| [30.0]| +----------+----------------+-------+--------+

You could have simply renamed the `sum(total_price)`

column to

`label`

, but because both the label and features columns are

merely technical constraints for the linear regression algorithm, I prefer

to keep the data separate from the technical constraints.

To build the vector, you can use a user-defined function (UDF). This

extension creates a vector from the original value.

package net.jgp.labs.informix2spark.l520; import org.apache.spark.ml.linalg.Vector; import org.apache.spark.ml.linalg.Vectors; import org.apache.spark.sql.api.java.UDF1; public class VectorBuilderInteger implements UDF1<Integer, Vector> { private static final long serialVersionUID = -2991355883253063841L; @Override public Vector call(Integer t1) throws Exception { double d = t1.doubleValue(); return Vectors.dense(d); } }

The UDF implements a UDF1 of `Integer`

(the input type) and

`Vector`

(the

return type). Vectors expect double value, so you need to transform the

integer to a double.

Before using the UDF, you have to register it in the Spark session. Make

sure to register the UDF right after you create the Spark session.

spark.udf().register("vectorBuilder", new VectorBuilderInteger(), new VectorUDT());

In this scenario:

`vectorBuilder`

is the name of the function you are adding

to Spark SQL.`VectorBuilderInteger`

is the class implementing the

UDF.`VectorUDT`

is the return type.

In your transformation code, you can simply call the `vectorBuilder()`

function to create the column.

Dataset<Row> df = allDf .withColumn("values_for_features", allDf.col("order_week")) .withColumn("label", allDf.col("sum(total_price)")) .withColumn("features", callUDF("vectorBuilder", col("values_for_features"))) .drop(col("values_for_features"));

Now that you have the data in the correct form, creating your model only

takes two lines of code.

LinearRegression lr = new LinearRegression().setMaxIter(20); LinearRegressionModel model = lr.fit(df);

## What about a

little introspection?

This section is optional. Imagine that I added it to raise the suspense

towards the end goal of discovering future orders — but I also added it for

those math lovers who want to understand that there really is no crystal

ball but some methodology and science.

Spark provides the tools needed to inspect your model. First, apply the

model to the full dataframe you had:

model.transform(df).show();

This adds a prediction column (the value on the linear regression

line).

+----------+----------------+-------+--------+------------------+ |order_week|sum(total_price)| label|features| prediction| +----------+----------------+-------+--------+------------------+ | 21| 4387.00|4387.00| [21.0]|2101.3694797687876| | 22| 2144.00|2144.00| [22.0]|2144.7183236994233| | 23| 940.00| 940.00| [23.0]|2188.0671676300585| | 24| 450.00| 450.00| [24.0]|2231.4160115606937| | 25| 1366.80|1366.80| [25.0]|2274.7648554913294| | 26| 2544.00|2544.00| [26.0]|2318.1136994219646| | 28| 3652.97|3652.97| [28.0]|2404.8113872832355| | 30| 2670.00|2670.00| [30.0]| 2491.509075144506| +----------+----------------+-------+--------+------------------+

Here’s a look at the different mathematical computations associated to the

model:

LinearRegressionTrainingSummary trainingSummary = model.summary(); System.out.println("numIterations: " + trainingSummary.totalIterations()); System.out.println("objectiveHistory: " + Vectors.dense(trainingSummary.objectiveHistory())); trainingSummary.residuals().show(); System.out.println("RMSE: " + trainingSummary.rootMeanSquaredError()); System.out.println("r2: " + trainingSummary.r2());

This code returns:

numIterations: 1 objectiveHistory: [0.0] +-------------------+ | residuals| +-------------------+ | 2285.6305202312124| |-0.7183236994233084| |-1248.0671676300585| |-1781.4160115606937| | -907.9648554913294| | 225.8863005780354| | 1248.1586127167643| | 178.4909248554941| +-------------------+ RMSE: 1246.0139337359603 r2: 0.009719742211204974

Let’s look at one of those criteria. The root-mean-square error (RMSE),

also called root-mean-square deviation (RMSD), is used to measure the

differences between values (sample and population values) predicted by a

model or an estimator and the values observed. Because this is a distance,

the smaller the number the better it is. And, when you compare it to the

value of the labels, it means that you are pretty far off, which is not

good.

While this is not good, the explanation is easy. There is a great disparity

in the labels because there is a limited number of features. This is

definitely not big data.

The other parameters define the line: the intercept, the regression

parameter, and the convergence tolerance of iterations.

double intercept = model.intercept(); System.out.println("Intersection: " + intercept); double regParam = model.getRegParam(); System.out.println("Regression parameter: " + regParam); double tol = model.getTol(); System.out.println("Tol: " + tol);

And the results are:

Intersection: 1191.0437572254443 Regression parameter: 0.0 Tol: 1.0E-6

## The

not-that-magic crystal ball

You are now ready to predict orders for the next three weeks. You are about

to discover the complex code to do so. Letting the suspense

grow, the code first:

for (double feature = 31.0; feature < 34; feature++) { Vector features = Vectors.dense(feature); double p = model.predict(features); System.out.printf("Total orders prediction for week #%d is $%4.2f.n", Double.valueOf(feature).intValue(), p); }

Remember what you saw before: features are stored in a

vector. So, even if you only have one feature (the week

number), it still takes a vector, so remember to build this vector with

the feature.

Then you can call the `predict()`

method from the model, using the vector.

That’s it; you just made your first prediction!

It takes seven lines of code to apply the model to the new features (here, the

week number). Among these seven lines, three are for display and two are for the

loop. The results:

Total orders prediction for week #31 is $2534.86. Total orders prediction for week #32 is $2578.21. Total orders prediction for week #33 is $2621.56.

Give yourself a high-five! You followed this long, and hopefully not too

painful, tutorial series and you even discovered that your company orders

are on the rise.

## What you learned

This fifth part of the Offloading your Informix Data in Spark series taught you:

- A use case of ML based on RDBMS data and that ML does not always need tons of data.
- A little bit of mathematics, like that the RMSE measures the quality

of your model. - The importance of data quality (and I should probably teach more

DQ). - A linear regression is an easy form of ML.
- ML has a simple process: train the model and reuse the

model on new data.

## Farewell

This is also the last part of this series. I sincerely hope you enjoyed it.

I took great pleasure in writing each part of this series and wanted to

thank the support team at IBM and especially Robin Wood, who showed

patience, tolerance from my Frenglish, and brought help. Thanks,

Robin.

Let’s keep in touch via Twitter (@jgperrin), email jgp@jgp.net (I

reply to all emails), or in the comments below. See you for more Spark

content and at IBM Think in 2018!

## Go further

More readings and info:

#### Downloadable resources

Credit: IBM