Variance

Models in Machine Learning

16th Friday Fun Session (Part 1) – 5th May 2017

Usually, machine learning algorithms build a model. We are trying to understand what a model looks like and some associated concepts around it.

Descriptive vs. predictive analytics

Descriptive analytics looks at historical data to understand what happened. It describes historical data using different statistical techniques and visualization.

Predictive analytics looks at historical data to understand and predict future. It also uses different statistical techniques, machine learning algorithms etc.

Both can have models, called as descriptive and predictive models. Here by model we are referring to predictive model.

Understanding model

The globe of the Earth does not show every detail of it but this small model can be put on our desk and it can give us an idea as to how the Earth looks like. When we talk about machine learning model, it is similar to that. It represents the data that is used to build it.

Suppose we have some employee data as shown in the table below.

Training Data.png

We want to build a model using this data set. Later, given an employee’s salary and experience, we would like to know her designation, using it.

Model as logical statement or rule

Based on the above data, we can construct two logical statements as shown below.

If salary is more than 3,000 and experience is more than 5 years 
then the employee is a Senior Software Engineer.

Otherwise, the employee is a Junior Software Engineer

Given an employee’s salary and experience, we can find her designation by using the above formula. That formula is called the model.

Model as function

The formula can take the form of a function as well. Let us draw a function that can work as a model for the above data set.

Function Model.png

We have used X-axis for experience, and Y-axis for salary (in thousands). The three data points would be placed as shown above.

The red line can work as a model. All employees on the left side of it, shown as green points are Junior Software Engineers. And all employees on the right side of it, shown as blue points, are Senior Software Engineers.

Note that, this model is not an exact translation/equivalent of the earlier model expressed as logical statements. Meaning, the same input data might be classified (Junior Software Engineer or Senior Software Engineer) differently.

The (red) line equation can be written as

   x + y = 5
=> x + y - 5 = 0
=> f(x, y) = x + y - 5

The model can be expressed this way:
if f(x,y) >= 0, then Senior Software Engineer
if f(x,y) < 0, then Junior Software Engineer

If a new employee input comes with salary 4,500, and experience 1 year, this model would classify her as a Senior Software Engineer.

   f(x, y) = x + y - 5
=> f(x,y) = 4.5 + 1 - 5
=> f(x,y) = 0.5
=> f(x, y) >= 0

If we use the earlier model to classify this input, it would classify it as Junior Software Engineer – different prediction!

Hybrid model

A model can use a combination of both logical statement and function.

To summarize, a model can be expressed using logical statement, function or a hybrid of both.

When a model is built?

At the beginning when we have some data, usually we split it into training data and test data; we provide the training data to a machine learning algorithm that in turn builds a model. Then we use the test data to check how this model is performing and tune the model, as and if required. The model can be further updated, say, periodically, when more data is gathered.

Different machine learning algorithms build different kinds of model. Some build at first, some delay it.

Eager vs. Lazy learning

When a machine learning algorithm builds a model soon after receiving training data set, it is called eager learning. It is called eager; because, when it gets the data set, the first thing it does – build the model. Then it forgets the training data. Later, when an input data comes, it uses this model to evaluate it. Most machine learning algorithms are eager learners.

On the contrary, when a machine learning algorithm does not build a model immediately after receiving the training data, rather waits till it is provided with an input data to evaluate, it is called lazy learning. It is called lazy; because, it delays building a model, if it builds any, until it is absolutely necessary. When it gets training data, it only stores them. Later, when input data comes, only then it uses this stored data to evaluate the result.

There is a different set of pros and cons associated with eager and lazy learning. It is obvious that lazy learning would take less time during training but more time during prediction.

Eager learning builds a model for the whole data set, meaning it generalizes the data set, at the beginning. It might suffer accuracy compare to lazy learning that has more options in terms of the availability of the whole data set as well as the mechanisms to make use of it.

Lazy learning is typically divided into two types: instance-based learning and lazy Bayesian rules.

Instance-based learning

Do all machine learning algorithms build a model?

No, all machine learning algorithms don’t build a model, when by model we mean generalizing the data. For example, decision tree builds a model but k-NN does not.

A row is also called an instance, meaning a set of attributes. Hence a set of training data is also called a set of instances. When a machine learning algorithm does not build a model, rather uses the set of instances directly to evaluate the result, it is called instance-based learning. It is also called memory based learning as it memorizes the whole data set. For the same reason it is also called rote learning.

Instance-based learning, as mentioned above is one kind of lazy learning.

Supervised, unsupervised and semi-supervised learning

The employee example that we have discussed here is an example of supervised learning. Here we wanted to predict an output variable – designation of an employee by providing input variables – salary and experience. While building the model, we provided training data having most (all for the example here) values for input variables and all values for the corresponding output variable.

An important requirement of supervised learning is that, for all the training data we must provide the output variable value. Because, supervised learning learns from it. Most machine learning algorithms use supervised learning.

However, some machine learning algorithms don’t predict an output variable. They just take only the input variables and try to understand, say the distribution or pattern of the data. They are classified mainly into clustering and association rules. For example, when we do clustering, it might come up and say the given data falls into 3 groups.

There is a third type of learning, in the middle of supervised and unsupervised, called semi-supervised. In many real life examples, a good portion of the training data does not have labels for the target variable, meaning many instances of the training data don’t have the output attribute value known. It might be expensive to label them as it might require domain experts.

In this situation, unsupervised learning comes to rescue. It labels them, and then the labelled data is fed into supervised algorithm (to build model) for prediction. This process (unsupervised algorithm labels them and supervised algorithm predicts), might be repeated unless satisfactory accuracy is acquired.

In the above example, we have seen examples of supervised models. However, predictive models include unsupervised and semi-supervised models as well, the latter being a combination of supervised and unsupervised models.

Parametric vs. non-parametric model

Some machine learning algorithms would come up with a model with a predetermined form. For example, it would construct a function with 2 parameters. Now given the training set it would compute that two parameters and come up with a function. An example would be naive Bayes. They are called parametric. For prediction they would depend on this function alone.

On the other hand, some machine learning algorithms would construct a model based on the information derived from the data. For example, k-NN, C4.5 – a decision tree etc. They are called non-parametric. Non-parametric does not mean no parameter, rather no predetermined parameters.

k-NN as an example of a non-parametric model might create a little confusion as k-NN does not build any model at the first place. Well, here model is used in a broader sense to mean how it computes output value. K-NN uses the complete training data set to do so. Hence the whole training data set is the input parameter. Adding one more training data can be thought as increasing the parameter by one. That perfectly matches another definition of non-parametric model that says – the number of parameters grows with the amount of training data.

Classification vs. regression

The model that we are discussing so far – given salary and experience, predict the designation, is called a classification model. The reason is the output variable, designation – a categorical variable. Meaning, it would take one of the predefined categories or classes. In this example, there are two categories or classes: Junior Software Engineer and Senior Software Engineer.

Let us alter the input and output a bit for the model. Suppose the model would now predict the salary, given experience and designation. Then this model would be called a regression model. The reason is the output variable, salary – a continuous variable. Meaning, it can take any value, not limited by a predefined set of classes, unlike earlier example.

Bias vs. variance

Let us continue with the previous example of the model that predicts salary. Ideally, the salary would be calculated by taking the input values, experience and designation into consideration.

But assume the model that is built by a machine learning algorithm, is so simple and dumb. Let us say, given the training data, it computes the average of the salary (2500 + 5500 + 6200) / 3 = 4733 by ignoring all other parameters. Now when an input comes asking the salary, it does not care the experience or designation of the input. The only output (salary) that comes out of it is 4733. Now that is called a highly biased model. It generalizes the data so much that it ignores the input values of the training data and hence underfits the training data. A biased model that does not distinguish a 2 years experienced Junior Software Engineer from a 15 years experienced Senior Software Engineer, and predicts the same salary of 4733 for both, is not a good model, for obvious reasons.

By the way, what machine learning algorithm can possibly come up with such a model and under what condition?

On the other extreme, there is this model with high variance that considers each minute detail of the training data that is possibly nothing but noise, to be the ultimate truth and incorporates them into the model. Hence it is said to be overfitting the training data, and results in a highly complex model. However, a model with all these intricate truths of training data, even though performs very well with training data (after all, the model is built to overfit the training data), do not stand the test of real world. This kind of model, with highly fluctuating prediction, due to little changes in input parameters, is not desirable either.

What we need is a balance, a trade-off between bias and variance; achieving which is a prerequisite, for a good model.

Understanding Correlation Coefficient

19th Friday Fun Session (Part 1) – 26th May 2017

What is correlation?

Correlation measures whether two sets of variables are related and if yes, how strongly. For example, when rain increases in Singapore, temperature drops. So rain and temperature are negatively correlated. Negatively because, while one increases the other decreases. On the other hand, as job experience increases salary also increases. Hence the two are positively correlated. Let us consider a third case, I am cooking at home on weekends and that time you are running. As far as we understand there is no relation between the two. And hence there is no correlation.

What values does the correlation coefficient take?

Correlation is measured in terms of correlation coefficient. Correlation coefficient varies from -1 to +1. The value +1 means that the two variables are moving together in the same direction (both are increasing or both are decreasing) in the strongest possible way. To be precise, both variables are moving towards the same direction at the same magnitude. The value -1 means the same just that they are moving in the opposite direction. The value 0 means there is no linear relationship. Generally a value between -0.1 to +0.1 signifies no correlation.

How correlation coefficient is calculated?

Correlation is very important in data analytics. Hence it is not enough to know only the meaning of it but also how it is computed. What is the formula that brings out this relationship and how? While doing data analytics (for that matter doing anything) it always helps tremendously if we know underneath mechanism. It helps us to truly appreciate and understand its meaning, and the weakness and strength of it.

Rain and Temperature in Singapore

We take a few days’ small sample of rain and temperature data of April 2017 for Changi, Singapore where our JLTi office is located.

Rain in cm, X = (42.8, 37.8, 30.4) and corresponding temperature in Celsius, Y = (22.8, 22.9, 23.9). Looking at this data we can clearly see while rain increases, temperature always decreases, and vice versa. So there is a strong negative correlation here.

Average

Average1

Sample Variance

Sample Variance1.png

Sample Standard Deviation

Sample Standard Deviation1

Sample Covariance

Sample Covariance

It is the covariance that tells how the two variables are moving together. A non-zero value (3.58) says that they are associated and when it is negative it says that they moving towards opposite direction.

We now need to understand how this value is calculated so that we get the intuition behind it.

Show Points.png

We are talking about two variables: rain and temperature, and whether they are associated or not. By that we mean to say whether on all the three days they moved and if yes, then to which direction. When we are talking movement, we measure it in terms of their respective average.

We see that on day 1, rain 42.8 cm was higher than average 37. On that day, temperature 22.8 degree Celsius was lower than average 23.2. That means: rain higher, temperature lower.

On day 2, rain 37.8 cm was higher than average 37. On that day, temperature 22.9 degree Celsius was lower than average 23.2. That means: rain higher, temperature lower.

On day 3, rain 30.4 cm was lower than average 37. On that day, temperature 22.9 degree Celsius was higher than average 23.2. That means: rain lower, temperature higher.

So in all 3 days both the variables moved from their respective averages and they moved towards opposite direction. The covariance formula captured exactly this. 3 components got added for 3 days. Each day the rain and temperature movement (difference with their respective average) was calculated and multiplied. Since they moved to the opposite direction in each of the three days, a negative value came from each of them in the calculation/formula.

Had they moved towards the same direction all the time, we would have got positive values from all of them resulting in a positive correlation.

Had any of them stayed on average without moving, for example, if on a day rain were 37 cm, same as rain average, there would not be any contribution to covariance. Since movement of one variable, rain would have been 0.

Had some of the days both moved in one direction and some of the days they moved in the opposite direction, then the first set would have given positive contribution and the later negative contribution, cancelling some or all of each other’s positive and negative contribution and in that process reducing covariance, rightfully indicating less association.

Sample Correlation Coefficient

Correlation is, in a sense a normalized form of covariance. We normalize by dividing covariance by the standard deviations of the two variables. Doing so makes sure it stays between -1 to +1. This helps when we want to understand the strength of the relation. It also helps when we want to compare two different correlations, say which correlation is stronger: correlation between somebody’s height and weight, or correlation between Singapore’s rain and temperature.

Correlation coefficient

Why did we add a Sample before all?

We used the term Sample before many of the terms, like sample variance, sample standard deviation, sample covariance, and sample correlation coefficient. Why? Well, we have taken only 3 days’ data (rain and temperature). This is a sample from the complete data set (population).

When we are not dealing with the complete population, rather a sample, we use (n-1) as denominator in the formula. Note that even if we have 3 datasets, we divided by (3-1) = 2 in the variance, covariance etc. formulas.

Had we used the complete population, we could divide by the actual number of dataset. This correction (dividing by n-1 instead of n) is called Bessel’s correction.

Sample and population measurements also use a different set of symbols to indicate them. For example, sample standard deviation is s while population standard deviation is σ (sigma).

Why did we use the term linear relationship?

The correlation coefficient that we discussed above is called Pearson product-moment correlation coefficient developed by Karl Pearson from a related idea from Francis Galton. This measures the linear relationship. What does that mean?

This essentially tries to draw a line to best fit the two sets of data. The correlation coefficient essentially tells how far the points are from the best fit line.

Let us see how well, the 3 data points that we have worked with so far fit in a line, by drawing a scatter plot using R.

Scatter Plot RT

They are fitting in a line quite well (the middle one a bit lower) and that’s why we got a very good negative correlation value, -0.94 (our own manual calculation ignoring some precision might slightly vary if calculated correctly, may be 0.946). How could we get a perfect score of 1? Well, we could get so had they fit in a line in the best way. What does it mean? It means all points should fall on the best fit line. It means, the slope of the line has to be respected by all points. How could we get so? Well, slope = y/x. Suppose the rain points are: 20, 30, and 40. Suppose, we fix the slope at 1.2. Then, the temperature (Y) values have to be: slope  * x, that is: 24, 36, and 48 respectively. Let us now compute the correlation coefficient using R.

> vp1 <-c(20, 30, 40) > vp2 <-c(24, 36, 48) > cor(vp1, vp2)
[1] 1

We get a perfect score of 1! Now let us visualize it once again using R.

> dfp = data.frame(vp1, vp2)
> names(dfp)  dfp
 Rain Temperature
1 20 24
2 30 36
3 40 48
> dfp %>% ggvis(~Rain, ~Temperature) %>% layer_points()

Scatter Plot RTP.png

All three points fit in a line. By the way, did you notice the positive and negative correlation in the the two lines shown in the previous two figures?

Expected Value

One final point: in variance etc. calculation we have used average. In some formulas you might encounter expected value E[X]. If all your friends with similar age and experience earning on average 5K per month, would not you also expect your salary to be hovering around the same? Expectation and average are the same in some cases. In this context, when we talked about expected value of a random variable (rain or temperature), expected value, mean and average, all mean the same.