Ever heard of model assumptions? What are they? And why are they important? A model is a simplified version of reality, and with machine learning models this is not different. To create models, we need to make assumptions, and if these assumptions are not verified and met, we may get into some trouble.
If these assumptions are not verified and met, we may get into some trouble.
Every (machine learning) model has a different set of assumptions. We make assumptions on the data, on the relationship between different variables, and on the model we create with this data. Most of these assumptions can actually be verified. So one thing you’ll always want to do is ask whether the assumptions have been verified. Some assumptions are only relevant for making conclusions about relationships (e.g. a 1-degree increase in temperature shows a 4% increase in ice-cream sales), and others are also relevant to predict outcomes (we predict ice cream sales of x tomorrow).
Most of these assumptions can actually be verified.
Let’s go through the assumptions that are made for the simplest model out there. The linear regression.
Assumption 1: fixed regressors
What this actually means is that we assume that the variables (input data) are not random variables but fixed numbers and that if we rerun the experiment (we collect the data again in the same manner), we expect the same results.
The opposite of fixed regressors is a random (or stochastic) regressor, which is typically looked at as data sampled from a wider population. Now if this is the case, then you can only make conclusions ‘conditional’ on the data. Meaning you can draw the same conclusions, but only on this data. You cannot generalize outside of your dataset.
The verdict — If your data is (representative of) the population, you are good. Otherwise, try to collect representative data or only make a conclusion on the data you have created the model on.
For business readers — If you have data on all your customers, and want to predict the behavior of new customers, you are fine as long as you are targeting a similar type of customer. If not, you may be looking at providing totally wrong recommendations or conclusions about these new customers, and losing them before you got them in. So ask for the representativeness of the dataset.
So ask for the representativeness of the dataset. If the data is representative for the population, you are good.
Assumption 2: random disturbances, zero mean
We assume that the error margin around our model is random and on average level out over all observations. This is something you can actually check.
The verdict — Take the average of all your error terms and verify if it’s statistically significantly different from zero. If yes → you may want to adjust your model and include more terms.
For business readers — You want your model to predict the right thing. If this condition is not met, you are either always under– or overestimating. For example, if your error term is on average 3.5, that means you are on average overestimating with 3.5. Not a good thing to happen if you are predicting stock prices and making automatic trading decisions. So ask for the average of the error terms.
Ask for the average of the error terms, to understand whether you are over- or underestimating. If the average is about 0, you are good.
Assumption 3: homoscedasticity
The variance of the disturbances exist and are equal. This means as much as that we expect the error in the model to be of similar size for all different data points and is sometimes referred to as homogeneity of variance. This only applies if the relationship that we are looking at is linear on all different levels.
For example, if you are looking at the relationship between income and spendings on traveling. The spread will be much less for lower incomes than for higher incomes, simply because higher-incomes will provide more of a choice on what to spend. The result is that your model gets ‘pulled’ in the wrong direction (because it assumes the spread is equal everywhere and tries to reduce the error), and the influence on the model of the higher-income data points is much larger than the lower-income data points.
In addition, this will influence the ability to make conclusions on the significance of your parameters.
The verdict — If you want to use your model for inference test for homoscedasticity, if you find your error terms aren’t equally distributed → scale (one of) your variable(s) or use WLS.
For business readers — You want the error terms to have homogeneous variance, otherwise, some of your data points may have a too large influence on the model and disturb the view for the rest of the data points. It is not that big of an issue, your model will still predict the right thing. So if that is what you care about, this is one to let slip.
If you just want to predict, let this one slip. If you want to infer on relations, better make a change.
Assumption 4: no correlation
The error terms are uncorrelated. If they weren’t, there would actually be potential to improve the model. What it means is that if there is a correlation in the error terms, there is still “explanatory” power that is available. The result of the violation of this assumption is a bias in the coefficients of your model. These coefficients “absorb” the information from the error terms.
The verdict — If you want to use your model for inference test correlation in your error terms, and if you find correlation → Add in more variables.
For business readers — If you are interested in making conclusions on relationships, correlation in the error terms is a no go. Correlation in the error terms also tells you there is a potential to improve the model and generate better predictions.
If there is a correlation present, you need to improve the model, your predictions get better and your inference will make sense.
Assumption 5: constant parameters
The parameters that you are estimating with the model are fixed and unknown numbers. For starters, if they were known, there’s no need for a model. And the reason why we assume they are fixed is that we want to avoid changes over time. That is the time meant in the sense of time when the data was collected. If there are changes over time, we may need to include two different parameters or take only the most recent sample of the data.
An example of a violation would be if data was collected by asking a customer how much money they have paid into their pension fund, and the yearly maximum amount has been changed last year and suddenly you can add in a few thousand more. In this case, your parameters aren’t constant, and you need to account for that.
The verdict — Can you safely say that the data at hand has been produced by the same process, that hasn’t changed over time? → Then you are good. If not → you will want to adjust your model and allow for new variables to enter.
For business readers — The key here is that data was produced by the same process, has the data collection changed over time? If it has, the conclusions made on relations between the different variables will not hold, and predictions on new data coming in may actually be under- or overestimated.
Has the data collection changed over time? Then adjust the model, otherwise you may risk over- or underestimate your predictions with new data coming in.
Assumption 6: linear model
The relationship between the different variables is a linear relationship. If this weren’t the case, and you would have a non-linear relationship, you cannot estimate a model that fits your data properly. Therefore, when you are creating a linear model, you need to assume linearity. This is not a linear relationship, and if you would treat it that way, you would estimate many people on the streets with 50 degrees Celcius.
The verdict — Test for linearity (scatterplots do the trick), and if the relationship isn’t linear → Transform your variables or go for a different model
For business readers — This type of model dictates the structure between what we try to predict and what goes into the model. If the structure isn’t met (in this case linearity), the model is meaningless. You can think logically if the relationship is expected to be linear. If it’s not, and if the test tell the relationship isn’t linear → This is a no go and the model needs adjustment both for making conclusions on the relationship as well as prediction.
If the model is linear, but the relationship isn’t, you can forget about inference as well as prediction.
Assumption 7: normality
This assumption says that the error terms are normally distributed. We want to verify this because we want to be able to make tests on significance, as well as define our confidence intervals.
The verdict — Plot your error terms and verify if they are normal. If they are not normally distributed→ check your linearity assumption again.
For business readers — This assumption allows us to tell us something about how sure we are about the estimated values in our model. If this assumption is not met, we cannot make conclusions about relationships, we can predict though.
Without this assumption, we cannot say how sure we are about our estimated parameters. We can predict on new data.
Inspired by: “Econometric Methods with Applications in Business and Economics” by Christiaan Heij, Paul de Boer, Philip Hans Franses, Teun Kloek and Herman K. van Dijk
About me: I am an Analytics Consultant and Director of Studies for “AI Management” at a local business school. I am on a mission to help organizations generating business value with AI and creating an environment in which Data Scientists can thrive.