Croatia osiguranje & BIRD Incubator Data Challenge – Part 1
Data science isn’t only about using fancy visualization tools and robust machine learning algorithms. There is a lot of manual hard work beneath the shiny and aesthetically pleasing solutions. In this blog post, divided into two parts, we tried to sum up data science concepts, technologies, and algorithms we used while working on Croatia osiguranje & BIRD Incubator Data Challenge.
Let’s dive into the first part, the one where we will explain basic data science concepts and techniques that were necessary to complete this Data Challenge.
Data science techniques
Every data scientist knows that before letting the machine do its job and learn something from the data, data needs to be preprocessed using a number of concepts and techniques that play a key role in ensuring the quality of the final model. Also, after the machine does its job and produces the model we asked for, we should check if the model is good enough for us. In most situations, we will want to have a large number of models that we will later compare using model evaluation techniques.
This section gives a brief overview of key data science concepts and techniques that were used during this Data Challenge to ensure the final solution is the best one possible.
One of the first steps in any data science project should be removing rouge data. Corrupt, inaccurate, and inconsistent data can lead to incorrect conclusions which means detection and removal of these records from the dataset are crucial for the project’s success.
We were on the lookout for non-existent data, duplicates, and outliers. It’s important to note our dataset is large enough to make it safe to remove the rouge data completely. Had it been significantly smaller, we would have to fill the missing values or replace the ones we detected as problematic.
The existence of rouge data in this dataset can be accredited to the fact that every data entry was entered manually. This is something that should be avoided whenever possible.
Also, an important part of data preprocessing, feature scaling, is performed to ensure all values fit into a predefined range. This is important because data distribution can highly affect the quality of machine learning model outputs.
To know which feature scaling method should be used and if it should be used at all, one should be familiar with the nature of a machine learning model that is being used and the data itself. After preliminary analysis, we noticed there was a significant difference between ranges of features in our dataset. That often happens when working with a dataset formed with different types of features. In this case, there were a lot of financial features of different nature, as well as features regarding the number of employees.
The most frequently used feature scaling methods are Min-Max Normalization, which we used for the course of this project, Mean Normalization and Standardization, also known as Z-Score Normalization.
When working with time series, a series of data points indexed in time order, stationarity is a very important concept one should keep in mind at all times. A stationary time series is the one whose mean and variance do not change over time. Differently put, a stationary time series is the one whose statistical properties do not depend on the moment at which the series is observed.
If these properties vary over time, that kind of time series is called a non-stationary time series. If a time series is non-stationary, it means it follows one of three possible behavioral patterns: trend, seasonal or cyclic. In each of these cases, we have to transform the time series into a stationary time series. The most commonly used technique for such a task is called differencing, explained in the next subsection.
Detecting stationarity can be done using various methods: looking at the ACF plots, analyzing summary statistics of multiple randomly selected data periods, and conducting statistical tests, the most popular ones being the Augmented Dickey-Fuller test and Kwiatkowski-Phillips-Schmidt-Shin test.
We decided to use statistical tests, both ADF and KPSS. Given that we used ARIMA models, differencing was made automatically as part of the forecasting process. In most of the cases, ARIMA managed to achieve stationarity, but some time series remained non-stationary even after differencing.
Differencing is a technique used in data science projects in order to stabilize the mean and variance of a time series over time. This is made by computing differences of consecutive terms in the time series.
Equation 1: First difference of a time series
In most of the cases, a first order difference is enough to make the time series stationary. However, sometimes a difference of higher order must be used.
Equation 2: General expression for n-order difference of a time series
The differencing technique should be applied multiple times until the time series becomes stationary according to the stationarity check techniques. An example of ACF plot before and after differencing is shown on figures 1 and 2. On the first figure a seasonal pattern can be noticed, while the differences are more randomly dispersed on the second figure.
Figure 1: ACF plot before differencing
Figure 2: ACF plot after first-order differencing
As already mentioned, ARIMA automatically conducts differencing as part of its forecasting process, so we didn’t have to worry about differencing. If we hadn’t worked with ARIMA, we would have to manually do the job. The trouble with manual differencing is that predictions produced with differenced time series also represent differences. That means we have to obtain an inverse difference of the model’s output which makes the implementation more complicated and error-prone. Also, in the process of differencing and inverse differencing some entries are inevitably lost. That can be especially important when working with smaller datasets.
Techniques used for model evaluation differ depending on the problem that is being solved.
In case of unsupervised machine learning models, such as clustering, there isn’t a go-to metric, especially if we don’t have the values of a dependent variable. One of the possible solutions to this problem, which is what we did during this project, is to plot each of the trained clustering models and visually assess which one is the best. Additionally, we compared the results of clustering models with the values of a similar variable inside the dataset and found out there is a satisfying correlation.
However, when working with supervised machine learning models, there are a number of widely accepted metrics one can use.
To begin with, a coefficient of determination, also known as r-squared (r2), indicates the proportion of variance in the dependent variable that is predicted using linear regression. Maximum value of r-squared is 1, representing that regression predictions perfectly fit the data.
Further often used metrics are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE) and mean absolute percentage error (MAPE). Unlike r-squared, above listed metrics should have smaller value for better machine learning models.
We used all above mentioned metrics to compare models. First, we compared r-squared for each model. Had they been very similar, we would look at the other metrics.
Residuals are differences between the observed value of the dependent variable and the value predicted by a regression estimator. They represent the portion of the validation data that can’t be explained using the given model.
Equation 3: Expression for calculating residuals
Residual plot should point out that residuals are randomly and symmetrically distributed around the zero line. In other words, there mustn’t be any clearly visible patterns in order to conclude there’s nothing wrong with the model. In case there is a visible pattern, e.g. residuals form a recognizable shape or an outlier exists, the machine learning model should be improved.
Figure 3: Example of a residual plot
In figure 3 there is an example of a residual plot that isn’t ideal. The distribution of residuals closely resembles the normal distribution which is how it should be, but we can see that distribution is shifted slightly right in relation to zero. That means the difference between observed values and predicted values is on average larger than zero from which we can conclude that this model underestimates while predicting.
Now that we covered some basic concepts and techniques, don’t forget to stay tuned for part 2 of this blog post where we turn to methods we used in Croatia osiguranje & BIRD Incubator Data Challenge to create a marketing segmentation and forecasting solution.