Croatia Osiguranje & BIRD Incubator Data Challenge – Part 2
Data science tools
As always, it is very important to use the right tool for the right job when working on a data science project. Additionally, it is important to follow the trends and to be familiar with the tools and technologies available at the moment. We choose our technologies carefully, keeping in mind the standards we have to fulfill to satisfy ourselves and our clients.
This section gives an insight into the technologies we, as well as millions of others, use every day trying to make our code more efficient and the user interface more inviting.
Python data science stack
We used data science libraries that are widely used in the industry when working in Python: scikit-learn, pandas, seaborn. Scikit-learn is a machine learning library that is simple, easy to use, and very efficient. Pandas is a fast and flexible data analysis and data manipulation software library. Seaborn is a statistical data visualization library based on matplotlib that provides a high-level interface for drawing informative and interesting statistical graphics.
Also, we used Jupyter notebook, a commonly used open-source web application intended for creating and sharing documents with blocks of live code and visualizations.
Furthermore, for forecasting, we used pmdarima, a statistical library intended to be used when working with time series, and for stationarity checks, we used statsmodels, a statistical library for statistical tests and models estimation.
We had to adjust some parts of the code to our problem, but we didn’t write any machine learning algorithms from scratch.
Anvil is a Python platform for developing easy, modern, and powerful web applications. It allows the programmer to design an interface using the drag and drop technique. It integrates with Jupyter notebooks and can be easily deployed. Its idea is to free programmers of repetitive tasks and allow them to design attractive and useful applications easily and without complications. Also, it provides non-technical staff with the opportunity to use machine learning without having to go through a ton of tutorials and educational materials.
Clustering is a complex problem in the domain of unsupervised learning. Deciding which models to use can sometimes be a very difficult problem which is why we decided to keep it simple. Analyzing the nature of data and the nature of available clustering models resulted in us picking only two models, one deterministic and the other probabilistic. The probabilistic one, the Gaussian mixture, turned out to be our choice for this Data Challenge.
The biggest challenge was to find the optimal number of clusters. Ultimately, we did it by comparing an existing variable in the dataset with the results of multiple variations of our models, each initialized with a different value for the number of clusters.
To be exact, it was done by comparing the distribution of a variable that represents the segmentation according to the actual business rules and the distribution of clusters generated by our models. Also, it’s important to point out our idea wasn’t to perfectly fit these two distributions – that would turn this problem into a supervised learning problem. Our idea was for the existing variable to simply steer us in the right direction, but still detect new rules and come to a new conclusion.
An iterative algorithm with the main goal of partitioning the dataset into K clusters where each observation belongs to only one cluster, the one with the nearest mean. It is a simple, general-purpose algorithm that minimizes the distance between points and the cluster’s centroids.
A Gaussian mixture model is a probabilistic model that assumes all data points come from a mixture of a finite number of Gaussian distributions with unknown parameters. In short, the algorithm first chooses random cluster parameters and then tunes them until the change in parameters is less than the predefined threshold or the maximum number of iterations has been reached.
Despite the Gaussian mixture algorithm being a probabilistic one and the K-Means algorithm being a deterministic one, there are some similarities among them; e.g. initial parameter for both models is the final number of clusters. In the end, Gaussian mixture models gave much better results during this Data Challenge which is why the final solution is generated using GMM.
The most natural solution to insurance market forecasting seemed to be creating a time series and using an ARIMA model. There are multiple reasons for us taking this approach, the main ones being the lack of details in the data and the seasonal nature of the data. It turned out that ARIMA gave quality forecasts we were satisfied with and we stuck with it until the end of the project, regardless of the fact we tried VAR, the more complicated model similar to ARIMA that didn’t provide us with better results.
ARIMA was the name that instantly popped into our heads when we read the task description for this Data Challenge. ARIMA stands for Autoregressive integrated moving average, it is a statistical model that uses past values of the time series to predict the future points in the time series. If it’s a non-seasonal ARIMA, it has 3 parameters: order of the autoregressive model, degree of data differencing, and the order of the moving-average model. In the case of a seasonal ARIMA, there are 6 parameters, each of the above-mentioned parameters for non-seasonal and seasonal parts of the model.
Let’s look at the comparison of two ARIMA models, one that did poorly and the other one that produced good results. The first model is the one with poor performance. We can see that by looking at all metrics – r-squared is much lower and all errors are much higher. Perhaps the most intuitive metric to look at is MAPE (Mean Absolute Percentage Error). We can see that the first model’s error is slightly higher than 50 percent, while the second one has an error lower than 10 percent.
Vector autoregression, or simply VAR, is a statistical model similar to ARIMA with the only difference being it captures the relationship between multiple variables allowing for multivariate time series. Although being more complicated than ARIMA, VAR didn’t generate better results, at all. On the contrary, it did much worse which, combined with the time and performance complexity of VAR, steered us into choosing ARIMA as our solution for this project.
In terms of doing any data science project, you’re going to go through trial and error phases to determine the best possible tools and models. You have to work with data, not against it. Sometimes the results you get aren’t going to be satisfactory, so you’ll have to adjust your approach. Your best course of action here is to do what brings more value to you and to your projects regardless of complexity.