With recent developments in the field of machine learning, collecting and analyzing data has become a crucial part of improving almost every industry in the world. This is why researchers from the Faculty of Education and Rehabilitation Sciences in Zagreb approached us to determine to which degree state-of-the-art machine learning methods can predict the mood of high school students. A successful model could be used for the early detection of mental disorders like depression in high school students.
Understanding the data
At the heart of every machine learning project is data. This makes understanding the dataset the most crucial part of a machine-learning project. To understand our dataset we must first understand how it was collected, what types of data were collected, and which variables will be used as features/predictions.
Data collection
Data about Croatian high school students was collected (with their knowledge and consent) by the Zagreb Faculty of Education and Rehabilitation Sciences over the course of eight days. It included data about their physical activity, music taste, application usage, amount of sleep, mood (affect), and behavior. Our task was to predict their mood using all other available data sources.
Passive and active data
Before even looking at our dataset it is important to understand how the data was obtained. Data regarding physical activity, music taste, application usage, and amount of sleep were gathered directly from subjects’ phones without the need for them to enter any information into the phones. On the other hand data about their mood and behavior was collected by providing the subjects with questionnaires. The first group of data is called passively collected data and the second is called actively collected data. Both methods have advantages and disadvantages.
The advantage of passively collected data is that it does not rely on humans’ subjective view of the situation. For example, if you were interested in someone’s sleep duration you would probably get more accurate data using a stopwatch than asking them how long they slept that day. The biggest disadvantage of passively collected data is that subjective things like emotions or taste can’t be measured. Also, as we will see later on, the quality of the passively collected data mostly depends on the algorithm used for collection. If the algorithm is flawed we won’t be able to extract useful information from the data.
Unlike passively collected data, actively collected data can be used to gain information about almost all topics, however, its main disadvantage is that it almost always provides data of lesser quality. Different people can have different reactions to the same situations and people can, intentionally or unintentionally, provide wrong or conflicting answers to some questions. Since the variable we are trying to predict is actively collected, we will need to take into account the said disadvantages when accessing our model’s performance.
Exploring the dataset
After we understood the data collection process we can finally dig deep into our dataset. The first thing we have to do is visualize our variables.
Sleep duration data
One of the things that sticks out in our data is that we have two data sources for sleep duration – an accelerometer and a phone manufacturer’s internal algorithm. To decide which one to use we will visualize both using a box-plot.

Image 1: Sleep duration box-plots
The box part of the plot represents the interquartile range (IQR) of the data or simply it shows us where the middle 50% of the data points are located with the line inside the box representing the median. The whiskers extend from the box and represent the range of the data beyond the IQR. The length of the whiskers is typically 1.5 times the IQR. Data points outside that range are considered outliers and are marked with dots.
Now that we know how to read the box-plot we can combine it with some general knowledge to select the data source we want to use. A major advantage of the accelerometer is that the median sleep of around 8 hours is much more realistic than the algorithm’s median of more than 9 hours. Also, extremely high values like 20 hours of sleep are labeled as outliers in the accelerometers box-plot while they are inside the expected distribution for phone algorithm data. For these reasons, we decided that accelerometer data is to be used.
Physical activity data
Another interesting data distribution is that of the physical activity data. We have plotted a histogram of how long the subjects were stationary.

Image 2: Stationarity distribution
We can see the histogram mostly looks normal however there is a large amount of values that represent less than 100 minutes of stationarity per day. Since high school students in Croatia have at least six classes per day and each class is 45 minutes long we can conclude that they should spend at least 270 minutes every day stationary. Any value less than that can be considered an outlier or an error in measurement. This is a problem because around 35% of our data regarding stationarity should be discarded. To try to minimize this problem we decided to replace all outliers with the median value of the rest of the dataset.
Correlation between features and labels
Last but not least it is important to visualize the correlation between our features and our labels. In the next image, we visualized the correlation between physical activity and positive/negative affect that we are trying to predict.

Image 3: Correlation between physical activity and affect
We can see that the correlations are quite low. We see this problem in all data sources. To increase the correlations we can try to combine certain variables. For example, instead of using time spent on Instagram, Facebook, and Youtube,… we can combine those variables into time spent on social media which may have a higher correlation. While machine learning models can model more complex relationships between variables than a simple correlation, a low correlation is suggesting that before mentioned problems with data, quality will affect our final result.
Aggregated Data
Another problem with our data is that some data is aggregated by day which can lead to significant loss of information. This problem is best explained on application usage data but it is also relevant for other data sources. For example, imagine a scenario in which a subject wakes up early and fills out the questionnaire right away. Later that day the subject spends 5 hours on Instagram. If we are trying to determine how social media affects subjects’ moods we will lose that information since time spent on social media appeared after we already received the information regarding subjects’ moods.
Training and evaluating the model
Now that we have understood and preprocessed our dataset we can focus on training and evaluating our models. The first step is to split our data into training and validation datasets. A common way to do this is to simply randomly split the dataset into two where the training dataset is the bigger of the two. While this approach can be sufficient a better way to do this is to use a method called cross-validation.
K-fold cross-validation
Cross-validation is a method used in machine learning that enables us to use most of our dataset. Instead of randomly splitting the dataset once and then using only a portion of the data for training, we will split the dataset into K equal parts called folds. One fold is used as the validation set, while the remaining folds are combined to form the training set. The model is then trained on the training set and evaluated on the validation set. This procedure is repeated multiple times, with each fold taking a turn as the validation set.
Cross-validation is valuable for hyperparameter tuning and model selection, as it helps in choosing the optimal configuration that generalizes well on unseen data. Other benefits of using cross-validation are better performance estimation and efficient use of data as we can retrain our model on all data once we found the optimal hyperparameters.
Group K-fold cross-validation
There is one problem with using both random split and cross-validation on our dataset – they both assume the data are independent. While this assumption is true for most datasets in our case there are two reasons why our data is not independent. First, we have a time variable built into our data. Remember that the subjects are monitored over eight days. If we were to randomly split our data we could end up predicting the past using future data which is not something we ever want to do. The second problem is that our data has groups inside of it. Each subject has its characteristics which make data from the same subject mutually dependent. To combat these issues we will use group k-fold cross-validation.
Group k-fold cross-validation is a variant of cross-validation that ensures that samples within the same group always appear in the same fold during the cross-validation process. This means no group appears in both training and validation datasets. The difference between regular cross-validation and group cross-validation is visualized in Image 4.

Image 4: Visualization of cross-validation and group cross-validation
Each bar represents one iteration. The yellow section of the bar represents the validation fold and the gray section represents the training folds. The last bar represents groups inside the data. The groups on these images are for visualization purposes only and do not represent our data. If we look at the plot on the right we can see that no groups are in both validation and training folds (yellow and gray sections) in any of the cross-validation iterations. This is not the case in the left plot.
By using group k-fold, the model gets to see all the samples from each group during training and testing, which helps it better generalize to new, unseen groups.
Model training and results
We can finally train our machine learning models. For training, we used nested group k-fold cross-validation and tuned the hyperparameters with a grid search. To evaluate our model we used mean absolute error (MAE) which is simply the absolute value of a difference between the predicted and true value of an affect. Models that were tried were: linear regression, random forest, support vector machine, and simple neural networks. Support vector machines provided the best results with errors being 0.650 for the positive affect and 0.629 for the negative affect.
Final thought
For a final thought, it is important to point out the importance of data privacy. We managed to gain significant insight into the mental state of high school students using a small dataset of only 150 subjects. Imagine what could be done if the dataset was scaled to several million or even a billion users that are available to big data companies.