Technology and the tech era have brought many advantages, both to individuals and companies. But there is no one influenced more by technology and its trends than developers and engineers. Overall trends and hypes influence them and cause constant switchovers between technologies. Technology, especially new ones, can significantly improve an engineer’s life, but in some cases, it can also be his nemesis.
The Loan
For us, the story began while developing a data-driven application for one of our clients. The main aspects of the application are data visualizations, where the underlying data comes from a database designed specifically to serve that purpose. The existing database system was fairly expensive and did not cover all functionalities that were much needed. As the license was about to expire, it was decided to replace it with alternative technology.
But, which one to choose? License expiration time defined a short deadline, and lack of experience in the domain did not help either. Some of the technical staff had limited experience in one distributed in-memory database called Apache Ignite. To prove suitability, the tech team conducted testing to prove that Ignite is the right choice for the current needs. At the time, the data-driven application was in the initial phases of development. As the requirements and scope of the application were not set yet, the tech team had a thankless job to test future-proof capabilities. So, in the end, after basic testing provided positive results, the decision was made and the existing database system was replaced with Apache Ignite.
Looking back on the decision, one could ask a few questions. The first one could be about complexity. From an infrastructure and configuration standpoint, the distributed nature of Apache Ignite brings extra complexity. So it’s questionable if this complexity was worth it when thinking of our use case. Maybe some simpler alternative would have worked just fine.
That brings us to the next question. Were any other alternatives tested and compared to Apache Ignite? It’s always beneficial to compare multiple options and decide which one suits you best.
In the end, there is a question if the testing phase covered all important aspects. Knowing all the limitations the team had, it’s hard to judge them for lack of testing.
These issues can be linked to two terms, technical debt and hype of new technologies.
The technical debt occurred when a fast decision had to be made in order to prioritize client value and time constraints over technical aspects of the solution. It might not be an ideal choice, but the transition to Ignite was done fast and solved the issue the client had. In other words, the team consciously decided to take a risk, solve current needs, and take care of issues that might occur along the way later on.
The hype is an issue when a certain technology or group of them becomes trendy. Advertising and media influence in combination with a lack of experience and maturity can lead to illusions that they will act as a silver bullet for all of our issues. This phenomenon is frequently compared to the Dunning-Kruger effect, which has its origins in psychology. One well-known example of its application in technology is Gartner Hype Cycle Research, which visually represents the maturity of technology and its suitability for solving business problems.
This happened to Apache Ignite as well. At the time, it was at the peak of popularity, which influenced the technical staff to consider it. While preparing for this blog post, I found an interesting article that describes hype in the tech industry. It defines a term called “Loudest guy-driven decision”, meaning that the decision-making is left to the team member(s) that talk the most, frequently based on a limited experience gathered from talks or conferences. To mitigate the hype, bring maturity to the focus, and reduce bias it would be beneficial to involve more people in the decision-making process.
Debt Collection
Each debt needs to be collected and there is no difference with the technical one. Let’s see how ours accumulated over time, and what caused it to explode in the end.
The starting usage period wasn’t bad per se. Configuration and infrastructure were done in the simplest way possible – we had two single-node clusters for each environment, where one of them acted as a backup replica, while the other one served the application. Hosting it on AWS cloud enabled us flexibility when talking about resource allocation. Resources like RAM, CPU, and disk were set based on estimations and adjusted based on monitoring results.
As the application was in the early stage of development, each sprint brought many new features. One of these included advanced data filtering and aggregations. Without going into too many details, the core functionality of the feature was to let users choose their own subsets of data on which they could perform filtering and aggregations. To achieve efficiency and application performance, we had to introduce some kind of data redundancy.
All at once, our tables stored many more records than before. As we said, the application is based on visualizations. All of them pull the data from Apache Ignite using SELECT statements. Increased number of records in combination with frequent querying led to increased load on our database system. Unfortunately, this was not taken fully into account, as the testing phase was pretty short, and was done in an environment where we did not have a representative subset of production data.
Why We Don’t Deploy on Fridays
We deployed the new feature on Friday morning. The monitoring did not show any issues during the initial data synchronizations and throughout the rest of the day. It seemed like everything went off without a hitch. Or so we thought.
The first issue occurred in the late afternoon. Which wasn’t a surprise since most users are based in the US. Knowing the time difference, our afternoon was the start of the day for them.
It all started with application slowdown, which was a reflection of the slow database that served the data. We did not need to wait much longer for the database to crash under such a load. The backup server was assigned to take over, but the final scenario was the same. As a consequence, we had to rollback the feature and postpone the release until we figure out what happened.
It was an unforgiving situation caused by the debt from the last year combined with the lack of performance testing done in feature development. We had no other option than to repay it at once.
Let’s Get Down to Business
Our data team took on the task to fix the issue. But, to do that we first had to understand what was the root cause of it.
Knowing the architecture, we identified two main factors that contribute to the database’s performance. The first one is related to the application itself. As users are opening visualizations, the application sends requests. The backend part then translates these requests to SQL queries and executes them on the database. The concern we had was if requests were too heavy for the database to handle. Here, the term heaviness is related to requested data volume (which theoretically could be much bigger than it was before this feature), request frequency, or a combination of both.
The other source of database load is caused by scheduled synchronization processes that fill the database with fresh data. As the new feature introduced redundancy, a valid question to ask would be if these data ingestion processes caused unexpected high load in the production environment.
After we knew what possible issues could be, we started working on the test plan. The idea was to simulate application usage that was similar to one in the time when our database server crashed. We knew this implied introducing parallelization to our test process. We had to simulate requests for many users, which are accessing different visualizations in the application.
To simplify and speed up the process, we decided to create a representative subset of visualizations by sampling them based on their complexity. That way we avoided the mistake of having a subset of “light” queries that are not significantly contributing to the database load.
“With a new environment that simulated production on the day of the release, we had full control over what was happening with data.”
To test data ingestion processes as the second contributor, we decided to take the same jobs we run in production environments and test their impact in the worst-case scenario. When talking about data ingestion or data synchronization processes, their impact directly depends on the number of records that are ingested into our database system. For that reason, we configured these jobs to ingest as many records as theoretically possible, and measure how it affects database performance.
Once our test plan was defined, we started with experiments. The first task was to set up a new environment that simulated production on the day of release. This way, we had full control over what was happening with the data, thus ensuring test repeatability.
To achieve effectiveness we had to choose appropriate tools for our case.
The application-based load was simulated with a combination of two tools – Gatling, and JMeter. Gatling is a great tool for simulating stress caused by web applications, but as we knew that the application itself is not the issue, we decided to focus on SQL queries that it generates in the background. This is where JMeter comes to play.
Finally, we had to visualize database server performance metrics, for which we used Visual VM.
- Gatling – load testing tool for web applications
- JMeter – load testing tool, used to load test database via JDBC connections
- Visual VM – visualizing database server performance metrics
“Given the server configuration and nature of the queries we had, all the issues should’ve been expected.”
Experiments showed that data ingestion processes alone were unable to crash the database or even slow it down in a way that would cause issues. After simulating application-caused load, we confirmed that heavy queries combined with large production datasets can lead to database slowdowns and crashes.
The output of the testing phase was the list of visualizations that can cause issues to our database.
Query analysis and available resources showed us that Apache Ignite is not the best choice for our scenario. It turned out that all the issues we faced should be expected given the server configuration and nature of the queries we had.
Here are the most important downsides of Apache Ignite related to the configuration and scenario we had:
- Not fitting well in one server
- A custom complex SQL query is likely to run long or even cause out-of-memory errors
- GROUP BY / ORDER BY over a large table without condition can cause out-of-memory errors
- SELECT over a large table without condition can cause out-of-memory errors
To see how we solved the issues and what approaches did we take, stay tuned for part II of this post. We will go through solutions and query migration that led us to resolve our Friday deployment setback.