We started with issues of a wrong tech choice in part I, and now we’ve come to part II where we discuss our approaches to the best solutions.
There were two possible solutions to our problems. The first was to keep Ignite and invest efforts to optimize queries and improve cluster configuration options. These include rewriting queries, changing configurations, optimizing indexes, and adding more nodes to the cluster. The main concerns for this approach were related to time and money that had to be put into the changes. Furthermore, we did not have the guarantee that this will achieve desired results. The alternative was to substitute Ignite with some well-known, mature technology. During the testing phase, we found that limited resources and a smaller developer community seriously impacted our capabilities to understand the issues we faced. Having some mature technology with a larger community and plenty of available resources would for sure help us along the way.
“When comparing test results, PostgreSQL was 10 to 1000 times faster than Ignite.”
After some research, we decided to try out PostgreSQL, which matches the above-mentioned criteria. Migrating to a new database engine is not a trivial task to perform. It requires modifications of the application code which connects to the database and modification of all ETL processes that are connected to the database. We definitely did not want to invest so much effort to find out that our alternative is not working either.
That’s why we went back to the list of sampled visualizations that generated most of the stress on the database. The test setup required only a new PostgreSQL database instance, and the production snapshot data we used during the experiments. As SQL syntax between Ignite and PostgreSQL is not the same, we had to modify table DDLs and queries. As we tested only for a subset of the most relevant queries, this step did not take so much time, even when doing it manually.
Comparing test results against Ignite showed that PostgreSQL is 10 to 1000 times faster. Also, it stayed stable during the entire test process. It’s important to note that allocated server resources were the same between these two engines, meaning the same infrastructure cost. Having a single server instead of a cluster topology resulted in simpler configuration and administration procedures when compared to Ignite.
Promising results like these led us to full-scale testing of the entire application. This is where we get to the heterogeneous database migration. Heterogeneity means that the existing and new databases aren’t the same technology. They differ in schema structure, data types, and SQL syntax.
Heterogeneous migration is highly demanding. There are some tools that could help, but neither of them will completely solve the problems without manual interventions (even though some guarantee it). This is especially the case with query translations. Besides that, Ignite is not so widely spread technology, so fitting these tools for it might be challenging. In the end, there is a question of how much time would take us to master them.
The first phase of the full-scale migration was assessment where we examined the complexity of database schema and queries, amount of the data we had to transfer, application changes to support the new engine, and the impact on the infrastructure.
Starting with the last one, infrastructure, we had no major issues. All systems were hosted on the Amazon Cloud so setting up a new database was simple. Besides new development and production environments, we created two additional experimental environments – one for a PostgreSQL-based application, and the other for an Ignite-based one. The idea was simple. The only way to test if the migration process was successful was to compare visualizations with the existing application. As the development environment is always changing (in terms of data and features), we needed stable, isolated environments to experiment with.
Once we had our environments in place, we could start with the migration. Firstly, we had to create schemas and tables specified by DDLs. This step is known as schema migration. As the number of schemas and tables was not so large, we decided to go with the manual migration approach. Most common changes included different data type namings.
“The biggest challenge we faced was related to the query translation process.”
The next step was to populate our database with a data snapshot we had on our Ignite experimental environment. For this purpose, we used the Pentaho Data Integration tool. As we used it for ETL development back in the day, we were familiar with its functionalities, which allowed us fast and easy data migration implementation.
The application development team did great work on their side. The main activities included changing and configuring the connection pool, and the way the application builds queries on the backend.
The biggest challenge our team faced was related to the query translation process. As we said, Ignite and PostgreSQL engines do not have the same SQL dialect. Simply put, some clauses written for Ignite would result in syntax errors when run on PostgreSQL. That’s why we had to translate and validate each query application used.
Having more than 300 complex queries with a short deadline and limited resources meant we had to work efficiently to finish the project on time. It was clear that we cannot use a manual approach, as we did in the PoC phase, since it can be tedious and error-prone. We had to adapt the approach to something more effective.
We started by examining the subset of translated queries from the PoC phase. For this purpose, we used a file comparison feature inside Visual Studio Code. This way we observed patterns. Once we compared all source–translated query pairs, it was clear that the same translation logic is shared between many queries. Ignite-specific clauses were translated to the PostgreSQL ones in the same way. The only difference occurred in the number of used columns, their names, or formatting.
Strings and patterns – ring a bell? Of course, regular expressions were the first that came to our minds. In a short time, we had a few expressions that covered all of the queries from the PoC subset. This is where regex101, a great online tool, helped us. Confirming our assumption we were encouraged to develop a full-scale process.
The last thing we wanted to end up with was a messy, semi-automated process that depends on numerous scripts and ad-hoc human interventions that would connect them together. Also, a crucial aspect of the process was validation. We had to trust the results and be entirely sure that translated queries would produce the same output as the original one.
The plan was to create a single automated process with the following components that would be connected together:
- Extract all Ignite-based application queries and store them in CSV
- Loop over each query to apply regular expressions
- Loop over each query and apply manually defined fixes
- Store translated query to CSV
- Point PostgreSQL-based application to use translated queries
- Validate outputs from PostgreSQL-based application against Ignite-based one
Coming from a data background, the programming language choice for this purpose was simple – Python. It offers good regular expressions support within re library and is pretty flexible for this kind of project.
Now, let’s get down to the details of the process. In the first step, we connected to the application and downloaded all the queries it uses to a CSV file. The next step was to apply regular expressions we identified by examining input Ignite queries. As we could not cover all translations by regular expressions, some manual fixes were required. All of them were defined in a separate component and applied each time the process would run, which ensured consistency. Once the translation process was completed, all queries were stored in a new CSV file. Then another process imported these queries into the PostgreSQL-based application. Finally, we could perform a validation step where outputs from both environments were tested against each other. Here we had to resolve minor issues related to different rounding and formatting options these database engines have. As a result of the process, we had PostgreSQL-based application that was configured to use translated queries, and a list of Boolean values that defined validation output for each query.
It was obvious that development needed to be iterative. We could not solve all cases at once. By following a trial and error approach, we eventually managed to resolve all queries. Around 90 % of them were covered by regular expressions, and only a minority of them depended on manual fixes from the third step.
One thing to keep in mind when applying multiple regular expressions to the same string is order. We had to be careful of the order of how we applied regular expressions. A general rule would be to apply more specific ones first and then move towards a general one. In most cases, generality is determined based on the number of wildcards used inside of a regular expression. More wildcards you have, the more general your expression is. So to avoid a situation where wildcards modify an input string that contains some specific substring that needs to be changed, you need to apply a specific one first.
After finishing the work on the experimental environment, we imported queries and switched database engines on the production and development environment as well.
There are a few lessons we learned along the way. The most important one is to choose the right technology for your project. Trendy technologies might be bright and shiny, but don’t let that fool you. Invest time and resources into research and testing to prove the technology is appropriate to your use case. Cutting corners during this phase may save you some time in the beginning, but will cause much more trouble in the long term.
Secondly, we ignored the importance of load and stress testing. When developing new features, it’s crucial to see how they impact the underlying infrastructure. Doing this with only a dummy dataset or a small subset of production data might not be enough. That’s why we improved our testing phase by importing representative data to the development environment for upcoming releases.
Lastly, avoid deployment on Fridays. Nobody wants to start a weekend with work in mind. Also, there is a big chance that nobody will be there to fix the critical issues that might occur during the weekend, and users would end up having a faulty application. This is why we shifted all of our releases towards the beginning of the week.