For any data engineer, data pipelines are very important. Choosing how to develop one is integral for the person building it since it’s going to determine the complexity of the process. When developing a data pipeline you can use two approaches. One of them is to use a programming language and a data processing framework or libraries like Apache Spark and Pandas. Another one is to use data integration tools that offer out-of-the-box components for data loading and processing. But how do you decide which one to use? How to determine which fits you better?
Both approaches have their benefits and drawbacks, it depends on the situation what is best for you. For instance, if you do not expect your data pipeline to be huge and don’t have any experience with such tools, it is probably better to stick with a programming language you already know. If you expect your data pipeline to have many components, data sources, data sinks, and a lot of processing then data integration tools could help you to do that effectively. Also, you should consider how much data is there, do you need to utilize cluster computing platforms such as Apache Spark or you could get away with one server where you installed your tool.
In our practice, we have used data integration tools more often than coding the processes. It saved us a lot of time and helped the team to have easier communication and understanding.
Data Integration Tools
There are many data integration tools on the market, some of them are open source and some require expensive licences. Each one has its advantages and disadvantages, and some are a better choice than others. The idea is not to describe and compare all of them but to mention a few and describe one that you can try for free. Deciding on one is determined by many factors and it’s usually your needs and goals that drive the selection.
Some of the popular data integration tools are Pentaho Data Integration, AWS Glue, Hevo, Talend, Informatica PowerCenter, and Oracle Data Integrator (ODI).
Pentaho Data Integration (PDI), nowadays known as Lumada Data integrator, is an ETL tool that contains built-in capabilities for extracting data from different sources, transformations, and loading data to target systems. It offers a free community and paid enterprise version that comes with the support. You can install it locally, or on a server hosted in a cloud or on-premise.
Data can be extracted from almost all popular databases, input streams, different file formats from local or remote servers, files from cloud and clustering storages, input streams, and APIs.
Although Pentaho is an ETL tool, we also use it as an ELT (Extract, Load, Transform) tool. You can achieve that by extracting and loading directly to a data warehouse which should be fast and efficient. The number of used built-in transformations is minimized. That way transformations that define business logic are done in the end with SQL statements which are executed in database engines that are optimized for data processing, especially if you use column-oriented databases like Vertica and Redshift. We could say that in the case of ELT, Pentaho serves as a data loading and workflow management tool (control flow, job order, transformation order).
Now, why would you use ETL tools? ETL looks like a simple process, you just have three steps and you are done. But in reality, you have many problems to solve even in the first phase of extracting data from the source. For instance, data structure from input files can change over time, you can have incomplete and wrong data that needs to be filled or be marked for manual checks and not cause the whole process to stop. To sort out those kinds of issues, you’ll have to exchange a lot of emails or calls with your client which takes time. In situations like that, it is good to have a tool in which you can do all those changes without too much code refactoring and messing up the parts that work. The ultimate goal is to deliver a project on time, and utilizing those tools can help you in that.
Benefits of data integration tools
- Easy to understand the process as it has a graphical interface. This is beneficial as in some cases you’ll have the whole team of business analysts, data scientists and data engineers, or even backend developers working together so you can present the process or some problems easily. Also, the graphical interface and visualisation of the process save time when you open a project that you haven’t worked on for some time, or present it to a new data engineer.
- Offers a great deal of built-in functionalities for extracting, processing, and loading data
- Built-in functionalities can be customized
- Supports custom scripting and programming. Supports execution of external scripts and programs.
- Some patterns from programming can be applied, for instance, control flow, loops that repeat the same job or transformation for each source file, variables for job parametrization, and reusability
Drawbacks of data integration tools
- Harder to version control. Some tools like Pentaho generate XML files that change even when you make just visual changes like moving the job entry on the canvas board so you can not use Git effectively.
- If a job is not structured well it could be hard to understand what is happening, where are variables set or updated. This is also the case with messy code.
- Each built-in feature can be configured which is good, but those options are often set by checking boxes, or radio buttons. One check-box with wrong ticks can cause your job to do unexpected things and is hard to debug.
- Such tools are not interesting to some developers so you could have a hard time hiring someone if you’ll need an extra pair of hands to help you.
- For someone that is not used to working with such tools, it can be tedious to use a graphical interface.
Some companies prefer to build their data pipelines by using popular programming languages, frameworks, and libraries. In most cases, the pipeline consists of Python, R, Scala scripts, Apache Spark jobs, MapReduce jobs, Apache Storm, or some other parallel processing frameworks. Scripts and jobs are usually orchestrated by Apache Airflow, Luigi, or other workflow management platforms. We have used Apache Spark with Apache Airflow to support the ETL process in some of our projects where we gained experience in utilizing these solutions. The important thing to remember is that you should choose the most appropriate technology. If there is no need for cluster computing then stick to Python and Pandas.
If you need to implement a data pipeline and already have developers on board, they will be familiar with this approach as they know programming languages, documentation, testing, release process, and so on. There is nothing new for them in this approach, besides learning new technologies like Apache Spark. In many cases, it is easier to educate your developers than to find ones that are experienced with specialized data integration tools or have the will to learn one.
The main benefit of this approach is performance as each framework and technology is specialized for a different purpose. You can choose the right one for each data pipeline you build, depending on data volumes, frequency, structure, etc. Of course, all that customisation comes with a cost as it takes more time to develop custom solutions than to use existing tools.
The hybrid approach takes the best from both worlds by utilizing commonly used features from data integration tools, and additionally using them for control flow. For specific problems or cases when built-in features are not enough, you can call custom scripts from those tools to process data. For instance, Pentaho offers Spark job execution as one step in the job. That way you have a clear view of how your pipeline looks, you can use predefined steps to save time, and still use specialized frameworks and libraries for your specific need.
This example compares the code and data integration tools approaches. The data pipeline will have the following components:
- Extract JSON data from the API endpoint
- Do a simple transformation
- Load data to a PostgreSQL database
We have used fake user data from this repository. The API endpoint and PostgreSQL database will be set up locally.
To start the local HTTP server you can run this Python command in a folder where you have saved the user’s JSON file python3 -m http.server 8080. The server will always serve the same static JSON file, but we could pretend that it is a service that serves data about new or updated users.
To run the PostgreSQL server you’ll have to install it, and run the command postgres -D /usr/local/pgsql/data. You can use DBeaver or some other database administration tool for creating tables and querying data. SQL statements for setting up the database are in this file.
Data Integration Tool Approach
We have used Pentaho Data Integration for this example. You can download it for free and try it yourself, all files are in the repository. Let’s see how it looks!
Pentaho jobs are made of Job and Transformation steps. We will not go into much detail about it, but we can say that Job steps are used for defining control flow and Transformation steps are used for data extraction, manipulation, and loading. In the Transformation step, data flows between Transformation entries that are connected with arrows, and processing is done by a Pentaho engine. Where in Job steps arrows define execution order and conditions. However, you can do data manipulation or loading in Job entries as well by SQL scripts in which case data is getting processed by the database engine.
Every Job and Transformation step is described by a Note that briefly describes it. If you take a closer look you’ll see that it is actually an ELT and not an ETL process, as data is loaded directly to a staging table and then transformed into the database. Screenshots below show the Main transformation that contains ELT steps, and the wrapper job around it that takes care of control mechanisms like updating metadata about the job (start and end time, batch ID, success flag…) and sending an email if the job fails.
It took us around 1 hour to come up with a solution, create tables, and build the pipeline. Please note that this pipeline is not ready for production, it contains a minimum number of steps to load data into a database and log the job results to a control table. Also, this is not the only solution for this task, for instance, you could download the JSON file from the API endpoint to your local machine, a server, or a cloud storage and then process it.
The solution is clean and easy to understand, although there is not much going on in it. The wrapper job first inserts new metadata to a control table, then the main job extracts and loads data to a stage table, transforms it, and moves to a final table. Later the wrapper job updates metadata depending on the success of the main job and sends an email if the main job failed.
For the code approach, we have used Python with json, requests, and psycopg2 libraries. The ELT process code can be found here.
We also implemented an ELT process that utilizes a database engine rather than Python. Although there is no need for ELT as transformations are very simple, but in real-life scenarios you can expect much more data and more complex transformations.
For this small example for us, it was faster to use a code approach rather than data integration tools. If you know which libraries to use and how to use them you should write concise code in a short time. With data integration tools you have to spend some time on setting all job and transformation steps, organize them in multiple files, and debugging is slower. But now think about the situation where you stop working on a project and come back to it after a few months. In that case, data integration tools are better as you can quickly scan through each job and get a hang of what is going on.
Using code or data integration tools approaches depends on the projects you have, the time limit, and the complexity of it. It also heavily depends on people working on the mentioned projects. Both approaches have their advantages and disadvantages, so it is up to you to choose which one suits you better. With the code approach, you have more flexibility and you can use what you already know from programming languages, which in turn allows you better performance since each function is specially created for a specific purpose. Using data integration tools is a bit limiting, but it can be beneficial since with them you get a clear overview of your work and processes. It also offers built-in functionalities. Deciding which one to use is dependent on what you want to achieve and how much time you want to invest in it. In the end, the processes and tools are defined by what is your main goal and which of those tools will most effectively achieve that goal.