Welcome to the final post in the data fabric series. So far we introduced data fabric and talked about the business benefits it brings. Now it is time to look at technical questions that may come to mind as you explore the concept of the data fabric.
Integration With Existing Data Systems
Unless you are starting a greenfield project, you already have some data systems in place. These can include warehouses, lakes, lakehouses, or some other types of storage that were built to solve the data needs of your organization.
Implementing data fabric does not mean building from scratch and throwing existing components away. Instead, it treats these as data sources, the same way as any other source from the hybrid-cloud layer. Once data stewards catalog them, they are either served directly to the consumers or sent for additional integration with other data assets.
This means that embracing data fabric will not interfere with the data system investments you made so far. It will build on top of them and provide extra features like governance, lineage, and data democratization.
What About Data Mesh?
Both data fabric and data mesh are oftentimes mentioned when discussing data democratization. If you are not familiar with the data mesh concept, check out a comprehensive explanation provided in the data mesh architecture post.
Although both of them provide a solution for enabling data democratization, they take different approaches. The main difference is that data mesh focuses on organizational pieces, emphasizing the value of data ownership and domain-centric organization, while data fabric is a technology-oriented and domain-agnostic approach for achieving unified and integrated data architecture.
At first, these two can be looked at as mutually exclusive, but it does not need to be the case. A company could embrace both approaches, combining organizational and technical aspects. This produces a solution composed of two layers, where the data fabric is encapsulated within the data mesh.
In that case, data mesh divides organization into domain teams defining their ownerships, while data fabric is used as a technical solution for all tasks within and across domain teams. These include data virtualization, cataloging, integration, and provisioning access, which enable seamless data exchange and governance.
Both standalone and hybrid approaches make sense for the right setup. So as always, the answer to the question of what to choose is “It depends”. In general, you would add data mesh to the story only if you have data complexity that requires domain orientation, a large number of domains, or some other organizational issue that requires decentralization of your data teams.
Does Data Virtualization Kill Performance?
The core idea of data virtualization is to provide a local virtual view on top of the multiple remote physical data sources, without the need to move or replicate them to a central storage. Moving and replicating data in traditional ETL workloads is a way for data engineers to ensure performance. So a valid question to ask now is how virtualization affects performance, either when accessing virtualized assets directly, or combining them with other data assets inside the data integration stage.
The same issue could be explained with SQL. If you have multiple underlying data sources that you need to join, you can either go with a physical or virtual approach. In a physical approach, you would create a new table, and run a classic ETL process that will join underlying tables, do necessary transformations, and load the data to the new table. On the other hand, in a virtual approach, you would only create a view on top of underlying tables.
The first one offers instant access to transformed data but requires a synchronization process to keep the final table up-to-date. Also, it brings additional storage requirements, because you have the same data stored twice. The virtual approach always serves the most recent data and does not require extra storage, but can be slow in case you have big tables and complex join conditions.
So how does data virtualization solve performance requirements? Depending on the tool, there might be different techniques. Let’s start from the simplest. To speed things up, data virtualization could use parallelization to access all physical sources at once. Caching frequently accessed data is also an option to reduce network and storage latency, which improves performance.
More complex optimization would include query optimizations and filter pushdowns, which apply filters on the source system itself, before moving data to the virtualization layer. This optimization minimizes data movement between the source systems and the data virtualization layer. Also, it enables faster joining operations, as data from individual physical sources does not contain unnecessary records.
The combination of these techniques improves the performance of data integration and access, while still providing a unified virtual view of data across multiple systems. Unless your latency requirements are particularly stringent, a proper data virtualization tool may serve you very well.
Augmented Data Catalog
Along with data virtualization, data catalog makes the core of data fabric architecture. A catalog is managed by data stewards and used to enrich data assets with all required metadata. These include data classifications, tags, business terms, data protection rules, and data quality rules. Assigned metadata enables users to get quick access to high-quality data assets, where only authorized users can access sensitive and confidential information.
As organizations have many data assets, and each one of them has its own set of columns that needs to be profiled and classified, manual metadata assignments can be time-consuming. That is where data catalog augmentation can help. By leveraging AI, some tools offer automated data profiling capabilities. These include automated data classifications and assignment of business terms, along with the detection of columns containing personal or other sensitive information that should be protected.
Because of business complexity and specific requirements, mentioned AI capabilities are not designed for autonomous work. Data stewards still need to monitor and interfere when they notice incorrect assignments. Despite not being perfect, data catalog augmentation still brings major value, especially if we consider the fact that AI can only improve.
There are many tools that might help you in the data fabric journey. Gartner did comprehensive research on data integration tools. Some of the tools focus on individual components of data fabric like storage, virtualization, or catalog, while others offer complete data fabric solutions. These vendors, listed in alphabetical order, include Denodo, IBM, Informatica, Palantir, Talend, and TIBCO.
Figure: Magic Quadrant for Data Integration Tools (August 2022)
Digital Poirots, as part of Deegloo company, is an IBM partner, so I had chance to work with IBM Cloud Pak for Data (CPD). It offers end-to-end data fabric solutions for hybrid cloud environments. Technologies I used include data virtualization in the form of Watson Query, data cataloging with Watson Knowledge Catalog, DataStage as integration tool, and collaborative data science platform called Watson Studio.
With the CPD I like the idea of modularization, scalability, and deployment flexibility that covers everything from cloud to on-premise. Modularization is especially important because organizations most likely want to cherry pick and pay only for the products they find useful and that can be easily integrated into their existing landscape.
Integrating with existing landscapes is something I wrote about in the first section. IBM realized they cannot go with an all-or-nothing approach, so they did a very good job in supporting multi-vendor stack by integrating with other clouds and tools. For example, you could use AWS S3 as a data lake and treat it as a data source for Knowledge Catalog, or use the reporting tool you already pay for as data consumer, which connects to the same governed catalog. IBM also offers help with implementing your business scenarios, which could be a major benefit in case you are just starting the journey.
The choice of vendor will depend on your needs, strategy, and tech stack preferences. Before you start picking, it is important to have management that understands what data fabric can bring to your company. Then you could work on the strategy and define use cases that should be covered. Finally, you need to assemble a technical team that will carry out the implementation and train employees to use the tools properly. Doing that will highly increase the chance of project success.
Whether you’re up for data fabric integration or not, it’s good to consider everything mentioned above. As it is, data fabric is making waves and will continue to do so. We will see more and more innovations here because of the rise of data democratization and data catalogs, as mentioned in this and previous posts. Which way companies will go depends on their data needs and how data-driven they want to become.
And this marks the end of our blog post series. Hope you found something new and interesting. If you have any further questions or concerns about the topic, just let us know in the comment section.