Lately, our colleague Valentina Dugan created a great post explaining How to tackle data security and concerns. Some key terms she covers there are unauthorized access, compliance, data masking, and data democratization. I couldn’t have imagined a more perfect backdrop for data fabric, the topic I’ve been wanting to write about for the past few months.
I’ll divide the topic in two parts. The first part will provide an explanation of data fabric architecture and tell why it is crucial for companies that aim to be data-driven. In addition, I’ll discuss technical details and provide a list of potential vendor solutions. So in the end, the series might be a valuable resource for anyone involved in data-related work, from governance leaders and business users to IT leaders and developers.
Drawbacks of traditional data management
Let’s kick off the topic with a few questions.
- What’s the typical turnaround time for delivering requested data to business users or data scientists?
- What are the implications of having data coming from multiple sources, such as on-premises and different cloud-based systems?
- Do you have adequate IT staff to fulfill data requests in a timely manner?
- What measures do you take to ensure that only authorized individuals have access to the data?
- What does your data quality monitoring process look like?
- Is it a wannabe automated process?
- Are your users typically the first to identify data quality issues?
- How long does it take to trace the data from data consumer level back to the source?
- How confident are you in the underlying data of the reports you generate?
- Do you find yourself in meetings where multiple attendees present different reports on the same topic?
- Are you able to effectively implement compliance measures across all data flows?
Are you experiencing any discomfort while reading these questions? If not, congratulations, you may close the tab. However, I believe that most of us would struggle to provide satisfactory answers to all of them. If you can relate to this, I suggest you continue reading.
What steps can we take to address the data management issues that were mentioned? You could address each issue individually, or come up with your own framework to tackle them collectively. However, it would require a significant investment of time and money, and if your IT team lacks sufficient expertise, the desired outcomes may not be achieved.
The good news is that there already exists a design concept called data fabric, that might be just what you are looking for. Although there are a few varying definitions depending on who you ask, the general idea and architecture remain the same.
A data fabric is a data architecture that integrates a set of technologies and services designed to achieve the ultimate objective of data democratization and self-service across the enterprise. If you are not familiar with the term of data democratization, be sure to Deep dive into data democratization with our Valentina.
Achieving data democratization is not a trivial task. It has prerequisites that must be met, and all of them are defined in a way to serve the democratization goal. A data fabric incorporates practices to ensure secure access to data that meets regulatory requirements and compliance standards throughout its entire data lifecycle, including:
Figure: Data Fabric Architecture
The first challenge data fabric addresses is data collection. Oftentimes, enterprises collect the data from various sources that can reside on-premises and in multiple public clouds. Traditionally, data collection processes involved ETL/ELT data integration processes that consolidated the data into a single location. Once centralized, a governance layer is applied to the data, followed by accessing the data by consumers.
However, this approach has proven to be unscalable. It requires significant effort from likely out-staffed IT departments. Additionally, data replication brings challenges of data latency, data security, and overall complexity caused by changes in business strategies and acquisitions that add new data for integration.
A data fabric takes a different approach. Instead of centralizing all data, enterprises can keep data wherever it makes the most sense and centralize only the governance piece. To accomplish this, data fabric incorporates two core components: data virtualization and data catalog.
The data virtualization layer abstracts the data integration complexity from data consumers, enabling real-time access and data integration, regardless of where it resides. In this way, data virtualization reduces time spent on data preparation, which leaves more time for analytics. It cuts down IT costs and efficiently provides the right data to the right people, ultimately leading to increased pace of innovation and competitiveness.
Having virtualized data sources alone is not enough to achieve data democratization. Particularly for business users who may not know which data source to consult when searching for specific information. That is where the data catalog comes in. By storing metadata, the catalog helps to organize virtualized data assets and assists users in quickly finding the most appropriate data. It also provides a centralized governance layer where data stewards define data protection rules and access policies to achieve a consistent and secure data management process.
I’m sure we can all agree that data quality issues are inevitable, as there will always be data sources that are unreliable by design. However, the way we handle these issues can have a significant impact on the end result, whether it’s customer satisfaction or the accuracy of decisions made based on the data. This is again the task for the data catalog.
A good catalog automates data quality tasks to monitor, identify, and prevent data quality issues on both column and data asset level. It also tracks data lineage through the entire data lifecycle, which helps in finding the root cause of the issue. This is especially important in the complex systems where data records go through numerous steps between data collection and presentation layer.
Data virtualization and catalog give us virtualized governed data assets. However, to create a comprehensive source of data for our analytics team or business users it’s often necessary to filter, join, and merge multiple assets. Data fabric does not leave integration behind. It places it after the governance phase and treats data assets from the catalog as data sources.
The advantage of this approach, instead of directly integrating the raw data source or its replica, is that we ensure data security throughout the entire process. For instance, if you are processing sensitive data such as credit card information, you likely do not want to grant access to the entire IT department while doing integration. By applying masking and data protection rules, you could easily control access level on a user or group level, ensuring only authorized users would have access to the sensitive data.
Thus, we have covered the overview and architecture of the data fabric. I hope you found it interesting and useful. The next post in the series will bring additional technical details including performance questions and discussion of how traditional data warehouses, and not so traditional data lakes and lakehouses come into play with data fabric. Also, I will discuss comparison of data mesh and data fabric, as two terms, which might be seen as mutually exclusive.
New paradigm, new challenges
Because of the many components, the data fabric architecture can be challenging to implement, both organizationally and technically. Firstly, it’s important to identify business cases that justify its implementation. That’s why it could be wise to start with a proof of concept that delivers the highest ROI. Additionally, you will need a team of experts who understand data management, integration, governance, and analysis, as they will execute the implementation.
We also must not leave out the human element and cultural change. The staff may have got used to the ways of working that will be impacted by the introduction of new tools and processes. It’s crucial to ensure necessary education, and a paradigm shift, as the new architecture will be of no value if not utilized effectively.
Implementing all the components of the architecture is a challenging task. Especially when it comes to combining all them together. This can become even more complex when considering deployment and configuration that need to support big data use cases, where performance and high availability are a must have.
To mitigate the challenges, you could think of partnering with vendors who offer data fabric solutions. The choice of vendor will depend on your needs and requirements, as different vendors may differ in offerings and pricing. With their assistance, you can define business cases, provide necessary training to employees, and set up the infrastructure and services. Additionally, if you want a cloud approach, you should check out if vendors offer data fabric as a service approach. This way, you would eliminate the need for investing time in setting up the infrastructure, installations, scaling, security, and you would pay only for the capacity you use. More details about vendor options will be given in the next post.
Now, let’s wrap it all up with a conclusion.
To summarize, embracement of data fabric architecture will lead to the following benefits
- Accelerate time to value by efficiently giving the right data to the right people
- Reduce infrastructure and storage costs due to data virtualization
- Focus time on analyzing rather than preparing data
- Increase compliance and security due to its centralization
- Gain more accurate insights due to improved data quality
Implementing this across the entire organization could be a significant effort. That’s why it would be wise to start small by conducting a proof of concept to new data flows or legacy ones that you have identified as important. Once you have proved the value and gained some experience, it will be much easier to roll it out to the rest of the organization.
If you’re interested in more technical details and information on vendor offerings, stay tuned for additional content coming soon.