Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

Data Engineering

Tags:

Data Lakehouse

Data Mesh

Data Fabric

Master Data Management

This article explains Data Governance perspective in connectivity with Data Mesh, Data Fabric and Data Lakehouse architectures.

Organizations across industries have multiple functional units and data governance is needed to oversee the data assets, data flows connected to these business units, its security and the processes governing the data products relevant to the business use-cases.  

Let's take a deep dive into data governance as the first step.

Data Governance

Role of data governance also includes data democratization, tracks the data lineage, oversees the data quality and makes it compliant to the regional regulations.

Microsoft Purview has the differentiator on the 150+ compliance level regulations covered under Compliance Manager Portal:

Data governance utilizes Artificial Intelligence to boost the quality level as per the data profiling results and the historical data set quality experience. ‍

Master Data Management helps to store the common master data set in the organization across domain with the features of data de-duplication and maintaining the relationships across the entities giving 360-degree view. Having a unique dataset and Role based Access Control leads to add-on governance and supports business insights.

Data governance helps in creating a Data Marketplace for controlled golden quality data products exchange between the data sources and consumers, AWS Data Zone SaaS has a specialization on Data Marketplace capabilities:

Reference data set along with the Master data management helps to do the Data Standardization which is relevant in the data exchange between the organization, subsidiaries, partners as per the industry level on the Data Marketplace platform.

Remember the data governance is feasible with the correspondence between the technical and the business users.

Technical users have the role to collect the data assets from the data sources, review the metadata and the data quality, do the data quality enrichment by building up the data quality rules as applicable before storing the data.

On the other hand, the business user has a role to guide on building the business glossary on data asset to Columnlevel, defining the Critical Data Elements (CDE), specifying the sensitive data fields which should be mask or excluded before data is shared to consumers and cooperating in the data quality enrichment request.

Best practice is to follow bottom to top approach between the business and the technical users. After the data governance framework has been set up still the governance task always go through ahead which implies the business stakeholders should be well trained with the framework.

Process Automation is another stepping stone involved in the data governance, to give an example workflow need to be defined which notify the data custodians about the data set quality enrichment steps to be taken and when the data quality is revised the workflow forwards the data set again to the marketplace to be consumed by the data consumers.

Data discovery is another automation step in which the workflow scans the data sources for the metadata details as per the defined schedule and loads in the incremental data to the inventory triggering tasks in defined data flow ahead.

Data governance approach may change as per the data mesh, fabric, Lakehouse architecture. let's get deep into this ahead.

Data Mesh vs Data Fabric vs Data Lake Architectures

Talking about the dataflow in every organization there are multiple data sources which store the data in different format and medium, once connected to this data sources the integration layer extracts, loads and transforms (ELT) the data, saves it in the storage medium and it gets consumed ahead. These data resources and consumers can be internal or external to the organization depending on the extensibility and the use case involved in the business scenario.

This lifecycle becomes heavy with the large piles of data set in the organization. The complexity increases when the data quality is poor, the apps connectors are not available, the data integration is not smooth, datasets are not discoverable.

Rather than piling all the data sets into a single warehouse, organizations segregate the data products, apps, ELT, storage and related processes across business units which we term Data Mesh Architecture.  

Data Mesh on domain level leads to de-centralized data management, clear data accountability, smooth data pipelines, and helps to discard any data silos which aren't being used across domains.

Most of the data pipelines flow within a particular domain data set but there are pipelines which also go across the domains. Data Fabric joins the data set and pipelines across the domains in the Integrated Architecture.

Data Virtualization and the DataOrchestration techniques help to reduce the technical landscape segregation but overall, it impacts the performance and increases the complexity.

There is another setup approach which companies are interested in as part of the digital transformation, migrating datasets from segregated storage mediums on different dimensions to a CentralizedData Lakehouse.  

Data sets are loaded into a single DataLakehouse preferably in Medallion architecture starting with Bronzelayer having the raw data.

Further the data is segregated on the same storage medium but across individual domains after cleansing and transformation building up the Silver layer.

Ahead for the Analytics purpose the Goldlayer is prepared having the compatible dimensions-facts data model.

This Centralized storage is like Data Mesh adopted on Data Lakehouse setup.

Different Clouds, Microsoft Fabric, Databricks provide capabilities for the same.

Data Governance options

As for the centralized and de-centralized implementation architecture the data governance also follows the same protocol.

Federated Governance aligns with the Data Mesh and Centralized Governance fits to the DataFabric and Data Lakehouse architecture.

Federated governance is justified with thecomplex legacy setup where we are talking about a large organization having multiple branches across domains with individual Domain level local Governor officers.

These local Governor officers track thedata pipelines, govern the accessibility to involved individual storage mediums, the integration layers and apps such that as and when there's any change in the data set the data catalog tool should be able to collect the metadata of those changes.

Centralized governance committee with data custodians handle the other two scenarios of the Data Fabric and Data Lake setup.

To take an example of the data fabric where data is spread across different storage medium as say Databricks for machine learning, snowflake for visualization reports, database/files as a data sources, cloud services for the data processing, in such scenario start to end centralized Data Governance is feasible via Data Virtualization and the Data Orchestration services.  

Similar central level governance applies where the complete implementation setup is on single platform as say AWS cloudplatform.

AWS Glue Data Catalog can be used for tracking the technical data assets and AWS DataZone for data exchange between the data sources and data consumers after tagging the business glossary to the technical assets.

Azure cloud with Microsoft Purview,Microsoft Fabric with Purview, Snowflake with Horizon, Databricks with Unity Catalog,AWS with Glue Data Catalog and DataZone, these and other platforms provide the scalability needed to store big data set, build up the Medallion architecture and easily do the Centralized data governance.

Conclusion

Overall Data Governance is relevant framework which works hand in hand with Data Mesh, Data Fabric, Data Lakehouse, Data Quality, Integration with the data sources, consumers and apps, Data Storage,MDM, Data Modeling, Data Catalog, Security, Process Automation and the AI.

Along with these technologies Data Governance requires the support of Business Stakeholders, Stewards, Data Analyst, Data Custodians, Data Operations Engineers and Chief Data Officer, these profiles build up the DataGovernance Committee.

Deciding between the Data Mesh, Data Fabric, Data Lakehouse approach depends on the organization's current setup, the business units involved, the data distribution across the business units and the business' use cases.

Industry current trend is for the distributed Dataset, Process Migration to the Centralized Lakehouse as the preferred approach with the Workspace for the individual domains giving the support to the adopted Data Mesh too.

This gives an upper hand to Centralized Data Governance giving capability to track the data pipelines across domains, data synchronization across the domains, column level traceability from source to consumer via the data lineage, role-based access control on the domain level data set, quick and easy searching capabilities for the datasets being on the single platform.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

This article explains Data Governance perspective in connectivity with Data Mesh, Data Fabric and Data Lakehouse architectures.

Let's take a deep dive into data governance as the first step.

Data Governance

Role of data governance also includes data democratization, tracks the data lineage, oversees the data quality and makes it compliant to the regional regulations.

Microsoft Purview has the differentiator on the 150+ compliance level regulations covered under Compliance Manager Portal:

Data governance utilizes Artificial Intelligence to boost the quality level as per the data profiling results and the historical data set quality experience. ‍

Remember the data governance is feasible with the correspondence between the technical and the business users.

Data governance approach may change as per the data mesh, fabric, Lakehouse architecture. let's get deep into this ahead.

Data Mesh vs Data Fabric vs Data Lake Architectures

Data Mesh on domain level leads to de-centralized data management, clear data accountability, smooth data pipelines, and helps to discard any data silos which aren't being used across domains.

Data Virtualization and the DataOrchestration techniques help to reduce the technical landscape segregation but overall, it impacts the performance and increases the complexity.

Data sets are loaded into a single DataLakehouse preferably in Medallion architecture starting with Bronzelayer having the raw data.

Further the data is segregated on the same storage medium but across individual domains after cleansing and transformation building up the Silver layer.

Ahead for the Analytics purpose the Goldlayer is prepared having the compatible dimensions-facts data model.

This Centralized storage is like Data Mesh adopted on Data Lakehouse setup.

Different Clouds, Microsoft Fabric, Databricks provide capabilities for the same.

Data Governance options

As for the centralized and de-centralized implementation architecture the data governance also follows the same protocol.

Federated Governance aligns with the Data Mesh and Centralized Governance fits to the DataFabric and Data Lakehouse architecture.

Centralized governance committee with data custodians handle the other two scenarios of the Data Fabric and Data Lake setup.

Similar central level governance applies where the complete implementation setup is on single platform as say AWS cloudplatform.

Conclusion

Data Lakehouse

Data Mesh

Data Fabric

Master Data Management

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

Explore current openings

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

MORE POSTS BY THIS AUTHOR

Sagar Jaswani

You may also like

Data Engineering: Beyond Big Data

Pratyush Pranav

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Data QA: The Need of the Hour

Rita Kushwaha

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Services

By Company Stage

By Engagement Model

Expertise

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting

Subscribe to get the latest technology updates

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

Sagar Jaswani

MORE POSTS BY THIS AUTHOR

Sagar Jaswani

You may also like

Data Engineering: Beyond Big Data

Pratyush Pranav

Iceberg: Features and Hands-on (Part 2)

Abhishek Sharma

Data QA: The Need of the Hour

Rita Kushwaha

Centralized Governance of Data Lake, Data Fabric with adopted Data Mesh Setup

About the Author

Did you like the blog? If yes, we're sure you'll also like to work with the people who write them - our best-in-class engineering team.

We're looking for talented developers who are passionate about new emerging technologies. If that's you, get in touch with us.

About Velotio

Subscribe to get the latest technology updates

Related Posts

Data Engineering: Beyond Big Data

Iceberg: Features and Hands-on (Part 2)

Data QA: The Need of the Hour

Iceberg - Introduction and Setup (Part - 1)

Confluent Kafka vs. Amazon Managed Streaming for Apache Kafka (AWS MSK) vs. on-premise Kafka

Mage: Your New Go-To Tool for Data Orchestration

The Data Lake Revolution: Unleashing the Power of Delta Lake

Unlocking the Potential of Knowledge Graphs: Exploring Graph Databases

Spatial Data Analytics : The What, Why, and How?

Apache Flink - A Solution for Real-Time Analytics

An Introduction to Stream Processing & Analytics

Modern Data Stack: The What, Why and How?

Best Practices for Kafka Security

Parallelizing Heavy Read and Write Queries to SQL Datastores using Spark and more!

ClickHouse - The Newest Data Store in Your Big Data Arsenal

How to Load Unstructured Data into Apache Hive

Building an ETL Workflow Using Apache NiFi and Hive

Unit Testing Data at Scale using Deequ and Apache Spark

Elasticsearch - Basic and Advanced Concepts

BigQuery 101: All the Basics You Need to Know

Your Quintessential Guide to AWS Athena

Real Time Analytics for IoT Data using Mosquitto, AWS Kinesis and InfluxDB

Lessons Learnt While Building an ETL Pipeline for MongoDB & Amazon Redshift Using Apache Airflow

The Ultimate Beginner’s Guide to Jupyter Notebooks

Product Engineering

Data and AI

Cloud & DevOps

Strategy and Consulting