Introduction to Data Governance in Databricks

data-platform software

One Stack Overflow question about Data Governance in Databricks got my attention lately. I am currently exploring Immuta and Privacera, so I can’t compare both tools in detail yet. Still, there are ways to solve some data governance aspects without buying an external component. Here is my answer.

Security and user access

  • By default, all users have access to all data stored in a cluster’s managed tables unless table access control is enabled for that cluster. Once table access control is enabled, users can set permissions for data objects on that cluster. Also, you can create dedicated Hive views and apply Row Level Security. Read more about Table access control

  • Use Azure Active Directory credential passthrough to avoid creating system users not linked with your LDAP. Unfortunately, Scala is not supported. You get only SQL and Python languages. Read more about accessing data from ADLS using Azure AD

  • You need to setup permissions on Azure Data Lake Gen 2 using ACLs. IMHO, setting ACLs is an awful experience, because changing the default ACL on a parent does not affect the access ACL or default ACL of child items that already exist. Read more about Data Lake Gen 2 ACLs

  • Please avoid creating dataset snapshots with columns/rows subsets as a way to solve column or row level security. Data duplication is never a good idea!

Lineage

  • One option to collect lineage in Databricks and Spark is possible with Apache Atlas & Spline. Here is an article describing how to get lineage from Databricks, using Spline and Apache Atlas.

  • Unfortunately, Spline is still under active development. Even reproducing the setup mentioned in the article is not straight forward. Good news that Apache Atlas 3.0 has many available definitions to Azure Data Lake Gen 2 and other sources.

  • Another option is to create a custom logging of all reads and writes. Based on these logs, one can create a Power BI report to visualize the lineage.

  • Also, consider using Azure Data Factory for orchestration. With a proper ADF pipeline structure, you can have a high level lineage and help you see dependencies and rerun failed activities. You can read more about Data Factory dependency framework here.

  • Take a look at Marquez - a small open-source library that has some nice features, including data lineage.

Data quality

  • Investigate Amazon Deequ - It supports Scala only so far, but it has some nice predefined data quality functions.

  • Also, there is no need for any fancy data quality libraries. Simple asserts can help you control data better.

Data life cycle management

  • One option is to use native Data Lake Storage lifecycle management. That’s not a viable alternative behind Delta/Parquet formats.

  • Second option, imagine that you have a table with information about all datasets (dataset_friendly_name, path, retention time, zone, sensitive_columns, owner, etc.). Your Databricks users use a small wrapper to read/write:

DataWrapper.Read("dataset_friendly_name")

DataWrapper.Write("destination_dataset_friendly_name")
  • It’s up to you then to implement the logging, data loading within the wrapper. You can skip sensitive_columns, define actions based on retention time (both available in dataset info table). You can always expand this table to more advanced schema, add extra information about pipelines, dependencies, etc.

  • Utilize Delta format and it’s superior features to implement retention, pseudoanonymization. After all, it seems that files are finally coming back to relational database functionality.

As you can see, setting up Data Governance in Databricks is not straighforward. There are many moving parts that require custom implementation.

Soon I plan to release my research about Immuta and Privacera - commercial solutions for Databricks (and not only) Data Governance. Subscribe to get notified about new post.


I'm Valdas Maksimavicius. I write about data, cloud technologies and personal development. You can find more about me here.