What is Immuta and how does it improve data governance in Databricks? - Part 1

February 27, 2021

5 min read

What is Immuta and how does it improve data governance in Databricks? - Part 1

Today, we’ll learn about data governance enhancement tool for Databricks - Immuta. A few days ago I wrote an article, Introduction to Data Governance in Databricks, describing how to approach data governance in Databricks without buying additional components.

As there are many moving parts that require custom implementation, I decided to explore Immuta trial. Initially, I started this post as an Evernote message to myself. Eventually the note grew in length and I decided to share my steps and lessons learned with you.

The series of articles about Immuta is a part of my Azure Data Platform Landscape overview project.

Part 1 - Getting Familiar and connecting to sources - Your are here
Part 2 - Policies, catalog, my feedback

Get started

First, let’s sign up on https://www.immuta.com/try/ to get access to Immuta’s trial.

Secondly, I want to link Immuta with my own Databricks workspace.

a9b5a0453dcb395a46554656567ee25768ce0ad9

The portal shows me a straight forward step-by-step tutorial of linking Immuta with Databricks. It takes less than 10 minutes to complete it. Major steps taken:

Download Immuta Jar, which later has to be uploaded to our DBFS.
Set up cluster init scripts to make sure that Jar gets executed during startup.
Create Databricks secret and paste your unique Immuta account key
Create a cluster policy (copy paste from Immuta)
Create a Databricks cluster
Execute your queries

Almost everything goes smooth, except the last step. I am not able to run queries. I get an error: NonImmutaUserAuthorized: Unable to create an API key (401: Unauthorized)

61785cca0ac7b030403ed4c93bd9273abe96be83

Reason? I didn’t read documentation and I chose two different users in Immuta and Databricks.

“It is important that the users’ principal names in Immuta match those in your Databricks Workspace. For instance, if you authenticate to Databricks with your.name@company.com, you will need to authenticate with Immuta using the same username.”*

After solving my user issue, I get another error related to disabled JDBC. It seems it’s related with the cluster policies I’ve created earlier.

199408c0d0f4f0176e2813c7c3159d40283bba23

Adding “SET immuta.enable.jdbc=true;” solves the issue. Now I can see medical_claim data from Immuta (the source available as demo data).

4eb7a31936910df09d243334a2afd095c2a8664a

And here is another view of how Metastore database and tables look like in Databricks. We see Immuta database and medical_claims table. ( We see other tables too, because I took the below print screen after connecting other sources)

1e6667444ea3ca0c178f47acb04197c1a4fb815f

Add data sources

Great, what’s next? I want to connect a few other sources and see how it works with Azure Data Lake Gen 2, SQL databases.

e1d9a0509a1c13d9bc312e7bfd5517e48efc23c2

I see quite a few connectors, but I miss Azure Data Lake Gen 2. Only after visiting the Settings I get a dissapointing message:

e52e3a6a0c836d74b066d336818825e6b6818b88

Anyways, we have to live without it now…. In that case, let’s try to connect to my Azure SQL instance. After playing with Immuta for half an hour, UI feels modern and friendly. Tips and pop ups help understand the tool even faster.

In the example below, the popup proactively reminds me about whitelisting an IP address to allow Immuta connect to my database. The trial version of Immuta runs on their infrastructure, but for enterprise deployments, you will have a possibility to deploy it within your virtual network. But about that later.

34b11ab48baa53b5aca290638c18ed9788f22066

Once the connection succeeds, I get a message asking what metadata I want to ingest. The message emphasizes metadata ingestion, nothing related to data copying. I decide to allow Immuta fetch 15 out of 16 tables.

afe026710ce1f1de96a1f4ac2bedaa99ab3d86ee

Later, there is an interesting Advanced section. I go with default settings, without investigating much different settings.

3abeab0fe7321f4899c00a549ca26bfb9e32ab3c

After clicking Create, my data source gets connected and tables become visible in the Data Sources tab.

1b9efae46f04f7554557e2d8699c9894ad33feec

Query data

Let’s open Customer table. The table view has new tabs: members, policies, data dictionary, queries, etc. I am going to explore that later.

59baa9b110e9d71d1c5e02cc3e29fef536bdbf63

Now, I am interested in reading the table I linked earlier in Databricks. According to description, it should be

SELECT * FROM saleslt_customer limit 100

However, the Metastore database name in Databricks Metastore is “Immuta”. We have to add it as a prefix. Let’s try.

71ffa1148beb7565d277875febef77ce6ac78c4e

Something fails: “Reference to database and/or server name in ‘adventure-works-db.SalesLT.Customer’ is not supported in this version of SQL Server.” What? It sounds like a SQL Server driver issues…

Interestingly, Immuta shows data health issues in its UI. Good idea to show that in one and central place right after any client faces issues.

940bd352327b9f9eba8b817f37f1b676d008b069

I can even run tests and look for possible fixes.

fed9695cc60189942bf27ad18eb412e1073ad21b

Unfortunately, the issue still persists. Quick googling says that referencing full table name, like “SELECT * FROM [Database.B].[dbo].[MyTable]”, appears to be not allowed in SQL Azure. So it’s Azure SQL issue, not Immuta’s.

As this is only demo and I want to get things working ASAP, I decide to use my SQL statement instead of Immuta’s. I assume there is a better solution for this, bet let’s hack a bit now.

f8c18d482e427868707e96cc7896ef9e96577415

Ladies and gentlemen, we have lift off! My SQL table query goes via Immuta and I see results in Databricks!

61029398fdabfbdc98baf75599bfd9e302b21c21

Hey, wait a minute! Why First Name and Last Name is replaced with a static value - REDACTED? The SQL query is proxied through the virtual Immuta table down to the Azure SQL database while enforcing the policies. It applies policies on the fly. AWESOME!

Create data policies

Why am I not allowed to see FirstName and LastName columns? After visiting Policy tab inside the table view, I realize there is a global policy named “Mask PII”. It picked up my schema automatically, and decided to hide these columns based on predefined patterns.

66ab7e5333881722ec64afee22359bb605893e4a

After connecting to my Azure SQL database, Immuta posted the below notification. It makes more sense now.

50b6b1326d70eebe1ce5c7e4f9c657e815e092c4

Global polices are managed on the Immuta subscription level. That’s a great functionality to ensure sensitive information is not leaked by an accident. Better to hide it globally by default than be sorry later.

In addition to the global “Mask PII” policy, I want to create a custom Customer table policy. As you can see below, we have quite a few actions available:

651af87a6ce0195cd10a973f3e9db6ecfb7dbb07

The policy builder has an impressive functionality. Based on user names, groups, LDAP attributes you can create foundational policies:

Mask values - many options like hashing, replacing with constant, regex, rounding, K-Anonymization, format preserving masking
Limit rows - row level security

There are as well some more advanced and interesting items, like:

Limit usage to purpose - to access you need to be granted purpose privileges, e.g. claim analysis, medical health analysis, credit card transactions, etc.
Minimize the amount data shown (e.g. top 100 for all tables just to get familiar with the schema)
Make deferentially private - policies will only return results for a certain type of SQL query: aggregates, such as the COUNT and SUM functions.

Talking about data policies, it’s worth to mention a bit about subscriber policy. It defines how table access management.

3485625a22520a3a32942e09fd5b03ea8661fd29

All my Azure SQL tables got “Most Restricted” access policy as these contain customer information. As you can read in the description, tables are not even visible in search and data source owners must manually add/remove users based on rationale.

There are weaker options available. If you are a part of specific group, then all members get access to it. Another option is to ask for permission, and data owner gets notification to approve/deny access.

Or, if a dataset contains no sensitive information (like general statistics), then you can select Anyone and make it public.

Projects and purpose-based policies

Let’s create a secret project. The idea here is to group members via a project to better control data usage. And clearly link with purpose restrictions.

3504be4727ff5472e9cadecb14574b738fc6ef98