What is Immuta and how does it improve data governance in Databricks?

data-platform software

A few days ago I wrote an article, Introduction to Data Governance in Databricks, describing how to approach data governance in Databricks without buying additional components.

As there are many moving parts that require custom implementation, I decided to explore Immuta trial. Initially, I started this post as an Evernote message to myself. Eventually the note grew in length and I decided to share my steps and lessons learned with you.


Jump to Get started section

Connect Immuta with Databricks

Add data sources

Query data

Create data policies

Create projects and purpose-based policies

Collaboration and data catalog

Governance (tags, logs, reports, purposes)

Query editor

How does Immuta work?

Installation

My observations


Get started

First, let’s sign up on https://www.immuta.com/try/ to get access to Immuta trial.

Connect Immuta with Databricks

Secondly, I want to link Immuta with my own Databricks workspace.

Immuta integrates with Databricks

The portal shows me a straight forward step-by-step tutorial of linking Immuta with Databricks. It takes less than 10 minutes to complete it. Major steps taken: 

  • Download Immuta Jar, which later has to be uploaded to our DBFS.
  • Set up cluster init scripts to make sure that Jar gets executed during startup.
  • Create Databricks secret and paste your unique Immuta account key
  • Create a cluster policy (copy paste from Immuta)
  • Create a Databricks cluster
  • Execute your queries

Almost everything goes smooth, except the last step. I am not able to run queries. I get an error: NonImmutaUserAuthorized: Unable to create an API key (401: Unauthorized)

Getting NonImmutaUserAuthorized error

Reason? I didn’t read documentation and I chose two different users in Immuta and Databricks. 

“It is important that the users’ principal names in Immuta match those in your Databricks Workspace. For instance, if you authenticate to Databricks with your.name@company.com, you will need to authenticate with Immuta using the same username.”*

After solving my user issue, I get another error related to disabled JDBC. It seems it’s related with the cluster policies I’ve created earlier. 

Adding “SET immuta.enable.jdbc=true;”solves the issue. Now I can see medical_claim data from Immuta (available as demo data).

And here is another view of how Metastore database and tables look like in Databricks. We see Immuta database and medical_claims table. ( We see other tables too, because I took the below print screen after connecting other sources)

Immuta adds database to Databricks metastore

Add data sources

Great, what’s next? I want to connect a few other sources and see how it works with Azure Data Lake Gen 2, SQL databases.

Immuta trial available connectors

I see quite a few connectors, but I miss Azure Data Lake Gen 2. Only after visiting the Settings I get a dissapointing message:

Anyways, we have to live without it now…. In that case, let’s try to connect to my Azure SQL instance. After playing with Immuta for half an hour, UI feels modern and friendly. Tips and pop ups help understand the tool even faster.

In the example below, the popup proactively reminds me about whitelisting an IP address to allow Immuta connect to my database. The trial version of Immuta runs on their infrastructure, but for enterprise deployments, you will have a possibility to deploy it within your virtual network. But about that later.

Immuta provides popup tips

Once the connection succeeds, I get a message asking what metadata I want to ingest. The message emphasizes metadata ingestion, nothing related to data copying. I decide to allow Immuta fetch 15 out of 16 tables.

Later, there is an interesting Advanced section. I go with default settings, without investigating much different settings. 

Immuta advanced source settings

After clicking Create, my data source gets connected and tables become visible in the Data Sources tab.

Immuta populates source metadata

Query data

Let’s open Customer table. The table view has new tabs: members, policies, data dictionary, queries, etc. I am going to explore that later.  

Now, I am interested in reading the table I linked earlier in Databricks. According to description, it should be ‘SELECT * FROM saleslt_customer limit 100’. However, the Metastore database name in Databricks Metastore is “Immuta”. We have to add it as a prefix. Let’s try.

Something fails: “Reference to database and/or server name in ‘adventure-works-db.SalesLT.Customer’ is not supported in this version of SQL Server.” What? It sounds like a SQL Server driver issues… 

Interestingly, Immuta shows data health issues in its UI. Good idea to show that in one and central place right after any client faces issues.

Immuta shows dataset health

I can even run tests and look for possible fixes. 

Immuta shows dataset health notifications

Unfortunately, the issue still persists. Quick googling says that referencing full table name, like “SELECT * FROM [Database.B].[dbo].[MyTable]”, appears to be not allowed in SQL Azure. As this is only demo and I want to get things working ASAP, I decide to use my SQL statement instead of Immuta’s. I assume there is a better solution for this, bet let’s hack a bit now.

Ladies and gentlemen, we have lift off! My SQL table query goes via Immuta and I see results in Databricks! 

Immuta applies global masking policy and hides sensitive information on the fly

Hey, wait a minute! Why First Name and Last Name is replaced with a static value - REDACTED? The SQL query is proxied through the virtual Immuta table down to the Azure SQL database while enforcing the policies. Applying policies on the fly.

Create data policies

Why FirstName and LastName columns are hidden? After visiting Policy tab inside the table view, I realize there is a global policy named “Mask PII”. It picked up my schema automatically, and decided to hide these columns based on predefined patterns. 

Immuta global policies

After connecting to my Azure SQL database, Immuta posted the below notification. It makes more sense now.

Global polices are managed on the Immuta subscription level. That’s a great functionality to ensure sensitive information is not leaked by an accident. Better to hide it globally by default than be sorry later.

In addition to the global “Mask PII” policy, I want to create a custom Customer table policy. As you can see below, we have quite a few actions available:

Immuta create table policies

The policy builder has an impressive functionality. Based on user names, groups, LDAP attributes you can create foundational policies:

  • Mask values - many options like hashing, replacing with constant, regex, rounding, K-Anonymization, format preserving masking
  • Limit rows - row level security

There are as well some more advanced and interesting items, like:

  • Limit usage to purpose - to access you need to be granted purpose privileges, e.g. claim analysis, medical health analysis, credit card transactions, etc. 
  • Minimize the amount data shown (e.g. top 100 for all tables just to get familiar with the schema)
  • Make deferentially private - policies will only return results for a certain type of SQL query: aggregates, such as the COUNT and SUM functions. 

Talking about data policies, it’s worth to mention a bit about subscriber policy. It defines how table access management. 

Immuta subscriber policies

All my Azure SQL tables got “Most Restricted” access policy. As you can read in the description, tables are not even visible in search and data source owners must manually add/remove users based on rationale.

There are weaker options available. If you are a part of specific group, then all members get access to it. Another option is to ask for permission, and data owner gets notification to approve/deny access.

Or, if a dataset contains no sensitive information (like general statistics), then you can select Anyone and make it public.

More about access and row-level policies

Projects and purpose-based policies

Let’s create a secret project. The idea here is to group members via a project to better control data usage. And clearly link with purpose restrictions.

Later, let’s add members and specify time-bound access. Brilliant functionality!

And now, as a member of a secret project working with a purpose, I can view all the columns. 

Collaboration and data catalog

As I mentioned earlier, Immuta’s UI is quite intuitive. In addition to data governance features, there are metadata management and data dictionary features also. 

First of all - search. Find all registered tables (restricted tables are not even visible via search). 

Immuta search

There is a Data Dictionary tab to look up schema, tagging, and description. One can also connect through APIs to external data catalogs. 

Immuta Data Dictionary

If something is unclear about tables, column descriptions, user can ask questions and create tasks via Immuta Discussions tab. The table owner will be notified.

Immuta Discussions tab

There is also Lineage tab, but that’s where I am surprised to see just project names using the table. Maybe it can show something more, but that’s not and end-to-end lineage of all data pipelines. 

Immuta Data Lineage

Public queries tab allows users to keep track of their personal queries, share their queries with others, and debug. Users can send a request for a query debug, which sends the query plan to the Data Owner. Interesting functionality, but haven’t tested it now.

Metrics tab shows details about the data source usage and general statistics. 

Immuta Metrics

Governance (tags, logs, reports, purposes)

Here is a section, that I as an engineer will not appreciate it enough. Immuta has powerful governance reports for data owners and data stewards, slicing and dicing access, sources, usage and other critical metrics to be compliant. 

View per user

Immuta Governance Reports per user

View per sources

Immuta Governance Reports per source

View per project

Immuta Governance Reports per project

Other views

Immuta Governance Reports options available

As well granular audit information.

Immuta detailed audit

Also, there are predefined tags and possibility to add your own tags (parent child hierarchy).

Immuta define tags

By creating data usage purposes, we can limit data access just to specific data usage scenarios.

Immuta Add purpose

Query editor

There is a query editor, that got me thinking of underlying computation. In my case, I connected Azure SQL database. Does it work with Data Lake Storage and what computation does it use then? What are hardware requirements? Unfortunately, I can’t test it properly as ADLS or other similar storages are to allowed in my trial version. 

For me, the query editor is a functionality for data owners and source admins, rather than analysts. Analysts will be using other tools they like, e.g. Databricks, PowerBI. This query editor seems like a good place to check your policies, see data quickly. Debug user queries if needed.

Immuta Query Editor

Connecting with PowerBI

Here is where things start to get interesting. To access Immuta from PowerBI, I am asked to install PostgreSQL connector. 

Immuta Connect to Power BI

Later, I need to enter SQL credentials, which I can create on my profile page in Immuta.

Immuta Create SQL credentials

And eventually I receive connection strings with host and port information. This connection can be used not only from PowerBI, but as well any other reader supporting ODBC (with PostgreSQL drivers installed).

Immuta get connection strings

More about setting up PowerBI and enterprise gateway:

How does Immuta work?

What is now obvious to me, Immuta is not a data virtualization technology. It’s seems to be a data governance wrapper between sources and computation engines. Based on Immuta’s documentation, I see 6 different data access patterns (it seems each pattern uses different data retrieval and authentication strategy, definitions taken from Immuta’s documentation):

1. Databricks
Data sources exposed via Immuta are available as tables in a Databricks cluster. Users can then query these data sources through their notebooks. Like other integrations, policies are applied to the plan that Spark builds for a user’s query and all data access is native.

2. Immuta Query Engine
Users get a basic Immuta PostgreSQL connection (that explains PostgreSQL driver installation). The tables within this connection represent all the connected data across your organization. Those tables, however, are virtual tables, completely empty until a query is run. At query time, the SQL is proxied through the virtual Immuta table down to the native database while enforcing the policy automatically. The Immuta SQL connection can be used within any tool supporting ODBC connections.

3. HDFS
HDFS is least interesting for me. Immuta HDFS layer can only act on data stored in HDFS. There are certain limitation with regards to HDFS deployments and accessibility of other Immuta data layers.

4. S3
Immuta supports an S3-style REST API, which allows users to communicate with Immuta the same way they would with S3. 

5. Snowflake
Native Snowflake workspaces allow users to access protected data directly in Snowflake without having to go through the Immuta Query Engine. Within these workspaces, users can interact directly with Snowflake secure views, create derived data sources, and collaborate with other project members at a common access level. Because these derived data sources will inherit all appropriate policies, that data can then be shared outside the project. 

6. SparkSQL
Users are able to access subscribed data sources within their Spark Jobs by utilizing Spark SQL with the ImmutaContext class. All tables are virtual and are not populated until a query is materialized. When a query is materialized, data from metastore backed data sources, such as Hive and Impala, will be accessed using standard Spark libraries to access the data in the underlying files stored in HDFS. All other data source types will access data using the Query Engine which will proxy the query to the native database technology. Policies for each data source will be enforced automatically.

More about data access patterns

Installation

I have used a trial instance of Immuta’s managed installation. Based on documentation, I see Kubernetes option available which will satisfy most Enterprise needs. As well Immuta on AWS has Managed Cloud deployment. Can’t wait to see Azure managed cloud deployment.

More about installation options

My observations

Strong points

  • Quick to get started with Databricks
  • It provides one data access layer (tracks data health, access management, etc.) 
  • Advanced global and dataset policy management scenarios
  • Advanced data governance reports and detailed audit information
  • Native data collaboration features (dictionary, discussions, search)
  • Smooth UI experience (except some glitches with data sources)
  • If used just with Databricks, there is a DBU consumption based pricing model 

Considerations

  • No Azure managed cloud offering / Not available in Azure marketplace yet
  • All access would be provided via Immuta, but it’s not quite a full data virtualization platform: no support for any kind of caching like Databricks Delta or Dremio, Synapse.
  • No semantic model and virtual datasets creation. Immuta gives access to what you have already built elsewhere.
  • It uses different data proxying techniques for different sources. Cross-source (using different patterns) performance and accessibility should be tested more thoroughly.
  • Azure Data Lake Gen 2 not supported in the trial. Overall, it seems computational engine is still needed to access ADLS Gen 2 data. AWS has Athena, Azure has Synapse or Databricks.
  • Licence based pricing model. Not as flexible as for Databricks.
  • External data catalog might be still needed for an end-to-end view (lineage, cover sources outside Immuta, etc.).

I'm Valdas Maksimavicius. I write about data, cloud technologies and personal development. You can find more about me here.