Scheduling Hive jobs using SSIS ETL is a blasphemy. Is it?


Here’s a conversation I had a while ago with my colleague:

  • SSIS ETL is the last tool to schedule Hive jobs on a separate cluster”
  • But what if … - I muttered.
  • There are so many great tools you can choose from, Apache Oozie, Nifi ,or even the cron. And you chose the worst.

For a moment let’s forget about Apache Oozie and similar services. Here are the prerequisites:

  • Data is in SQL Server
  • SSIS is being used across the organisation (you’ll definitely find someone when help needed)
  • HDInsight in Azure cannot access servers on-premises. Only push from on-premises is viable.
  • SSIS is being used to upload data to Azure Storage.

Having this in mind, SSIS ETL doesn’t seem to be that bad. But is it working properly? In January 2017 I was so fed up withconstant fights with Azure and HDInsight. Azure Feature Pack for SSIS supported HDInsight up to 3.2. Yes, the Windows version. My cluster (v 3.4) was absolutely invisible. But then, in March, Microsoft releases a new version. Brilliant!!! Detailed description of Hive task

Limitation If you run a script file, there are no options to inject Hive environment variables. (Sick suggestion) Inject variables in Hive script before running Azure HDInsight Hive Task. If you run in-line script, then inject as you wish in SSIS variables.

How it works

  • HDInsightConnection - Azure HDInsight Connection Manager (DNS, user name and password)
  • AzureStorageConnection - Azure Storage Connection Manager (account name and key)
  • BlobContainer - this is only necessary if you want to download the script execution output and error logs
  • LocalLogFolder - this is only necessary if you want to download the script execution output and error logs locally.
  • Script - Hive script, if no file specified
  • BlobName - the script in Azure Blob Storage. Path to file after container.
  • ExternalStorageAccountName and ExternalBlobContainer - the script storage

Parameters A standard way to inject values into Azure HDInsight Hive Task

Main picture

Final remarks Don’t forget to whitelist your SSIS ETL server in HDInsight VNet :)

I'm Valdas Maksimavicius. I write about data, cloud technologies and personal development. You can find more about me here.