Making Data Scientists Productive in Azure

software

Doing data science today is far more difficult that it will be in the next 5-10 years. Sharing, collaborating on workflows in painful, pushing models into production is challenging. Let’s explore what Azure provides to ease Data Scientists’ pains.

In November this year, I got an opportunity to speak at Big Data Conference Vilnius 2018 about Data Analytics products in Azure. Here is a written version of my talk. In this post, you will learn about Azure Machine Learning Studio, Azure Machine Learning Service, Azure Databricks, Data Science Virtual Machine, and Cognitive Services. What tools and services can we choose based on a problem definition, skillset or infrastructure requirements?

One thing about Microsoft - they have many ways to solve the same problem

Picking a good name for your classes, methods or variables is essential (and difficult). Finding a good name for a product or service seems to be even more challenging. When I look at the Azure service names (“machine learning” this, “machine learning” that), it is clear that even big companies, like Microsoft, have difficulties of finding catchy and straightforward names.

As a result, there are many different services with similar names. For example, what is the difference between Azure Machine Learning Service and Azure Machine Learning Studio? Is Microsoft Machine Learning Server the same thing as Data Science Virtual Machine? Let’s find out!


Jump to Azure Cognitive Services

Azure Machine Learning Studio

Azure Machine Learning Service

Microsoft Machine Learning Server

Azure Databricks

Data Science Virtual Machine


What do you mean by saying “Making Data Scientists Productive in Azure”?

Matei Zaharia, the author of Spark, in one of his presentations pointed our the main aspects of machine learning lifecycle.

In the underlying machine learning lifecycle, we start with data. Later, we run data preparation scripts, model training, and model deployment. Then, if our application is doing anything important, we want to monitor it to see how it’s doing, collect extra data and feed it back into this process again. Each step has many tools that often need tuning for better results and performance. Finding what parameters were used at each stage to get a specific result is essential to be able to experiment with all. Everything needs to happen at scale.

By “productive” in the title of this post, I mean a collection of well-integrated tools that support the whole machine learning lifecycle.

Azure Cognitive Services

What is it?

Azure services with pre-built AI and ML models

What can you do with it?

Add intelligent features to your apps

Azure Cognitive Services is a powerful capability that allows software developers (no machine learning knowledge required) use state of the art machine learning models and integrate with other applications by calling APIs or importing SDKs.

Azure Cognitive Services lets to build apps with powerful algorithms using a few lines of code, run across devices and platforms such as iOS, Android, and Windows. The Cognitive Services continually expands with new features. Many services offer free demos (https://azure.microsoft.com/en-us/services/cognitive-services/directory/).

For example, here is a face detection API returning my face parameters. The service marks my face, which IMHO is a minimal viable product in this context. Additionally, the result includes other attributes, such as hair color, smile, gender. But the first property is BALD: 0.13! My wife said the value is wrong because it’s too low :)

Azure Cognitive Services - Summary

Key benefits:

  • Minimal development effort
  • Easy integration via HTTP REST
  • Built-in integrations with other Azure services

Considerations:

  • Only available over the web (an exception is the Custom Vision Service)
  • Partial customization allowed
  • Limited support for Non-English languages

Azure Machine Learning Studio

What is it?

Drag-and-drop visual interface for ML

What can you do with it?

Build, experiment and deploy models using pre-configured algorithms

Azure Machine Learning Studio (ML Studio) is a collaborative, drag-and-drop visual workspace where you can build, test, and deploy machine learning solutions without needing to write code. It uses pre-built and pre-configured machine learning algorithms and data-handling modules. Business analysts/statisticians without R/Python knowledge would be productive with this tool.

Azure Machine Learning Studio is an impressive service, that can make people productive quickly. Yet, an experienced Data Scientist might find the tool very limiting and slow.

Use ML Studio when you want to experiment with machine learning models quickly and easily, and the built-in machine learning algorithms are enough for your solutions.

The whole experiment looks like a graph, with inputs at the top and outputs (predictions) at the bottom. In the example above, “Binary Classification: Direct marketing”, I compare two algorithms (two-class boosted decision tree and two-class support vector machine), and the tool makes it easy to deploy a better performing model as a web service.

ML Studio offers support for the whole cycle, from data ingestion to deployment via exposed web services. There are limitation of available data connectors, algorithms and deployment methods. ML Studio reads data only from other Azure services (Azure SQL, Azure Cosmos DB, Event Hub, HDInsight). For data preparation, a user can pick available data cleaning and transformation operations. There is limited support to write custom Python/R transformations. For training, there are many predefined algorithms available. To serve models, you can deploy your experiment as a web service (batch + request/response endpoints).

Azure Machine Learning Studio - Summary

Key benefits:

  • Interactive visual interface
  • Built-in Jupyter Notebooks for data exploration
  • Direct deployment of trained models as web services
  • Built-in integrations with other Azure services

Considerations:

  • Limited scalability (the largest size of a training dataset is 10GB)
  • Online only
  • Limited support for custom Python/R code

Azure Machine Learning Service

What is it?

Managed cloud service for ML

What can you do with it?

Train, deploy and manage models in Azure using Python and CLI

First of all, be aware that we discuss now Azure Machine Learning Service, NOT STUDIO (presented earlier).

Azure Machine Learning Service (ML Service) provides a cloud-based environment you can use to develop, train, test, deploy, manage, and track machine learning models. It supports open-source technologies so that you can use Python open-source packages with machine learning components.

You might consider ML Service if you work in a Python environment, you want more control over your machine learning algorithms, or you want to use open-source machine learning libraries.

The workflow generally follows this sequence:

  1. Develop machine learning training scripts in Python.
  2. Create and configure a compute target (local computer, VM in Azure, Azure Batch AI cluster, Databricks, Azure Container Instance, Apache Spark for HDInsight)
  3. Submit the scripts to the configured compute target to run in that environment. During training, the compute target stores run records to a datastore (Blob Storage).
  4. Query the experiment for logged metrics from the current and past runs. If the parameters don’t indicate the desired outcome, loop back to step 1 and iterate on your scripts.
  5. After a satisfactory run is found, register the persisted model in the model registry.
  6. Develop a scoring script.
  7. Create an image and register it in the image registry.
  8. Deploy the image as a web service in Azure (Azure Container Instance, Azure Kubernetes Service, Azure IoT Edge)

By using ML Service, you can start training on your local machine and then scale out to the cloud. With many available compute targets, and with advanced hyperparameter tuning services, you can build better models faster by using the power of the cloud.

ML Service supports the whole cycle, from data ingestion to deployment using Docker containers. Data should be available in Azure Blob Storage. For data preparation and training you can use any Python open-source package. For deployment, the easiest setup is achievable with Azure Container Instances or Azure Kubernetes Service.

Azure Machine Learning Service - Summary

Key benefits:

  • Central management of scripts and run history
  • Run model training scripts locally (offline), and then scale out to the cloud
  • Management and deployment of models to the cloud or edge devices

Considerations:

  • Python only

Microsoft Machine Learning Server

What is it?

Cross-platform standalone server for predictive analysis

What can you do with it?

Build and deploy models written in R or Python

In September 2017, Microsoft R Server was released under the new name of Microsoft Machine Learning Server (because of added Python support). Microsoft Machine Learning Server (ML Server) is a flexible choice for analyzing data at scale, building intelligent apps, and discovering insights. It includes a collection of R packages, Python packages, interpreters, and infrastructure for developing and deploying distributed R and Python-based machine learning solutions on a range of platforms across on-premises and cloud.

Satisfy security and compliance needs of any enterprise

ML Server offers best-in-class operationalization - from the time a machine learning model is completed, it takes just a few clicks to generate web services APIs. These web services are hosted on a server grid on-premises or in the cloud and can be integrated with line-of-business applications. Additionally, ML Server integrates seamlessly with Active Directory and Azure Active Directory and includes role-based access control to satisfy security and compliance needs of your enterprise.

ML Server has full support for the data science lifecycle of R and Python-based analytics.

Microsoft Machine Learning Server - Summary

Key benefits:

  • Built on a legacy of Microsoft R Server and Revolution R Enterprise
  • Advanced security options
  • Deploy R and Python models as web services

Considerations:

  • You need to deploy and manage Machine Learning Server in your enterprise

Azure Databricks

What is it?

Spark-based analytics platform

What can you do with it?

Build and deploy models and data workflows

Databricks provides a managed cloud platform built around Spark that delivers 1) fully managed Spark clusters, 2) an interactive workspace for exploration and visualization, 3) a production pipeline scheduler, and 4) a platform for powering your Spark-based applications

The main concepts:

  • Databricks Runtime (Apache Spark, concurrent clusters, REST APIs, libraries)
  • Collaborative workspace (notebooks, user access, git integration)
  • Deploy Jobs & Workflows (job scheduler, notifications & logs, multi-stage pipelines)
  • Security (single sign-on (SSO), access control list (ACL), secrets via Azure Key Vault)

Azure Databricks, with help of extra libraries and services, supports complete machine learning cycle. Databricks folks are working on a project called MLflow, an open source platform to manage the ML lifecycle, including experimentation, reproducibility, and deployment. It is still in beta, but it seems to be a great tool to simplify model development, management, runs tracking and deployment.

Azure Databricks - Summary

Key benefits:

  • The most mature development environment for ML on the Azure platform
  • Integrated with other Azure services (e.g., Azure Data Factory, Azure Key Vault)

Considerations:

  • Online only
  • Expensive

Data Science Virtual Machine

What is it?

An Azure virtual machine with pre-installed data science tools

What can you do with it?

Develop ML solutions in a pre-configured environment

Data Science Virtual Machine (DSVM) is a pre-installed and pre-configured set of images for Windows or Linux virtual machines. DSVM includes the most popular data science tools. Since it has access to the full potential of Azure networking and scalability, DSVM can be a great environment even for data science teams.

Data Science Virtual Machine can be useful for learning and comparing different machine learning tools.

Data Science Virtual Machine - Summary

Key benefits:

  • The most complete development environment for ML on the Azure platform
  • Reduced time to install, manage, and troubleshoot data science tools and frameworks
  • Included the latest versions of all commonly used tools and frameworks
  • Virtual machine options include scalable GPU images

Considerations:

  • Online only
  • Infrastructure as a service (IaaS), not a managed data science solution

After quite an extensive blog post, can you name differences between Azure Machine Learning Service and Azure Machine Learning Studio; Microsoft Machine Learning Server and Data Science Virtual Machine?

Overall, it seems that Azure Databricks is the most powerful and mature service currently available in Azure. I have been using Databricks for over a year, with ups and downs, but it is a great data science tool. Azure Machine Learning Service might be a great framework if you know Python. Azure Machine Learning Studio can make people productive but has severe limitations. Data Science VM is a free-ride - install what you want, scale when you want.

Are you using Azure analytics products? What products do you use?