What do data scientists at Microsoft

Platforms and tools for data science projects

  • 8 minutes to read

Microsoft offers a comprehensive range of analytical resources for both cloud and local platforms. By providing them, you can improve the efficiency and scalability of your data science projects. The Team Data Science Process (TDSP) is a guide for teams who want to implement data science projects collaboratively, in a traceable manner and with version control. Under Roles and Tasks in the Team Data Science Process you will find an overview of the most important employee roles and assigned tasks for a data science team that is striving for standardization in accordance with this process.

The following analysis resources are available for data science teams using the TDSP:

  • Data Science VMs (Windows and Linux CentOS)
  • HDInsight Spark cluster
  • Azure Synapse Analytics
  • Azure data lake
  • HDInsight Hive cluster
  • Azure File Storage
  • SQL Server 2019 R and Python Services
  • Azure Databricks

This document provides a brief description of the resources and links to the tutorials and walkthroughs that TDSP teams have published. There you will find detailed information on how to use it and how to use it to build intelligent applications. Further information on the resources can be found on the respective product page.

Data Science VM (DSVM)

The Data Science VM is offered by Microsoft for both Windows and Linux and contains popular tools for data science models and development activities. The tools include the following:

  • Microsoft R Server Developer Edition
  • Anaconda Python Distribution
  • Jupyter notebooks for Python and R
  • Visual Studio Community Edition with Python and R-Tools (Windows) / Eclipse (Linux)
  • Power BI Desktop for Windows
  • SQL Server 2016 Developer Edition (Windows) / Postgres (Linux)

It also includes Machine learning and AI tools like xgboost, mxnet and Vowpal Wabbit.

The data science VM is currently for the operating systems Windows and Linux CentOS available. Choose the size of your data science VM (number of CPU cores and amount of memory) depending on the requirements of the data science projects you want to run with it.

For more information on the Windows edition of the Data Science VM, see Microsoft Data Science Virtual Machine in the Azure Marketplace. For the Linux edition of the Data Science VM, see Data Science Virtual Machine for Linux (CentOS).

To learn how to efficiently perform some common data science tasks on the Data Science VM, see Ten Things You Can Do With the Data Science Virtual Machine.

Azure HDInsight Spark cluster

Apache Spark is an open source parallel processing framework that supports in-memory processing to improve the performance of analytics applications on large amounts of data. The Spark processing engine is designed for speed, ease of use, and sophisticated analysis. Thanks to the in-memory calculation functions, Spark is particularly suitable for iterative machine learning algorithms and for graph calculations. Spark is also compatible with Azure Blob Storage (WASB), so your existing data stored in Azure can be easily processed with Spark.

When you create a Spark cluster in HDInsight, you use it to create Azure server resources with Spark installed and configured. It takes about ten minutes to create a Spark cluster in HDInsight. Store the data to be processed in Azure Blob Storage. For information about using Azure Blob Storage with a cluster, see Using Azure Storage with Azure HDInsight Clusters.

Microsoft's TDSP team has published two comprehensive walkthroughs for building data science solutions using Azure HDInsight Spark clusters: one for Python and one for Scala. More information about Spark clusters in Azure HDInsight, see Introduction to Spark in HDInsight. Learn how to build a data science solution with python for an Azure HDInsight Spark cluster, see the overview of Data Science with Spark in Azure HDInsight. Learn how to build a data science solution with Scala for an Azure HDInsight Spark cluster, see Data Science Using Scala and Spark on Azure.

Azure Synapse Analytics

With Azure Synapse Analytics you can easily and quickly provision computing resources - without oversizing or overpayment. In addition, only here do you have the option to interrupt the use of computer resources and thus better control your cloud costs. With the ability to provide scalable computing resources, you can transfer all of your data to Azure Synapse Analytics. The cost of storage is extremely low, and you can limit compute operations to only the parts of the dataset that you want to analyze.

For more information on Azure Synapse Analytics, see the Azure Synapse Analytics website. For information on building comprehensive advanced analytics solutions with Azure Synapse Analytics, see The Team Data Science Process in Action: Using Azure Synapse Analytics.

Azure data lake

Azure Data Lake is a company-wide repository for all types of data that is collected in a central location before formal requirements or schemas are applied. Thanks to this flexibility, any type of data can be stored in a data lake - regardless of size, structure and acquisition speed. Organizations can then examine these data lakes for patterns using Hadoop or advanced analytics. Data lakes can also act as a repository for cheaper data preparation before the data is finally prepared and moved to a data warehouse.

For more information about Azure Data Lake, see Introducing Azure Data Lake. To learn how to build a comprehensive, scalable data science solution with Azure Data Lake, see Scalable Data Science with Azure Data Lake: Seamless Walkthrough.

Azure HDInsight Hive cluster (Hadoop)

Apache Hive is a data warehouse system for Hadoop that enables the aggregation, query and analysis of data with HiveQL (SQL-like query language). Hive can be used to browse data interactively or to create repeatable batch processing jobs.

Hive enables you to structure mostly unstructured data. Once the structure is defined, you can use Hive to query the data in a Hadoop cluster without using Java or MapReduce. You can use HiveQL (the Hive query language) to build queries using T-SQL-like statements.

Data specialists can use Hive to run custom Python functions in Hive queries to process records. This significantly expands the functionality of Hive queries in data analysis. In particular, data specialists can develop scalable features in languages ​​they are most familiar with: the SQL-like HiveQL and Python.

For more information about Azure HDInsight Hive clusters, see What are Apache Hive and HiveQL in Azure HDInsight ?. For information on building a comprehensive, scalable data science solution using Azure HDInsight Hive clusters, see The Team Data Science Process in Action: Using Azure HDInsight Hadoop Clusters.

Azure File Storage

The Azure File Storage service provides file shares in the cloud using the standard Server Message Block (SMB) protocol. Both SMB 2.1 and SMB 3.0 are supported. With Azure File Storage, you can quickly migrate legacy applications that require file shares to Azure without costly rewriting. Applications that run on Azure virtual machines, in cloud services, or on local clients can provide a file share in the cloud just as a desktop application provides a normal SMB share. The file storage shares can then be integrated and used by any number of application components at the same time.

Particularly useful for data science projects is the ability to create an Azure file store for sharing project data with members of the project team. This gives each member access to the same copy of the data in the Azure file store. This file memory can also be used to share feature groups that are generated in the course of project implementation. If the project is a sales order, your customers can create an Azure file store under their own Azure subscription to share the project data and features with you. This gives the customer full control over the project's data resources. For more information about Azure File Storage, see Developing for Azure Files with .NET and Using Azure Files with Linux.

SQL Server 2019 R and Python Services

R Services (internal database) offers a platform for developing and providing intelligent applications for gaining new knowledge. You can use the rich and powerful R language (including many packages from the R community) to build models and generate predictions based on your SQL Server data. Since R Services (within the database) integrates the R language into SQL Server, there is a close connection between analysis and data. This avoids the costs and security risks associated with moving data.

R Services (in-database) supports the open source R language with a comprehensive set of SQL Server tools and technologies. This gives you the benefit of superior performance, security, reliability, and manageability. You can deploy R solutions using practical and familiar tools. Your production applications can call the R runtime and use Transact-SQL to get predictions and visuals. You can also use the ScaleR libraries to improve the scaling and performance of your R solutions. For more information, see SQL Server R Services.

Microsoft's TDSP team has published two comprehensive walkthroughs for building data science solutions in SQL Server 2016 R Services: one for R programmers and one for SQL developers. The Comprehensive Data Science Walkthrough for R programmer is located here. The tutorial on advanced in-database analysis for SQL developer is located here.

Appendix: Tools for setting up data science projects

Install Git Credential Manager on Windows

If you have the TDSP under Windows you need to communicate with the git repositories Git Credential Manager (GCM) to install. To install GCM, you must first Chocolaty to install. To install Chocolaty and GCM in Windows PowerShell, run the following commands as Administrator out:

Installing Git on Computers Running Linux (CentOS)

Run the following bash command to install Git on computers running Linux (CentOS):

Generate SSH public keys on computers running Linux (CentOS)

If you are running the Git commands on a Linux (CentOS) computer, you will need to add your computer's public SSH key to your Azure DevOps Services so that it can be recognized by Azure DevOps Services. First, you need to generate an SSH public key and add it to the SSH public keys on the Azure DevOps Services security setting page.

  1. To generate the SSH key, run the following two commands:

  2. Copy the entire SSH key (including the ssh-rsa).

  3. Sign in to Azure DevOps Services.

  4. At the top right of the page, click <Ihr Name> and then on security.

  5. click on SSH public key and then on + Add.

  6. Paste the copied SSH key into the text box and save it.

Next Steps

Complete walkthroughs that show every step in the process for certain scenarios are also available. They are listed in the topic Walkthroughs and linked to thumbnail descriptions. They show how cloud and on-premises tools and services are combined in a workflow or pipeline to create an intelligent application.

For examples showing how to complete the steps in the Team Data Science process using Azure Machine Learning Studio (classic), see the With Azure ML learning path.