Announcing General Availability of HDInsight on Linux + new Data Lake Services and Language

Monday, September 28, 2015

Today, I’m happy to announce several key additions to our big data services in Azure, including the General Availability of HDInsight on Linux, as well as the introduction of our new Azure Data Lake and Language services.

General Availability of HDInsight on Linux

Today we are announcing general availability of our HDInsight service on Ubuntu Linux. HDInsight enables you to easily run managed Hadoop clusters in the cloud. With today’s release we now allow you to configure these clusters to run using both a Windows Server Operating System as well as an Ubuntu based Linux Operating System.

HDInsight on Linux enables even broader support for Hadoop ecosystem partners to run in HDInsight providing you even greater choice of preferred tools and applications for running Hadoop workloads. Both Linux and Windows clusters in HDInsight are built on the same standard Hadoop distribution and offer the same set of rich capabilities.

Today’s new release also enables additional capabilities, such as, cluster scaling, virtual network integration and script action support. Furthermore, in addition to Hadoop cluster type, you can now create HBase and Storm clusters on Linux for your NoSQL and real time processing needs such as building an IoT application.

Create a cluster

HDInsight clusters running using Linux can now be easily created from the Azure Management portal under the Data + Analytics section. Simply select Ubuntu from the cluster operating system drop-down, as well as optionally choose the cluster type you wish to create (we support base Hadoop as well as clusters pre-configured for workloads like Storm, Spark, HBase, etc).

All HDInsight Linux clusters can be managed by Apache Ambari. Ambari provides the ability to customize configuration settings of your Hadoop cluster while giving you a unified view of the performance and state of your cluster and providing monitoring and alerting within the HDInsight cluster.

Installing additional applications and Hadoop components

Similar to HDInsight Windows clusters, you can now customize your Linux cluster by installing additional applications or Hadoop components that are not part of default HDInsight deployment. This can be accomplished using Bash scripts with script action capability. As an example, you can now install Hue on an HDInsight Linux cluster and easily use it with your workloads:

Develop using Familiar Tools

All HDInsight Linux clusters come with SSH connectivity enabled by default. You can connect to the cluster via a SSH client of your choice. Moreover, SSH tunneling can be leveraged to remotely access all of the Hadoop web applications from the browser.

New Azure Data Lake Services and Language

We continue to see customers enabling amazing scenarios with big data in Azure including analyzing social graphs to increase charitable giving, analyzing radiation exposure and using the signals from thousands of devices to simulate ways for utility customers to optimize their monthly bills. These and other use cases are resulting in even more data being collected in Azure. In order to be able to dive deep into all of this data, and process it in different ways, you can now use our Azure Data Lake capabilities – which are 3 services that make big data easy.

The first service in the family is available today: Azure HDInsight, our managed Hadoop service that lets you focus on finding insights, and not spend your time having to manage clusters. HDInsight lets you deploy Hadoop, Spark, Storm and HBase clusters, running on Linux or Windows, managed, monitored and supported by Microsoft with a 99.9% SLA.

The other two services, Azure Data Lake Store and Azure Data Lake Analytics introduced below, are available in private preview today and will be available broadly for public usage shortly.

Azure Data Lake Store

Azure Data Lake Store is a hyper-scale HDFS repository designed specifically for big data analytics workloads in the cloud. Azure Data Lake Store solves the big data challenges of volume, variety, and velocity by enabling you to store data of any type, at any size, and process it at any scale. Azure Data Lake Store can support near real-time scenarios such as the Internet of Things (IoT) as well as throughput-intensive analytics on huge data volumes. The Azure Data Lake Store also supports a variety of computation workloads by removing many of the restrictions constraining traditional analytics infrastructure like the pre-definition of schema and the creation of multiple data silos. Once located in the Azure Data Lake Store, Hadoop-based engines such as Azure HDInsight can easily mine the data to discover new insights.

Some of the key capabilities of Azure Data Lake Store include:

Any Data: A distributed file store that allows you to store data in its native format, Azure Data Lake Store eliminates the need to transform or pre-define schema in order to store data.
Any Size: With no fixed limits to file or account sizes, Azure Data Lake Store enables you to store kilobytes to exabytes with immediate read/write access.
At Any Scale: You can scale throughput to meet the demands of your analytic systems including the high throughput needed to analyze exabytes of data. In addition, it is built to handle high volumes of small writes at low latency making it optimal for near real-time scenarios like website analytics, and Internet of Things (IoT).
HDFS Compatible: It works out-of-the-box with the Hadoop ecosystem including other Azure Data Lake services such as HDInsight.
Fully Integrated with Azure Active Directory: Azure Data Lake Store is integrated with Azure Active Directory for identity and access management over all of your data.

Azure Data Lake Analytics with U-SQL

The new Azure Data Lake Analytics service makes it much easier to create and manage big data jobs. Built on YARN and years of experience running analytics pipelines for Office 365, XBox Live, Windows and Bing, the Azure Data Lake Analytics service is the most productive way to get insights from big data. You can get started in the Azure management portal, querying across data in blobs, Azure Data Lake Store, and Azure SQL DB. By simply moving a slider, you can scale up as much computing power as you’d like to run your data transformation jobs.

Today we are introducing a new U-SQL offering in the analytics service, an evolution of the familiar syntax of SQL. U-SQL allows you to write declarative big data jobs, as well as easily include your own user code as part of those jobs. Inside Microsoft, developers have been using this combination in order to be productive operating on massive data sets of many exabytes of scale, processing mission critical data pipelines. In addition to providing an easy to use experience in the Azure management portal, we are delivering a rich set of tools in Visual Studio for debugging and optimizing your U-SQL jobs. This lets you play back and analyze your big data jobs, understanding bottlenecks and opportunities to improve both performance and efficiency, so that you can pay only for the resources you need and continually tune your operations.

Learn More

For more information and to get started, check out the following links:

Hope this helps,

Scott

This is great news! One question...if configuration changes are made via Ambari, are they persisted if a node is re-imaged due to maintenance reasons?

Chris Campbell - Tuesday, September 29, 2015 10:23:05 PM

Yes, configuration changes made via Ambari are persisted through the VM life cycle including maintenance events.

Adnan Ijaz - Wednesday, September 30, 2015 1:28:42 AM

I'm new on this, can you give tutorial about installing HDInsight ? Thank a lot. I'm waiting :)

Sewa Elf Surabaya - Friday, October 30, 2015 4:39:38 PM

General Availability of HDInsight on Linux

New Azure Data Lake Services and Language

Azure Data Lake Store

Azure Data Lake Analytics with U-SQL

Learn More

3 Comments