Azure: Announcing New Real-time Data Streaming and Data Factory Services

The last three weeks have been busy ones for Azure.  Two weeks ago we announced a partnership with Docker to enable great container-based development experiences on Linux, Windows Server and Microsoft Azure.

Last week we held our Cloud Day event and announced our new G-Series of Virtual Machines as well as Premium Storage offering.  The G-Series VMs provide the largest VM sizes available in the public cloud today (nearly 2x more memory than the largest AWS offering, and 4x more memory than the largest Google offering).  The new Premium Storage offering (which will work with both our D-series and G-series of VMs) will support up to 32TB of storage per VM, >50,000 IOPS of disk IO per VM, and enable sub-1ms read latency.  Combined they provide an enormous amount of power that enables you to run even bigger and better solutions in the cloud.

Earlier this week, we officially opened our new Azure Australia regions – which are our 18th and 19th Azure regions open for business around the world.  Then at TechEd Europe we announced another round of new features – including the launch of the new Azure MarketPlace, a bunch of great network improvements, our new Batch computing service, general availability of our Azure Automation service and more.

Today, I’m excited to blog about even more new services we have released this week in the Azure Data space.  These include:

  • Event Hubs: is a scalable service for ingesting and storing data from websites, client apps, and IoT sensors.
  • Stream Analytics: is a cost-effective event processing engine that helps uncover real-time insights from event streams.
  • Data Factory: enables better information production by orchestrating and managing diverse data and data movement.

Azure Event Hub is now available in general availability, and the new Azure Stream Analytics and Data Factory services are now in public preview.

Event Hubs: Log Millions of events per second in near real time

The Azure Event Hub service is a highly scalable telemetry ingestion service that can log millions of events per second in near real time.  You can use the Event Hub service to collect data/events from any IoT device, from any app (web, mobile, or a backend service), or via feeds like social networks.  We are using it internally within Microsoft to monitor some of our largest online systems.

Once you collect events with Event Hub you can then analyze the data using any real-time analytics system (like Apache Storm or our new Azure Stream Analytics service) and store/transform it into any data storage system (including HDInsight and Hadoop based solutions).

Event Hub is delivered as a managed service on Azure (meaning we run, scale and patch it for you and provide an enterprise SLA).  It delivers:

  • Ability to log millions of events per second in near real time
  • Elastic scaling support with the ability to scale-up/down with no interruption
  • Support for multiple protocols including support for HTTP and AMQP based events
  • Flexible authorization and throttling device policies
  • Time-based event buffering with event order preservation

The pricing model for Event Hubs is very flexible – for just $11/month you can provision a basic Event Hub with guaranteed performance capacity to capture 1 MB/sec of events sent to your Event Hub.  You can then provision as many additional capacity units as you need if your event traffic goes higher. 

Getting Started with Capturing Events

You can create a new Event Hub using the Azure Portal or via the command-line.  Choose New->App Service->Service Bus->Event Hub in the portal to do so:

image

Once created, events can be sent to an Event Hub with either a strongly-typed API (e.g. .NET or Java client library) or by just sending a raw HTTP or AMQP message to the service.  Below is a simple example of how easy it is to log an IoT event to an Event Hub using just a standard HTTP post request.  Notice the Authorization header in the HTTP post – you can use this to optionally enable flexible authentication/authorization for your devices:

POST https://your-namespace.servicebus.windows.net/your-event-hub/messages?timeout=60&api-version=2014-01 HTTP/1.1

Authorization: SharedAccessSignature sr=your-namespace.servicebus.windows.net&sig=tYu8qdH563Pc96Lky0SFs5PhbGnljF7mLYQwCZmk9M0%3d&se=1403736877&skn=RootManageSharedAccessKey

ContentType: application/atom+xml;type=entry;charset=utf-8

Host: your-namespace.servicebus.windows.net

Content-Length: 42

Expect: 100-continue

 

{ "DeviceId":"dev-01", "Temperature":"37.0" }

Your Event Hub can collect up to millions of messages per second like this, each storing whatever data schema you want within them, and the Event Hubs service will store them in-order for you to later read/consume.

Downstream Event Processing

Once you collect events, you no doubt want to do something with them.  Event Hubs includes an intelligent processing agent that allows for automatic partition management and load distribution across readers.  You can implement any logic you want within readers, and the data sent to the readers is delivered in the order it was sent to the Event Hub.

In addition to supporting the ability for you to write custom Event Readers, we also have two easy ways to work with pre-built stream processing systems: including our new Azure Stream Analytics Service and Apache Storm.  Our new Azure Stream Analytics service supports doing stream processing directly from Event Hubs, and Microsoft has created an Event Hubs Storm Spout for use with Apache Storm clusters.

The below diagram helps express some of the many rich ways you can use Event Hubs to collect and then hand-off events/data for processing:

image

Event Hubs provides a super flexible and cost effective building-block that you can use to collect and process any events or data you can stream to the cloud.  It is very cost effective, and provides the scalability you need to meet any needs.

Learning More about Event Hubs

For more information about Azure Event Hubs, please review the following resources:

Stream Analytics: Distributed Stream Processing Service for Azure

I’m excited to announce the preview our new Azure Stream Analytics service – a fully managed real-time distributed stream computation service that provides low latency, scalable processing of streaming data in the cloud with an enterprise grade SLA. The new Azure Stream Analytics service easily scales from small projects with just a few KB/sec of throughput to a gigabyte/sec or more of streamed data messages/events.  

Our Stream Analytics pricing model enable you to run low throughput streaming workloads continuously at low cost, and enables you to only have to scale up as your business needs increase.  We do this while maintaining built in guarantees of event delivery, and state management for fast recovery which enables mission critical business continuity.

Dramatically Simpler Developer Experience for Stream Processing Data

Stream Analytics supports a SQL-like language that dramatically lowers the bar of the developer expertise required to create a scalable stream processing solution. A developer can simply write a few lines of SQL to do common operations including basic filtering, temporal analysis operations, joining multiple live streams of data with other static data sources, and detecting stream patterns (or lack thereof).

This dramatically reduces the complexity and time it takes to develop, maintain and apply time-sensitive computations on real-time streams of data. Most other streaming solutions available today require you to write complex custom code, but with Azure Stream Analytics you can write simple, declarative and familiar SQL.

Fully Managed Service that is Easy to Setup

With Stream Analytics you can dramatically accelerate how quickly you can derive valuable real time insights and analytics on data from devices, sensors, infrastructure, or applications. With a few clicks in the Azure Portal, you can create a streaming pipeline, configure its inputs and outputs, and provide SQL-like queries to describe the desired stream transformations/analysis you wish to do on the data. Once running, you are able to monitor the scale/speed of your overall streaming pipeline and make adjustments to achieve the desired throughput and latency.

You can create a new Stream Analytics Job in the Azure Portal, by choosing New->Data Services->Stream Analytics:

image

Setup Streaming Data Input

Once created, your first step will be to add a Streaming Data Input.  This allows you to indicate where the data you want to perform stream processing on is coming from.  From within the portal you can choose Inputs->Add An Input to launch a wizard that enables you to specify this:

image

We can use the Azure Event Hub Service to deliver us a stream of data to perform processing on. If you already have an Event Hub created, you can choose it from a list populated in the wizard above.  You will also be asked to specify the format that is being used to serialize incoming event in the Event Hub (e.g. JSON, CSV or Avro formats).

Setup Output Location

The next step to developing our Stream Analytics job is to add a Streaming Output Location.  This will configure where we want the output results of our stream processing pipeline to go.  We can choose to easily output the results to Blob Storage, another Event Hub, or a SQL Database:

image

Note that being able to use another Event Hub as a target provides a powerful way to connect multiple streams into an overall pipeline with multiple steps.

Write Streaming Queries

Now that we have our input and output sources configured, we can now write SQL queries to transform, aggregate and/or correlate the incoming input (or set of inputs in the event of multiple input sources) and output them to our output target.  We can do this within the portal by selecting the QUERY tab at the top.

image

There are a number of interesting queries you can write to processing the incoming stream of data.  For example, in the previous Event Hub section of this blog post I showed how you can use an HTTP POST command to submit JSON based temperature data from an IoT device to an Event Hub with data in JSON format like so:

{ "DeviceId":"dev-01", "Temperature":"37.0" }

When multiple devices are streaming events simultaneously into our Event Hub like this, it would feed into our Stream Analytics job as a stream of continuous data events that look like the sequence below:

Wouldn’t it be interesting to be able to analyze this data using a time-window perspective instead?  For example, it would be useful to calculate in real-time what the average temperature of each device was in the last 5 seconds of multiple readings.

With the Stream Analytics Service we can now dynamically calculate this over our incoming live stream of data just by writing a SQL query like so:

Running this query in our Stream Analytics job will aggregate/transform our incoming stream of data events and output data like below into the output source we configured for our job (e,g, a blog storage file or a SQL Database):

The great thing about this approach is that the data is being aggregated/transformed in real time as events are being streamed to us, and it scales to handle literally gigabytes of data event streamed per second.

Scaling your Stream Analytics Job

Once defined, you can easily monitor the activity of your Stream Analytics Jobs in the Azure Portal:

image

You can use the SCALE tab to dynamically increase or decrease scale capacity for your stream processing – allowing you to pay only for the compute capacity you need, and enabling you to handle jobs with gigabytes/sec of streamed data. 

Learning More about Stream Analytics Service

For more information about Stream Analytics, please review the following resources:

Data Factory: Fully managed service to build and manage information production pipelines

Organizations are increasingly looking to fully leverage all of the data available to their business.  As they do so, the data processing landscape is becoming more diverse than ever before – data is being processed across geographic locations, on-premises and cloud, across a wide variety of data types and sources (SQL, NoSQL, Hadoop, etc), and the volume of data needing to be processed is increasing exponentially. Developers today are often left writing large amounts of custom logic to deliver an information production system that can manage and co-ordinate all of this data and processing work.

To help make this process simpler, I’m excited to announce the preview of our new Azure Data Factory service – a fully managed service that makes it easy to compose data storage, processing, and data movement services into streamlined, scalable & reliable data production pipelines. Once a pipeline is deployed, Data Factory enables easy monitoring and management of it, greatly reducing operational costs. 

Easy to Get Started

The Azure Data Factory is a fully managed service. Getting started with Data Factory is simple. With a few clicks in the Azure preview portal, or via our command line operations, a developer can create a new data factory and link it to data and processing resources.  From the new Azure Marketplace in the Azure Preview Portal, choose Data + Analytics –> Data Factory to create a new instance in Azure:

image

Orchestrating Information Production Pipelines across multiple data sources

Data Factory makes it easy to coordinate and manage data sources from a variety of locations – including ones both in the cloud and on-premises.  Support for working with data on-premises inside SQL Server, as well as Azure Blob, Tables, HDInsight Hadoop systems and SQL Databases is included in this week’s preview release. 

Access to on-premises data is supported through a data management gateway that allows for easy configuration and management of secure connections to your on-premises SQL Servers.  Data Factory balances the scale & agility provided by the cloud, Hadoop and non-relational platforms, with the management & monitoring that enterprise systems require to enable information production in a hybrid environment.

Custom Data Processing Activities using Hive, Pig and C#

This week’s preview enables data processing using Hive, Pig and custom C# code activities.  Data Factory activities can be used to clean data, anonymize/mask critical data fields, and transform the data in a wide variety of complex ways.

The Hive and Pig activities can be run on an HDInsight cluster you create, or alternatively you can allow Data Factory to fully manage the Hadoop cluster lifecycle on your behalf.  Simply author your activities, combine them into a pipeline, set an execution schedule and you’re done – no manual Hadoop cluster setup or management required. 

Built-in Information Production Monitoring and Dashboarding

Data Factory also offers an up-to-the moment monitoring dashboard, which means you can deploy your data pipelines and immediately begin to view them as part of your monitoring dashboard.  Once you have created and deployed pipelines to your Data Factory you can quickly assess end-to-end data pipeline health, pinpoint issues, and take corrective action as needed.

Within the Azure Preview Portal, you get a visual layout of all of your pipelines and data inputs and outputs. You can see all the relationships and dependencies of your data pipelines across all of your sources so you always know where data is coming from and where it is going at a glance. We also provide you with a historical accounting of job execution, data production status, and system health in a single monitoring dashboard:

image

Learning More about Stream Analytics Service

For more information about Data Factory, please review the following resources:

Other Great Data Improvements

Today’s releases make it even easier for customers to stream, process and manage the movement of data in the cloud.  Over the last few months we’ve released a bunch of other great data updates as well that make Azure a great platform to perform any data needs.  Since August: 

We released a major update of our SQL Database service, which is a relational database as a service offering.  The new SQL DB editions (Basic/Standard/Premium ) support a 99.99% SLA, larger database sizes, dedicated performance guarantees, point-in-time recovery, new auditing features, and the ability to easily setup active geo-DR support. 

We released a preview of our new DocumentDB service, which is a fully-managed, highly-scalable, NoSQL Document Database service that supports saving and querying JSON based data.  It enables you to linearly scale your document store and scale to any application size.  Microsoft MSN portal recently was rewritten to use it – and stores more than 20TB of data within it.

We released our new Redis Cache service, which is a secure/dedicated Redis cache offering, managed as a service by Microsoft.  Redis is a popular open-source solution that enables high-performance data types, and our Redis Cache service enables you to standup an in-memory cache that can make the performance of any application much faster.

We released major updates to our HDInsight Hadoop service, which is a 100% Apache Hadoop-based service in the cloud. We have also added built-in support for using two popular frameworks in the Hadoop ecosystem: Apache HBase and Apache Storm.

We released a preview of our new Search-As-A-Service offering, which provides a managed search offering based on ElasticSearch that you can easily integrate into any Web or Mobile Application.  It enables you to build search experiences over any data your application uses (including data in SQLDB, DocDB, Hadoop and more).

And we have released a preview of our Machine Learning service, which provides a powerful cloud-based predictive analytics service.  It is designed for both new and experienced data scientists, includes 100s of algorithms from both the open source world and Microsoft Research, and supports writing ML solutions using the popular R open-source language.

You’ll continue to see major data improvements in the months ahead – we have an exciting roadmap of improvements ahead.

Summary

Today’s Microsoft Azure release enables some great new data scenarios, and makes building applications that work with data in the cloud even easier.

If you don’t already have a Azure account, you can sign-up for a free trial and start using all of the above features today.  Then visit the Microsoft Azure Developer Center to learn more about how to build apps with it.

Hope this helps,

Scott

P.S. In addition to blogging, I am also now using Twitter for quick updates and to share links. Follow me at: twitter.com/scottgu

9 Comments

  • "log millions of events per second" - this could change everything.

  • Azure Stream Analytics but no Linq neither no Rx support? Quite strange to query only with SQL like syntax...

  • I'm glad to see Azure branching out to incorporate technologies like NoSQL and others. All of these offerings are awesome! Quite an impressive set of tools.

  • Looks almost like an Event Store as a service...
    I really like the direction Microsoft is going. This underlines perfectly "mobile first, cloud first"

  • We are really helpful to you, for adding such a great features to the Azure!

  • This is a lot of information to process in one go, but I definitely like the sound of “log millions of events per sec”. Would have liked some greater insight into Hive.

  • Thanks For Your valuable posting, it was very informative. Am working in<a href="http://www.excelanto.com"> Erp Software Company In India</a>

  • Hi, we are trying to use Event Hubs form Apache Spark Streaming using Scala language and Java JMS client library.

    We were not able to start reading the stream from a given position using the AMQP 1.0 JMS client library. It always starts from the first message.

    Is it possible to start reading from a given offset with the JMS library?

    Thanks in advance.-

  • You need to use the messaging annotation to get the offset. The JIRA for this capability is here: https://issues.apache.org/jira/browse/QPID-6065 and it is resolved. When you connect you will need to specify an offset so that you can start reading at that point in the stream again. A good overview is on this blog: http://www.mikelanzetta.com/2014/11/talking-to-eventhub-with-amqp-1-0/

Comments have been disabled for this content.