You might associate the term “data streaming” with a service like Netflix. But data streaming has a much broader set of applications.

For example, a smart building system uses data streaming to determine which building sections have occupants and adjust A/C settings in real-time. A high-precision digital microscope can stream what it “sees” directly to a research computer. You’ll also find data streaming in modern household appliances and devices, like automated vacuum cleaners and outdoor security cameras.

What exactly is data streaming and why do we need it? Keep reading as we go over data streaming basics, key applications, and some tools you can use in your own data streaming projects. 

Data Streaming Basics

Data streaming is the continuous transmission of data from a source to a destination. With streaming, data sources send data frequently, sometimes multiple times per second, and in small quantities. Contrast that with the more traditional batch processing, where operations run infrequently and transmit larger amounts of data each time.

Let’s return to the Netflix example for a second. The streaming service revolutionized the entertainment industry by having users download small chunks of a movie while viewing it (streaming), instead of having them download an entire movie in advance (traditional batch processing). This would save the user disk space, and if a user did not like the movie, they did not have to waste much bandwidth.

Because of the data stream’s real-time nature, data sources can adjust their streams in near real-time. For example, Netflix has created a range of innovations that improve stream quality on slow connections and in busy networks. The company also generates lots of data from each stream and uses it to improve its content.

Data streams are not limited to video. Here are a few non-video applications of data streaming:

  • Storing logs generated in web or mobile applications for easy search and future analysis;
  • Information about online purchases, including time to complete order, website navigation habits, and route followed on the website;
  • Social media actions such as likes, views, new posts, and messages; and,
  • Navigation data: position of a vehicle at a given time and its progress against its planned route.

Compared to batch processing, however, data streaming is more resource-intensive. Streaming one movie over, say, ten thousand separate HTTP requests means greater overhead compared to that of just one request. However, streaming trades this potential inefficiency for convenience and the additional value of having data available quickly.

The table below compares batch processing to data streaming side-by-side:

 Batch ProcessingData Streaming
Data ScopeProcessing of all or most of the available dataProcessing of data from a restricted time window or on the most recent data
Data SizeLarge batchesIndividual recordings or very small batches
PerformanceMinutes to hours of latencyMilliseconds or seconds of latency

As a data scientist or data engineer, you’ll find that your work with streams will differ from your batch processing projects. For example, you will need to work with GPUs, TPUs, and FPGAs to ensure just-in-time data processing of large volumes of data from the streams. To work with data streams, you’ll clearly need to understand how programming systems process inputs/outputs and interact with networks. You will also need to know how to use data streaming tools.

Data Streaming Applications  

Data streaming has found its place across a range of industries. Common data streaming applications include entertainment, transport & industry, financial markets, solar energy, and multiplayer gaming—just to name a few. 

Entertainment

Beyond what we already know about Netflix, video entertainment services use data streaming to analyze what users are watching, how long it takes them to finish a show, and which moments most capture their attention. This information allows entertainment services to rapidly adapt their offerings to the tastes or habits of a particular audience. Companies like Netflix can also detect patterns in streams that may expose a technical issue. Based on slow or disconnecting streams, Netflix can often detect an issue with a particular Internet service provider sooner than the ISP itself!

Transport & Industry

Sensors in transport vehicles, industrial equipment, and agricultural machinery continuously send information related to status and performance. Data analysis tools monitor machine performance, detect faults, and order spare parts as quickly as possible. The amount of data generated by the equipment allows analysts and engineers to generate precise predictions of work capacity and timelines. 

Financial Markets

In the financial markets, the millisecond is the scale at which financial transactions are conducted. Automated algorithms engage in trades so quickly that it’s impossible for humans to keep up, at least at the transaction level. Algorithms analyze streams of data in near real-time, from price movements to market sentiment, and try to predict a market’s behavior to get an edge over the competition.

Solar Energy

Solar energy companies need to keep their equipment in order, and frequently implement applications that monitor all panels in their network and schedule real-time repair and maintenance tasks. Monitoring the “health” of individual solar panels helps maximize energy production and reduce operational losses.

Multiplayer Games

Multiplayer gaming companies collect streams of data that describe interactions between players and game elements. They then analyze the data in real-time to provide a dynamic experience to players, allowing for adaptation of the digital world to cater to the gamers’ playing style that would not be possible in statically-designed games.

Data Streaming Tools To Consider

To get the most out of data streaming, you’ll need tools that will help you capture and analyze the data, and extract useful information from it. In this section, we cover tools designed to facilitate your data streaming projects.

Apache Spark

Apache Spark is an open-source data processing engine based on cluster computing. It enables large-scale analysis through clustered machines, gathering all their power together. Apache Spark includes a SQL query engine, a streaming data processing library (Spark Streaming), and a graph processing system (GraphX). Apache Spark also includes many algorithm libraries for machine learning. Apache Spark can perform real-time and batch processing.

Apache Kafka

Apache Kafka is an open-source event streaming platform commonly used to collect, process, and store continuous streams of event data or data that has no precise beginning or end. Kafka has excellent throughput and features like built-in partitioning, replication, and fault tolerance. This makes it a widely-used solution for large-scale message processing and streaming applications.

Amazon Kinesis

Amazon Kinesis is a fully-managed service for processing real-time streaming data at massive scale. You can configure hundreds of thousands of data producers to add streaming data to an Amazon Kinesis stream. This can include data from website clickstreams, application logs, and social network feeds. In less than a second, the data is available to your Amazon Kinesis applications, which can read and process it from the stream.

Master Data Streaming With Udacity

Data engineers no longer rely on static data alone. Data flows constantly from many sources and at high speed. Knowing how to manipulate data from data streams is an in-demand skill.

Udacity’s Data Streaming Nanodegree allows you to get real-world experience with data streaming applications. By learning more about data streaming, you’ll be at the forefront of modern data engineering. Enroll today!