March 28, 2023

How Does Data Locality Affect Real-Time Streaming Applications?

Ajay Govindarajan

Companies need to be nimble. With the rapid pace of change we experience today, quarterly — or even annual — planning doesn’t cut it. Competitive organizations need the ability to adjust by the day, hour, and even minute and second to accommodate market shifts.

Data locality is a concept making this business agility possible.

Data locality describes the process of bringing computation closer to where your data resides within your network. In this blog, we’ll explain why data locality is a key factor in empowering modern streaming applications and modern businesses.

Why does data locality matter to modern businesses?

When it comes to building business applications, organizations want applications with lower latency and faster performance. This is often easier said than done since processing huge amounts of data can lead to network congestion. It’s this congestion that causes a lag in obtaining real-time insights.

Streaming data is created continuously, and storing high volumes of raw stream logs prior to processing requires significant storage and bandwidth resources.

Effective data locality keeps data that is commonly accessed together on the same computing node, which saves the need to make a network request instead of accessing memory or local disk.

It is 100 times slower to send over a network, and this assumes that the data is accessed from memory on the remote node. Otherwise, you are now looking at a difference that is 10,000 times slower per transmission — and there could be tens of thousands of such instances occurring at a given time.

Data locality wasn’t a great concern for a long time due to the prevalence of stateless applications. Stateless applications process requests without retaining any state information, which is why stateless microservices are sometimes referred to as ‘forgetful.’ Popular frameworks Hadoop and Spark have stateless architectures as well.

However, with the ability to obtain real-time data at our fingertips, more companies have started incorporating stateful applications into their offerings and processes. Stateful applications must carry their own state in some form to function — and this data retrieval process can introduce latency. The increased popularity of stateful applications is causing businesses to rethink — and revamp their data locality strategy.

Why modern businesses struggle with data locality

Cutting-edge companies use real-time streaming applications to zoom in on hyper-specific problems as they’re happening, such as cellular networks using applications to determine the operating status of any cell tower at any given time. The actions of these streaming applications are performed within the context of previous transactions, and these “memories” can be used to inform future transactions.

Because of these “memories,” companies need easy access to their stored data. This is where issues like latency and data locality come into play — the farther away that information is stored, the harder it is for companies to make real-time decisions. How does data locality affect real-time data?

We’ve said this before, but databases are where real-time streaming data goes to die. The farther away you store your data, the longer it will take for you to access it.

In general, databases write to disk — meaning databases write information to be stored in your hard drive. As a result, databases only have a relative access time of around 10ms, which, compared to in-memory access, is 40 times slower. This is why latency often occurs in the data retrieval process.

Databases are also designed for querying. The data stored here is at rest, but it gets put back into motion once it is found by query resolution. But only the database knows when any of its data has changed, so if an individual is interested in seeing which pieces of data have changed, the individual must continue to execute queries repeatedly, wasting database resources and their own.

Storing often-requested data in localized nodes instead of databases allows stateful applications to quickly access needed information, eliminating latency and providing applications with authentic, real-time data.

How Nstream Sees Localized Data for Real-Time Streaming Applications

Companies need to be able to process both streaming data and contextual data together — to take action quickly. In order to achieve real-time insights, developers need to efficiently combine static and streaming data sources without adding latency or excess cost.

Nstream uses data locality to make real-time streaming applications a reality, using a stateful streaming architecture in which concurrent, stateful Web Agents act as digital twins of real-world entities. Each Web Agent statefully consumes raw data from the real world, and the set of web agents is a mirror of the real world. With Nstream, collected data can be grouped into nodes, allowing for easy, real-time retrieval.

The result is that all data appears instantaneously at rest, despite constant motion between operations. This means entire applications can now run at the speed of their fastest data sources — and your information can come in true real-time.

Getting started with real-time streaming applications

Once you select your data sources, you can begin creating cutting-edge real-time streaming applications.

Nstream is the first (and only) open-source, full-stack platform for building, managing, and operating streaming applications at scale. See how other companies have used real-time data insights to drive ROI to their business.