September 26-27, 2023 - Meet us at Current hosted by Confluent! Stop by booth #211 to hear how you can build streaming data applications in minutes, not months. / Register Now

Real-time is the only time

Storage is the leading cause of data loss in the industry

Not for the reasons you might think.

Not because of hardware failures, or lost backups.

But because of SYSTEMIC ARCHITECTURAL LIMITATIONS imposed by REQUIRING all data to pass through storage BEFORE USE

Think about all the data that never reaches a database

Most data is discarded pretty soon after it’s created

You could actually define real-time data as data that hasn’t yet reached storage

All data is real-time at some point

The totality of real-time data generated in the world is far too big to store

Because most data architectures require storage before use, they impede the collection of real-time data in the first place

Why collect data if you can’t use it?

But real-time data can be used to generate immense value by real-time applications, without the data having to pass through storage

The de facto storage requirement causes the data to be unavailable for use by real-time applications

That’s why storage is the leading cause of data loss in the industry

The data is lost before it ever reaches storage, because the data is too big to store.

But if you can process real-time data as it arrives—make use of the it and discard it–you can avoid the loss.

Most of this data is only useful in real-time, so there’s no loss in discarding it AFTER USE.

An analyze then store architecture is the key to preventing real-time data loss

Data loss occurs when data isn’t available for use

If a tree falls in the woods and nobody’s around, does it make a sound? If data is dropped and nobody ever asks to read it, will it be missed?

Reality is too big to fit on disk

You either have to limit you perception of reality to what you can permanently store.

Or you have to interpret data first, and only store the gist of what you processed.

Latency is equivalent to data loss for many use cases

Data is effectively “lost” during the time in which it’s being ingested and processed.

Data not getting where it’s needed in time, is indistinguishable from data loss due to a hardware failure.

Latency causes data loss when it prevents you from collecting data in the first place

“I store all data forever. But I only collect data every 15 minutes because that’s all I can store.” This is data loss, plain and simple.

Lack of visibility into what’s happening in real-time is data loss

The isn’t available when you need it, like an unrecoverable disk failure.

The industry currently accepts ludicrously high rates of data loss in the name of “zero data loss”

The status quo is to lose as much data as you possibly can until you’ve thrown out so much that what remains fits permanently onto multi-region replicated disks.

Because reality doesn’t fit onto hard drives that exist in reality

If you’re storing everything you receive, you’re missing most of what there is to know.

If reality is too big to store, then real-time is the only time you have

The data is too big to store, so you have to process it before you store it. It’s tautological.

There is literally no other opportunity to process the full picture of what’s going on except in real-time. If you wait too long, the data will be lost forever, because it’s too big to store

I don’t remember what I ate for breakfast this morning, or any morning . But I’ve still managed to learn what I like for breakfast

I analyze what I eat, and then store what I like. I don’t store everything I’ve ever eaten and then periodically analyze my eating habits to determine what I I’m statistically most likely to like. That would be stupid. And unnecessarily imprecise.

Long term analysis might be useful from a nutritional perspective. My stored recollection of what I tend to ear is like a highly compressed jpeg of my actual food intake: It’s close enough. And it’s small size frees up my capacity to remember a lot more.

Storage isn’t the only way to guarantee consistency

Just ask TCP. Consistency is an end-to-end problem, to which the end-to-end principle should be applied.

The end-to-end principle is the idea that the endpoints of a transaction are in the best position to enforce consistency constraints.

The reason the end-to-end principle gets overlooked by software architectures is that no one piece of software controls the end-to-end picture. A vertically integrated software stack like Swim changes that.

Storage isn’t necessary for resilience

Storage itself gets its resilience from replication. Stateful applications can directly replicate state without touching storage.

Don’t get me wrong, storage is super useful, and critically important. It just doesn’t need to be on the critical performance path

Here’s an idea: use storage to… store stuff! Stop using storage as a consistency and resilience crutch; doing so cripples you more than you may realize.There are better ways to solve these problems that don’t force you to limp around for all eternity.