Streaming-First Architecture

Imagine a world where software evolved the way broadcast media did: streaming first

In the early days of radio and TV, real-time was the only option. Recorded media was a relatively late arrival, as far as broadcast media is concerned.

If the internet had reached critical mass a few years earlier, when storage was still pretty firmly capped at 1.44 MB, we might be having a very different conversation today.

It’s conceivable that data streaming data could have taken hold before batch processing took root. In such a world, the idea that you wouldn’t stream data by default might be just as inconceivable as the idea of not having streaming broadcast media is to us.

Might we in fact be living in the more awkward and unnatural branch of history?

(Granted, the gramophone appeared first, but I would argue that radio hit critical mass first.)

At the end of the day, it’s all streamed anyway

Streaming is quite literally the only way of moving data around. There is no alternative. Whether it’s over a serial port, on a RAM bus, or from a Kafka topic, all data transfer is streaming.

Live streaming vs. time-shifting

What “non-streaming” architectures really do is time-shift the delivery of data. The data still gets delivered as some sort of stream, the data of the arrival of the data is just shifted in time from when then the data was generated to when the data was queried.

A streaming-first architecture os simply one that does not force you to time-shift data

Nothing about a streaming-first architecture precludes you from time-shifting data.

Newsflash: Live-streamed TV can broadcast pre-recorded content!

The opposite is not true. Recorded media cannot carry live TV.

This is why software architectures should be streaming first: it’s the more general approach.

You sacrifice nothing being streaming first. You can always stream historical data straight out of a database.

But you fundamentally cripple your architecture when you make it Blockbuster-first.

Forcing all data to pass through a database is like forcing all TV broadcasts to pass through a Blockbuster

You can imagine the excuses people would male in our alternate reality: “we can’t stream TELEVISION! Are you nuts? What if somebody has to go to the bathroom? They’ll LOSE DATA! Forcing everyone to go to Blockbuster for everything GUARANTEES they’ll never miss an episode:”

What it actually guarantees is that everyone sees way less content.

Near real-time is like trying to time-shift live TV by 5 seconds with a VCR

Record. Pause. Rewind and playback really quickly. Record another “mini batch”. Pause, rewind, and replay really quickly. People have been given Ph-fucking-D’s for coming up with this shit.

Anything you can do to time-shifted data you can also do with real-time data

You just have to approach it the right way. When you watch TV, you remember what happened in the last scene. Watching TV is a.stateful process. This is why it works.

The reason people think applications can’t run in-stream is because they assume statelessness.

If you assumed that viewers had the memory span of a fish, live TV would seem impossible too. But it’s a bogus, artificially imposed assumption.

It’s not just me saying so, the laws of physics say so

Conservation of energy quite literally arises from the invariance of the laws of physics to translation in time. Noether’s theorem states that every conservation law arises from a symmetry.

If you can do it later, you can do it now too.

Doing it later can be exponentially more expensive though. You have to expend a lot of energy storing data to ensure that there is a later.

You can have your cake and eat it too—for a while

You can still go to Blockbuster if you like. All I’m saying is, don’t force everyone to go to Blockbuster just because it’s what you’re used to. I like Blockbuster too. Or at least I used to. When I had to. Why is it that Blockbuster went away again? Oh yeah, because it sucked! Streaming is way better.

It just takes time to get used to it. And you need the right infrastructure and software to make it easy. Once that software exists and is broadly adopted though, there’s likely no turning back.

The timescales of change

Bill Gates perhaps put it best when he said, “people overestimate what can happen in five years, but underestimate what can happen in ten years.”

We’re about five years into the transition to streaming data, edge computing, and AI. Most people grossly overestimated where we’d be by now. But they also probably grossly underestimated how different the landscape will look another five years from now.

Where will the Blockbusters of data be in the next five years?

And who will be the Netflix of streaming data? Will it be Confluent? Or will Confluent just be one of the pipes used by the streaming data Netflix?