September 26-27, 2023 - Meet us at Current hosted by Confluent! Stop by booth #211 to hear how you can build streaming data applications in minutes, not months. / Register Now

A Quick Note on Continuous Analysis

Analyzing data on the fly is tricky: Data sets are unbounded and real-time responses demand fast analysis. Incremental algorithms can be used for statistical analysis, set membership, regression-based learning, and training and prediction of learning algorithms, amongst others. In specific use cases, domain-specific algorithms also apply.

Measures like the mean, variance, moments, and correlation coefficients are easy to compute incrementally, but recursive algorithms – for example for quantiles - are not so simple. Statistical measures describe both properties of the data and are also estimates for model parameters in the context of hypothesis testing. For example, the mean is an estimator for the expected value of the distribution. Statistical moments of higher order provide information about the variance, the skewness, and the kurtosis of a distribution.

An example of change detection for time series data from a waste water treatment plant

Non-stationarity is important for many practical use cases such as predictive maintenance. Effectively the parameters of the sampled distribution may change over time, requiring change detection using a χ2- or t-test. Change detection is also important to derive optimal sampling strategies, such as the size of a time window for parameter estimations.

Swim can deal with many of these incremental functions, and we continue to add to it. But there is a treasure trove of open source tools for example in Apache Flink that you can plug into Swim. We’ll take care of the hard parts, like distributed coherence…