Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Articles Stream Processing Anomaly Detection Using Yurita Framework

Stream Processing Anomaly Detection Using Yurita Framework

Key Takeaways

  • There is a need to have better automated monitoring capabilities which can detect abnormal behavior in our systems, both from the infrastructure side as well as from the user side (like user fraud).
  • Anomaly detection is considered a branch of unsupervised machine learning tasks.
  • Some of the anomaly detection approaches include statistical, curve fitting, clustering and deep learning.
  • Yurita, an open source anomaly detection framework for stream processing, is based on Spark Structured Streaming framework. 
  • A typical Yurita data streaming pipeline includes stages like data extraction, windowing, data modeling, anomaly Detection, and output reports.

As the scale modern software requires to operate at- both in terms of the number of users it services and infrastructure supporting it- continues to increase, it becomes virtually impossible for a human to manually monitor that systems behave as they should. Meanwhile, software user expectations have also increased dramatically in the last decade: unresponsive pages, long wait times and transaction failure are errors that are not tolerated anymore. Users would probably move to a competitor service without giving it a second thought. This has set the stage for a need to have better automated monitoring capabilities which can detect abnormal behavior in our systems, both from the infrastructure side as well as from the user side (like user fraud), or in other words Anomaly Detection capabilities.

There has been a lot of interest in anomaly detection in recent years, with many diverse technical talks in conferences and proprietary products in the market (e.g. Elasticsearch Anomaly Detection). Yet I would argue that the term anomaly detection is still too ambiguous, and we as an industry should come up with better terminology for what problems we are actually trying to solve. I like to compare it to the early days of Natural Language Understanding. Breaking down the last term, Natural Language, can be interpreted as a language a human would use, but Understanding is an ambiguous term, as it is a subjective one. Today it is widely accepted that a Natural Language Understanding domain groups a growing set of well-defined tasks like Sentiment Analysis, Named Entity Recognition and Text Summarization.
Today, the status of the anomaly detection domain is more similar to the definition of Oxford Dictionary: "1 Something that deviates from what is standard, normal, or expected,"
and a variety of methodologies borrowed from statistics and machine learning. In real world use cases, "Something" can many times be referring to a time series (window of continuous values ordered by time of observation), distribution of categorical values, or a vector representing a feature set of an event, but can have many more interpretations.

Anomaly Detection Landscape

Developing accurate anomaly detection models is considered challenging due to the lack of positive samples (observations that presented an anomaly are rare), lack of tagged data, and also since anomalies can be so distinct and unexpected it’s difficult to train models to recognize them. This is why anomaly detection is usually considered a branch of unsupervised machine learning tasks.

I will not deep dive into any specific method of anomaly detection, but I feel a general survey of a few current state-of-the-art approaches is required at this point.

  1. Statistical approaches: Intuitively, we estimate what the probability of an observation is at different value ranges; if an observation appears in a low enough probability region, it might be an anomaly, e.g. Bayesian inference.
  2. Curve fitting approaches: We try to fit our data with a mathematical function composed from simpler known mathematical functions; we then use this function to create a forecast to estimate what the expected observation at the current time is. After that, we evaluate (often using statistical methods) if the current observation is significantly different from the expected one, e.g. Fast Fourier Transformation.
  3. Clustering approaches: We would look for outliers – points that don’t belong to any cluster, as they might be anomalous. Because we are interested in the outliers (and not the actual clusters), density-based models are often a better match, e.g. DBSCAN.
  4. Deep learning approaches: Of course, this article would not be complete without talking about deep learning. The rapid advancements in recent years in neural networks methodologies did not skip the anomaly detection domain. There is a lot of ongoing work in the industry to disrupt this area, similarly to what deep learning has done in other machine learning areas. Most recent works have shown promising results, mostly using LSTMs and Auto-Encoders, e.g. Donut (Variational Auto-Encoder).

[Click on the image to enlarge it]


Working at PayPal on a next generation stream processing platform, we started to notice that many of our users wanted to use stream processing to apply anomaly detection models in real time. After we explored different architectures to create a flexible production grade framework that can scale to real world workloads, eventually we decided to go with a pipeline-based API, inspired by other open source projects like scikit-learn and Spark MLlib.

This work has led to the development of Yurita - an open source anomaly detection framework for stream processing. Yurita is based on the new Spark structured streaming framework, and utilizes its processing engine capabilities to reach high scale and performant execution. The name Yurita comes from a traditional Japanese gold panning tool (we liked the analogy between finding gold in dirt, to finding anomalies in data).

[Click on the image to enlarge it]

When using pipelines, the developer models the data processing as a series of transformations applied to an input data. In Yurita, we found these stages in the pipeline to be common across different applications:

  • Extracting metrics that we are interested in analyzing for anomalies from the raw data stream.
  • Windowing: as we are working with an infinite stream of data, it is common in stream processing to group events by time window.
  • Data modeling: we transform the events in each window to a data structure (or computation) ready to be analyzed by an anomaly detection model.
  • Anomaly Detection: we apply the detection model on the data and evaluate if it is an anomaly or not, then produce an output report.
  • The output is a stream of reports that is ready for post processing, activating an alerting system, or just to be plugged into a dashboard.

These stages are quite generic, and can be related to other data processing applications as well, but where Yurita really shines is in its internal window referencing mechanism.

Without knowing what the normal behavior of a metric is, we would be able to use only simple anomaly detection techniques, like rule-based decisions which also require a deep understanding of each specific dataset, and therefore are not scalable from a productivity point of view. The Window referencing mechanism enables the system to dynamically represent the expected behavior of the data according to a user-specified high level strategy.

For example, if a dataset has seasonality which correlates with day of the week & hour of the day, when analyzing a current time window, e.g. 12:00 – 13:00 on a Saturday, we might want to look at previous Saturdays at 12:00-13:00 to understand what is expected.

The process works as follows:

  • After a windowed data model is evaluated by the detection model, it remains in the system in a historical pool. This pool can be stored in memory or offloaded to an external key-value store.
  • A user can parameterize built-in referencing strategies or implement a new one using a Yurita interface.
  • Under the hood, Yurita will calculate which historical windows are relevant to the detection model when analyzing a certain real-time window, and will retrieve them from the pool.
  • A user can supply a reducer function that merges the historical data retrieved and creates a reference expected behavior. For example, in simple cases, the reducer can calculate the average. In more advanced cases, the reducer can create a forecasted value.

[Click on the image to enlarge it]

In code, a pipeline will look similar to this example:

val pipeline = AnomalyPipeline("gender",
   Window.fixed("20 minutes"),
   WindowReference.MultiRef.recurringDuration("1 hour", "1 week", "1 month"),
   new Categorical.AggregationFactory,

In the example, we track a stream of account logins where we chose the user gender as a metric for anomaly detection. We also grouped events to 20-minutes window aggregations. Since gender data is categorical (M or F), we count the times each value appears in the window. The system will generate a baseline window of the same current hour, but in previous weeks (same day) in the last month. Then it will compare the real-time window with the baseline window using a statistical entropy function.

In this design, we have decoupled the logic of creating the expected behavior, from the logic of the anomaly evaluation. This separation of concerns helps experimentations when doing research, which brings me to the next part of this article.

Research and Production

Many times, research software and production software are separate; you often see researchers developing models completely in runtimes like MATLAB or R, and then software engineers converting them to their production runtime like Java. A process like this can create a lot of friction, many times resulting in inconsistencies between the research model and production model. New frameworks try to minimize this gap and allow researchers to directly create models that are production-ready. In order to do that, the following two factors should be taken into account.

Lowering the Entry Barrier

I will use the Apache Spark project, which has become as of today the industry standard batch processing engine, as an example. This is a relatively complex project with many innovative and optimized pieces, yet I would argue that one of its relatively simpler features is also one of its most important ones, "setMaster(‘local[*]’)". Adding this code line changes its deployment mode to run on a developer machine locally, even though it was designed to run in a much more complex distributed environment with many machines. Making it so easy to get started and trying out the project was a key driver in the rapid adoption of the framework in the industry, and indeed we see this pattern in the majority of successful open source projects, which allow developers to play around with it with very low friction.

Similarly, we wanted Yurita to be friendly to early adopters, so we utilized spark capabilities to run in local mode and to run stream processing code in batch mode, and topped it with built-in ready-to-run example applications. We also made sure to create a light-weight packaging scheme for jars that are available in our notebook’s runtime, ready for data scientists to experiment with.

Visibility into Models

A critical component for any ML framework would be extensive visibility into the model processing and its output. It is not enough to say that the model detected an anomaly; being able to answer why a model returned a given output is a must. I myself am familiar with cases where better models were rejected, since their output could not be adequately explained. Some important qualities we get from model visibility:

  • It creates trust; the developer can verify the model is working correctly, and can evaluate if it’s reliable enough for a production setting.
  • Debug the model implementation.
  • Tune parameters of the model and observe their effect.

In Yurita, when we output a report of an anomaly detection analysis, we try to include as much information as we can in the report. When running in production, probably the top information would only be relevant, like if it’s an anomaly or not ,and what is the immediate cause of it. But when doing research, having visual representation of the processing can boost productivity and help build better models. Even though the framework was designed for stream processing and checks for anomalies in the new data it encounters, we added a feature specific for research that simulates all the outputs that the model would have produced in real-time, for a long historical dataset (together with an internal optimized execution of this mode).

Final Thoughts

In this article, I shared my two cents on why we should communicate better when we talk about anomaly detection, while also hoping this can act as a good starting point for anyone getting into this domain. As event-based architectures and stream processing technologies are evolving and are increasingly adopted, we might see more frameworks designed to tackle the new challenges that comes with them. Automated monitoring for long running applications and keeping models continuously up-to-date are only some of the areas where we have found room for innovation. You are welcome to try out Yurita, join its early community, and contribute

About the Author

Guy Gerson is a former software engineer at PayPal, where he was working on a next generation stream processing platform and on the adaptation of Statistical and Machine learning methodologies as part of real-time data pipelines. Prior to PayPal, he was a researcher on the IBM Cloud and Data Technologies group, focusing on designing large scale Internet of Things analytics architectures.


Rate this Article