BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Anomaly Detection Using ML.NET

Anomaly Detection Using ML.NET

Key Takeaways

  • The current pillar of machine learning development in .NET technology is the ML.NET framework, a library dedicated to the C# and F# programming languages.
  • ML.NET allows you to train a machine learning model or use existing models. Then you can run them in any environment using a variety of algorithms.
  • One of the algorithms you can use in the ML.NET library is Anomaly Detection.
  • There are three types of settings for Anomaly Detection: supervised, unsupervised and clean.
  • Anomaly Detection is used in cases such as fraud detection or validation of the values entered into the system.
  • You can evaluate Anomaly Detection in ML.NET using the AUC-ROC metric or Detection Rate At False Positive Count.

Nowadays, we are flooded on all sides with an enormous amount of data. Large companies store vast amounts of statistical information, production, or financial reports. More and more often, we want or need to deduce something from the available data.

These deductions could be the analysis of customer recommendations based on data collected from similar customers or the detection of potentially illegal transactions. It is for such purposes that we use machine learning algorithms. We select them according to the problem with which we are dealing.  

Anomaly Detection

One such algorithm is Anomaly Detection. As the name suggests, it is about finding what is abnormal from what you expect in your day-to-day life. It helps identify data points, observations, or events that deviate from the normal behavior of the dataset. There are now many distributed systems where monitoring their performance is required. A considerable amount of data and events pass through such systems.

Anomaly detection gives possibilities to determine where the source of the problem is, which significantly reduces the time to rectify the fault. It also allows us to detect outliers and report them accordingly.

These all applications have one common focus that I have mentioned earlier - outliers. These are cases where the data points are distant from the others, do not follow a particular pattern, or match known anomalies.

Each of these data points can be useful for identifying these anomalies and responding correctly to them.

The main applications of Anomaly Detection are:

  • Data cleaning
  • Fraud detection
  • Detection of hacker intrusions
  • Validation of the values entered in the system
  • Disease detection

For better understanding, I will give you the following example. Imagine that you are responsible for controlling an internal road in a city where only passenger cars are allowed to drive. You use a system for this, which gets a dataset with tagged data and recognizes passenger cars. With this, it recognizes a lorry, as you can see in the picture below (1), and it is an anomaly in this case . However, one day the city is celebrating the "green day" when all vehicles in the city center are supposed to be green (including lorries). The dataset is updated, and the anomaly, in this case, is already different (2). As you can see, it is much easier to define what the anomaly is, if you establish the norm or - in other words - if you label examples in the dataset.  

In this case, we are talking about the supervised setting for anomaly detection because the dataset is labeled. It is worth mentioning that we can distinguish other types, such as clean and unsupervised. Let’s deep dive into those settings.

Clean

In the clean setting, all data are assumed to be "nominal" and are contaminated with "anomaly" points. The clean setting is the case where the data group is complete and clean. This assumes that all data are nominal data points. It is then the detector's task to detect anomalies within this dataset.

Unsupervised

That is definitely the hardest case in which the training data is unlabeled and consists of a mix of "nominal" and "anomaly" points. There is no knowledge here from which a result can be expected. The model has to decide for itself what is anomalous and what is nominal. The main goal and objective here is to create clusters from the data and then find the few groups that do not belong to them. It can be said that all anomaly detection algorithms are some form of approximate density estimation. The methods used here include: K-means, One-class Support Vector Machine, orand Self-organizing maps.

Supervised

The supervised setting is a case where a data set is prepared with all data points marked as anomalous or nominal, which is a great convenience. In this case, all anomalous points are known in advance. That means there are sets of data points that are anomalies, but they are not identified as such for the model. In this case, we can use algorithms such as k-nearest neighbors, Support Vector Machine, or Randomized Principal Component Analysis (PCA) -  which I will describe in detail below.

Randomized PCA

I think it is enough about the anomaly detection theory. Let’s focus on the algorithm that we are going to use in that case. Randomized PCA is used in the context of Anomaly Detection to analyze available features and determine what represents a "normal" class. It uses distance metrics to identify cases that represent anomalies. Thanks to that, the model can be trained using existing data.

Randomized PCA means an approximate principal component analysis (PCA) model using the randomized singular value decomposition (SVD) algorithm. I know, it sounds complicated. However, I will try to give you a brief overview of what this method is all about. If you want to explore the subject in more detail, I refer you to the paper by Vladimir Rokhlin, Arthur Szlam, and Mark Tygert.

PCA is a frequently used technique in machine learning, particularly for data analysis. It allows us to learn about the internal structure of data. The algorithm works by looking for correlations between variables and determining the combination of values that best represents the differences in results. Then these combined feature values are used to create a more compact feature space, which is called principal components.

In Anomaly Detection, all new inputs are analyzed. The algorithm calculates the anomaly score that is determined by the normalized error of the input data. The higher the error, the more anomalous the instance is (5).

Example usage of Randomized PCA in ML.NET

In that case, we will use the Anomaly Detection Falling People dataset from Kaggle, which was prepared to detect the falling of elderly people in smart homes. Four sensors were used during the experiments and were attached to each person’s chest, ankles, and belt.

The features include the three-dimensional parameters X, Y, and Z, which represent the positions of the sensors in each sample. The other features are coded one-shot representations of the activity of each sensor. As for the labels, they represent a fall/normal-life event, where 0 means normal and 1 means anomalous fall event.

The first step is to create a console application project (6) and download ML.NET from NuGet Packages (7). You can then proceed to implement and create the model.

The basic idea is to create classes corresponding to the attributes of our dataset:

public class Features
{
    [LoadColumn(0)]
    public float Xposition { get; set; }
    [LoadColumn(1)]
    public float Yposition { get; set; }
    [LoadColumn(2)]
    public float Zposition { get; set; }
    [LoadColumn(3)]
    public float FirstSensorActivity { get; set; }
    [LoadColumn(4)]
    public float SecondSensorActivity { get; set; }
    [LoadColumn(5)]
    public float ThirdSensorActivity { get; set; }
    [LoadColumn(6)]
    public float FourthSensorActivity { get; set; }
    [LoadColumn(7)]
    public float Anomaly { get; set; }
}

public class Result
{
    public bool PredictedLabel { get; set; }
    public float Score { get; set; }
}

Then you can load the mentioned earlier dataset from the project's folder. We have a testing set that contains twenty files and a training set with five files:

The code to load these files is as follows:

var trainingSetPath = "TrainingSet/*";
var testingSetPath = "TestingSet/*";
var ml = new MLContext();

var trainingDataView = ml.Data.LoadFromTextFile<Features>(trainingSetPath, hasHeader: true, separatorChar: ',');
var testingDataView = ml.Data.LoadFromTextFile<Features>(testingSetPath, hasHeader: true, separatorChar: ',');

Now create a training pipeline. Here you select an Anomaly Detection trainer in the form of Randomized PCA, to which you define the names of the features in the parameters. Additionally, you can set up some options like Rank (the number of components in the PCA) or Seed (The seed for random number generation).

var options = new Microsoft.ML.Trainers.RandomizedPcaTrainer.Options()
            {
                Rank = 4
            };
var pipeline = ml.Transforms.Concatenate("Features", columnNames)
                .Append(ml.AnomalyDetection.Trainers.RandomizedPca(options));

Finally, you can move on to training and testing the model, which is limited to three lines of code.

var predictions = model.Transform(testingDataView);

var results = ml.Data.CreateEnumerable<Result>(predictions,
                reuseRowObject: false).ToList();

As you may have noticed before, the Result class is used to capture prediction. PredictedLabel determines whether it is an outlier (true) or an inhaler (false). Score is the result of the anomaly and a data point with a predicted score higher than 0.5 is usually considered an outlier.
You can display the outliers in the console using the following code:

foreach (var result in results.Where(result => result.PredictedLabel))
{
   Console.WriteLine("The example is an outlier with a score of being outlier {0}", result.Score);
}

Additionally, you can check out the number of falls detected what means the number of outliers:

var numberOfFallsDetected = 0;
foreach (var result in results.Where(result => result.PredictedLabel))
{
                numberOfFallsDetected += 1;
}
Console.WriteLine("Number of Falls detected: {0}, Where the total number of measurements is: {1} ", numberOfFallsDetected, results.Count);

The crucial part of the model development process is the evaluation. It is this phase that determines whether the model performs better. Therefore, it is critical to consider the model's performance according to every possible evaluation method.

Each group of machine learning algorithms has its own dedicated metrics.

In the case of Anomaly Detection, we distinguish two metrics that are accessible in ML.NET: AUC - ROC curve and Detection Rate At False Positive Count.

The AUC-ROC metric helps us to define and tell us about the model's ability to discriminate between classes. The higher the AUC value, the better the model is. It is determined that values closer to 1 are better. Only if the score is greater than 0.5 can we say that our model is effective. Values of 0.5 or lower indicate that the model is no better than randomly allocating inputs to anomalous and normal categories, which means that it is not at all useful.

AUC-ROC curves are typically used to graphically represent the relationship and trade-off between sensitivity and specificity for each possible cut-off point for the test performed or for combinations of tests performed. If you want to know the details of this metric, I refer you to a great article by Victor Dey.

Detection Rate At False Positive Count is the ratio of correctly identified anomalies over the total number of anomalies in our test set, indexed by each false positive. In other words, it means that for each false positive item, there is a detection rate value with the number of false positives. Values closer to 1 are good. If there are no false positives, then this value is equal to 1.

To use these metrics in ML.NET, we can use the Evaluate method, the arguments of which are the results of the prediction and the name of the label column. In our case, it is a column called Anomaly, and we will use the AUC-ROC metric:

var testMetrics = ml.AnomalyDetection.Evaluate(predictions, "Anomaly");
var areaUnderRocCurve = testMetrics.AreaUnderRocCurve;
Console.WriteLine("AUC Curve: {0}", areaUnderRocCurve);

The Area Under ROC Curve metric value of 0.74 may indicate that the model created is robust. As I mentioned earlier, values being close to 1 are better and it can suggest that the algorithm works efficiently and the results are valuable.

Summary

In this article, I have briefly outlined the theory behind Anomaly Detection. Additionally, I have introduced the PCA method, which is a common choice for this problem. Of course, for using this algorithm in ML.NET, you do not need to know the theory behind that. However, I assume that such knowledge of the basics allows for a better understanding of the issues and the results obtained. Additionally, you can later compare these outcomes with the results obtained from other libraries or sources.  

About the Author

Robert Krzaczyński is a software engineer who specialises in Microsoft technologies. On a daily basis, he develops software primarily in .NET, but his interests reach much further. Additionally, he delves into machine learning and artificial intelligence. After hours, Robert also shares his knowledge on a blog (bush-dev.com). He holds a BSc Eng  degree in Control Engineering and Robotics and an MSc Eng degree in Computer Science.

Rate this Article

Adoption
Style

BT