thesenmap/README

10 lines
3.3 KiB
Plaintext

Ensembles are a way to combine multiple models to create a more powerful model.In anomaly detection you can use a concept called feature bagging to create multiple predictions from the same algoritm. For this each run of the algorithm only works on some features. Generally this is used to increase the robustness of the anomaly detection method (if 2 features seem really important, anomaly detection methods might neglect the other features. If you have runs without these important features, this forces the algorithm to still consider less important features), but I would like to explore a sligthly different question:
If you are given multiple predictions, you will see events that are anomalous to some predictions, but normal to other ones. And when each model has different inputs, you might find that models considering a feature are anomalous, while models that dont consider the current event normal. In this case you could say that the input feature is the reason this event is anomalous.
Youre task would be to develop this into a method to analyze the reason for a given anomaly. Normally I would now include some example code, but since my trivial example needs thausands of models to output something useful, I only show 2 example images.
In both I train an ensemble of anomaly methods to differentiate mnist data (letters). The model should consider a "7" as normal, while finding every other letter as anomaly. The images shown are my favorite from ~20 I have looked at.
The first image (example1.pdf pdf because vector graphics) shows a slightly weird 7 (a 7 with another line at the top) on the left and the "anomaly reason" on the right side. You see the part of the 7 which we would initially consider normal in black (low anomaly reason), but not the additional line as this is not a usual part of the "7".
The second image (example2.pdf) shows a "2" and thus an anomaly. See this "2" here again as a "7" with another line. Again you see the basic structure of the "7" represented in the image, but this time the second line is really anomalous(We can not expect there to be a 7 with a line below, but we could imagine in the test set being another 7 with a line above), and so it is found by the algorithm and as you see in the heatmap, this is represented: This image is not a "7" since it contains another line.
The biggest drawback of this algorithm is that it requires many different anomaly predictions (I used here ~2000, this is also only possible because I use an anomaly algorihm I thought of, which is really fast). This is partially the case since the mnist images used have many (784) features, and we can assume that this effect will be less strong with fewer features. You can probably still improve the speed (number of models) quite a lot. A better querry strategy for the feature bagging, a better combination function for the resulting anomaly scores or even some more active idea (train this model to test the current hypothesis) should help quite a lot.
On the other hand, this algorithm could also be used for fewer features (where it will be much faster), but then you could also consider relations between the features (given two inputs, which are always between 0 and 1, but always the same: They are anomalous not for any value, but always when they are not the same)
If you have any questions, feel free to write an email to Simon.Kluettermann@cs.tu-dortmund.de