Now lets focus a bit more on the trivial model in it, I compare just the angular part to 0 (its mean) and as you see on the left side, the distribution for tops is way more complicated (logarithmic color coding!) so since comparing to zero=approximating this radius, tops are clearly classifiable using this compare this to the distribution in #p_t# basically no preference even switches depending on the displayed particle This related to another problem if you would train a working autoencoder no longer on qcd data, but on top, it would still consider tops more complicated This you can see best in aucmaps These show the auc as a function of the particle id and the current feature blue color = qcd data is simpler red color = top data is simpler white color = no preference a perfectly working network would be darkblue if trained on qcd and darkred if trained on top you can subtract those maps here more different=more red basically no difference in angular data you have the same problem of adding d-distributions as you have in the scaling case so you could ask yourself if adding something to the angular data actually helps comparing the only angular data to the general data, you see that it in fact hurts the auc (even though just a bit) this effectively means, my current network does not use #p_t# at all But again, this does not mean, that there is no information in pt in fact, you see in these aucmaps, that the pt part is actually red where it should be red and blue where it should be blue so how about using only #p_t# you obviously lose quality also training an autoencoder to get an high auc in pt is not yet trivial multiplicative scaling does not really work best network reaches an auc of about #0.78# which is about the same, as QCDorWhat gets for minimally mass decorrelated networks Benefits Problems you basically split your training into a network with a good auc, and one that learns (hopefully) non trivial stuff So maybe you could do the same with some different preprocessing (one that does not just give you trivial information) Easiest Transformation: no Transformation (4 vectors) so Energy #p_1# #p_2# #p_3# trained on qcd, but prefers top! Why is that so? maybe just a bad network compare metrics (defining distance in topK) basically require the network to learn the meaning of #phi# and #eta# itself so without, no concept of locality, meaning no useful graph add Dense Network infront of the TopK better, but still not good run TopK still on preprocessed Data good, but numerical problems require to go to 4 particles and less training data same good reconstruction in #p_1# and #p_2# makes sense, since #Eq(p_t**2,p_1**2+p_2**2)# but apparently Energy and #p_3# prefer tops