The Undeniable Beauty of Cross EntropyBy Eric Antoine Scuccimarra
When I began working on this project my intention was to do multi-class classification of the images. To this end I built my graph with logits and a cross-entropy loss function. I soon realized that the decision to do multi-class classification was quite ambitious, and scaled back to doing binary classification into positive and negative. My goal was to implement the multi-class approach once I had the binary approach working reasonably well, so I left the cross entropy in place.
Over the months I have been working on this I have realized that, for many reasons, the multi-class classification was a bad idea. For an academic project it might have made sense, but for any sort of real world use case it made none. There is really no use to outputting a simple classification for something as important as detecting cancer. A much more useful output is the probabilities that each area of the image contains an abnormality as this could aid a radiologist in diagnosing abnormalities rather than completely replacing her. Yet for some reason I never bothered to change the output or the loss function.
The limiting factor on the size of the model has been the GPU memory of the Google Cloud instance I am training this monstrosity on. So I've been trying to optimize the model to run within the RAM constraints and train in a reasonable amount of time. Mostly this has involved trying to keep the number of parameters to a minimum, but today I was looking at the model and realized that the logits were definitely not helping the situation.
For this problem classification was absolutely the wrong approach. We aren't trying to classify the content of the image, we are trying to detect abnormalities. The negative class was not really a separate class, but the absence of any abnormalities, and the graph and the loss function should reflect this. In order to coerce the logits into an output that reflected the reality just described, I put the logits through a softmax and then discarded the negative probability - as I said the negative class doesn't really exist. However the cross entropy function does not know this and it places equal importance on the imaginary negative class as on the positive class (subject to the cross entropy weighting of course.) This means that the gradients placed equal weight on trying to find imaginary "normal" patterns, despite the fact that this information is discarded and never used.
So I reduced the logits layer to one unit, replaced the softmax activation with a sigmoid activation, and replaced the cross entropy with binary cross entropy. And the change has been more impactful than I imagined it would be. The model immediately began performing better than the same model with the logits/cross entropy structure. It seems obvious that this would be the case as now the model can focus on detecting abnormalities rather than wasting half of it's efforts on trying to detect normal patterns.
I am not sure why I waited so long to make this change and my best guess is that I was seduced by the undeniable elegance of the cross entropy loss function. For multi-class classification it is truly a thing of beauty, and I may have been blinded by that into attempting to use it in a situation it was not designed for.