Wonderful explanation. The only thing that I would like your perspective on is that if this method works well enough, then under what kind of situation should we think of using R-CNN (region based) approach? This is what I think.
R-CNN would help in cases where we need to identify where in the image the label occurs. We can possibly get a count of such label occurrences in the image as well
Would love to hear your perspective on this.