Dear Necip,
You understood correctly as you mentioned in the first paragraph. And yes, smaller kernel size yields better results (at least as found in this post), one can use a per-pixel simple ANN architecture for that matter, but that is not always the case — my team members work in this area and we have developed a CNN model that outperforms ANN as it accounts for the geographical context. About UNET, in general, I prefer using it for high-resolution image segmentation and stuff like building footprints extraction, which is not possible using only the spectral characteristics. The present article is mainly to get remote sensing folks with basic programming and ML knowledge started with CNN, so the objective was not really to present the best classifier, rather, demonstrate a basic CNN workflow from scratch.
About matching the content with the title, I am fully considering updating it to make the post reach the right audience.
Thank you for your inputs!