Hi, below are responses.
1- During prediction, you output bounding box and class scores for each convlayer you took predictions out from. Then you apply NMS to get final bounding box prediction.
2- final detection layer is nothing but convolution layer with 4+n_class+1 number of filters.
3- scale you can take care of in 2 ways, you can either scale the ground truth box down or apply deconvolution.