On Dual Xeon E5 2670 (Sandy Bridge EP, 2.6 Ghz, SSE4.2, AVX, 8 cores 16 threads), Win Server 2016, vanilla tf 1.10 your original cifar10_train.py initially had troubles:
2.7 examples/sec; 46.836 sec/batch
utilizing 100% cpu. I guess something is wrong with detecting number of threads/cores.
I had to add
config = tf.ConfigProto(intra_op_parallelism_threads=5, inter_op_parallelism_threads=10, allow_soft_placement=True, device_count={‘CPU’: 2})
session = tf.Session(config=config)
tf.keras.backend.set_session(session)
to its beginning to start getting at least 261.6 examples/sec; 0.489 sec/batch
intra_op_parallelism_threads setting was what’s mattered the most.
Still not sure about optimal settings for that hardware configuration.
On my laptop i7–7700HQ (Kaby Lake, 2.8 Ghz, AVX2,FMA3, 4 cores 8 threads), Win 10, tf-gpu 1.10 bult vs cuda 9.2 cudnn 7.2 with avx support, Gtx 1050Ti mobile, it was blinking either ~5400 or ~3500 examples/sec.