[-------------- normalize @ torchvision==0.15.0a0+b1f6c9e ---------------]
                                          |       v1       |       v2     
1 threads: ---------------------------------------------------------------
      (3, 400, 400) / float32 / cpu       |  128 (+-  1)   |   93 (+-  2) 
      (3, 400, 400) / float32 / cuda      |   91 (+-  0)   |   54 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  3528 (+- 26)  |  2507 (+-  9)
      (16, 3, 400, 400) / float32 / cuda  |  764 (+-  2)   |  501 (+-  1) 
6 threads: ---------------------------------------------------------------
      (3, 400, 400) / float32 / cpu       |   54 (+-  0)   |   36 (+-  0) 
      (16, 3, 400, 400) / float32 / cpu   |  381 (+-  3)   |  289 (+-  3) 

Times are in microseconds (us).

Aggregated performance change of v2 vs. v1: -30.1% (improvement)