Experiencing the same problem.
I noticed that with pytorch backend the GPU memory is ~10x smaller, so I increased the batch size to be 16x, so the training speed is 16x faster. Now comparable to the TensorFlow backend (however, the GPU utilization is still low, ~3% vs ~30% with TF).
NOTE: increasing the batch size may affect training quality, which is yet to be compared.
I suspect batch size with pytorch has different semantics than the traditional Keras semantics. See here: https://discuss.pytorch.org/t/solved-pytorch-lstm-50x-slower-than-keras-tf-cudnnlstm/10043/8