For anyone that comes across this article, I'd recommend Hpk FFT if it's supported for your platform. It's a new library which has better performance and accuracy than other options out there.
They also did a great job minimizing the overheads of the python bindings, making it great for small and large problems alike.
import hpk
from timeit import default_timer as timer
import numpy as np
factory = hpk.fft.makeFactoryCC(np.float32)
fft = factory.makeInplace([256, 256], 1000)
a = np.ones([1000, 256, 256], dtype=np.complex64)
start = timer()
fft.forward(a)
end = timer()
print("Time to execute", (end - start))