Batch size = 1 seems problem to me. You are giving one data point at a time and because of this the updates in the weight has high varaince and it make the convergence difficult and unstable.
And try to use Gradient Scaling Before Clipping.