Thanks to chrslg: I removed the division by n in the derivatives. After changing the parametres for the learning rate and max iteration, the algorithm is much faster and also optimizes b. Using more data than only the small toy data set was helpful. And thanks Dr. Snoopy: It was helpful to plot the loss for each epoch.