Polynomials and generally in parameter-linear models like y=w*g(x) are linear in parameters, it means the learning/optimization is to solve sets of linear equations to find w because y and x becomes measured data. By g(x) you define the nonlinear mapping of features x to y. In case you have not too long x (len(x)<100) you can be fine with up to 4th order full polynomial(s) or just define your nonlinearities that may be intrinsic to the system. Levenberg-Marquardt and Conjugate Gradients are working great. But all matters, choice of x, g(x), sampling, ... , I recently found cases where1,2,3rd order polynomials did not work, but 4th one converged greatly. But proper training still does not mean it truly learned proper causal relations...needs proper testing (and perhaps pruning).