I ended up implementing an rpy ball spring: https://github.com/RobotLocomotion/drake/compare/master...krish-suresh:drake:ball_spring which is running faster than the linearbushing+ball constraints, unsure why exactly that is the case.