About your overarching problem, as far as I know, it is in principle better to split down an AD computation into multiple components (but one must consider the overhead of passing quantities between components). But I'm trying to get more references and prove that. Did you get a better understanding of the problem in the meantime? That would help me a lot.