I'm going to recommend starting with a larger pre-trained model. Since you’ve already experimented with a model in the 6.7B range; you can stick with something that has a larger number of parameters. It should provide more robust performance for handling complex prompts; even though it's not fully tuned to your exact needs.