Do you have an implementation of the recipe you describe that you can share? Maybe your current implementation can be optimized? Or: would it be an option to run it in parallel to speedup your processing?