if your memory were big, you can use your function to generate a lookup array of length=256 in compile time. when you do the reaarange job in run time, just look it up in the array. this can boost your speed up into 1 clockcycle in best case.