Yes 1 unit operates on 1 warp in lockstep, but warps can be swapped with context switch. Obviosly, usually there are a lot more warps then warp schedulers, so they will also go sequentialy. In theory GPU can put threads in warp depending on what brach they go??? (idk but seems like a viable option, because constant branches are eliminated, and usually dont make any effect with modern compilers and GPUs)
I can not imagine device with different branches. It can be possible, but as long as you have smaller number of schedulers and bigger number of warps to process, that make only half of sense, because you switch the whole warp, and you will need to finish the longest branch then. Sure it will then make smaller latency, because you can execute both branches concurently, but still wont eliminate the other problem.
As long as you are doing the same operations in the same order in different branches, and just use different data, it should be okay and perform same instructions for all of threads without stalls or computing both variants.
The last thing Im generally curious about is that, can GPU architecture allow threads swaps in warps, then sure there will be even better possibilities in branches and whatever. Also dont take my whole statements as complete thruth, I also dont know that much, may be better to look at AMD, as they have more open(to look at) architecture.