Following @aMike's tip, I added the code inside a proc so that TCL could optimize the bytecode. This resulted in a significant improvement in performance. Now runs in twice the time of Python.
TCL 8.6
array for: 154s
array while: 152s
dict for: 197s
dict while: 200s
list for: 363s
list while: 364s