When I enqueued many (say 1000) kernels I noticed that enqueue operation took more and more time. Adding clFinish(queue) time after time gave ~15% increase of overall speed.
clFinish(queue)