It looks like your controllers only have 1Gi of memory, which might not be enough. Try increasing it and see if that helps. Kafka controllers need constant inter node communication to stay in sync. If there's any delay a nonactive controller might still think it's in charge. You could also try increasing controller.quorum.fetch.timeout.ms
and controller.quorum.retry.backoff.ms
in the Kafka config to allow more time for coordination.