Reports

There are a few considerations when building a health check for an application like this.

First consider when and why you should do a check for the health of the Kafka brokers. If your application can continue to function without Kafka then you do not want your app to be considered down if Kafka is down. You still might want a health check in this scenario but be careful about how you use it.

Second, a production Kafka cluster will have more than one broker and will be setup in such a way as to provide high availability. If one of the brokers goes down your consumers and producer will most likely continue to function just fine. There may be a temporary blip where you might see some failed requests but for the most part I have observed production systems run just fine with a single broker down. It usually takes some significant disaster to result in an entire Kafka cluster being down.

Third, the admin client itself can have timeouts and face other issues that could cause an exception to be thrown from any of its methods. If that happens, the cluster is not down, so you may need to take transient failures into account.

Here is some simple psuedocode that you could use for the health check function:

    N = 3
    failureCounter = 0

    try {
        adminClient = AdminClient.create()
        adminClient.describeCluster()
        failureCounter = 0
        // health up
    } catch (KafkaException e) {
        // Keep track of how many times KafkaException is thrown successively
        failureCounter++
        if (failureCounter >= N) { // N successive failures might indicate a problem
            // health down
        }
    }

The describeCluster() method suffices as a check. But in addition you might want to use the describeAcls() method to check if specific ACLs are also set, since any ACL issues can prevent your producers or consumers from functioning properly.

I answered a similar question here: https://stackoverflow.com/a/79349717/5468867

79349778