Reports

This feels an awful lot like a memory leak, but in my experience, that's not always the case. I'll try to throw out a few things to try based on the configurations that you provided. I think the key culprits here though may be related to one (or more) of: state changelog, checkpoint retention, or possibly some other configurations.

I'll provide a few suggestions (feel free to try one or more):

Disabling State Changelog
This is a recommendation that likely could explain some of the situation as the use of changelog can add some additional memory pressure (it functions as a buffer for those changes as mentioned in this older blog post). I'd suspect that using it in conjunction with RocksDB could likely impact memory utilization:

state.backend.changelog.enabled: false

Limiting Checkpoint Retention
Currently your job has checkpoint retention enabled which is fine, however you may want to consider limiting it to a specific number of those to retain (otherwise things could balloon) as well as cleaning them up to ensure too many don't stick around:

state.checkpoints.num-retained: 5
state.checkpoints.externalized.enable: true
state.checkpoints.externalized.delete-on-cancellation: true

Gather Metrics
One thing that I would highly suggest as you monitor the job and these changes would be to implement and monitor some of the built-in metrics that Flink provides out of the box with regards to memory/JVM. Using these with some type of visualization tool (e.g., Prometheus, Grafana, etc.) would allow you to easily monitor things like the JobManager, Changelog, Checkpointing, etc.

Definitely check out:

Any/all memory-related metrics (or just JobManager in general too)
Any/all changelog-related metrics
Checkpointing sizes/durations
RocksDB metrics (these need to be enabled separately)

Questions
As far as your questions go -- I'll try to give a few possible explanations:

Please help me understand why would checkpointing consume such large buffers gradually? Even then why aren't they getting released?

tl;dr: there's a lot more moving pieces to the puzzle when checkpointing is enabled that can impact memory pressure (even gradually) in a consistently flowing system

So there's a lot of things at play when checkpointing is enabled (vs. why things are rainbows and butterflies when it's disabled). Checkpointing is going to bring RocksDB into the picture which has a native memory impact with each checkpoint this in conjunction with the changelog could apply a quite a bit more pressure for the changelog-related segments as well.

Many of these things can stick around much longer than expected if data is continually flowing at scale into the system and may require tuning. RocksDB, for example, does a decent job at cleanup during its compaction process, however if the job is busy 24/7 it may never have the opportunity to do so, especially with all of the checkpointing operations and state being interacted with.

What exactly is getting stored in this buffer memory by checkpoint co-ordinator?

Obviously there's things like just direct memory like Netty, checkpoint buffers for your filesystem, tons of RocksDB related things (e.g, cache, tables, changelog, etc.), and the changelog has its own series of content.

How can i handle this issue or apply tuning so that this wouldn't occur? What can be my next steps of action to try out and resolve this?

Combining these two as this is already way too long, but hopefully some of the configurations that I provided above can help relieve the issue. Checkpointing and OOM type errors can be really nasty to troubleshoot, even when you know the ins/outs of a given job, but I'll keep my fingers crossed for you.

79822476