Reports

Since StackOverflow endorses answering your own question: I have the answers after deploying a cluster with horizontally scaled writes myself. There weren't any good how-to guides on this so here goes.

Brief preamble: Citus deprecated sharding, what they also call "statement based replication" (not to be confused with statement based replication in general) years ago. HA and fault tolerance are now achieved by 1. having a coordinator cluster and 1+ worker clusters, and 2. bringing your own cluster admin tools like Patroni. This migration solved subtle but serious deadlock scenarios plaguing the sharding/statement-based replication strategy.

Terminology:

coordinator cluster: 1+ nodes that are candidates for being the coordinator
- coordinator leader: the primary/leader node serving as "the coordinator" for Citus
- coordinator followers: the followers/standby nodes
worker cluster: same idea as coordinator cluster; there is a leader and 0+ followers for each Citus worker [group]

Citus' documentation often uses "the coordinator" or "a worker" to refer to the coordinator leader or to a worker group's leader. I'll try to avoid that ambiguity below.

Citus mostly deals only with the leader for each group. The main exception to this is the citus.use_secondary_nodes GUC. Another exception is Citus has a metadata table with all nodes tagged with their leader or follower status. This table is used to direct DDL and DML statements to the correct node within each group/cluster. Your bring-your-own HA solution such as Patroni is responsible for updating this table correctly.

Concise Guide:

you can only direct DDL statements to the leader coordinator
you can only direct DML statements to any leader node
- you CANNOT direct DML to follower coordinators even though DML in theory doesn't change data on the coordinators. Attempting to do so results in Postgres errors.
therefore to scale writes you need to add worker clusters
- there is only ever one coordinator leader
- therefore the only way to add leaders is adding worker clusters to get 1 new writable leader per new worker cluster
to scale reads you have two options
- option 1: citus.use_secondary_nodes = never and add more worker clusters; never means all queries are sent to the leader of each worker cluster, so scaling requires adding worker clusters
- option 2: citus.use_secondary_nodes = always and add followers to all worker clusters; always means queries are only sent to replicas within each group
use Patroni or another solution that supports both HA and is "Citus-aware" to enable HA on the cluster
- "Citus-aware:" the Citus table that tracks leader versus follower status for all nodes must be kept up to date with the nodes' actual states as your tool handles failover etc.

Adding worker clusters to scale writes likely seems counterintuitive. There are two reasons for this:

again, DML statements can only be directed to leaders. Postgres itself denies DML on followers, even coordinator followers
since version ~10 (years ago) Citus has a "queries on any node" feature where you can send queries statements to any worker and it gets rerouted to the correct worker and shard automatically
- "any node" is not quite correct though; DML can only be sent to leaders. Only SELECT queries can truly be run on any node

79709684