Duplicating the partition key as a clustering column is technically valid in CQL, but it usually doesn’t give you much benefit and can even introduce unnecessary overhead.
A few points to consider:
The partition key determines data placement in Cassandra (which node(s) a row lives on).
The clustering key determines row ordering within the partition.
If you duplicate the partition key as a clustering key, every row in the partition will have the same value for that clustering column. That means it adds no real ordering value, and every query that filters on that key is already bound by the partition key anyway.
A SASI index on the duplicated clustering key won’t help you search partitions, because SASI works within the scope of partitions, not across them.
To search partitions, Cassandra requires a secondary index (not clustering), or better, a separate lookup/index table (common C* data modeling pattern).
For Spark workloads, it’s normal to scan multiple partitions:
Spark-Cassandra Connector is designed to push down partition key restrictions if you provide them.
If you don’t, it will parallelize the scan across nodes automatically.
So in practice you don’t need to “duplicate” keys for Spark — if your jobs are supposed to span multiple partitions, Spark already handles that efficiently.
Pro: You could argue that duplicating keys might make schema “symmetric” and allow certain uniform queries.
Con: You waste storage, you risk confusion, and you don’t actually improve queryability across partitions.