Thanks for your input!
What we have done last week is the following scenario:
We have a 'raw' table that ingests the data with a TTL of 3 days.
We have a 'HashIndex' table per raw table that is a single table column that stores hash indexes created with the hash_sha256() algorithm, based on a few unique columns that are separated with "|".
hash_sha256(strcat(Column1, "|", Column2, "|", Column3))
We have a 'deduplicated' table that stores unique records
Whenever data is ingested in the 'raw' table, an update policy runs and creates a hash based on the unique columns, and then checks the 'HashIndex' table to see if the has is already present. If this is not present, the record is ingested in the 'deduplicated' table and otherwise nothing happens.
When a record is successfully ingested into the 'deduplicated' column, a second update policy runs and the hash value is also added to the 'HashIndex' column so we hopefully do not have a lot of false positives.
We have chosen to not always check against the deduplicated table on ingestion because this will put a lot of workload on our main table. Since data ingestion happens a lot and we have a frontend that also queries the deduplicated table.
Hopefully this approach will work and the updatepolicies work fine together in terms of instantly updating also the HashIndex table. We have done some initial tests and it seems to work fine, but we still have to stress test.
Otherwise we will have to change the approach with 2 updatepolicies and try to create 1 updatepolicy.
Kind regards!