79781199

Date: 2025-10-02 18:26:53
Score: 1.5
Natty:
Report link

I've been able to consistently recreate queries using OR operators in where clause, that rely on multiple table joins as part of the or condition, to be orders of magnitude slower than the same query rewritten as a union all query. The difference was in at least 40 mins the query running with the OR operator, vs less than a minute using the UNION ALL clause. In all the examples the underlying tables were several hundreds of millions of rows in cardinality (charge transactions in healthcare encounter related tables). When examining the execution plans of boths queries, the UNION ALL version looked more complex with many more parrallel tasks but was the faster one, and although the OR version looked simpler with less parallel tasks it was the worse performing one. The benefits of the Spark Engine are realized when your data tasks can be parallelized as much as possible, due to data skipping of the parquet files underneath delta tables.

Reasons:
  • Long answer (-0.5):
  • No code block (0.5):
  • Single line (0.5):
  • Low reputation (1):
Posted by: Nelson A