79388540

Date: 2025-01-26 12:50:33
Score: 1.5
Natty:
Report link

After some more debugging, I think the issue in Code 2 is with the WHERE clause:

idx NOT IN (SELECT idx FROM cte)

Code 2 works if I change the WHERE clause to:

idx > (SELECT MAX(idx) FROM cte)

I think this answers the intent of the original question, though I'm somewhat unsatisfied with this because I would expect both WHERE clauses to work equivalently in this case.

Additionally, it seems that Code 2 is slower than Code 1. This is also counter intuitive to me because Code 1 will have a recursive iteration for each row on the ref table, whereas Code 2 should have a recursive iteration for each sequence of rows filtered to have a cumulative sum of val greater than or equal to 30.

The original ref dataset is so small that the speed difference is irrelevant, you can replace ref:

ref <- data.table(
  idx=1:10000
  ,val=sample(1:32,10000,replace=T)
)

Given the above, I suppose there are 3 questions I have now.

  1. Why does idx NOT IN (SELECT idx FROM cte) not work?
  2. What makes the Code 2 approach slower than the Code 1 approach?
  3. Is there a faster way to implemented this in SQL DuckDB?

Thank you kindly!

P.S. G. Grothendieck - thank you for your response. I agree that it is much faster in R than in SQL. I was looking for a DuckDB specific solution, though I think your solution would be helpful for other looking for a more general solution. Thank you!

Reasons:
  • Blacklisted phrase (0.5): Thank you
  • Blacklisted phrase (0.5): thank you
  • Long answer (-1):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • Self-answer (0.5):
  • Low reputation (1):
Posted by: ohwaitwhatohok