Reports

After some more debugging, I think the issue in Code 2 is with the WHERE clause:

idx NOT IN (SELECT idx FROM cte)

Code 2 works if I change the WHERE clause to:

idx > (SELECT MAX(idx) FROM cte)

I think this answers the intent of the original question, though I'm somewhat unsatisfied with this because I would expect both WHERE clauses to work equivalently in this case.

Additionally, it seems that Code 2 is slower than Code 1. This is also counter intuitive to me because Code 1 will have a recursive iteration for each row on the ref table, whereas Code 2 should have a recursive iteration for each sequence of rows filtered to have a cumulative sum of val greater than or equal to 30.

The original ref dataset is so small that the speed difference is irrelevant, you can replace ref:

ref <- data.table(
  idx=1:10000
  ,val=sample(1:32,10000,replace=T)
)

Given the above, I suppose there are 3 questions I have now.

Why does idx NOT IN (SELECT idx FROM cte) not work?
What makes the Code 2 approach slower than the Code 1 approach?
Is there a faster way to implemented this in SQL DuckDB?

Thank you kindly!

P.S. G. Grothendieck - thank you for your response. I agree that it is much faster in R than in SQL. I was looking for a DuckDB specific solution, though I think your solution would be helpful for other looking for a more general solution. Thank you!

79388540