After some more debugging, I think the issue in Code 2 is with the WHERE clause:
idx NOT IN (SELECT idx FROM cte)
Code 2 works if I change the WHERE clause to:
idx > (SELECT MAX(idx) FROM cte)
I think this answers the intent of the original question, though I'm somewhat unsatisfied with this because I would expect both WHERE clauses to work equivalently in this case.
Additionally, it seems that Code 2 is slower than Code 1. This is also counter intuitive to me because Code 1 will have a recursive iteration for each row on the ref table, whereas Code 2 should have a recursive iteration for each sequence of rows filtered to have a cumulative sum of val greater than or equal to 30.
The original ref dataset is so small that the speed difference is irrelevant, you can replace ref:
ref <- data.table(
idx=1:10000
,val=sample(1:32,10000,replace=T)
)
Given the above, I suppose there are 3 questions I have now.
idx NOT IN (SELECT idx FROM cte)
not work?Thank you kindly!
P.S. G. Grothendieck - thank you for your response. I agree that it is much faster in R than in SQL. I was looking for a DuckDB specific solution, though I think your solution would be helpful for other looking for a more general solution. Thank you!