79510948

Date: 2025-03-15 10:25:46
Score: 1
Natty:
Report link

I came across this incomplete ticket on the same. https://issues.apache.org/jira/browse/SPARK-24932. The PR was closed without merging and this is the excerpt from the discussion. What I understand from this is, behavior for joins is not straightforward as it is for aggregations. Unlike aggregations, joins mostly will create new row so append is a simpler solution which was supported first i suppose. You will not see the below problem for stream static join as the the data from the other table is static so there is no situation of late arrival of data to join. Hope this helps.

Hi @attilapiros , before going on adding test cases (actually I am doing this right now), I think there is one more thing need to be figure out first.

As this PR wrote, I want to support stream-stream join in update mode by let it behaves exactly same as in append mode. This is totally fine for inner join, but not so straight forward for outer join.

For example:

Assuming watermark delay is set to 10 minutes, we run a query like A left outer join B, while event A1 comes at 10:01 and event B1 comes at 10:02. In append mode, of course, A1 will wait for B1 to produce a join result A1-B1 at 10:02.

However, in update mode, we can keep this behavior, OR take actions as following, which looks also reasonable but some kind of costly:

  1. Emit an A1-null at 10:01.

  2. Emit an A1-B1 at 10:02 when B1 appears, and expect the data sink to write over previous result.

So which is the correct behavior? - The same way as append mode, or the above way. Please let me know your opinion.

Reasons:
  • Whitelisted phrase (-1): Hope this helps
  • RegEx Blacklisted phrase (2.5): Please let me know your
  • RegEx Blacklisted phrase (1): I want
  • Long answer (-1):
  • Has code block (-0.5):
Posted by: Vindhya G