79232488

Date: 2024-11-28 03:25:05
Score: 0.5
Natty:
Report link

Yeah, not an expert but been going through the same issues lately and might have some leads for you.

  1. I would not recommend using multiple stateful streaming operations in the same "query", at least not while you are debugging your code.

See: https://spark.apache.org/docs/3.5.1/structured-streaming-programming-guide.html#unsupported-operations

"Chaining multiple stateful operations on streaming Datasets is not supported with Update and Complete mode."

I know, it doesn't say that it is not supported with "Append" mode, but better safe than sorry. It's also easier to check/debug your code if you split it into smaller steps. If everything works, you can still try to combine it again afterwards.

  1. It's not enough to just add the watermark to your dataframes, you also have to "use" the watermarked column (usually the event-time column) in your joins conditions and groupBy clauses.

See the code example here for joins: https://spark.apache.org/docs/3.5.1/structured-streaming-programming-guide.html#inner-joins-with-optional-watermarking

When talking about joins there are different requirements for various types of joins (inner is easier than outer etc.)

And check the example here for a watermark/window aggregation: https://spark.apache.org/docs/3.5.1/structured-streaming-programming-guide.html#handling-late-data-and-watermarking

By the way, you don't necessarily have to use a window in your groupBy clause, you can also group by the event-time column (that you used in the watermark). But one of the two options you must go for, or your watermark has zero effect in that aggregation.

An effective way of checking if your watermarks work in groupBy cases, is to start an incremental DLT pipeline refresh (not a full refresh!) and look at the number of processed/written records. It should only process the streaming increment and not the total number of records... If you see the same number of records (or growing) in each DLT update, you are doing something wrong, because DLT is then switching to "complete" mode automatically (which you don't want usually in a streaming workload).

Reasons:
  • Long answer (-1):
  • No code block (0.5):
  • Low reputation (1):
Posted by: Thomas