79352610

Date: 2025-01-13 15:11:02
Score: 1.5
Natty:
Report link

I think using fixed-size windows with unbounded sources isn't ideal for this scenario, as you've discovered. The problem is that your secondary source's infrequent updates are lost when they don't fall within a window containing events from the main source. Simple upsampling of the secondary source won't solve this fundamentally, it will just create many redundant copies of the same BigQuery data, increasing processing load without improving accuracy.

You can try using keyed windows based on a common key between your main and secondary sources. This key should be the key identifier relevant to join. Both your Pub/Sub messages from the main and secondary sources need to include this key. If the BigQuery table update affects multiple records, the secondary source message should include all relevant keys. Then

use a global window for the secondary source. This means the secondary source's data will persist until explicitly cleared.

Also, I figured this article might be helpful to you.

Reasons:
  • Blacklisted phrase (1): this article
  • Long answer (-0.5):
  • No code block (0.5):
  • Low reputation (0.5):
Posted by: jggp1094