79447737

Date: 2025-02-18 09:37:02
Score: 4
Natty:
Report link

I am setting up a CDC system that includes Kafka and Debezium to capture changes on the source table (mysql). Then, I will use pyspark to clean the data and finally insert the data into the destination table. Suppose I need to stop the pyspark script for 2 days. When I turn it back on, it will no longer capture the data that appeared during those 2 days. Is there a solution for this issue? `

df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", kafka_bootstrap_servers) \
    .option("subscribe", topic) \
    .option("startingOffsets", "latest") \
    .option("group.id", consumer_group_id) \
    .load()

`

Reasons:
  • Blacklisted phrase (3): Is there a solution
  • Blacklisted phrase (0.5): I need
  • Long answer (-0.5):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • Low reputation (1):
Posted by: Hảo Nguyễn Thị Phương