79550790

Date: 2025-04-02 13:40:27
Score: 1
Natty:
Report link

Reposting it as an answer: I found a solution for my problem in this question, I can specify an explicitly defined schema when reading the json data from an rdd into a DataFrame:

json_df: DataFrame = spark.read.schema(schema).json(json_rdd)

It seems however that I'm reading the data twice now:

    df_1_0_0 = _read_specific_version(json_rdd, '1.0.0', schema_1_0_0)
    df_1_1_0 = _read_specific_version(json_rdd, '1.1.0', schema_1_1_0)

def _read_specific_version(json_rdd, version, schema):
    json_df: DataFrame = spark.read.schema(schema).json(json_rdd)
    return json_df.filter(col('version') == version)

Is there a more efficient way to do this? Like, is this exploiting parallel execution, or do I enforce sequential execution here? Maybe a spark newbie question.

Reasons:
  • Long answer (-0.5):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • Self-answer (0.5):
  • Low reputation (1):
Posted by: pmaier-bhs