Reports

Reposting it as an answer: I found a solution for my problem in this question, I can specify an explicitly defined schema when reading the json data from an rdd into a DataFrame:

json_df: DataFrame = spark.read.schema(schema).json(json_rdd)

It seems however that I'm reading the data twice now:

    df_1_0_0 = _read_specific_version(json_rdd, '1.0.0', schema_1_0_0)
    df_1_1_0 = _read_specific_version(json_rdd, '1.1.0', schema_1_1_0)

def _read_specific_version(json_rdd, version, schema):
    json_df: DataFrame = spark.read.schema(schema).json(json_rdd)
    return json_df.filter(col('version') == version)

Is there a more efficient way to do this? Like, is this exploiting parallel execution, or do I enforce sequential execution here? Maybe a spark newbie question.

79550790