Reposting it as an answer: I found a solution for my problem in this question, I can specify an explicitly defined schema when reading the json data from an rdd into a DataFrame:
json_df: DataFrame = spark.read.schema(schema).json(json_rdd)
It seems however that I'm reading the data twice now:
df_1_0_0 = _read_specific_version(json_rdd, '1.0.0', schema_1_0_0)
df_1_1_0 = _read_specific_version(json_rdd, '1.1.0', schema_1_1_0)
def _read_specific_version(json_rdd, version, schema):
json_df: DataFrame = spark.read.schema(schema).json(json_rdd)
return json_df.filter(col('version') == version)
Is there a more efficient way to do this? Like, is this exploiting parallel execution, or do I enforce sequential execution here? Maybe a spark newbie question.