So, after looking into this for few days, I failed to use spark session to save my parquet file to s3 from the executors. This can probably be achieved via structured streams but that requires reading data via the same streams (which is a bit too much refactoring for my use case).
I also tried to use transformation functions like map (as suggested by Chris above) but I faced the same issue: you cannot use spark session after transformations without collecting dataset first (similar to executors).
With that I gave up on spark and decided to implement java parquet writer instead which worked. It makes use of hadoop which I am personally not a big fan of but other than that, it turned out to work just fine.
I made use of this blog post, to build a parquet writer. Fair warning thou, the way it is defined in the blog is now deprecated. So you would need to make use of a builder. For reference, that is how I defined my writer:
try(ParquetWriter<DummyAvro> parquetWriter = AvroParquetWriter.<DummyAvro>builder(new Path("s3a://bucket/key/example.parquet"))
.withSchema(DeltaTable.getClassSchema())
.withWriteMode(ParquetFileWriter.Mode.OVERWRITE)
.withCompressionCodec(CompressionCodecName.SNAPPY)
.withConf(getConfig())
.withPageSize(4 * 1024 * 1024) //For compression
.withRowGroupSize(16L * 1024 * 1024)
.build()) {
for(DummyAvro row : rows) {
parquetWriter.write(row);
}
}
Where rows
is my list of DummyAvro
records and getConfig()
is a method defined as so:
private Configuration getConfig() {
Configuration conf = new Configuration();
conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
conf.set("fs.s3a.access.key", "");
conf.set("fs.s3a.path.style.access", "true");
conf.set("fs.s3a.connection.establish.timeout", "501000");
conf.set("fs.s3a.secret.key", "");
conf.set("fs.s3a.session.token","");
conf.set("fs.s3a.endpoint", "s3-us-east-2.amazonaws.com");
conf.set("fs.s3a.connection.ssl.enabled", "true");
conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem");
return conf;
}
This is certainly not a great way to go around doing this but after a week of bashing my head against the wall, I was all out of ideas.