I ended up building the missing functionality in the pyspark API for arbitrary stateful functions for myself, using at first the delta tables as a means of keep the state information. This worked fine, but in order to speed up to sub-second processing of our cases we ended up with a more production ready version that used a redis cache in the background.