According to your requirements,to split each dataset row into two in Spark,flatMap transforms one row into two in a single pass, much faster than merging later. Just load your data, apply a simple function to split rows, and flatMap handles the rest. Then, convert the result back to a DataFrame for further use.
Below is the code snippets:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SplitRows").getOrCreate()
data = [("a,b,c,d,e,f,g,h",), ("1,2,3,4,5,6,7,8,7,9",)]
df = spark.createDataFrame(data, ["value"])
df.show()
def split_row(row):
parts = row.value.split(',')
midpoint = len(parts) // 2
return [(",".join(parts[:midpoint]),), (",".join(parts[midpoint:]),)]
split_rdd = df.rdd.flatMap(split_row)
result_df = spark.createDataFrame(split_rdd)
result_df.show()