I encountered this question about a year ago, and Daniel Lee Alessandrini's answer helped me a lot. I improved his function by casting it as a double for millisecond precision(xxxxxxx.yyy) in my use cases, and it works fine in my environment (PySpark 3.4.1 and Pandas 2.1.3).
However, I haven't tried using nanosecond precision because my environment is still using Iceberg 1.3.1. For reference, see: Iceberg Specification - Primitive Types.
import pandas as pd
from pyspark.sql import functions as F
def convert_to_pandas(Spark_df):
"""
This function will safely convert a spark DataFrame to pandas.
Ref: https://stackoverflow.com/questions/76072664
"""
# Iterate over columns and convert each timestamp column to a string
timestamp_cols = []
for column in Spark_df.schema:
if (column.dataType == T.TimestampType()) | (column.dataType == T.TimestampNTZType()):
# Append column header to list
timestamp_cols.append(column.name)
# Set column to string using date_format function
Spark_df = Spark_df.withColumn(column.name, F.col(column.name).cast("double"))
# Convert to a pandas DataFrame and reset timestamp columns
pandas_df = Spark_df.toPandas()
for column_header in timestamp_cols:
pandas_df[column_header] = pd.to_datetime(pandas_df[column_header],unit = 's')
return pandas_df