Dealing with mixed date formats in large DataFrames is tricky because pd.to_datetime() with a single format won’t handle row-by-row variations, and dateutil.parser.parse() can be slow. You can combine fast vectorized parsing, row-level fallback, and validation. Here’s one approach:
import pandas as pd
from datetime import datetime
import numpy as np
# Example DataFrame
df = pd.DataFrame({
'date': ['01/15/2023', '2023-02-20', '03/25/23 14:30:00', '2023/13/45', 'invalid']
})
# List of expected formats
formats = [
"%m/%d/%Y",
"%Y-%m-%d",
"%m/%d/%y %H:%M:%S",
"%m/%d/%y"
]
def parse_date_safe(x):
"""Try multiple formats; return NaT if none match or invalid."""
for fmt in formats:
try:
dt = datetime.strptime(x, fmt)
# Optional: validate business logic (e.g., year range)
if 2000 <= dt.year <= 2030:
return dt
except Exception:
continue
return pd.NaT
dates = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True)
# Rows that failed fast parsing
mask = dates.isna()
if mask.any():
# Apply slower but format-specific parsing only on failed rows
dates.loc[mask] = df.loc[mask, 'date'].apply(parse_date_safe)
df['parsed_date'] = dates
# Report failures
failed_rows = df[df['parsed_date'].isna()]
print(f"Failed to parse {len(failed_rows)} rows:")
print(failed_rows)
This approach first tries a fast, vectorized parsing method and only falls back to slower row-wise parsing for problematic rows. It efficiently handles mixed date formats while tracking parsing failures for data quality purposes.