79505967

Date: 2025-03-13 09:34:48
Score: 0.5
Natty:
Report link

tl;dr

Remove .with_infer_schema_length(None)

details

I opened a ticket (https://github.com/pola-rs/polars/issues/21298) on the pola.rs GitHub repository and received the answer there. This I want to share here.

The most relevant comment: https://github.com/pola-rs/polars/issues/21298#issuecomment-2717919596

Interpretation of functionality for date/time inference parsing:

  • "infer schema" tries up to 188 string patterns, 1x of 2x

  • it does so on every field in in scope

  • it does so for every row up to infer_schema_length is in scope

  • if "infer_schema_length" is not set, it defaults to 100 rows

  • if set to None, it processes (every field in) every row

Note that the inference is not cheap, and can significantly impact the performance.

After I removed the lines .with_infer_schema_length(None) from my rust code the performance was increased significantly, also for smaller files.

For reference, small file, unchanged, release-compiled with rustc 1.85.0 (4d91de4e4 2025-02-17)

================
CPU 132%
user    0.112
system  0.030
total   0.108

Changed, small file:

================
CPU 265%
user    0.036
system  0.038
total   0.028

Changed, big file:

================
CPU 497%
user    0.145
system  0.047
total   0.039
Reasons:
  • RegEx Blacklisted phrase (1): I want
  • Long answer (-1):
  • Has code block (-0.5):
  • Self-answer (0.5):
  • Low reputation (0.5):
Posted by: kelko