It's hard to tell what can be done without knowing more details about the data transformations you're performing.
csv
is (I believe) written in pure Python, whereas pandas.read_csv
defaults to a C extension [0], which should be more performant. However, the fact that performance isn't improved in the pandas
case suggests that you're not bottlenecked by I/O but by your computations -- profiling your code would confirm that.
In general, during your data manipulations, pandas
should be a lot more memory-performant than csv
, because instead of storing your rows in Python objects, they are stored in arrays behind the scenes.
The general principle for performant in-memory data manipulation for Python is to vectorize, vectorize, vectorize. Every Python for
loop, function call, and variable binding incurs overhead. If you can push these operations into libraries that convert them to machine code, you will see significant performance improvements. This means avoiding direct indexing of cells in your DataFrame
s, and, most importantly, avoiding tight for
loops at all costs [1]. The problem is this is not always possible or convenient when you have dependencies between computations on different rows [2].
You might also want to have a look at polars
. It is a dataframe library like pandas
, but has a different, declarative API. polars
also has a streaming API with scan_csv
and sink_csv
, which may be what you want here. The caveat is this streaming API is experimental and not yet very well documented.
For a 2-5GB CSV file, though, on an 8GB machine, I think you should be able to load the whole thing in memory, especially given the inefficiency of CSV files will get reduced once converted to the in-memory Arrow data format.
[0] In some cases pandas
falls back to pure Python for reading, so you might want to make sure you're not falling into one of those cases.
[1] I find this is usually where Python performance issues are most apparent, because such computations run for every single row in your dataset. The first question with performance is often less "how do I my operations quicker?", but "what am I doing a lot of?". That's why profiling is so important.
[2] You can sometimes get around this with judicious use of shifting.