I have been wondering about this problem for years now. The answers in this question have been very useful for starting to understand what is going on. I think the ultimate fix will have to be via the Pandas project but here is a work around for now.
I tested out a few approaches using Jupyter Lab so I could see exactly how much memory the python kernel was using.
TLDR: Use Approach 3 for most likely guarantee that memory is freed correctly.
Note: Images with Jupyter Lab show memory usage in the middle of the bottom bar.
This approach is based on @hardi's answer. Here is the code:
import pandas as pd
import numpy as np
import gc
# Define a large number of rows and columns
num_rows = 10_000_000 # 10 million rows
num_cols = 50 # 50 columns
# Generate a large DataFrame with random float values (uses more memory)
large_df = pd.DataFrame(
np.random.rand(num_rows, num_cols), # Floating-point data
columns=[f"col_{i}" for i in range(num_cols)]
)
# Display DataFrame memory usage in megabytes (MB)
memory_usage_mb = large_df.memory_usage(deep=True).sum() / (1024**2)
print(f"Memory usage: {memory_usage_mb:.2f} MB")
del large_df
gc.collect()
large_df = pd.DataFrame()
This does work. Here is what the effect on memory looks like.
Releasing Data returns usage to 1.41GB
However this does not fully release data in all cases.
I have a large data file I am working with that was not being fully released via this method so I also introduced a malloc memory call which will work for Linux but Windows and MacOS users will have to find their own commands.
Here is the code:
import pandas as pd
import numpy as np
import gc
import ctypes # Works for Linux
large_data_file = 'your_directory/file.pkl.zip'
large_df = pd.read_pickle(large_data_file)
gc.collect()
ctypes.CDLL("libc.so.6").malloc_trim(0)
This is what the memory usage looks like.
Releasing with gc.collect lowers usage only to 6.85GB:
Releasing at OS level lowers usage all the way back down:
In my testing in a more complicated notebook with more operations, I found that Approach 2 did not work for releasing memory. I don't know why this is, but the approach of using a dictionary from @Anil P did work so my recommendation is to use this approach when working with large datasets.
import pandas as pd
import gc
import ctypes # Works for Linux
large_data_file = 'your_directory/file.pkl.zip'
large_df = dict()
large_df['df'] = pd.read_pickle(large_data_file)
large_df.clear()
gc.collect()
ctypes.CDLL("libc.so.6").malloc_trim(0)