79480004

Date: 2025-03-03 01:20:53
Score: 2
Natty:
Report link

I have been wondering about this problem for years now. The answers in this question have been very useful for starting to understand what is going on. I think the ultimate fix will have to be via the Pandas project but here is a work around for now.

I tested out a few approaches using Jupyter Lab so I could see exactly how much memory the python kernel was using.

TLDR: Use Approach 3 for most likely guarantee that memory is freed correctly.

Note: Images with Jupyter Lab show memory usage in the middle of the bottom bar.

Approach 1:

This approach is based on @hardi's answer. Here is the code:

import pandas as pd
import numpy as np
import gc

# Define a large number of rows and columns
num_rows = 10_000_000  # 10 million rows
num_cols = 50          # 50 columns

# Generate a large DataFrame with random float values (uses more memory)
large_df = pd.DataFrame(
   np.random.rand(num_rows, num_cols),  # Floating-point data
   columns=[f"col_{i}" for i in range(num_cols)]
)

# Display DataFrame memory usage in megabytes (MB)
memory_usage_mb = large_df.memory_usage(deep=True).sum() / (1024**2)
print(f"Memory usage: {memory_usage_mb:.2f} MB")

del large_df
gc.collect()
large_df = pd.DataFrame()

This does work. Here is what the effect on memory looks like.

Base Kernel using 1.31GB Base data usage is 1.31GB

Load Data using 5.14GB Loading data raises data usage to 5.14GB

Releasing Data returns usage to 1.41GB Releasing data returns usage to 1.41GB

However this does not fully release data in all cases.

Approach 2:

I have a large data file I am working with that was not being fully released via this method so I also introduced a malloc memory call which will work for Linux but Windows and MacOS users will have to find their own commands.

Here is the code:

import pandas as pd
import numpy as np
import gc
import ctypes # Works for Linux

large_data_file = 'your_directory/file.pkl.zip'

large_df = pd.read_pickle(large_data_file)

gc.collect()
ctypes.CDLL("libc.so.6").malloc_trim(0)

This is what the memory usage looks like.

Dataframe uses 10.2GB: Dataframe uses 10.2GB

Releasing with gc.collect lowers usage only to 6.85GB: Partial memory release with gc.collect

Releasing at OS level lowers usage all the way back down: Releasing at OS level lowers usage all the way back down

Approach 3:

In my testing in a more complicated notebook with more operations, I found that Approach 2 did not work for releasing memory. I don't know why this is, but the approach of using a dictionary from @Anil P did work so my recommendation is to use this approach when working with large datasets.

import pandas as pd
import gc
import ctypes # Works for Linux

large_data_file = 'your_directory/file.pkl.zip'

large_df = dict()
large_df['df'] = pd.read_pickle(large_data_file)

large_df.clear()
gc.collect()
ctypes.CDLL("libc.so.6").malloc_trim(0)

Base Memory Usage 1.38GB: Base Memory Usage 1.38GB:

Memory Usage when data loaded 10.08GB: Memory Usage when data loaded 10.08GB

Memory release brings it back down to 1.43GB: Memory release brings it back down to 1.43GB

Reasons:
  • Blacklisted phrase (1): did not work
  • Probably link only (1):
  • Long answer (-1):
  • Has code block (-0.5):
  • User mentioned (1): @hardi's
  • User mentioned (0): @Anil
  • Low reputation (0.5):
Posted by: bradchattergoon