I noticed that with pandas==2.2.1
(and possibly other versions), the bad line error is not triggered unless engine='python'
is explicitly set when reading the file.
Following the example provided by @sitting_duck, here's a minimal reproducible code:
import io
import pandas as pd
sim_csv = io.StringIO(
'''A,B,C
11,21,31
12,22,32
13,23,33,43 # Bad Line
14,24,34
15,25,35'''
)
Without engine
and with on_bad_lines='error'
:
with pd.read_csv(sim_csv, chunksize=2, on_bad_lines='error') as reader:
for chunk in reader:
print(chunk)
A B C
0 11 21 31
1 12 22 32
A B C
2 13 23 33
3 14 24 34
A B C
4 15 25 35
With engine='python'
and with on_bad_lines='error'
:
sim_csv.seek(0)
with pd.read_csv(sim_csv, chunksize=2, engine='python', on_bad_lines='error') as reader:
for chunk in reader:
print(chunk)
A B C
0 11 21 31
1 12 22 32
[...] pandas.errors.ParserError: Expected 3 fields in line 4, saw 4