The Best way to Read a Large CSV File in Python
TLDR: Compare the performance of 4 different ways to read a large CSV file in Python. Find the method that best suits your use case. Hint: all approaches make use of Python’s generators somehow.
Why compare different ways of reading CSV files?
Reading a CSV is a very common use case as Python continues to grow in the data analytics community. Data is also growing and it’s now often the case that all the data folks are trying to work with, will not fit in memory.
It’s also not always necessary to load all the data into memory. We can make use of generators in Python to iterate through large files in chunks or row by row.
The experiment
We will generate a CSV file with 10 million rows, 15 columns wide, containing random big integers. This file for me is approximately 1.3GB, not too big, but big enough for our tests.
Each approach will read the entire CSV and compute the sum of the column at index 2.
My machine is a Windows 10 Desktop with 16GB RAM using an AMD Ryzen 5 with 6 cores (12 logical).
Here is the script I used to generate the huge_data.csv
.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(10000000,14)))
df.to_csv("/mnt/c/data/huge_data.csv")
I then used the time
module to time the execution of the entire script for each approach to reading a big CSV file.
Four ways to read a large CSV file in Python
Pure Python
This approach uses no additional libraries. Under the hood the for row in csv_file
is using a generator to read one line at a time.
Time: 12.13 seconds
import time
start = time.time()
FILE_PATH = "/mnt/c/data/huge_data.csv"
def read_data(filename, column_index, skip_header=True):
with open(filename) as csv_file:
total = 0
if skip_header:
next(csv_file)
for row in csv_file:
try:
r = row.split(",")
total += int(r[column_index])
except ValueError:
print(f"Failed to convert {row[column_index]} to int")
return total
data_total = read_data(FILE_PATH, 2)
print(data_total)
print(f"Done in {time.time()-start} seconds")
CSV Reader
Here we use the popular csv
module to parse the open file and use it’s generator to iterate. I’m not entirely sure why the performance took a hit but it’s worth digging into what overhead the csv
module might add if you require high performance.
Time: 26.32 seconds
import csv
import time
start = time.time()
FILE_PATH = "/mnt/c/data/huge_data.csv"
def read_data(filename, column_index, skip_header=True):
with open(filename) as csv_file:
total = 0
reader = csv.reader(csv_file)
if skip_header:
next(reader, None)
for row in reader:
try:
total += int(row[column_index])
except ValueError:
print(f"Failed to convert {row[column_index]} to int")
return total
data_sum = read_data(FILE_PATH, 2)
print(data_sum)
print(f"Done in {time.time()-start} seconds")
Pandas with chunksize
Here we use pandas
which makes for a very short script. If you already have pandas
in your project, it makes sense to probably use this approach for simplicity.
We specify a chunksize
so that pandas.read_csv()
does not read the entire CSV into memory. Each chunk is a data frame itself.
Time: 8.81 seconds
import pandas as pd
import time
start = time.time()
FILE_PATH = "/mnt/c/data/huge_data.csv"
def sum_column(chunk, column_idx):
return chunk.iloc[:,column_idx].sum()
chunksize = 10 ** 6
total_sum = 0
column_index = 2
with pd.read_csv(FILE_PATH, chunksize=chunksize) as reader:
for chunk in reader:
total_sum += sum_column(chunk, column_index)
print(f"Total: {total_sum}")
print(f"Done in {time.time()-start} seconds"
Multi-processing after splitting the file
This is the most complicated approach but has the best performance, assuming your use case of processing the CSV can allow the file to be split.
You can either split the file directly before using your Python script but I include the subprocess
call from Python to do the splitting. Another step to add would be removing the split files after the fact.
This solution is the most scalable and follows a map-reduce type of approach.
Time: 3.25 seconds (Winner)
import csv
from multiprocessing import Pool, cpu_count
from functools import partial
import os
import time
import subprocess
start = time.time()
FILE_PATH = "/mnt/c/data/huge_data.csv"
FILE_DIR = "/mnt/c/data"
# First split huge file into multiple
num_cores = cpu_count()
subprocess.call(["split", "--lines=1000000", "--numeric-suffixes", FILE_PATH, "split"], cwd=FILE_DIR)
def read_data(filename, column_index, skip_header=True):
with open(filename) as csv_file:
if skip_header or filename.endswith('00'):
next(csv_file)
for row in csv_file:
try:
r = row.split(",")
yield int(r[column_index])
except ValueError:
print(f"Failed to convert {row[column_index]} to int")
def sum_file(file_name, column_index, skip_header):
return sum(read_data(file_name, column_index, skip_header))
all_files = os.listdir(FILE_DIR)
file_names = list(filter(lambda name: name.startswith('split'), all_files))
file_paths = map(lambda name: os.path.join(FILE_DIR, name), file_names)
with Pool(processes=num_cores) as pool:
partial_sums = pool.map(partial(sum_file, column_index=2, skip_header=False), file_paths)
print(f"Total: {sum(partial_sums)}")
print(f"Done in {time.time()-start} seconds")
Conclusion
Like any tool, there is usually not a single always best solution.
If you want pure python this first approach is viable. If you want simplicity and are already using pandas, go with the pandas approach. If you have truly massive data sets try the multi-processing with file split solution.