Pandas — Optimize Memory and Speed Operation

Image: Memory Error in Jupyter Notebook
  • work on a project involves 4 GB RAM System and lot of memory intensive computation or data-set size that was large enough to hang your system ?
  • already have high quality server but want to optimize (reduce memory and execution time) your code ?
  • code in Python and want to know best practices ?

Table of contents

  1. See Hardware and System Information
    1.1. System, CPU, Memory, Swap, Disk, GPU Information
  2. Reduce DataFrame size
    2.1. Change in int datatype
    2.2. Change in float datatype
    2.3. Change from object to category datatype
    2.4. Convert to Sparse DataFrame
  3. Reduce DataFrame Loading Time
    3.1. Don’t Load all columns
    3.2. Read Large CSV File (Avoid Memory error and fast reading)
  4. Fast Pandas Operation
    4.1. Loop and Vectorization
    4.2. Indexing
    4.3. Parallelization
  5. References
Image 01: Memory usage per Datatype [1]
SegmentLocal
Image 02 — Numpy basic Datatypes [8]
  1. See Hardware and System Information

1.1 System, CPU, Memory, Swap, Disk, GPU Information

# ! pip install tabulate
# ! pip install gputil
import numpy as np
import pandas as pd
from time import time
import gc
import psutil
import platform
import sys
from datetime import datetime
import GPUtil
from tabulate import tabulate

def get_size(bytes, suffix="B"):
"""
Scale bytes to its proper format
e.g:
1253656 => '1.20MB'
1253656678 => '1.17GB'
"""
factor = 1024
for unit in ["", "K", "M", "G", "T", "P"]:
if bytes < factor:
return f"{bytes:.2f}{unit}{suffix}"
bytes /= factor

#### CODE -- https://www.thepythoncode.com/article/get-hardware-system-information-python ####

print("="*40, "Python Version", "="*40)
print (sys.version)
print (sys.version_info)

print("="*40, "System Information", "="*40)
uname = platform.uname()
print(f"System: {uname.system}")
print(f"Release: {uname.release}")
print(f"Version: {uname.version}")
print(f"Machine: {uname.machine}")
print(f"Processor: {uname.processor}")

# let's print CPU information
print("="*40, "CPU Info", "="*40)
# number of cores
print("Physical cores:", psutil.cpu_count(logical=False))
print("Total cores:", psutil.cpu_count(logical=True))
# CPU frequencies
cpufreq = psutil.cpu_freq()
print(f"Max Frequency: {cpufreq.max:.2f}Mhz")
print(f"Min Frequency: {cpufreq.min:.2f}Mhz")
print(f"Current Frequency: {cpufreq.current:.2f}Mhz")
# CPU usage
print("CPU Usage Per Core:")
for i, percentage in enumerate(psutil.cpu_percent(percpu=True, interval=1)):
print(f"Core {i}: {percentage}%")
print(f"Total CPU Usage: {psutil.cpu_percent()}%")

# Memory Information
print("="*40, "Memory Information", "="*40)
# get the memory details
svmem = psutil.virtual_memory()
print(f"Total: {get_size(svmem.total)}")
print(f"Available: {get_size(svmem.available)}")
print(f"Used: {get_size(svmem.used)}")
print(f"Percentage: {svmem.percent}%")
print("="*40, "SWAP", "="*40)
# get the swap memory details (if exists)
swap = psutil.swap_memory()
print(f"Total: {get_size(swap.total)}")
print(f"Free: {get_size(swap.free)}")
print(f"Used: {get_size(swap.used)}")
print(f"Percentage: {swap.percent}%")

# Disk Information
print("="*40, "Disk Information", "="*40)
print("Partitions and Usage:")
# get all disk partitions
partitions = psutil.disk_partitions()
for partition in partitions:
print(f"=== Device: {partition.device} ===")
print(f" Mountpoint: {partition.mountpoint}")
print(f" File system type: {partition.fstype}")
try:
partition_usage = psutil.disk_usage(partition.mountpoint)
except PermissionError:
# this can be catched due to the disk that
# isn't ready
continue
print(f" Total Size: {get_size(partition_usage.total)}")
print(f" Used: {get_size(partition_usage.used)}")
print(f" Free: {get_size(partition_usage.free)}")
print(f" Percentage: {partition_usage.percent}%")
# get IO statistics since boot
disk_io = psutil.disk_io_counters()
print(f"Total read: {get_size(disk_io.read_bytes)}")
print(f"Total write: {get_size(disk_io.write_bytes)}")

# GPU information
print("="*40, "GPU Details", "="*40)
gpus = GPUtil.getGPUs()
list_gpus = []
for gpu in gpus:
# get the GPU id
gpu_id = gpu.id
# name of GPU
gpu_name = gpu.name
# get % percentage of GPU usage of that GPU
gpu_load = f"{gpu.load*100}%"
# get free memory in MB format
gpu_free_memory = f"{gpu.memoryFree}MB"
# get used memory
gpu_used_memory = f"{gpu.memoryUsed}MB"
# get total memory
gpu_total_memory = f"{gpu.memoryTotal}MB"
# get GPU temperature in Celsius
gpu_temperature = f"{gpu.temperature} °C"
gpu_uuid = gpu.uuid
list_gpus.append((
gpu_id, gpu_name, gpu_load, gpu_free_memory, gpu_used_memory,
gpu_total_memory, gpu_temperature, gpu_uuid
))

print(tabulate(list_gpus, headers=("id", "name", "load", "free memory", "used memory", "total memory",
"temperature", "uuid")))
Image 03 — Hardware and System Information

2. Reduce DataFrame size

2.1 Change in int datatype

## Initializing minimum and maximum value of age
min_age_value , max_age_value = 1,150
## Number of rows in dataframe
nrows = int(np.power(10,7))
## creation of Age dataframe
df_age = pd.DataFrame({'Age':np.random.randint(low=1,high=100,size=nrows)})
## check memory usage before action
df_age.info(memory_usage='deep')
memory usage: 38.1 MB ## Range of "uint8"; satisfies range constraint of Age column
np.iinfo('uint8')
## Action: conversion of dtype from "int32" to "uint8"
converted_df_age = df_age.astype(np.uint8)
## check memory usage after action
converted_df_age.info(memory_usage='deep')
memory usage: 9.5 MB

2.2 Change in float datatype

## no. of documents
ncols = int(5*np.power(10,3))
## no. of search queries
nrows = int(5*np.power(10,4))
## creation of cosine similarity dataframe
df_query_doc = pd.DataFrame(np.random.rand(nrows, ncols))
print("No. of search queries: {} and No. of documents: {}".format(df_query_doc.shape[0],df_query_doc.shape[1]))
No. of search queries: 50000 and No. of documents: 5000## check memory usage before action
df_query_doc.info(memory_usage='deep')
memory usage: 1.9 GB## Action: conversion of dtype from "float64" to "float16"
converted_df_query_doc = df_query_doc.astype('float16')
## check memory usage after action
converted_df_query_doc.info(memory_usage='deep')
memory usage: 476.8 MB

2.3 Change from object to category datatype

## unique values of "days of week"
day_of_week = ["monday","tuesday","wednesday","thursday","friday","saturday","sunday"]
## Number of times day_of_week repeats
repeat_times = 7*np.power(10,6)
## creation of days of week dataframe
df_day_of_week = pd.DataFrame({'day_of_week':np.repeat(a=day_of_week,repeats = repeat_times)})
print("No of rows in days of week dataframe {}".format(df_day_of_week.shape[0]))
No of rows in days of week dataframe 49000000## check memory usage before action
df_day_of_week.info(memory_usage='deep')
memory usage: 2.9 GB## Action: conversion of dtype from "object" to "category"
converted_df_day_of_week = df_day_of_week.astype('category')
## check memory usage after action
converted_df_day_of_week.info(memory_usage='deep')
memory usage: 46.7 MB## check first two rows of dataframe
converted_df_day_of_week.head(2)

2.4 Convert to Sparse DataFrame

## number of rows in dataframe
nrows = np.power(10,7)
## creation of dataframe
df_dense =pd.DataFrame([[0,0.23,np.nan]]*nrows)
## check memory usage before action
df_dense.info(memory_usage='deep')
memory usage: 228.9 MB## Percentage of Non-zero and Non-NaN values in dataframe
non_zero_non_nan = np.count_nonzero((df_dense)) - df_dense.isnull().sum().sum()
non_zero_non_nan_percentage = round((non_zero_non_nan/df_dense.size)*100,2)
print("Percentage of Non-Zero Non-NaN values in dataframe {} %".format(non_zero_non_nan_percentage))
Percentage of Non-Zero Non-NaN values in dataframe 33.33 %## Action: Change of DataFrame type to SparseDataFrame
df_sparse = df_dense.to_sparse()
## check memory usage after action
df_sparse.info(memory_usage='deep')
memory usage: 152.6 MB

3. Reduce DataFrame Loading Time

3.1 Don’t Load all columns

t0 = time()
ufo = pd.read_csv('http://bit.ly/uforeports')
t1 = time()
print("Default Loading time {} seconds".format(t1-t0))
Default Loading time 2.86840422821045 secondst0 = time()
ufo = pd.read_csv('http://bit.ly/uforeports',usecols=['City','State'])
t1 = time()
print("Default Loading time {} seconds".format(t1-t0))
Default Loading time 0.8735167980194092 seconds

3.2 Read Large CSV File

  • Explicit pass appropriate dtypesfor each column while loading file and reduce the volume of the data. Example - Sometimes 64-bit precision is overkill and can use 32-bit, 2x memory savings are worth it.
  • Usecols parameter in read_csv i.e. only load required columns
  • Data compression - Here, I don't mean by zip or gzip file. Say your status column has two values: "AVAILABLE" and "UNAVAILABLE". Instead of storing them as string (~ 10 bytes per entry), store as boolean: True or False (1 byte per entry)
  • Dask — Dask provides multi-core execution on larger-than-memory datasets.
  • Modin — Modin transparently distributes the data and computation. Modin provides speed-ups of up to 4x on a laptop with 4 physical cores. Pandas able to use one core at a time when you are doing computation of any kind. With Modin, you are able to use all of the CPU cores on your machine.
SegmentLocal
Image 04 — Pandas and Modin read_csv comparison [9]
SegmentLocal
Image 05 —Comparison among alternatives of pandas [11]
# # pip install modin[ray]
# import ray
# ray.init(num_cpus=4)
# import modin.pandas as pd
# export MODIN_OUT_OF_CORE=true
SegmentLocal
Image 06 — Pandas and Modin operations comparison [13]

4.1 Loop and Vectorization

SegmentLocal
Image 07 — Loop and Vectorization comparison [16]
SegmentLocal
Image 08 — Pandas Optimization [17]
df1 = pd.DataFrame({'A':list(range(10000000)),'B':list(range(0,20000000,2))})
df2 = pd.DataFrame({'A':list(range(10000000)),'C':list(range(0,40000000,4))})

t0 = time()
df = pd.merge(df1,df2,how='left',on='A')
t1 = time()
print("Time Taken in merge before indexing {}".format(t1-t0))

gc.collect()

t0 = time()
df1_ = df1.set_index('A')
df2_ = df2.set_index('A')
df_ = pd.merge(df1_,df2_, left_index=True, right_index=True)
t1 = time()
print("Time Taken in merge after indexing {}".format(t1-t0))
Time Taken in merge before indexing 10.060492992401123
Time Taken in merge after indexing 2.3176727294921875

4.3 Parallelization

SegmentLocal
Image 09 — Standard Vs Parallel comparison using Pandarallel[19]
  • Swifter — It applies any function to a pandas dataframe or series in the fastest available manner (Vectorize OR Dask Parallel Processing OR Pandas apply). This notebook contains speed comparison for swifter, Please check.
#########################################
#### PANDARALLEL CODE #################
#########################################
# from pandarallel import pandarallel
# pandarallel.initialize(nb_workers = _NO_CORES_USED_FOR_PARALLELIZATION,progress_bar = True)
# df.parallel_apply(func)
# df.groupby(args).parallel_apply(func)
#########################################
#### SWIFTER CODE #################
#########################################
# import pandas as pd
# import swifter
# df['Col'].swifter.apply(anyfunction)

5. References

  1. https://www.dataquest.io/blog/pandas-big-data/
  2. https://machinelearningmastery.com/sparse-matrices-for-machine-learning/
  3. https://stackoverflow.com/questions/39100971/how-do-i-release-memory-used-by-a-pandas-dataframe
  4. https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html
  5. https://cmdlinetips.com/2018/03/sparse-matrices-in-python-with-scipy/
  6. https://jakevdp.github.io/blog/2014/05/09/why-python-is-slow/
  7. https://pythonspeed.com/articles/pandas-load-less-data/
  8. https://numpy.org/devdocs/user/basics.types.html
  9. https://modin.readthedocs.io/en/latest/
  10. DASKhttps://docs.dask.org/en/latest/best-practices.html
  11. https://towardsdatascience.com/scaling-pandas-comparing-dask-ray-modin-vaex-and-rapids-c74c85a4e59c
  12. https://pythonspeed.com/datascience/
  13. Modin Vs Pandas https://www.kdnuggets.com/2019/11/speed-up-pandas-4x.html
  14. https://www.csvexplorer.com/blog/open-big-csv/
  15. https://www.kaggle.com/c/zillow-prize-1/discussion/37261
  16. https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6
  17. https://github.com/s-heisler/pycon2017-optimizing-pandas/blob/master/MinneAnalytics%20materials/Minneanalytics_talk_slides.pdf
  18. https://medium.com/bigdatarepublic/advanced-pandas-optimize-speed-and-memory-a654b53be6c2
  19. https://github.com/nalepae/pandarallel
  20. https://www.thepythoncode.com/article/get-hardware-system-information-python
  21. https://github.com/aakashgoel12/Play-DataStructure-Python-Data-Engineering/blob/master/pandas_optimize_memory_speed_operation.ipynb
  22. https://nbviewer.jupyter.org/github/aakashgoel12/Play-DataStructure-Python-Data-Engineering/blob/master/pandas_optimize_memory_speed_operation.ipynb

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store