How efficient are Python generators?

Assess the performance of generators compared to normal functions using the resource module.

How efficient are Python generators?

Topics Covered

  1. Introduction
  2. Where should you use Python generators IRL
  3. Python resource library
  4. Experimenting with return
  5. Experimenting with yield
  6. Plotting out the results in Matplotlib

Introduction

In my previous article on Python generators, I mentioned the following 3 advantages of using Python generators.

  1. Easier to build iterators using generators.
  2. They are memory-efficient since they produce one item at a time.
  3. They can represent an infinite stream of data

A lot of my readers had the following comment.

Good job! What I miss is real life examples? When to use what? And why?

So, in this article, I will be addressing the below points.

  1. A real-life example of using Python generators.
  2. How fast/efficient are generators compared to using return?

Where should you use Python generators IRL?

One of the most popular applications of using the generator function is to read a file containing large volumes of data.

Generators perform Lazy evaluation. They compute the value of each item when you ask for it and not during the time of instantiation.

This makes generators very useful when you have a very large data set to compute. You can start using the data immediately, while the whole data set is being computed.

We are going to conduct the following experiment.

  1. We will be using 2 datasets. The first file has 100 rows in it whereas the second file has 5 million rows in it.

2. The first program will read all the rows into a list and then return it. For both the files, we will calculate the time it takes and the memory it consumes.

3. The second program will use yield to read one line at a time and return it when asked by the program for printing. We are again going to calculate the time it takes and the memory it consumes for using generators on both the files.


Python resource library

The resource module is a UNIX package and won't work with the Windows system.

This module provides basic mechanisms for measuring and controlling system resources utilized by a program.

We would specifically be using resource.getrusage function.

This function returns an object that describes the resources consumed by either the current process or its children, as specified by the who parameter. We will be using resource.RUSAGE_SELF symbol. This will provide the resources consumed by the calling process, which is the sum of resources used by all threads in the process.


Experimenting with return

The following code reads the entire file, stores it in memory, and prints every line of data inside a loop.

We will run the below code for the file containing only 100 rows and the file containing 5 million rows.

import resource

filename = '<filename>'

def read_file(file_name):
    csv_file = open(file_name, 'r')
    data = csv_file.readlines()
    csv_file.close()
    return data
 
csv_data = read_file(filename)

for data in csv_data:
    print(data)
 
print('Peak Memory Usage =', resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
print('User Mode Time =', resource.getrusage(resource.RUSAGE_SELF).ru_utime)
print('System Mode Time =', resource.getrusage(resource.RUSAGE_SELF).ru_stime)

Results for the file containing 100 rows

Peak Memory Usage = 9376
User Mode Time = 0.040309
System Mode Time = 0.012093999999999999

Results for the file containing 5 million rows

Peak Memory Usage = 943516
User Mode Time = 10.662542
System Mode Time = 14.784168


Experimenting with yield

The following code uses yield keyword to read one line at a time and return it to the caller.

We will run the below code for the file containing only 100 rows and the file containing 5 million rows.

import resource

filename = '<filename>'
 
def read_file(file_name):
    csv_file = open(file_name, 'r')
    while True:
        data = csv_file.readline()
        if not data:
            csv_file.close()
            break
        yield data
 
data = read_file(filename)

for row in data:
    print(row)
 
print('Peak Memory Usage =', resource.getrusage(resource.RUSAGE_SELF).ru_maxrss)
print('User Mode Time =', resource.getrusage(resource.RUSAGE_SELF).ru_utime)
print('System Mode Time =', resource.getrusage(resource.RUSAGE_SELF).ru_stime)

Results for the file containing 100 rows

Peak Memory Usage = 9424
User Mode Time = 0.016108
System Mode Time = 0.008058

Results for the file containing 5 million rows

Peak Memory Usage = 9436
User Mode Time = 11.590708
System Mode Time = 14.579287

Plotting out the results in Matplotlib

Let's summarize the results in a table format and then plot them in Matplotlib.

No. of rows Return statement Yield statement
100 rows Memory: 9376 bytes, Time:0.0523 secs Memory: 9424 bytes, Time:0.016 secs
5 million rows Memory: 943516 bytes, Time: 25.4 secs Memory: 9436 bytes, Time:26.1 secs
Memory consumption comparison of return vs yield for different file size

While both return and yield statements are performing similarly on the time front, it is in the area of memory consumption where the generators really outshine the return statement.

Subscribe to Pylenin

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe