Skip to content

Load your own libraries in pySpark on-the-fly


TL;DR Use SparkContext.addPyFile(path)

For a lot of our data science work at Chimnie, we use Apache Spark on AWS EMR via pySpark (that’s a lot of links!)

We have an extensive python library which is used in all of our pipelines, as well as for interactive R&D via Jupyter Notebooks. This library is in constant evolution, and we keep it in GitHub. Whenever a change is merged in GitHub, the library is zipped and uploaded to S3 automatically using a GitHub Action.

For quite some time, we were using a dedicated EMR cluster, and loaded this library from S3 during cluster initialization. This is fine for automated jobs, like the ones triggered by our pipelines. But if you want to use the library in a Jupyter Notebook for R&D, that meant that whenever a change was made to the library in the GitHub repo, the only way to have it available in a notebook was to restart the cluster, which could take more than ten minutes! Not the best developer experience.

Recently we migrated to EMR Serverless, which has proven to be more cost effective while lowering our maintenance needs compared to managing a dedicated EMR cluster. As part of this migration, we wanted to also improve the R&D workflow: to be able to have our python library updated in seconds within notebooks, without having to restart the cluster.

After some way too long digging, we found the little method that solved our problems:


For example, if using a notebook in EMR Studio, you can just do:

jupyter notebook
import your_library

And your_library will be ready to use right away, not only in the driver, but also in all the workers!

Just make sure to zip your library with a properly named root folder inside, and having __init.py__ files in all subfolders.

A example structure could be:
├── __init.py__
└── helpers
    ├── __init.py__

Official docs

Some examples from AWS