Full Development Lifecycle for PySpark Data Flows Using Databricks on AWS

Priyan Chandrapala
May 12, 2020
Read: 6 min

In this blog we will focus on creating the project skeleton for a PySpark job, test framework, automating the build with GitLab CI, and deploying the jobs in production with Databricks Docker images on AWS. Phew! That's a lot!

This can be considered Part 2 of my earlier blog post where we discussed the collaborative development of a PySpark/ Keras data flow using Databricks Notebooks.

Now we will see how such a data flow can be made production-ready.




Install pip and virtualenv

[php]python3 -m pip install --user --upgrade pip python3 -m pip install --user virtualenv[/php]

Creating the Job and test case

The notebook we discussed in Part 1 can be broken down into at least 2 Python jobs. For simplicity’s sake we will consider a small part of it until the first checkpoint discussed in Part1. This will just convert the CSV file to parquet.


Let's also consider the test case. Not going into detail of every aspect of the test class, it will do a simple assertion to check if data exist in the parquet file. Here the tmp_path_factory is a session-scoped fixture that can be used to create arbitrary temporary directories from any other fixture or test, this is automatically created by the test framework. The spark session is also a session-scoped fixture that we create.

Test framework

Setting up the test framework is simple. Let's see the steps involved.

This is the contents of my requirements file for the test stage. Not all of these packages are used for this example but you might most probably use Pandas and mocks for some of your test cases.

[php]# requirements-test.in

A simple Makefile.txt looks like the following.

We can define the fixture functions in the conftest.py to make them accessible across multiple test files like the spark session defined here.

This is what the test package structure is going to look like. Notice here we have copied a small sample of the data to a file sample_data.csv.

test package structure

Before running the test framework we need to first generate the requirements file. Have a look again at the freeze target in the Makefile again. Running the command make freeze in a command shell will generate the test and production requirements files. The newly generated requirements files will have all the dependencies embedded for the packages you require.

requirements file

The test suit needs to run within a virtual environment to isolate the packages and dependencies. The following commands will create the virtual environment and install the test framework with all the package dependencies.


Then you can run the tests with this command make run-tests. If the tests all pass, the output should look like this.

tests output

Building and packaging the project

Since 2018 python packaging is done using the wheel package. We have created a setup.py to facilitate the build. Let's see the contents of this file:

Running the following command make src_package will create the build. Check the contents of the Makefile given above to see the list of commands executed. As a result of this step, now we can see the contents of the build and dist folders. The .wlh file in dist is the packaged distribution.

build and dist folders

In addition, we can also see the .egg-info file. This will have metadata to let us install the package above with pip install.

egg-info file

Now let us see how all this will be packaged into a Docker image.

Configuring and building the Databricks Docker image

We have selected a Docker image with the Databricks runtime so that all the Spark and ML dependencies are already embedded. The Docker build is done in 2 stages.

The first stage will create the Python dependencies installing them from our requirements.txt file. Note that we don’t have to install Java, Scala, or PySpark because these distributions are already available with the Databricks runtime.

In the second stage we will initially copy the packages we build in the first stage and then copy out the codebase into the Docker image to install our package. Now let us have a look at the Docker file.

To build this image locally using the following command:

[php]docker build -t pyspark-databricks-poc .[/php]

To run the docker container locally and to log in to it use this command:

[php]docker run exec -it pyspark-databricks-poc /bin/bash[/php]

Creating a CI/CD pipeline and deploying to a Databricks cluster

Now we have done most of the heavy lifting with the codebase and it’s time for some dev-ops.

We use GitLab as an integrated repository and dev-ops lifecycle management tool. This provides a GitHub like source repository coupled with many other cool features like a built-in CI/CD tool, artifact repository, wiki, etc...

GitLab CI was the intuitive choice for our CI/CD pipeline as it has all the features we were looking for and it's already integrated with our source repository so less extra work. Let us have a look at the stages involved with the use of the YAML file.

We have 3 stages. Each of these stages will run within a docker container executed by a GitLab CI Runner.

The build stage has two parts. The first stage will build and package the source code. The second stage will deploy it within the Docker container.

Docker deployment

The output of the stages is shown below.

Test: Only if the tests are all passing, we can progress into the next stage of the pipeline.

testing before the next stage

Build stage 1: Note that at the end of this stage, the output (build/ dist/ .agg-info/ folders) will be copied to a temporary location to be staged again into the next step.

temporary location for build/ dist/ .agg-info/ folders

This is the second build stage where the Docker image with the Databrics runtime will be built with the Python dependencies and our source package installed. This image will then be uploaded to Docker hub.

Building Docker image with the Databrics runtime with Python dependencies

Docker push plc9/spark-poc

In the deploy stage, we execute a curl command to invoke the Databricks REST API. This will create a cluster using the Docker image we pushed in the earlier stage as the base. This Docker image will have both the Databricks runtime and our source package. In the below image I have masked the authorisation tokens for both Dataricks and Docker hub.

We are also submitting the job in the “spark_python_task“ within the same curl command. This will result in the job getting executed soon as the cluster is ready.


Finally, login to the Databricks account and check the status of the cluster and the job.

cluster and job status on Databricks


At Infinite Lambda we architect end-to-end integrated solutions with the Databricks Platform on the cloud. We will also help with all aspects of technical implementation.
Please check out our latest service offering on Databricks:







More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Tag-Based Masking in Snowflake
Tag-Based Masking in Snowflake: Practical Guide with Scalable Implementation
As data continues to be a critical asset for organisations across industries, safeguarding sensitive information while enabling data access for authorised users is a constant...
June 11, 2024
Cloud sustainability
Cloud Sustainability
This article on cloud sustainability is a part of a series on carbon analytics published on the Infinite Lambda Blog. Appreciating the cloud is becoming...
June 5, 2024
How to measure happiness and safety in tech teams
How to Measure Happiness and Safety in Tech Teams
Software product development initiatives can go awry for a whole range of reasons. However, the main ones tend not to be technical at all. Rather,...
May 30, 2024
why sustainability analytics
Why Sustainability Analytics
We all like a sunny day. Kicking back in the garden with the shades on, cool drink in hand and hopefully a liberal amount of...
May 8, 2024
Data diff validation in a blue green deployment: how to guide
Data Diff Validation in Blue-Green Deployments
During a blue-green deployment, there are discrepancies between environments that we need to address to ensure data integrity. This calls for an effective data diff...
January 31, 2024
GDPR & Data Governance in Tech
GDPR & Data Governance in Tech
The increasing focus on data protection and privacy in the digital age is a response to the rapid advancements in technology and the widespread collection,...
January 18, 2024

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.