Full Development Lifecycle for PySpark Data Flows Using Databricks on AWS

Priyan Chandrapala
May 12, 2020
Read: 6 min

In this blog we will focus on creating the project skeleton for a PySpark job, test framework, automating the build with GitLab CI, and deploying the jobs in production with Databricks Docker images on AWS. Phew! That’s a lot!

This can be considered Part 2 of my earlier blog post where we discussed the collaborative development of a PySpark/ Keras data flow using Databricks Notebooks.

Now we will see how such a data flow can be made production-ready.




Install pip and virtualenv

[php]python3 -m pip install –user –upgrade pip python3 -m pip install –user virtualenv[/php]

Creating the Job and test case

The notebook we discussed in Part 1 can be broken down into at least 2 Python jobs. For simplicity’s sake we will consider a small part of it until the first checkpoint discussed in Part1. This will just convert the CSV file to parquet.


Let’s also consider the test case. Not going into detail of every aspect of the test class, it will do a simple assertion to check if data exist in the parquet file. Here the tmp_path_factory is a session-scoped fixture that can be used to create arbitrary temporary directories from any other fixture or test, this is automatically created by the test framework. The spark session is also a session-scoped fixture that we create.

Test framework

Setting up the test framework is simple. Let’s see the steps involved.

This is the contents of my requirements file for the test stage. Not all of these packages are used for this example but you might most probably use Pandas and mocks for some of your test cases.

[php]# requirements-test.in

A simple Makefile.txt looks like the following.

We can define the fixture functions in the conftest.py to make them accessible across multiple test files like the spark session defined here.

This is what the test package structure is going to look like. Notice here we have copied a small sample of the data to a file sample_data.csv.

test package structure

Before running the test framework we need to first generate the requirements file. Have a look again at the freeze target in the Makefile again. Running the command make freeze in a command shell will generate the test and production requirements files. The newly generated requirements files will have all the dependencies embedded for the packages you require.

requirements file

The test suit needs to run within a virtual environment to isolate the packages and dependencies. The following commands will create the virtual environment and install the test framework with all the package dependencies.


Then you can run the tests with this command make run-tests. If the tests all pass, the output should look like this.

tests output

Building and packaging the project

Since 2018 python packaging is done using the wheel package. We have created a setup.py to facilitate the build. Let’s see the contents of this file:

Running the following command make src_package will create the build. Check the contents of the Makefile given above to see the list of commands executed. As a result of this step, now we can see the contents of the build and dist folders. The .wlh file in dist is the packaged distribution.

build and dist folders

In addition, we can also see the .egg-info file. This will have metadata to let us install the package above with pip install.

egg-info file

Now let us see how all this will be packaged into a Docker image.

Configuring and building the Databricks Docker image

We have selected a Docker image with the Databricks runtime so that all the Spark and ML dependencies are already embedded. The Docker build is done in 2 stages.

The first stage will create the Python dependencies installing them from our requirements.txt file. Note that we don’t have to install Java, Scala, or PySpark because these distributions are already available with the Databricks runtime.

In the second stage we will initially copy the packages we build in the first stage and then copy out the codebase into the Docker image to install our package. Now let us have a look at the Docker file.

To build this image locally using the following command:

[php]docker build -t pyspark-databricks-poc .[/php]

To run the docker container locally and to log in to it use this command:

[php]docker run exec -it pyspark-databricks-poc /bin/bash[/php]

Creating a CI/CD pipeline and deploying to a Databricks cluster

Now we have done most of the heavy lifting with the codebase and it’s time for some dev-ops.

We use GitLab as an integrated repository and dev-ops lifecycle management tool. This provides a GitHub like source repository coupled with many other cool features like a built-in CI/CD tool, artifact repository, wiki, etc…

GitLab CI was the intuitive choice for our CI/CD pipeline as it has all the features we were looking for and it’s already integrated with our source repository so less extra work. Let us have a look at the stages involved with the use of the YAML file.

We have 3 stages. Each of these stages will run within a docker container executed by a GitLab CI Runner.

The build stage has two parts. The first stage will build and package the source code. The second stage will deploy it within the Docker container.

Docker deployment

The output of the stages is shown below.

Test: Only if the tests are all passing, we can progress into the next stage of the pipeline.

testing before the next stage

Build stage 1: Note that at the end of this stage, the output (build/ dist/ .agg-info/ folders) will be copied to a temporary location to be staged again into the next step.

temporary location for build/ dist/ .agg-info/ folders

This is the second build stage where the Docker image with the Databrics runtime will be built with the Python dependencies and our source package installed. This image will then be uploaded to Docker hub.

Building Docker image with the Databrics runtime with Python dependencies

Docker push plc9/spark-poc

In the deploy stage, we execute a curl command to invoke the Databricks REST API. This will create a cluster using the Docker image we pushed in the earlier stage as the base. This Docker image will have both the Databricks runtime and our source package. In the below image I have masked the authorisation tokens for both Dataricks and Docker hub.

We are also submitting the job in the “spark_python_task“ within the same curl command. This will result in the job getting executed soon as the cluster is ready.


Finally, login to the Databricks account and check the status of the cluster and the job.

cluster and job status on Databricks


At Infinite Lambda we architect end-to-end integrated solutions with the Databricks Platform on the cloud. We will also help with all aspects of technical implementation.
Please check out our latest service offering on Databricks:







More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Apache Airflow start_date and execution_date explained
Airflow start_date and execution_date Explained
Despite Airflow’s popularity in data engineering, the start_date and execution_date concepts remain confusing among many new developers today. This article aims to demystify them. Basic...
June 15, 2022
Breaking Some Myths about the Use of Dual-Track Agile
Bringing both flexibility and transparency, the Dual-Track Agile methodology is increasingly popular. With a growing number of teams that decide to try it out, it...
June 10, 2022
Creating a PostgreSQL to BigQuery Sync Pipeline Using Debezium and Kafka
Many companies today use different database technologies for their application and their data platform. This creates the challenge of enabling analytics on application data without...
June 1, 2022
How to Apply Dual-Track Agile in Practice
This article is a part of a series on the Dual-Track model. Here, I am going to share with you 5 rules on how to...
May 17, 2022
Challenges of Using Dual-Track Agile and How to Handle Them
Welcome to Part II of the Infinite Lambda’s blog series on Dual-Track Agile. You might want to check Part I that explains what this model...
April 15, 2022
Sustainability: the Last Frontier in Business Intelligence
The power of the modern data stack in generating actionable insights out of disparate data is well documented. It’s time to apply this to sustainability....
April 1, 2022

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

Optimized by Seraphinite Accelerator
Turns on site high speed to be attractive for people and search engines.