...

Full Development Lifecycle for PySpark Data Flows Using Databricks on AWS

Priyan Chandrapala
May 12, 2020
Read: 6 min

In this blog we will focus on creating the project skeleton for a PySpark job, test framework, automating the build with GitLab CI, and deploying the jobs in production with Databricks Docker images on AWS. Phew! That's a lot!

This can be considered Part 2 of my earlier blog post where we discussed the collaborative development of a PySpark/ Keras data flow using Databricks Notebooks.

Now we will see how such a data flow can be made production-ready.

 

production-ready-data-flow

Pre-requisites

Install pip and virtualenv

[php]python3 -m pip install --user --upgrade pip python3 -m pip install --user virtualenv[/php]

Creating the Job and test case

The notebook we discussed in Part 1 can be broken down into at least 2 Python jobs. For simplicity’s sake we will consider a small part of it until the first checkpoint discussed in Part1. This will just convert the CSV file to parquet.

 

Let's also consider the test case. Not going into detail of every aspect of the test class, it will do a simple assertion to check if data exist in the parquet file. Here the tmp_path_factory is a session-scoped fixture that can be used to create arbitrary temporary directories from any other fixture or test, this is automatically created by the test framework. The spark session is also a session-scoped fixture that we create.

Test framework

Setting up the test framework is simple. Let's see the steps involved.

This is the contents of my requirements file for the test stage. Not all of these packages are used for this example but you might most probably use Pandas and mocks for some of your test cases.

[php]# requirements-test.in
pyspark~=2.4.2
mock==4.0.2
pytest~=5.2.2
pytest-mock~=1.11.2
pandas==0.25.3
wheel==0.34.2[/php]

A simple Makefile.txt looks like the following.

We can define the fixture functions in the conftest.py to make them accessible across multiple test files like the spark session defined here.

This is what the test package structure is going to look like. Notice here we have copied a small sample of the data to a file sample_data.csv.

test package structure

Before running the test framework we need to first generate the requirements file. Have a look again at the freeze target in the Makefile again. Running the command make freeze in a command shell will generate the test and production requirements files. The newly generated requirements files will have all the dependencies embedded for the packages you require.

requirements file

The test suit needs to run within a virtual environment to isolate the packages and dependencies. The following commands will create the virtual environment and install the test framework with all the package dependencies.

 

Then you can run the tests with this command make run-tests. If the tests all pass, the output should look like this.

tests output

Building and packaging the project

Since 2018 python packaging is done using the wheel package. We have created a setup.py to facilitate the build. Let's see the contents of this file:

Running the following command make src_package will create the build. Check the contents of the Makefile given above to see the list of commands executed. As a result of this step, now we can see the contents of the build and dist folders. The .wlh file in dist is the packaged distribution.

build and dist folders

In addition, we can also see the .egg-info file. This will have metadata to let us install the package above with pip install.

egg-info file

Now let us see how all this will be packaged into a Docker image.

Configuring and building the Databricks Docker image

We have selected a Docker image with the Databricks runtime so that all the Spark and ML dependencies are already embedded. The Docker build is done in 2 stages.

The first stage will create the Python dependencies installing them from our requirements.txt file. Note that we don’t have to install Java, Scala, or PySpark because these distributions are already available with the Databricks runtime.

In the second stage we will initially copy the packages we build in the first stage and then copy out the codebase into the Docker image to install our package. Now let us have a look at the Docker file.

To build this image locally using the following command:

[php]docker build -t pyspark-databricks-poc .[/php]

To run the docker container locally and to log in to it use this command:

[php]docker run exec -it pyspark-databricks-poc /bin/bash[/php]

Creating a CI/CD pipeline and deploying to a Databricks cluster

Now we have done most of the heavy lifting with the codebase and it’s time for some dev-ops.

We use GitLab as an integrated repository and dev-ops lifecycle management tool. This provides a GitHub like source repository coupled with many other cool features like a built-in CI/CD tool, artifact repository, wiki, etc...

GitLab CI was the intuitive choice for our CI/CD pipeline as it has all the features we were looking for and it's already integrated with our source repository so less extra work. Let us have a look at the stages involved with the use of the YAML file.

We have 3 stages. Each of these stages will run within a docker container executed by a GitLab CI Runner.

The build stage has two parts. The first stage will build and package the source code. The second stage will deploy it within the Docker container.

Docker deployment

The output of the stages is shown below.

Test: Only if the tests are all passing, we can progress into the next stage of the pipeline.

testing before the next stage

Build stage 1: Note that at the end of this stage, the output (build/ dist/ .agg-info/ folders) will be copied to a temporary location to be staged again into the next step.

temporary location for build/ dist/ .agg-info/ folders

This is the second build stage where the Docker image with the Databrics runtime will be built with the Python dependencies and our source package installed. This image will then be uploaded to Docker hub.

Building Docker image with the Databrics runtime with Python dependencies

Docker push plc9/spark-poc

In the deploy stage, we execute a curl command to invoke the Databricks REST API. This will create a cluster using the Docker image we pushed in the earlier stage as the base. This Docker image will have both the Databricks runtime and our source package. In the below image I have masked the authorisation tokens for both Dataricks and Docker hub.

We are also submitting the job in the “spark_python_task“ within the same curl command. This will result in the job getting executed soon as the cluster is ready.

spark_python_task

Finally, login to the Databricks account and check the status of the cluster and the job.

cluster and job status on Databricks

Summary

At Infinite Lambda we architect end-to-end integrated solutions with the Databricks Platform on the cloud. We will also help with all aspects of technical implementation.
Please check out our latest service offering on Databricks:

Databricks_end-to-end-solutions

References

https://docs.pytest.org

https://packaging.python.org/tutorials/packaging-projects/

https://medium.com/@greut/building-a-python-package-a-docker-image-using-pipenv-233d8793b6cc

https://docs.databricks.com/dev-tools/api/latest/index.html

More on the topic

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.

digital skills gap
How to Address the Digital Skills Gap to Build a Future-Proof Tech Workforce
If an organisation is to scale, it needs a data and cloud related talent strategy. A bold statement, I know, so let us look into...
May 20, 2023
dbt deferral
Using dbt deferral to simplify development
As data irrevocably grows in volume, complexity and value, so do the demands from business stakeholders, who need visibility in good time, whilst minimising cost....
May 11, 2023
How to implement Data Vault with dbt on Snowflake
How to Implement Data Vault with dbt on Snowflake
Data Vault is a powerful data modelling methodology when combined with dbt and Snowflake Data Cloud. It allows you to build a scalable, agile and...
April 27, 2023
Data Vault components
Data Vault Components: An Overview
Data Vault is a data warehousing methodology that provides a standardised and scalable approach to managing enterprise data. At its core, it is designed to...
April 21, 2023
Data Vault
Data Vault: Building a Scalable Data Warehouse
Over the past few years, modern data technologies have been allowing businesses to build data platforms of increasing complexity, serving ever more sophisticated operational and...
April 13, 2023
Snowpark for Python
Snowpark for Python: Best Development Practices
While machine learning applications have been enjoying a peak in popularity in the last few years, companies still have a hard time integrating these innovative...
March 29, 2023

Everything we know, we are happy to share. Head to the blog to see how we leverage the tech.