While machine learning applications have been enjoying a peak in popularity in the last few years, companies still have a hard time integrating these innovative technologies with their cloud data platforms. Fortunately for Snowflake users, Snowpark, a recent addition to Snowflake’s capabilities, aims to solve this problem.
Being a Snowflake Elite Partner, Infinite Lambda has extensive experience with the Data Cloud and our engineers are early adopters of all of the latest additions to the ecosystem. Now, we are going to help you find your way around Snowpark for Python so you and your team can benefit from the framework’s capabilities too.
Getting started with Snowpark for Python
Snowpark allows developers to use Scala, Java or Python to build secure and scalable data pipelines and ML workflows directly within Snowflake. This empowers companies to extract more value from their data without having to manage any additional infrastructure.
This article looks at some best practices for developing and deploying Snowpark for Python projects based on an example* machine learning (ML) use case. Note that to make the most use of the content, you need a basic understanding of Snowpark.
Find the complete code for the solution in this repository and feel free to explore, fork and play around with it.
*The example project is based on a template presented by Jeff Hollan of Snowflake at BUILD ‘22. You can access the full code of the template on GitHub.
In this scenario, multiple data engineers and/or data scientists collaborate on the development of a predictive ML model. Given a year as an input, the model should be able to predict the consumer spending in that particular year.
The Python implementation of the solution includes both user-defined functions and stored procedures. These are used to load data from a dataset on the Snowflake marketplace, preprocess it and transform it, and finally, train a simple linear regression model. Details about the implementation of this ML application are out of the scope of this article, but the complete code of the solution is available in the repository.
Our solution must fulfil the following requirements regarding the development ways of working and best practices:
- As developers should not need access to production data to be able to develop and test their models, the production and development environments on Snowflake must be isolated from each other;
- Team members must be able to work on and deploy new features and modifications without interfering with each other’s work;
- A single source of truth must define the production environment (database objects and artifacts) on Snowflake;
- Source version control must be used to track changes in the project;
- The Python code must be reusable, modularised, structured and managed according to existing Python development best practices and using established tools;
- The local environment should be reproducible, including dependencies;
- Every change made to the solution must be automatically built and tested before being deployed;
- The trained model must be made available both within Snowflake and as an artefact for external use;
- Multiple versions of the ML model must be available and usable in Snowflake;
- Credentials and other secrets must be stored in a secure way.
The structure of the complete solution is as follows:
- ci/Dockerfile: Dockerfile for the builder image used in our GitLab CI pipeline;
- setup: SQL scripts that we use to create the required Snowflake objects and call procedures for data preprocessing, model training or inference generation;
- snowpark_devops: The root Python package containing all the Python source code;
- tests: End-to-end and unit tests used in the testing phase;
- config.py: Snowflake credentials for local development;
- pyproject.toml: Poetry configuration – dependencies for the local environment are added here;
- poetry.lock: a dependencies lock file generated and managed by Poetry;
- requirements.other.txt: pure Python dependencies not available in the Snowflake Anaconda Channel
- requirements.txt: dependencies available in the Snowflake Anaconda Channel;
- scratch.ipynb: Jupyter Notebook for local development and testing.
Local development environment and dependency management
Project collaborators often encounter difficulties setting up their local environment when starting to work on an existing project. This is a major source of frustration for the people involved and results in delays to achieving meaningful project contributions.
Furthermore, project members often have subtle differences in their local Python environments for various reasons, such as using a different operating system or platform or mismatching versions of project dependencies. This may lead to code that doesn’t work consistently for all developers.
These issues can be fixed by introducing a tool to manage the development environment in a deterministic, replicable way. We use a tool called Poetry to manage our local development environment and Python dependencies. This helps us ensure that the development environment that project collaborators use is reproducible and identical, so the same code produces the same results for all collaborators.
Poetry also allows us to easily build, package and publish the Python package and track all of our dependencies’ versions. Dependencies are installed using the CLI and added to the pyproject.toml file. Each dependency’s version is listed in a generated poetry.lock file committed to the version control and replicated when installing the project using the Poetry CLI.
In addition to the Poetry configuration used to set up the local environment, we maintain two more files containing lists of dependencies. requirements.txt is used to install the project dependencies during the CI/CD process, and requirements.other.txt is needed to build packages unavailable in the Snowflake Anaconda Channel.
As with any software project, we need to test our code sufficiently to make sure it functions according to our requirements and be able to detect regressions. In the example we are looking at now, there are two types of tests: Python and end-to-end tests.
We need to test our Python code to ensure the correctness of our Snowpark stored procedures, user-defined functions as well as any other custom Python code our workflow uses. We leverage the industry-standard pytest framework to write and run our tests, which can be classic Python unit tests, testing the functionality of a function in isolation, or integration tests.
You can think of the tests implemented in our example project as a form of integration testing per se. This allows us to test our workflow’s functionality on Snowflake and quickly detect any introduced regressions in our Python code.
Consider the following example:
Here, we are testing a Python function to preprocess the data used for the model training. We can ensure it behaves as expected by calling it with testing data and comparing the result to an expectation. Since the testing code makes use of the Snowpark DataFrame API, it runs remotely on Snowpark and can detect errors before actually deploying the code to Snowflake.
Once our Snowpark application is deployed, we would like to verify the result of our ML workflow in a real-world scenario and validate any tables or views. To do this, we define assertions based on real SQL statements executed against our Snowflake account.
In our case, these are described as simple JSON objects, consisting of a name and an SQL query which should return an empty result. However, you can use any other tool or platform to validate your data, like Great Expectations for example.
There is a step in our CI/CD pipeline that runs these queries against our Snowflake instance and logs any unmet expectations.
Another recommended practice when working with Snowpark for Python is to import the Snowpark functions and types modules using an alias. This will make it easy to differentiate between built-in Python functions and their respective Snowpark DataFrame counterparts (such as min, max). Furthermore, you’ll be able to access the functionalities provided that these modules provide without having to manage additional imports.
Consider the following:
Workflow and environment separation
To support collaborative workflows, track changes to our project’s source code and manage parallel versions of it, we leverage Git and host our repository on the GitLab platform, which also allows us to define, execute and monitor CI/CD pipelines with GitLab CI.
In our Git workflow, there is a single long-living branch called main. The main branch serves as the single source of truth for the Snowflake production schema. This branch has appropriate protections and checks in place to prevent unstable code from affecting the production Snowflake environment. No project member can push directly to the main branch, and only repository maintainers have permission to merge incoming pull requests into it.
GitLab allows us to set up checks and protections for some of our repository’s branches. Let us use the following configuration:
- Developers can work on new features or fixes only in separate branches cut from the main branch;
- Maintainers can only merge feature branches into the main branch after the latter pass all steps in the CI/CD pipeline and undergo peer review;
- To set up branch protections, we utilise the options in Settings -> Repository -> Protected Branches.
We also restrict the creation of new tags to maintainers because this feature is utilised to release new model versions (Settings -> Repository -> Protected Tags).
Furthermore, we are going to store all secrets and credentials needed to build and deploy the project using GitLab variables. We can set these up in the Settings -> CI/CD -> Variables menu.
We want to keep the production versions of our ML workflow isolated from the development ones on the Snowflake side too. To do this, we maintain separate schemas for the main production environment and for features that are not yet ready for production deployment.
Access to these can then be restricted using Snowflake role-based access control. New schemas for feature branches of our code are automatically created during the CI/CD process.
Adding a new feature
In our example repository, a developer wants to add a function materialising a table of our ML model’s predictions to the Python code. So, they cut a branch called feature/forecast from the most recent state of the main branch. They work on the new feature, then commit and push the branch to the remote.
A new schema called FEATURE_FORECAST is created in the Snowflake database and the updated ML workflow is deployed to it. After the developer is satisfied with their work, they can open a new pull request to the main branch. Once all checks succeed and the pull request is reviewed, a project maintainer can merge it. This is how the CI/CD pipeline automatically updates the production environment.
A new version of our ML workflow and model can be released simply by creating a new Git tag on the main branch:
This triggers a CI/CD job exporting the model artefact to a file in an internal stage named with the tag name. Snowflake users can then register the exported model to a Snowflake UDF and utilise it for inferences by calling the UDF, using either Snowpark or regular Snowflake queries.
Here is an example:
In addition, the serialised model will be available on the internal stage and you can export it to external services. You might want to do that to make it available via a RESTful API for instance.
Internal stage availability makes sure that the deployment of our model is not limited to Snowflake, so we can opt for a custom microservice or an external platform to deploy our model for inferences.
Finally, let us go step-by-step through the GitLab CI pipeline that puts all of the pieces together.
Many of the steps in our pipeline send out SQL statements to Snowflake. To reduce code repetition and the CI/CD pipeline execution time, we have built a simple Docker image that contains all of the tools we need in our jobs.
GitLab CI allows us to run our jobs in containers using this image:
We can build and push the image to any Docker registry configured in GitLab. After that, we just need to specify it in our jobs.
The preliminary stage in our pipeline parses a name for the new schema to be created from our Git branch name. This name will be accessible as an environment variable from all following jobs. This is necessary due to the fact that some symbols that are allowed in Git branch names, such as a hyphen, are not permitted in Snowflake schema names. This way, the Git branch feature/my-new-feature becomes the Snowflake schema FEATURE_MY_NEW_FEATURE.
In our build stage, we install the requirements needed for testing and run our Python tests. In case a test fails, the entire CI/CD pipeline will fail. We also create the Snowflake schema for the branch in case there is not one yet. We also build all manually managed packages which are unavailable in the remote environment and zip them together with our workflow code. The archive is then published as a pipeline artefact.
In the next stage, we are going to upload our Python code package to an internal stage and run the SQL script to register our stored procedures. Moreover, we are going to create any views or tables with data that our training procedure utilises.
By using the GitLab environments feature, we can define actions to be taken when a feature branch is deleted. In our case, a special job in this stage runs only on branch deletion, whether manually or when merging a pull request.
To avoid filling our Snowflake database with obsolete schemas, this job drops the schema for the specific branch after it has been deleted.
We are almost ready, so great job keeping up so far.
The next step is to run the procedures specified in the second SQL script to train and export our model to an internal stage.
Release on Git tag
As we mentioned earlier in the article, we use Git tags to support model versioning. A special job in our pipeline runs when a new tag is pushed, writing the serialised model to an internal stage and releasing it as a pipeline artefact.
End-to-end testing phase
As a last step in our pipeline, we execute a shell script to run all of our end-to-end tests against our Snowflake account. Again, if this job fails, the entire pipeline will be marked as failed.
Following best practices for development and deployment, especially when it comes to the latest technology, has two major benefits. First, it facilitates effective delivery by streamlining collaboration for all team members. Second, it helps ensure continuity, making it easy for Infinite Lambda's clients to take over the solutions we deliver. This encourages client teams to adopt and adhere to high standards that help them make the most of their infrastructure in the long run.
Now that you are familiar with the best development and deployment practices for Snowpark for Python, feel free to explore the repository to dig deeper and experiment, or use it as a base for your next Snowpark project. If you have any questions, reach out to the team and we would be more than happy to help.
Make sure to explore more technical content on the Infinite Lambda blog.