Just a little recap of what we’ve published recently

Hugo Shi
3 min readOct 13, 2021

This post is mostly for myself to organize in my brain the various things Saturn Cloud has published.

Dask

Beginner’s Guide

I recently started a series of articles, aimed at beginner’s, but incorporating practical tips we’ve picked up working with customers.

Just Start with the Dask LocalCluster: There is a lot out there about different ways to deploy dask. Dask can be deployed on Kubernetes with the dask-kubernetes project (which we use as a building block for Saturn), as well as directly on to most clouds with dask-cloudprovider. At certain scales, these deployment patterns make sense. I fundamentally believe in simplicity — which is why I argue that for most people, the Dask LocalCluster is the right way to go.

If You Can Write Functions, You Can Use Dask: I think that Dask is heavily under-utilized. There is a lot of content out there about advanced use cases of Dask for machine learning, implementing geo-spatial algorithms, and other advanced use cases, however most people can leverage Dask for embarassingly parallel workloads with dask delayed.

Deep Learning

Speeding up Neural Network Training With Multiple GPUs and Dask and Combining Dask and PyTorch for Better, Faster Transfer Learning. These 2 articles were written about parallel training with PyTorch and Dask using the dask-pytorch-ddp library that we developed specifically for this work. This approach tends to work quite well in practice, however work usually must be done to load data across the cluster efficiently.

Computer Vision at Scale With Dask and PyTorch and The Future of Computer Vision with AI Pioneer Senseye. These 2 articles cover the classical example of using Dask for batch inference with PyTorch. Including a case study with one of our first forays into Dask and deep learning with Senseye.

Modeling Unstructured Data Using Snowflake and Saturn Cloud. This last piece that we demoed at the Snowflake Build Summit covered training and inference with Dask, leveraging the new unstructured data support that is new to Snowflake.

Technical Choices

We have a few articles covering some technical choices related to Dask and optimizations.

Troubleshooting Dask GroupBy Aggregation Performance. This is one of the pieces that took me the longest to write, but it gives a good overview of how dask groupby + agg works.

Should I Use Dask?. As mentioned earlier — I fundamentally believe that simplicity is extremely important. If you can avoid parallel computation — you should

Leveraging Snowflake and Dask. That article is a bit outdated, but it covers loading query shards into Dask dataframes.

Data Science Operations

We’ve also written some articles covering the business of making data science useful for companies.

Strategies for managing big data. You don’t always have to use Dask. This article covers practical strategies for dealing with big datasets in your company.

Host a Jupyter Notebook as an API. Ok, this article covers quite a bit. Running a notebook as a job using Papermill, As well as triggering data science jobs via a CI/CD pipeline.

Deploy Your Machine Learning Model — This is a 3 article series on deploying ML models and building a dashboard out of it.

Deploying Data Pipelines at Saturn Cloud with Dask and Prefect. This article covers how we use Dask at Saturn for our internal production jobs.

Using Dask to Execute CLI Calls. We’ve worked with a few companies in the life sciences that need to execute shell scripts with Dask. That piece of documentation covers some practical advice.

--

--