What Is A Snowflake Stage?
At work the past week, I’ve been working with large datasets, trying to figure out the most efficient way to process and transform the data without overloading other teams’ servers.
Disclaimer, I’m not a data scientist and SQL is not my strong suit. Any mistakes in the below are mine :)
So imagine you have a table like this:
First | Last | Favorite_color | Zip_code |
---|---|---|---|
Angela | Apple | red | 90210 |
Bruce | Banana | yellow | 02134 |
Claire | Canteloupe | orange | 20008 |
… | … | … | … |
Zed | Zucchini | …zcolor | 12345 |
And let’s say that table has eleventy billion rows in it. The actual number is not important, for the purposes of our exercise the fact that it is “huge” is the important part.
Now let’s say you want to get everyone’s first name whose Zip code matches a list of 10,000 arbitrary zip codes.*
You could do:
SELECT First FROM my_table WHERE Zip_code in (10001, 10002, 10003 ... 20000)
but that is very slow! I don’t know how long it takes on this hypothetical table but in my real-world scenario, a similar query took 20 seconds for 1000 items in the “in” clause. With the amount of data we need to pull from this table (which is significantly more than 10,000 rows), batching in groups of 1000 would be prohibitively slow.
However, if you had your desired ZIP codes as a separate table, my_codes
:
Desired_zip_codes |
---|
10001 |
10002 |
… |
20000 |
you could run a much faster query:
SELECT First from my_table INNER JOIN my_codes ON my_table.zip_code = my_codes.desired_zip_codes
This is much faster because sql magic?**
In my case, I had confirmed that a join would be faster than an in clause. So how do I get my list of codes into a JOIN-able format?
A colleague suggested using a Snowflake Stage, which I had never heard of before. The basic documentation on stages explains how to create them. Essentially it’s a loading area for ingesting structured data that is not in your database.
In my case I have a CSV uploaded to S3 with the data I want to join on. So I need to load that into a stage and then I can join against it.
So we create a stage:
CREATE STAGE my_stage URL='s3://mybucket/mypath'
There are a million other params you can pass in to tweak the behavior of said stage, see more on that here. But that’s all it takes to create a very basic stage. Depending on how your database and AWS integrations are set up you probably need to add some permissions. This will be different for every setup so I’ll spare you the gory details of what I did here (it was mostly Terraform and pinging our data-platform team for help). But assuming you’ve created the stage and your database role has permissions to read from it, you are now in business!
Say you have a file in your S3 storage: mybucket/mypath/data.csv
. It’s now as simple as:
SELECT $1, $2 FROM @my_stage/data.csv
Snowflake doesn’t know that the header of your CSV is its column names (if you’re not using a headerless CSV, which is a whole other thing) which is why we have to use positional arguments. However, we can fix that by importing the CSV into a temporary (or permanent, depending on our use case) table:
CREATE TEMP TABLE mytemptable (ZIP number);
Lets now say our zip code list is a one-column CSV:
COPY INTO mytemptable @my_stage/data.csv;
SELECT First FROM mytable JOIN mytemptable ON mytemptable.ZIP = mytable.Zip_code;
As before, there are about a thousand different configurable options here. I especially like MATCH_BY_COLUMN_NAME
, which, when paired with parse_header=TRUE
, automagically determines the column names in your CSV (with headers) and inserts them into the table, even if the columns in the table are in a different order.
Note: MATCH_BY_COLUMN_NAME
doesn’t get you a free pass to not create the temp table, if you want to select by column name. To dynamically create the temp table without knowing the headers in the CSV, that’s a separate post. Or see “Generate column description” in my resources below.
That said, I’ve now created a script that does everything I want to and selects the data I want from Snowflake, joining on my imported CSV! It’s still quite slow, taking a few minutes to run each time, but compared to the previous process, this is a huge improvement!
Resources:
- Querying data in staged files
- Generate column description
- Create FILE FORMAT (very useful for specifying options on your incoming CSV)
*I realize this is a contrived example; believe me that my real-world use-case is more relevant.
**I learned in the process of writing this blog post that a JOIN is not always faster than an IN, and ‘real’ data scientists and database admins have dedicated many words to analyzing which query is faster under which circumstances. (Don’t forget that with most database languages you can run something like EXPLAIN PLAN before actually running the query to see how long it might take.)