Skip to content

Instantly share code, notes, and snippets.

@Andrielson
Created July 25, 2025 05:04
Show Gist options
  • Save Andrielson/a552b235d982ff5de53b4aed748dc963 to your computer and use it in GitHub Desktop.
Save Andrielson/a552b235d982ff5de53b4aed748dc963 to your computer and use it in GitHub Desktop.

Implementing Vector Embedding Processing with Python UDFs in Snowflake

This document outlines the design and implementation of a Snowflake-native system for processing and generating vector embeddings using Python UDFs. It replaced a distributed EC2-based solution, reduced operational overhead, and enabled scalable, SQL-driven vector workflows within our analytics platform.

Problem statement

Our team needed to perform operations on embedding vectors (e.g., similarity scoring, normalization, distance calculations) using billions of records stored in our Snowflake database.

However, the existing workflow—relying on distributed Python scripts running on EC2 and ingesting CSVs from S3—was operationally brittle, inefficient, and difficult to maintain.

Therefore, I had the idea of implementing Snowflake user-defined-functions using Python, that allowed us to improve the process significantly, reducing costs, time spent and complexity.

Goals and Non-goals

Goals

  • Enable vector operations (e.g., normalization, cosine similarity) via SQL queries in Snowflake
  • Allow on-demand embedding generation using our own sentence-transformer models
  • Maintain performance suitable for batch processing and near-real-time scoring
  • Use native Snowflake capabilities (UDFs) to avoid infrastructure sprawl
  • Use our own embedding models via Sentence Transformers library
  • Avoid overhead of managing Python environments with hundreds of dependencies

Non-goals

  • Build a full-featured vector database
  • Use Snowflake native vectors functions

Technical Overview

  • Python UDFs were written in a local project that used uv for dependency management, then pushed to a private GitHub repo from which Snowflake could fetch the Python modules via external Git integration and deploy the functions, allowing us to have a full CI/CD pipeline.
  • Each UDF handles a specific vector operation, such as:
    • parse_address_with_libpostal(address VARCHAR)
    • embed_address(address VARCHAR)
    • embed_organization(company_name VARCHAR)
  • Snowflake’s Anaconda integration was used to load dependencies (like postal, pandas, numpy, pymilvus, sentence-transformers).
  • Snowflake's AWS S3 integration was used to load external assets, like our custom embedding models:
CREATE OR REPLACE FUNCTION PUBLIC.SENTENCE_TRANSFORMERS_ENCODE_ADDRESS_128("ADDRESS" VARCHAR)
RETURNS ARRAY(FLOAT)
LANGUAGE PYTHON
RUNTIME_VERSION = 3.11
PACKAGES = ('pandas','sentence-transformers')
IMPORTS = (
	'@GITHUB.SNOWFLAKE_PYTHON/branches/main/user_defined_functions/sentence_transformers/encode_address_128.py',
	'@MODELS/address-v1-128dim.zip')
HANDLER = 'encode_address_128.main';
import fcntl
import os
import sys
import threading
import zipfile
from pandas import DataFrame
from sentence_transformers import SentenceTransformer
from _snowflake import vectorized

@vectorized(input=DataFrame)
def main(df: DataFrame):
	embeddings = model.encode(df[0])
	return embeddings

# File lock class for synchronizing write access to /tmp.
class FileLock:
	def __enter__(self):
		self._lock = threading.Lock()
		self._lock.acquire()
		self._fd = open('/tmp/lockfile.LOCK', 'w+')
		fcntl.lockf(self._fd, fcntl.LOCK_EX)

	def __exit__(self, type, value, traceback):
		self._fd.close()
		self._lock.release()

# Get the location of the import directory. Snowflake sets the import
# directory location so code can retrieve the location via sys._xoptions.
IMPORT_DIRECTORY_NAME = "snowflake_import_directory"
import_dir = sys._xoptions[IMPORT_DIRECTORY_NAME]

# Get the path to the ZIP file and set the location to extract to.
zip_file_path = import_dir + "address-v1-128dim.zip"
extracted = '/tmp/models'

# Extract the contents of the ZIP. This is done under the file lock
# to ensure that only one worker process unzips the contents.
with FileLock():
	if not os.path.isdir(extracted + '/address-v1-128dim'):
		with zipfile.ZipFile(zip_file_path, 'r') as myzip:
			myzip.extractall(extracted)

model = SentenceTransformer(extracted + '/address-v1-128dim')

System Architecture

Before

sequenceDiagram
Snowflake ->> AWS S3: export CSV with addresses for embedding
AWS S3 ->> AWS EC2 producer: list all CSV files
AWS EC2 producer ->> AWS SQS: write to queue
AWS SQS ->> AWS EC2 consumer: read from queue
Note right of AWS EC2 consumer: stream CSV file from S3<br/>parse address with libpostal<br/>embed parsed address with<br/>sentence-transformers.
AWS EC2 consumer ->> AWS S3: export the libpostal JSON and vectors JSON in CSV files
AWS S3 ->> Snowflake: import the CSV files into new tables
Loading

After

Everything run in Snowflake:

flowchart TD
	A[ORIGINAL ADDRESSES] --> B(LIBPOSTAL_PARSE_ADDRESS)
	B --> C[PARSED ADDRESSES]
	C --> D(SENTENCE_TRANSFORMERS_ENCODE_ADDRESS_128)
	D --> E[EMBEDDED ADDRESSES]
Loading

Detailed Design

Python script

from pandas import DataFrame
from postal.parser import parse_address
from _snowflake import vectorized

@vectorized(input=DataFrame)
def main(df: DataFrame):
	return df.iloc[:, 0].apply(lambda address: {k: v for (v, k) in  parse_address(address)})

Create UDF statement

CREATE OR REPLACE FUNCTION PUBLIC.LIBPOSTAL_PARSE_ADDRESS("ADDRESS" VARCHAR)
RETURNS OBJECT
LANGUAGE PYTHON
RUNTIME_VERSION = 3.11
PACKAGES = ('pandas','postal')
IMPORTS = ('@GITHUB.SNOWFLAKE_PYTHON/branches/main/user_defined_functions/libpostal/parse_address.py')
HANDLER = 'parse_address.main';

Usage example

Query:

SELECT PUBLIC.LIBPOSTAL_PARSE_ADDRESS('5203 NORTH RED MAPLE STREET, IRONS, MI 49644');

Output:

{
	"city": "irons",
	"house_number": "5203",
	"postcode": "49644",
	"road": "north red maple street",
	"state": "mi"
}

Trade-offs and Alternatives Considered

  • Considered pre-processing vectors externally and storing normalized versions in the DB, but this added maintenance burden and decreased flexibility.
  • Considered using Snowpark for Python but opted for simple UDFs due to lower overhead and easier integration into SQL pipelines.

Security, Scalability, and Performance

Security

  • UDFs run in Snowflake’s secure sandboxed Python environment.
  • Strict control over secrets and external access using Snowflake's EXTERNAL_ACCESS_INTEGRATIONS and SECRETS:
CREATE OR REPLACE FUNCTION ZILLIZ_CLOUD.DELETE_ENTITIES_BY_PRIMARY_KEY("COLLECTION_NAME" VARCHAR, "ID" VARIANT, "PARTITION_NAME" VARCHAR DEFAULT '')
RETURNS OBJECT
LANGUAGE PYTHON
RUNTIME_VERSION = 3.11
PACKAGES = ('pandas','pymilvus')
IMPORTS = ('@GITHUBP.SNOWFLAKE_PYTHON/branches/main/user_defined_functions/zilliz/delete_entities_by_primary_key.py')
HANDLER = 'delete_entities_by_primary_key.main'
EXTERNAL_ACCESS_INTEGRATIONS = (ZILLIZ_CLOUD_API_ACCESS_INTEGRATION)
SECRETS = (
    'api_token' = ZILLIZ_CLOUD.API_TOKEN_SECRET, 
    'api_endpoint' = ZILLIZ_CLOUD.DEDICATED_CLUSTER_ENDPOINT_SECRET, 
    'db_name' = ZILLIZ_CLOUD.DEDICATED_CLUSTER_ENTITY_NAMES_DB_SECRET);

Scalability

  • Performance acceptable for datasets with hundreds of millions of rows in batch mode.
  • Easily scale up by using larger Snowflake warehouses without the overhead of managing hundreds of AWS EC2 instances.

Performance Tuning

  • Batches thousands of rows using the vectorized functions that groups rows into Pandas Dataframes instead of one function call for every row.

Testing Strategy

  • Created integration tests using a small Snowflake warehouse with data samples in a staging environment.

Deployment / Rollout Plan

  • Deployed using versioned UDFs (e.g., LIBPOSTAL_PARSE_ADDRESS_V1, V2, etc.) to allow for safe upgrades.
  • Deploying using Github Actions with pipelines for testing and Snowflake operations

Monitoring and Observability

  • Used Snowflake’s query history and profiling tools to observe UDF execution times.
  • Monitored for UDF failures or memory issues (e.g., vector too large, eval errors).
  • Logged vector dimensions and runtimes to Snowflake audit logs for later analysis.

Known Limitations and Future Work

  • Python UDFs currently have limited GPU access.
  • Snowflake Anaconda channel doesn't have the most recent versions for several packages and has limited packages compared to PyPi
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment