Scaling vLLM for Embeddings: 16x Throughput and Cost Reduction

Embeddings do what LLMs can’t alone — make data searchable, meaningful and ready for real-time use. From semantic search and recommendations to fraud detection, embedding models power the AI capabilities of Snowflake Cortex AI, a built-in suite of AI functions for agents, search and analytics.

To support such workloads at scale, Snowflake processes trillions of tokens per month across both real-time and batch workloads. But when we benchmarked the embedding models using vLLM, we uncovered performance bottlenecks that left GPUs underutilized. After a deep round of system profiling and architectural tuning, we reworked how tokenization, batching and serialization flow through vLLM and observed:

16x and 4.2x higher embedding throughput for short (50 tokens) and long sequences (512 tokens), respectively, as compared to vLLM.
2.4x higher embedding throughput for short sequences while achieving performance parity for longer sequences, as compared to Text Embeddings Inference (TEI)

In this blog, we walk through what caused the bottlenecks, how each optimization works and how these improvements are being rolled out across Snowflake Cortex AI.

We are also open sourcing all vLLM-specific improvements as part of Arctic Inference — our fast, cost-effective open source inference system for enterprise AI.

Bottlenecks: Prompt tokenization and vector serialization

We envisioned a future where Cortex Search, Snowflake’s semantic retrieval service, could operate at the scale of billions of documents. To support fast indexing and real-time retrieval across many workloads, embedding models need to run with high throughput and minimal overhead. Improving efficiency isn’t just about speed — it directly impacts GPU utilization and cost.

When we profiled our embedding models on vLLM, the throughput and GPU utilization were far worse than what a PyTorch-native implementation could achieve. To uncover the bottlenecks, we profiled embedding inference in vLLM using Python runtime traces. Figure 1 shows a flame graph from one run — each span represents the duration of a function call. The embed() function, which handles GPU inference, accounts for just 10% of total compute time. The remaining 90% is spent on CPU tasks, revealing significant overhead.

Figure 1: A profiling flame graph when vLLM was embedding a batch of 64 documents, each of sequence length 512 tokens.

The flame graph attributed the CPU overhead to two sources: tokenization and data serialization. But why did they take so long?

Tokenization overhead is a bottleneck in vLLM. When vLLM receives the embedding requests with prompts as strings, vLLM would first tokenize the prompts (CPU) and then perform inference (GPU). The sequential process meant that the GPU would sit idle until tokenization completes, creating "bubbles" in the GPU schedule.

Data serialization becomes a bottleneck when vLLM is deployed behind a gRPC frontend — a setup we use at Snowflake to support multi-tenant serving, multiple programming languages, and dynamic model swapping for surging demand. While this architecture improves flexibility and GPU utilization, it introduces latency when converting embedding outputs from Python List[float32] to Protobuf’s repeated float format — likely due to Python’s Global Interpreter Lock (GIL) and the lack of SIMD vectorization in Python Protobuf.

With these discoveries, we came up with three optimizations to address the bottlenecks.

Optimization 1: Encode embedding vector as Little-Endian bytes

This optimization addresses the data serialization bottleneck in any gRPC service that wraps vLLM. We significantly reduced gRPC response latency by encoding the output list of floats (embedding) as raw bytes in little-endian.

Endianness affects how the computer stores in-memory, multibyte data, such as floating points and integers. Little-endian means the least-significant byte (the "smallest" part) comes first in lower-memory addresses. We use little-endian because it is the native endianness on most instruction set architectures (ISA), so using little-endian avoids a memory copy or byte-swapping step, thus faster than big-endian for our use case.

We further speed up such raw-bytes serialization by using vectorization by NumPy:

embedding_bytes = response.embedding_tensor.numpy().astype(dtype="<f4").tobytes(),

Vectorization accelerates computation by applying operations to entire arrays at once, leveraging low-level optimizations and SIMD instructions. Vectorization in NumPy is built on optimized C libraries and avoids Python’s per-element overhead.

Optimization 2: Disaggregate tokenization and inference

This optimization addresses the tokenization bottleneck in vLLM.

When vLLM receives prompts as strings, it would tokenize input strings on the CPU before launching inference on the GPU. Because these steps are sequential, the GPU remains idle during tokenization, creating “bubbles” in the schedule and limiting throughput.

To resolve this, we disaggregated tokenization and inference into a two-stage pipeline. Instead of sending raw text, we pretokenize inputs and pass token IDs directly to vLLM. This enables pipeline parallelism; tokenization and inference can run in parallel across different requests, even though they remain sequential within each one.

For example, in Figure 2, for a batch of three requests, denoted as r1, r2 and r3:

Tokenization for r2 can happen while inference for r1 is running
Tokenization for r3 can overlap with inference for r2

Figure 2: The GPU schedule for an example batch of three inference requests, before and after disaggregating tokenization and inference.

With this overlapping execution, we achieved a higher embedding throughput by parallel processing multiple requests while maintaining sequential order within each individual request.

Optimization 3: Multiple identical models on one GPU

This optimization addresses underutilization that persists even after improving tokenization and serialization. Embedding models typically have fewer parameters and shorter runtimes than autoregressive LLMs, which means a single instance often leaves GPU resources underused, such as when the GPU is waiting on memory or kernel launch.

The idle GPU resources mean that we can run multiple replicas of the same model on a single GPU. These replicas can serve inference requests concurrently and thus increase inference throughput given the same number of GPUs.

Figure 3 illustrates the effects of multiple model replicas on one GPU. A GPU consists of many streaming multiprocessors (SMs), which include many cores. When a model runs, its operations, such as matrix multiply and softmax, get compiled into kernels. These kernels are dispatched to the GPU, where the scheduler assigns them to available SMs. Running multiple replicas of an embedding model on a single GPU increases throughput by better utilizing available GPU resources that would otherwise be idle due to CPU and data-transfer bottlenecks.

Figure 3: Kernel assignment to cores with multiple model replicas on one GPU.

3x throughput in Snowflake Cortex AI

This section shows the incremental impact of the three optimizations — faster serialization, disaggregated tokenization and multi-replica execution — directly in Snowflake Cortex. The benchmark assumes the following setup:

GPU type	A single A10g
Embedding model	`snowflake-arctic-embed-m-v1.5`
Precision	FP16
Input token size	512
Batch size	96
Baseline	Upstream vLLM v0.8.3, with a gRPC frontend in Python runtime and deployed behind layers of Snowflake microservices

With all optimizations active, we saw 3x throughput improvement in Snowflake Cortex AI. As shown in Figure 4, each optimization step contributed to higher throughput. The final configuration delivered a sustained throughput of 230,000 tokens per second.

Figure 4: Embedding model throughput changes after applying combinations of optimizations.

At the time of publishing, Snowflake uses A10g to serve embeddings in production. There is an industry-wide supply shortage of more powerful GPUs, such as H200, and we reserve them for models with hundreds of billions of parameters. We benchmarked the snowflake-arctic-embed-m-v1.5 model because it is the most popular embedding model on the Snowflake Cortex platform. The usage pattern of Snowflake customers skews toward long input sequence, so we used 512 for benchmarking, which is the maximum input size that snowflake-arctic-embed-m-v1.5 accepts.

The throughput gains are rolling out across Cortex, powering faster indexing and lower-latency responses across products like Cortex Agents, Cortex Search and Cortex Analyst. The product experiences will feel more real-time than ever. Without any changes, Snowflake users will see their queries — such as SNOWFLAKE.CORTEX.EMBED_TEXT_768() — complete much faster.

Pushing the boundaries in open source: Up to 16x faster and more cost efficient

We extended our evaluation to benchmark open source performance on more powerful hardware — specifically the H200 GPU. By optimizing serialization, tokenization and GPU utilization, we increased throughput by 16x on short sequences and 4.2x on long sequences, while reducing cost per token 16x on short sequences.

As shown in Figure 5, our system outperformed the vLLM baseline using an HTTP/JSON interface. Compared to Text Embeddings Inference (TEI), we delivered up to 2.4x higher throughput on short sequences and maintained parity on long sequences.¹

Figure 5: Embedding model throughput for the snowflake-arctic-embed-m-v1.5 model with Arctic Inference, in comparison to vLLM and TEI.

The throughput improvements shown in Figure 6 translate into 16x cost savings, as shown in Figure 6.

Figure 6: Cost per trillion tokens for embedding inference on A10g and H200 before and after optimizations. Cost is calculated based on the public AWS GPU instance price.

Here, we show the cost per trillion tokens when using Arctic Inference versus vLLM on: A10g, a widely used cost-efficient option; and H200, a newer high-end GPU with significantly higher hourly pricing.

This means that by pushing the embedding throughput on the H200 GPUs, we are able to lower the cost of inference using the more expensive H200 than what would be possible with the cost-efficient A10g, challenging the assumption that lower hourly cost always means better price-performance.

To help others achieve similar gains, we are open sourcing all vLLM-specific improvements as part of Arctic Inference — a high-throughput plug-in for embedding inference in enterprise settings.

Acknowledgements

In collaboration, the AI Research team and the Cortex Platform team worked together to achieve the 16x improvement in embedding model throughput. The accomplishment would not be possible without contributions from Flex Wang, Hyungtae Kim, Asim Shankar, Pawel Lis and Vincent Chan.

With leadership support from Seth Li, Yuxiong He, Mona Attariyan and Dwarak Rajagopal, the Research and Engineering organizations have partnered in many inference performance breakthroughs, and we can’t wait to share more of them soon.

¹The vLLM baselines shown in Figure 4 and Figure 5 have subtle differences. Figure 4 uses vLLM with Snowflake’s in house gRPC implementation and is deployed behind layers of microservices, while Figure 5 uses vLLM’s HTTP/JSON interface which offers a stronger baseline compared to the baseline in Figure 4.