Gen AI

APR 21, 2025|8 min read

Evaluating Multimodal vs. Text-Based Retrieval for RAG with Snowflake Cortex

In this blog post, we show how we improved Snowflake Cortex AI multimodal retrieval by treating each PDF page as a stand-alone image — allowing natural language queries to match both text and visuals. We walk through our results, highlighting when this works best (and when it doesn’t).

➡️ Try it yourself with our open source example

Why traditional RAG struggles with enterprise PDFs

Enterprise PDFs push traditional text-based retrieval systems to their limits. These documents combine long-form text, financial tables, technical diagrams and slide visuals — all packed into complex layouts that don’t play nicely with traditional retrieval-augmented generation (RAG) systems.

Traditional RAG pipelines break down in the face of rich, structured layouts for three reasons:

OCR is error-prone: Optical character recognition (OCR) is the process of converting text from images — such as scanned PDFs or photos — into machine-readable text. While Snowflake’s PARSE_DOCUMENT and other OCR tools are powerful, they can stumble on older scans, unconventional layouts or blurry text. Errors here ripple downstream, weakening both retrieval and generation quality.
Visual data is lost: Charts and diagrams often contain the most important insights — but since these are not extractable as text, they are often invisible to the model.
Workflow is complex: Multistep pipelines (OCR → chunk → embed → search) can be hard to operationalize. They require tuning, orchestration and infrastructure — a tall order for many enterprise teams.

These challenges inspired us to rethink retrieval from the ground up. What if, instead of extracting text, we treated PDFs as multimodal documents from the start — and matched queries directly against their visual and textual content?

Our multimodal approach: Searching PDFs as images

Instead of extracting text from PDFs using OCR, which often misses layout and visual context, we treat each page as a stand-alone image.

This preserves both the structure and content — including charts and tables — in a single snapshot. We then embed each image into the same vector space as natural language queries, enabling unified search across both text and visuals.

This design offers key advantages:

No need for OCR: No manual parsing – the full layout is preserved by default.
Visual awareness: Queries can match based on tables, diagrams or slide content — even when no clean text exists.
Efficiency: Each page uses a single embedding, reducing compute and latency costs.

While many multimodal systems use patch-based models (for instance, ColBERT-style late interaction), we focus on single-vector models for better efficiency at scale.

Models we evaluated

To compare retrieval effectiveness across document types, we tested several single-vector multimodal models:

Voyage Multimodal 3 (Snowflake Cortex functions, closed source)
GME-Qwen2-VL (2B and 7B, open source)
Nomic-Embed-Multimodal (3B and 7B, open source)

For baseline comparisons, we also evaluated text-only retrieval using OCR for PARSE_DOCUMENT. All text embeddings used Voyage Multilingual 2, a strong multilingual model, to ensure a fair comparison with billion-scale multimodal models.

This unified evaluation setup allowed us to directly compare retrieval power across both structured and unstructured enterprise content — from SEC filings to slide decks.

Existing benchmarks aren't realistic — so we built one

Most popular benchmarks for multimodal document retrieval don’t reflect how search works in real enterprise settings. They often focus on question answering, where the model is already given the correct document or page and asked to extract a specific answer.

For example: “What is 3M’s 2018 capital expenditure?” — along with a preselected cash flow statement. This tests whether the model can understand a page, not whether it can find that page in the first place.

Even retrieval-focused data sets such as ViDoRe (versions 1 and 2) operate over small collections — usually thousands of pages at most. But real-world enterprise search systems must be able to operate over millions of pages, spanning a wide range of formats and layouts.

To better evaluate how retrieval systems perform under these conditions, we built a custom benchmark using three types of enterprise documents, each chosen to address a specific challenge (see table 1):

Tech manuals (such as this one): Dense guides filled with diagrams, spec tables and nonlinear layouts that are difficult for traditional text-based methods. To evaluate different retrieval strengths, we split queries into two groups: one focused on charts, the other on text.
SEC financial filings: A large collection of quarterly reports, annual statements and regulatory documents. Building on our previous work, these are ideal for evaluating structured text retrieval, across long, table-heavy documents.
Presentation slides (SlideVQA): Visually rich decks where layout and graphics carry key information, ideal for testing multimodal retrieval.

This setup let us evaluate retrieval performance across a wide range of structured and visual document formats, reflecting realistic, large-scale conditions.

Data set	# Queries	# Relevant pages per query	# Pages in the collection
Tech Manuals (Chart)	143	1.0	27,000
Tech Manuals (Text)	69	1.3	27,000
SEC Financial Filings	495	3.9	2.3 million
SlideVQA	2,215	1.3	52,000

Table 1. Statistics of data sets used in this study.

What we learned: The best retrieval method depends on the document

With this evaluation framework in place, we tested how both text-based and multimodal retrieval methods performed across different document types and retrieval challenges.

To measure performance, we used:

Mean reciprocal rank (mRR): Evaluates the average rank of the first correct result returned.
Embedding throughput: The number of document embeddings generated per second, which assesses system efficiency, including the time cost of OCR for text-based methods.

Multimodal models outperform on visual-heavy documents

The result? On technical manuals (chart-based queries) and presentation slides, multimodal models consistently ranked the correct page higher and achieved faster embedding throughput. These models were able to capture layout and visual structure that text-only pipelines missed.

Graph showing average mRR on tech manuals (text, chart) and SlideVQA versus throughput measured in number of embeddings generated per second.

Figure 1. Average mRR on tech manuals (text, chart) and SlideVQA versus throughput measured in number of embeddings generated per second. Text embedding models also need to account for time invested in OCR processing. The top right corner represents the best combination of quality and efficiency.

Text-based retrieval still leads for structured documents

However, for financial reports, such as SEC filings — which feature clean text and highly structured tables — traditional pipelines using OCR and chunked text embeddings still delivered the highest retrieval accuracy. In these cases, the structure of the text itself was more informative than visual layout.

Chart showing mRR on SEC filing reports versus throughput.

Figure 2. mRR on SEC filing reports versus throughput.

This confirmed a key insight: Retrieval performance depends heavily on document type. Multimodal systems excel with layout-heavy content, while text-based approaches remain strong for structured, well-formatted documents.

No single modality wins everywhere

In some cases, results varied significantly by model, with text-based retrieval outperforming multimodal, depending on the setup.

To assess how this impacts the quality of generated answers, we ran a full RAG setup using Claude 3.5 Sonnet, a multimodal LLM capable of processing both text and images. For each query, we passed in content retrieved by text-only, multimodal or hybrid methods, then scored the model’s output for factual accuracy.

We used an LLM-as-a-judge system — an automated approach where another language model evaluates responses against human-verified correct answers. Final judgments were reviewed by humans for quality control.

Three charts showing average answerability score across tech manuals, SEC filing reports and SlideVQA data sets versus model efficiency measured in embeddings per second.

Figure 3. Average answerability score across tech manuals, SEC filing reports and SlideVQA data sets versus model efficiency measured in embeddings per second. Multimodal models (blue) generally outperform text-based models (black) on visual-heavy documents, while text-based models show strength on structured documents.

These results demonstrate that optimal retrieval approaches vary significantly by document type, with no universal solution across different enterprise content formats.

Combining the best of all worlds: Hybrid retrieval with Cortex Search

While multimodal models are strong at capturing layout and visual context, they — like text-based embeddings — often prioritize topical relevance. This can cause them to miss finer-grained matches, such as specific keywords, identifiers or phrases.

To address this, we built a hybrid retrieval strategy using Cortex Search, combining the strengths of multiple methods:

Multimodal embeddings: To capture layout and visual structure
Keyword search: For fast, high-precision filtering
Text-based reranking: To refine top results based on semantic relevance

This approach significantly improved Recall@5, especially for ambiguous or mixed-format queries. While it introduces some additional system complexity and compute cost, the improvements in answer quality made it well worth it for most enterprise use cases.

Augmenting multimodal vector retrieval with keyword search and neural reranking on Cortex Search leads to significant quality improvement.

Figure 4. Augmenting multimodal vector retrieval with keyword search and neural reranking on Cortex Search leads to significant quality improvement.

Together, these retrieval strategies form a flexible foundation for building high-quality RAG systems that work across the full spectrum of enterprise documents — from structured filings to layout-heavy slide decks.

Final takeaways

Should you use text embeddings or multimodal embeddings? It depends!

For text-heavy PDFs in clean, OCR-friendly formats (such as financial reports), text embeddings typically perform best.
For documents with complex layouts or heavy visual content (such as slides, manuals or charts), multimodal embeddings offer a clear advantage.
Snowflake Cortex Search supports all three approaches out of the box, allowing teams to easily experiment, combine methods and scale high-quality retrieval across enterprise data.

Choosing the right retrieval method (or combination of them) can be the difference between a generic response and a precise, enterprise-grade answer.

Interested in trying it out?

Contact your account representative to join as an early user and get access to this feature. Then check out the open source Jupyter notebook Multimodal RAG with Cortex Search to help you get started.

It walks through:

PDF processing for both multimodal retrieval and OCR
Indexing and searching with Cortex Search
Running RAG-style prompting using retrieved results

Hit a snag or find something cool? Jump into the Snowflake Community Forum — we’re there to help and would love to see what you’re building.

Explore more from the Arctic Agentic RAG series

This post is Episode 2 in our Arctic Agentic RAG series, in which we explore innovations in retrieval-augmented generation for enterprise AI.

More episodes coming soon — stay tuned.

Contributors

Snowflake AI research: Puxuan Yu, Danmei Xu, Zhewei Yao, Bohan Zhai, Krista Muir and Yuxiong He

Authors

Puxuan Yu

Danmei Xu

Zhewei Yao

Snowflake AI Research

Product

Solutions

Why Snowflake

Resources

Developers

Pricing

Evaluating Multimodal vs. Text-Based Retrieval for RAG with Snowflake Cortex

Why traditional RAG struggles with enterprise PDFs