AIPython

Hybrid Retrieval and Semantic Search in RAG: Building Smarter Document Search Systems

2/22/2026
9 min read

Introduction: Why Your RAG System Is Failing at Retrieval

You've built a Retrieval-Augmented Generation (RAG) system. You picked a good LLM, fine-tuned the prompts, optimized the context window. But something's still off—your system gives confident-sounding answers that are subtly wrong, or it misses relevant information entirely.

Here's the uncomfortable truth: your retriever is the bottleneck, not your generator.

Most developers focus optimization efforts on the language model—tweaking prompts, experimenting with different model sizes, adjusting temperature and top-k parameters. But here's what research consistently shows: a mediocre retriever paired with a powerful LLM will always lose to an excellent retriever paired with a smaller LLM. If the right documents never reach the generator's context window, no amount of model sophistication can save you.

The core problem with traditional RAG is oversimplification: most systems rely on a single retrieval method—either keyword search (BM25) or dense vector similarity (embeddings). Each approach has fundamental blind spots:

  • Keyword search nails exact term matches but fails when users ask the same question using different vocabulary
  • Vector search captures semantic relationships beautifully but can hallucinate relevance for documents that merely sound related
  • Neither method alone handles the full complexity of real-world information retrieval

This is where hybrid retrieval enters the picture. By intelligently combining multiple retrieval strategies, you can overcome the limitations of any single approach and dramatically improve both retrieval accuracy and downstream answer quality.

This article walks you through the complete landscape of hybrid retrieval and semantic search techniques that will make your RAG system actually work.

Part 1: The Three Core Retrieval Indices

Before we blend approaches, let's understand what we're blending. There are three fundamental retrieval paradigms, each with distinct strengths and weaknesses.

BM25: The Keyword Foundation

BM25 (Best Matching 25) is the industry standard for lexical search. It's been around since 1994, and it works because it elegantly captures how documents relate to queries at the term level.

How BM25 works:

python
# Simplified BM25 scoring
def bm25_score(query_terms, document, avg_doc_length, k1=1.5, b=0.75):
    """
    Calculate BM25 score for a document given query terms.
    
    Args:
        query_terms: List of query words
        document: Dictionary with term frequencies and length
        avg_doc_length: Average document length in corpus
        k1, b: Tuning parameters (standard values shown)
    """
    score = 0
    doc_length = document['length']
    
    for term in query_terms:
        # Inverse document frequency: penalizes common terms
        idf = math.log((total_docs - docs_containing(term) + 0.5) / 
                       (docs_containing(term) + 0.5) + 1)
        
        # Term frequency with length normalization
        term_freq = document['term_freq'].get(term, 0)
        numerator = term_freq * (k1 + 1)
        denominator = (term_freq + k1 * 
                      (1 - b + b * (doc_length / avg_doc_length)))
        
        score += idf * (numerator / denominator)
    
    return score

Why BM25 excels:

  • Exact keyword matching with sophisticated term weighting
  • Length normalization prevents bias toward longer documents
  • Works with zero training—just index and search
  • Computationally efficient (milliseconds for large corpora)
  • Transparent: you can understand why a document ranked highly

Why BM25 struggles:

  • Completely misses semantic relationships (synonyms get zero credit)
  • Query expansion required for coverage ("car" won't find "vehicle")
  • One typo kills matching
  • Rare technical terms get over-weighted despite low semantic relevance

Dense Vector Search (KNN): The Semantic Approach

Dense vectors represent documents and queries as points in high-dimensional embedding space. Documents with similar meanings cluster together mathematically.

python
from sentence_transformers import SentenceTransformer
import numpy as np

# Initialize a pretrained sentence transformer
model = SentenceTransformer('all-MiniLM-L6-v2')

# Encode documents and query into dense vectors
documents = [
    "The cat sat on the mat",
    "A feline rested on the carpet",
    "The stock market crashed today"
]

query = "Where did the cat rest?"

# Get embeddings (384-dimensional vectors)
doc_embeddings = model.encode(documents)
query_embedding = model.encode(query)

# Calculate cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity([query_embedding], doc_embeddings)[0]

# Rank by similarity
ranked = sorted(zip(documents, similarities), 
                key=lambda x: x[1], reverse=True)
for doc, score in ranked:
    print(f"{score:.3f}: {doc}")

Output:

0.847: A feline rested on the carpet
0.823: The cat sat on the mat
0.124: The stock market crashed today

Why dense search excels:

  • Captures semantic meaning independent of exact vocabulary
  • Handles synonyms, paraphrases, and conceptual variations naturally
  • Works across languages (with multilingual models)
  • Generalizes to out-of-vocabulary terms
  • State-of-the-art performance on many benchmarks

Why dense search struggles:

  • Computationally expensive (thousands of comparisons for large corpora)
  • Prone to semantic drift: similar-sounding documents about unrelated topics
  • Opaque—you can't easily explain which terms contributed to a match
  • Requires large vector storage (8KB+ per document with modern models)
  • Embedding quality depends entirely on training data; domain-specific embeddings often outperform general ones

Sparse Encoder Search (ELSER): The Semantic-Keyword Bridge

Sparse encoders like Elasticsearch's ELSER represent a fascinating middle ground. They're trained to expand query terms contextually while maintaining interpretability.

Think of ELSER as teaching the system that when someone searches for "car," documents mentioning "vehicle," "automobile," and "transportation" are semantically relevant—but instead of representing this as an opaque 384-dimensional vector, it expands the query itself.

python
# Conceptual example of sparse encoding
# (ELSER works internally, but here's the idea)

def sparse_encode_query(query_text, expanded_terms):
    """
    Expand query based on learned semantic relationships.
    """
    original_terms = query_text.lower().split()
    
    # ELSER learns these expansion patterns from training data
    expansion_map = {
        'car': ['vehicle', 'automobile', 'motor'],
        'effective': ['efficient', 'productive', 'successful'],
        'recent': ['recent', 'latest', 'current']
    }
    
    expanded = set(original_terms)
    for term in original_terms:
        if term in expansion_map:
            expanded.update(expansion_map[term])
    
    return expanded

# Query: "effective cars"
# Becomes: {"effective", "cars", "efficient", "productive", 
#           "successful", "vehicle", "automobile", "motor"}

Why sparse encoders excel:

  • Combines keyword matching precision with semantic understanding
  • Interpretable: you can see which expanded terms matched
  • Computationally efficient (sparse operations, not dense vectors)
  • Works well with existing keyword-based infrastructure (Elasticsearch, Solr)
  • Specialized dense-to-sparse training improves semantic matching

Why sparse encoders struggle:

  • Less mature technology than BM25 or dense vectors
  • Requires specialized infrastructure (Elasticsearch 8.0+)
  • Performance depends on expansion quality from training
  • Not yet as universally adopted as other methods

Part 2: Blended RAG Architecture in Action

Now that we understand each retrieval method's strengths, let's see why combining them dramatically improves results.

The Research Evidence: Concrete Performance Gains

The research team at Sawarkar et al. tested this empirically using their Blended RAG framework. Here's what they found:

Blended Retriever Architecture
Blended Retriever Architecture
Figure: The multi-stage blended retrieval pipeline combines BM25, dense (KNN), and sparse encoder indices through various multi-match query patterns before ranking fusion. Source: "Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers"

Notice the architecture shows multiple parallel retrieval pathways—this is the key insight. Rather than choosing one retrieval method, we run them simultaneously and intelligently combine their results.

Let's look at the concrete numbers:

| Model/Pipeline | EM | F1 | Top-5 | Top-20 | | --- | --- | --- | --- | --- | | RAG-original | 28.12 | 39.42 | 59.64 | 72.38 | | RAG-end2end | 40.02 | 52.63 | 75.79 | 85.57 | | BlendedRAG | 57.63 | 68.4 | 94.89 | 98.58 |

Source: "Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers"

That's a 45% improvement in exact match (EM) over the previous state-of-the-art. Not a marginal optimization—a fundamental transformation.

Multi-Match Query Types: The Technical Foundation

The secret sauce is using different query formulations against the same indices. In Elasticsearch terminology, these are called "multi_match" query types. Each treats term relationships differently:

1. Best Fields (Optimal for ELSER)

python
# Best fields: aggregate scores within individual fields
# Useful when a term appearing fully in one field is better than spread across multiple

elastic_query = {
    "multi_match": {
        "query": "machine learning fundamentals",
        "type": "best_fields",
        "fields": ["title^2", "content", "summary"],  # title weighted 2x
        "operator": "or"  # documents matching ANY term ranked
    }
}

# A document with "machine learning fundamentals" in the title gets highest score
# Score: best match from any single field wins

2. Most Fields (Balanced approach)

python
# Most fields: appearance across different fields boosts relevance
# If "learning" appears in title AND content, it's weighted higher

elastic_query = {
    "multi_match": {
        "query": "machine learning",
        "type": "most_fields",
        "fields": ["title^2", "content", "metadata"],
        "operator": "and"  # documents matching ALL terms ranked higher
    }
}

# A document with "machine" in title + "learning" in content scores high

3. Cross Fields

python
# Cross fields: treats all fields as one when calculating term frequency

elastic_query = {
    "multi_match": {
        "query": "author Chollet",
        "type": "cross_fields",
        "fields": ["author^2", "content"]
    }
}

# Useful for author-book queries where term relevance spans fields

4. Phrase Prefix

python
# Phrase prefix: emphasizes phrase matches and prefix completion

elastic_query = {
    "multi_match": {
        "query": "deep learning optim",
        "type": "phrase_prefix",
        "fields": ["title", "content"]
    }
}

# Matches: "deep learning optimization" (phrase with prefix completion)
# Not: "optimization of deep learning" (wrong order)

The breakthrough insight: different indices perform better with different query types. BM25 performs optimally with best_fields, while dense search prefers most_fields, and sparse encoders excel with best_fields formulations.

Practical Implementation: Building Your Hybrid Retriever

Let's build a working hybrid retrieval system using Python and Elasticsearch:

python
from elasticsearch import Elasticsearch
from elasticsearch.helpers import bulk
import numpy as np
from sentence_transformers import SentenceTransformer

class HybridRetriever:
    def __init__(self, es_host="localhost:9200"):
        self.es = Elasticsearch([es_host])
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.index_name = "hybrid_documents"
        
    def create_index(self):
        """Create Elasticsearch index with all three retrieval types."""
        index_config = {
            "settings": {
                "number_of_shards": 1,
                "number_of_replicas": 0,
                "analysis": {
                    "analyzer": {
                        "default": {
                            "type": "standard",
                            "stopwords": "_english_"
                        }
                    }
                }
            },
            "mappings": {
                "properties": {
                    "id": {"type": "keyword"},
                    "title": {
                        "type": "text",
                        "analyzer": "default"
                    },
                    "content": {
                        "type": "text",
                        "analyzer": "default"
                    },
                    # Dense embedding for KNN search
                    "embedding": {
                        "type": "dense_vector",
                        "dims": 384,
                        "index": True,
                        "similarity": "cosine"
                    },
                    # Sparse vector for ELSER-style search
                    "elser_embedding": {
                        "type": "sparse_vector"
                    }
                }
            }
        }
        
        if self.es.indices.exists(index=self.index_name):
            self.es.indices.delete(index=self.index_name)
        
        self.es.indices.create(index=self.index_name, body=index_config)
    
    def index_documents(self, documents):
        """Index documents with all three retrieval modalities."""
        actions = []
        
        for doc in documents:
            # Generate dense embedding
            embedding = self.encoder.encode(doc['content']).tolist()
            
            action = {
                "_index": self.index_name,
                "_id": doc['id'],
                "_source": {
                    "id": doc['id'],
                    "title": doc['title'],
                    "content": doc['content'],
                    "embedding": embedding,
                    # In production, ELSER embedding would be generated server-side
                    "elser_embedding": {}
                }
            }
            actions.append(action)
        
        bulk(self.es, actions)
        self.es.indices.refresh(index=self.index_name)
    
    def hybrid_search(self, query, k=10):
        """
        Execute parallel searches using all three methods
        and combine results using Reciprocal Rank Fusion.
        """
        
        # 1. BM25 Search (keyword-based)
        bm25_query = {
            "multi_match": {
                "query": query,
                "type": "best_fields",
                "fields": ["title^2", "content"],
                "operator": "or"
            }
        }
        bm25_results = self.es.search(
            index=self.index_name,
            body={"query": bm25_query, "size": k}
        )
        
        # 2. Dense Vector Search (semantic)
        query_embedding = self.encoder.encode(query).tolist()
        dense_query = {
            "knn": {
                "field": "embedding",
                "query_vector": query_embedding,
                "k": k,
                "num_candidates": k * 3
            }
        }
        dense_results = self.es.search(
            index=self.index_name,
            body={"query": dense_query, "

Share this article

Chalamaiah Chinnam

Chalamaiah Chinnam

AI Engineer & Senior Software Engineer

15+ years of enterprise software experience, specializing in applied AI systems, multi-agent architectures, and RAG pipelines. Currently building AI-powered automation at LinkedIn.