CDRAG

CDRAG (Clustered Dynamic Retrieval-Augmented Generation): LLM-selected cluster retrieval for RAG — outperforms standard cosine retrieval on legal QA

Standard RAG systems retrieve the top-K most similar documents from an entire corpus using cosine similarity. While simple and effective, this approach is corpus-blind, it has no awareness of the semantic structure of the document collection.
 
This can be especially problematic for complex questions that must be interpreted or approached from multiple abstraction levels, such as legal ones. Cosine similarity retrieves documents that are lexically and semantically close to the query, but may under-retrieve documents that are relevant at a higher level of abstraction.
 
For example, a question about whether a defendant’s silence can be used as evidence” might retrieve passages about silence in interrogation, while missing higher-level doctrine on the burden of proof or the presumption of innocence — context that is essential for a complete legal answer but uses none of the same vocabulary.
 
More advanced retrieval architectures exist, such as hierarchical indexing, graph-based retrieval, and hybrid dense-sparse method, each replacing or augmenting flat cosine similarity with more structured retrieval strategies.
 
I developed a RAG architecture named CDRAG that addresses this same problem by:
1. Pre-clustering the (embedded) corpus into semantically coherent groups
2. Extracting LLM-generated keywords per cluster to summarise their content
3. At query time, asking an LLM to reason about which clusters are relevant and how many documents to draw from each. This way the LLM can explore multiple abstraction levels or periphally related topics
4. Performing cosine similarity retrieval within the selected clusters only
 
This allows the retrieval budget (top-K) to be allocated dynamically and intelligently across the document space rather than spread blindly across the full corpus.

Results

CDRAG was evaluated against a standard RAG based on (top-K) cosine similarity retrieval across 100 legal questions from the Legal RAG Bench dataset, scored by an LLM judge on 6 metrics (1–5 scale). It outperforms standard top-K RAG on 5 out of 6 metrics:

Faithfulness: +0.51 (12% improvement), the largest gain across all metrics
Overall quality: +0.34 (8% improvement)
Conciseness: marginally lower than standard RAG (4.31 vs 4.35)

The gains in faithfulness and overall quality suggest that routing queries to semantically relevant clusters produces more accurate and grounded answers.

Clustering

In order to create the clusters, I first embedded the documents in the corpus using the sentence-transformers/all-MiniLM-L6-v2 model, which is a lightweight Transformer based on a distilled version of BERT optimized for producing semantically meaningful sentence embeddings. This model maps text into a dense vector space where semantically similar documents are positioned close together, making it well-suited for downstream clustering tasks.

I then applied agglomerative hierarchical clustering (AHC) to these embeddings. During experimentation, I found that using a relatively low distance threshold yielded more coherent groupings. Higher thresholds tended to merge loosely related topics into the same cluster, which in turn reduced the effectiveness of downstream retrieval, as the retrieval budget could not be allocated with sufficient topical precision. The resulting hierarchical structure is visualized in the dendrogram below.

AHC ultimately identified 20 distinct clusters. For each cluster, a language model was used to extract representative keywords that capture the underlying themes. An example cluster is shown below:

Keywords cluster 9:

  • beyond reasonable doubt
  • onus on the prosecution
  • Jury Directions Act 2015
  • circumstantial evidence
  • reasonable hypothesis consistent with innocence

Leave a Reply

Your email address will not be published. Required fields are marked *