AutoMeta RAG : Enhancing Data Retrieval with Dynamic Metadata-Driven RAG Framework

Darshil Modi
5 min readJul 29, 2024

--

In the ever-evolving field of data retrieval, Retriever-Augmented Generator (RAG) frameworks are the most trending tools, facilitating better and accurate responses from LLMs. Traditional RAG frameworks, however, does not always yield better results since it fails to perform semantic search over raw and unprocessed data without additional information or metadata. In this approach, we will explore how tagging your raw data with metadata can enhance the search. However, not everyone is a data scientist equipped with tools and techniques to extract meta-data from large corpus of data. And that is why, this framework includes dynamic generation of metadata using LLMs to build effective RAG systems.

Traditional RAG frameworks vs AutoMeta RAG

The major challenge with traditional RAG frameworks is the approach of direct ingestion of raw data without any preprocessing. This often results in less efficient and relevant data retrieval, as semantic search mechanisms typically excel at word-level rather than paragraph or chunk-level analysis. To address this, enhancing data with metadata can dramatically narrow the search space, speeding up retrieval and improving relevance.

Consider you are searching for a black Nike T-shirt on Amazon. The results often includes a wide array of products across different genders, age groups, sizes, and locations. However, if you apply filters such as gender, size, and price point on your search, the results become much more relevant and faster. Our goal with the AutoMeta RAG is to mirror this level of precision in vector searches within larger datasets.

Many current frameworks use node- or graph-based systems to manage data, but these often fall short in preserving detailed information. When data is converted into graphs, subtle but critical details can be lost, leading to less accurate retrievals.

We focus on extracting metadata from the data but we shall only use metadata for filtering and search. Once we find the most relevant data chunks, we will still pass original content to the LLM for final response. Metadata will just help us to point the exact data point, however using original data content will reduce information loss and generate better answers compared to knowledge graphs which only uses relationship among entities for RAG.

AutoMeta RAG Framework in detail

This approach focuses on leveraging metadata not just as an ancillary feature but as a core component of the search and retrieval process. Here’s how our framework improves upon traditional methods:

  • Metadata Schema Suggestion: Initially, the framework analyzes the data alongside probable questions expected from users. This analysis leads to the generation of two metadata schemas:

— File level metadata: This is consistent across all data chunks within a file, providing a broad overview.

— Chunk level metadata: Unique to each chunk, allowing for detailed and precise indexing.

  • Metadata Extraction and Storage: Once the schemas are defined, the framework extracts the relevant metadata for each chunk based on these schemas and stores it in Qdrant vectorDB along with the data vectors. This indexed metadata significantly refines the search process.
  • Inference and Retrieval: Once these schemas are in place and metadata for each data chunk is generated and stored, the next critical step involves utilizing this metadata efficiently during the query or inference stage. we extract unique values for all key-value pairs in metadata json. We call this master json. This will be used by LLM at the time of inference to extract metadata filter from the user query.

For example: If my metadata json has two values — price point and product category as follows : {‘price’ : [‘0$-100$’, ‘100$-500$’, ‘500$-1000$’, ‘1000$+’], ‘product_category’: [‘clothes, accessories’, ‘mobiles, laptops’, ‘grocery, essentials’]}

And user Query is “What are some good options for Mens black Tshirt”

LLM will extract key-value pair {‘product_category’:’clothes, accessories’} as my metadata filter to filter out Qdrant DB.

During the retrieval phase, the framework extracts metadata from user queries based on the predefined schemas. It uses this metadata to filter through the VectorDB, efficiently identifying the most relevant data chunks for RAG.

Filtering using Qdrant Search

Qdrant provides multiple ways to implement payload filtering and thats just one of the many reasons, I love Qdrant. Although current implementation only supports filter to match exact key and value pairs, there are other filter-options in Qdrant that can be used to further enhance the search. You can read more about it here

Preserving Original Content

Importantly, once the relevant chunks are identified, the original content is passed to the Large Language Model (LLM) for the final response and not just the metadata. This ensures that while metadata helps in pinpointing the exact data needed, the richness and completeness of the original data are maintained, preventing the loss of information.

How to make this implementation cheaper

I understand that LLMs are costly and using OpenAI for generating metadata for each data chunk can leave your pockets empty. Thus, I found a suitable alternative. Numind’s NuExtract model is a finetuned version of phi3 which specialises in extracting json from the text, given json schema. NuExtract is only a 3b model and also provides tiny version for light-weight implementation however it may alter accuracy.

I would still suggest to use gpt4o or similar LLM for extracting json schema in the first step as it is crucial for the entire pipeline to work.

Full code implementation

Full implementation notebook can be found on my github. Although it is not the most optimised version and needs lot of improvements, I believe it’s a good starting point and would love to receive contributions from the AI community.

Future Enhancements

This framework still lacks evaluation metrics over a RAG based dataset and I leave it upto the users on the same. However, I suggest few enhancements that I have on my to-do list to implement in near future.

  1. Automatic data-typing of JSON from LLM responses: In some of the functions, I still have to manually pass json schema from previous outputs which can be automated by using json.loads or dumps
  2. Provide support for multiple data ingestion frameworks: Currently it only supports Llama Index Simple directory reader which can be expanded to support other frameworks from langchain and llama Index. It should also support custom data readers
  3. Provide support for other LLMs including opensource alternatives
  4. Parallelise metadata generation and Qdrant insertion to reduce memory consumption.
  5. Provide support for other vectorDB (however I love Qdrant !)

Connect with me on LinkedIn and lets discuss the next revolution in AI !

--

--

Darshil Modi
Darshil Modi

Written by Darshil Modi

AI enthusiast. Trying to catch up the tide ! GitHub : @darshil3011 , LinkedIn : https://www.linkedin.com/in/darshil3011

Responses (1)