Hierarchical Semantic filtering using vector database

6 min readDec 8, 2024

Imagine having a complex dataset with multiple fields and you have a way to perform semantic search on specific field in a nested manner until you find the most relevant result !

Hierarchical filtering in semantic manner

The shift from keyword search to semantic search was really groundbreaking. But we are not there yet. Semantic search looks really fascinating at distant but when implemented at scale on a large scale database, that’s when you start seeing challenges in finding out the most relevant results.

What’s the problem ?

The biggest problem that I have encountered while working with vector search, is that there is no way to instruct the system what part of the data should be given highest priority while searching. Let’s say you run a query “find me cheap hotels with swimming pool and free wifi near SF” — And the system extracts all the hotels which has “swimming pool” and “free wifi” but are not in SF. All the search results, despite of being relevant to your query, is useless for the user. Some would argue that the embedding model used for generating the vector embeddings have attention mechanism to learn what part of the sentence holds more importance. While this is true at a generalised level, there is currently no way to make the system understand what part of the query is most important for a given use-case.

Oh ! just filter them out

Well, there are few methods to deal with such problems. One of the most famous one, is to use “meta-data filtering”. I have been a very big fan of this feature. I also published my own framework around this called “Autometa RAG”.

However, the problem here is that metadata filterig works purely on keywords. So for our above example, we could implement a filter “Location” = “San Francisco” and then filter the data points to further perform vector search. Cool, right ?

But what if your data points doesn’t have a structure ? It may contain “SF”, “San Francisco Bay area” or even “Silicon valley” and all of these data points are also relevant, correct ? In this case, meta-data filtering will miss out results that may be a good match to the user query.

So what is the way out ?

Well, I figured out a work-around to this problem. We are going to use a combination of meta-data filtering and named-vector storage in Qdrant to make this work.

Let’s say we have a hotel-search dataset that contains following fields — Hotel name, Description, Amenities, Location and Reviews. The reason why I choose this dataset is because my journey with Information Retrieval started with Prof. Hamza Farooq when he first introduced to me his semantic-similarity based hotel search system.

Well, Let’s consider you want to perform search over this dataset and your highest priority to filter would be location, then amenities and then description and reviews. So our first step would be to break down our dataset into 4 individual vectors and then ingest into the dataset.

For more details on how to implement named multi-vector storage using Qdrant, check this documentation. This feature allows us to breakdown single data point into multiple vectors and store them as a single id.

The next step is to break down user query also into 4 different fields so that we can match them. So lets say our user query is “Looking for a hotel in San Francisco that allows pet, has clean rooms and a scenic view”. We first break this query in the following manner:

Well, this query-breakdown can be easily implemented by a single OpenAI call. Now, lets go to the next step. We first perform semantic search on location field in our vectors and save the location field from the output payloads to a list.

This shows that there are data points that have location = SF Bay, SF, San Jose which are similar to our query location — “San Francisco”. Hence we should consider these locations while continuing our search on other fields like amenities and reviews.

We use this list of locations for “meta-data filtering” which is keyword based filtering. We use “MatchAny” function here so all the results that has location = SF Bay or SF or San Jose will be our eligible candidates. And this is the work-around I was talking about. Notice how we used combination of semantic search and meta-data filtering to perform nested semantic filtering !

Now we focus on our amenities and perform vector search on Amenities field but with a meta-data filter on location = [SF Bay, SF, San Jose] that we got from our previous step. We get results similar to our amenities_query — ie. Pet Friendly, Free Wi-fi etc. Notice how we are narrowing down our search space and moving one step closer to our most relevant results.

We again save these results in a list to perform meta-data filtering in our next step. We now focus on description to match our description_query, with using meta-data filtering on amenities. Because we know that these eligible candidates have location and amenities both satisfying our user query.

Bang ON ! These are the results that we were looking for. These hotels have stunning city views, they are pet friendly, have free WIFI and are in and around San Francisco. Most of our conditions in the user query are met and we landed up to the most relevant results in our dataset.

I know this is a complicated work-around and thus I would recommend you to re-read the blog to get more clarity on what we just did. Also take a look at the diagram below.

Amazing, But where’s the code ?

I got you, bro. Here you go :

GitHub - darshil3011/heirarchical-semantic-filtering: Perform nested semantic filtering on your…

Perform nested semantic filtering on your vector database ! - darshil3011/heirarchical-semantic-filtering

github.com

If you feel there is something out there that is more efficient, fulfils the purpose and can be used at a production scale, feel free to add that in comments. Qdrant, I know you are reading this. Please consider this as a feature-request, which has a lot of important use-cases at scale.