Build your own ChatGPT like LLM without OpenAI APIs
My articles are usually titled “without APIs” because I believe to be in control of what you have built. This doesn’t mean to re-invent the wheel but you should be able to tweak your system to yield desired results that best fit your use case.
While OpenAI’s GPT3 Model is phenomenal, the goal of this article is not to compete with any of the SOTA LLM models but to understand how different aspects of NLP can be combined to answer user queries. We are going to tweak a bunch of pre-trained models to understand how AI can fetch relevant information and articulate it as a response to a user's question.
Okay, enough of walking around the bush, let's get into it.
Begin the search
Let's try to understand the concept with an analogy. Let's go back in time when there were no search engines and you had to complete your physics assignment. In order to answer a question, you would first search for that topic in your textbook. And then read it, understand it and rephrase it in the form that satisfactorily answers the question asked in the assignment.
For eg, if the question was “ How do day and night occur on the earth ?”
To answer this question, you must first know that day and night happens as a result of the rotation of the earth. Now you would find this topic in the textbook which describes how the earth rotates. And then you would articulate the answer as
“ Day and night occur due to the Earth rotating on its axis. The Earth orbits the sun once every 365 days and rotates about its axis once every 24 hours. The term ‘one day’ is determined by the time the Earth takes to rotate once on its axis and includes both daytime and night time. A day is the time it takes for the sun to move around the Earth whereas a night occurs when the moon covers the sun”
So to summarise, we performed below steps:
1) Correlate “day & night” with “rotation”
2) Searched for the topic “rotation of the earth”
3) Articulate the description into answer
Don’t worry you are reading the right article. In no time, we will rush from earth, moon and stars to embeddings and vectors!
Adding semantics to the search
I am sure you know how to implement keyword-based matching and search. But that’s not enough. We need to correlate that “day and night” has something to do with the rotation of the earth. In order to implement this, we use the power of semantics. Semantic search, in simple words, means that your search engine should understand that “great” and “awesome” are similar words.
Vectors & Embeddings
Firstly, we convert the words into tokens and then vectorise them. Vectorisation is a process that converts text into numbers which we technically call “embeddings”. Yes, We are there now! There are different type of embedding models that does this. The difference lies in how they number each word. If you want to understand in detail, how embeddings are formed, watch this awesome StatQuest.
Semantic search algorithms
Once the words are converted into embeddings, we use a semantic search algorithm to find relevant results based on the user query.
For this, I have used an open-source implementation by Facebook called FAISS. FAISS contains several methods for similarity search. It assumes that the words/instances are represented as vectors and are identified by an integer and that the vectors can be compared with L2 (Euclidean) distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors. Although FAISS is a powerful semantic search algorithm, you can use simpler algorithms such as cosine similarity or nmslib
Step 1: Convert a document containing AI Wikipedia into chunks of text using langchain text splitter
Step 2: Convert chunks of text into pandas dataframe
Step 3: Convert text into Embeddings using hugging-face transformer model and implement FAISS on vectors.
Step 4: Search function to merge FAISS extracted index with the chunks of text
Now lets try to ask few questions and see what we are able to extract
Note how it extracts relevant pieces of text related to the chronology of AI from the article that we provided. Let's try one more
Hurray! We are done with the first two steps :
1) Vectorization
2) Semantic search
We are able to extract relevant information but still not able to exactly answer our questions to the point. This is something that search engines like Google do for us but not what ChatGPT does. So now we need to articulate these pieces of text into an answer form.
Next step is – text generation.
Articulating the response
Now that we have extracted relevant piece of text that can answer user query, we have built our own search-engine like Google. But we need to take it a step further to build something like ChatGPT which retrieves information in dialogue form. To pursue this, we need a text generation model that can convert the relevant piece of text into an answer form.
There are few Q&A models out there which performs really good on SQUAD dataset but those can only extract short answers from the given piece of text without rephrasing it as per the query. Hence I fine-tuned a BERT model on ELI5 (Explain like I am 5 – subReddit) dataset. If you want to know how I fine-tuned the model, connect with me on LinkedIn.
This model takes two input – question and context as input and outputs a well articulated answer. We already have question (user query) and we also extracted relevant supporting doc using semantic search. Our next job is to use this fine-tuned BERT model to give us the final output. Below are the results extracted by the model when asked “When was AI invented ?” and given the results extracted by semantic-search as context
Below are the results extracted by the model when asked “How can a classifier be trained in ANN?” and given the results extracted by semantic search as context.
Putting it all together
So let's do a quick recap. We started with a document of text and divided it into chunks. Next, we used hugging-face mpnet model to convert chunks of text into embeddings and then implemented semantic search on it. These helped us to pick top relevant chunks of text as per the user query. Lastly, we used a fine-tuned BERT model to rephrase the relevant chunk of text into an answer form.
Disclaimer: This does not represent the working of ChatGPT or other LLM models. This was my approach to build something similar to LLMs using different NLP techniques.
Connect with Think In Bytes to know how NLP-based AI pipelines can make your business pipeline efficient, faster and better!