Once you have decided that you want to use GenAI for a business solution and have implemented a first iteration of your tool you may soon discover that the model can miss key data points or even hallucinate. Without the augmentations I will talk about here, these could quickly become roadblocks. Here are two techniques to help resolve this issue.
Augmentations: when would you ever need them All of what I have talked about so far in previous insights is great for more generalised use cases and getting somewhat deterministic results, but what if you want even more control over the models output? That's where model augmentation comes in, for which there are two main techniques: retrieval-augmented generation for dynamic, ever changing data like stock market data or interest rates and fine-tuning for static data like company policies or the laws of physics .
Retrieval-augmented generation (RAG) When you want to ground a large language model (LLM) on your most up-to-date and accurate information from an external knowledge base then you want to deploy RAG.
A good analogy I heard from IBM to explain what RAG is simply: “It’s the difference between an open-book and a closed-book exam. In a RAG system, you are asking the model to respond to a question by browsing through the content in a book, as opposed to trying to remember facts from memory.”
This is because RAG, as the name suggests, is essentially a look-up system, retrieving the relevant information for the user’s prompt from a defined set of sources rather than an open set such as the web. RAG relies on your information being stored in a vector database , and uses a semantic index to optimise looking up data via similarities and relationships. For instance, in an unlabelled list of fruits, you ask for all the citrus fruits and it can tap into its knowledge base to infer which fruits are categorised as citrus.
Essentially, you build a structured knowledge base which you then store in a vector database, then the LLM can reference that knowledge based on your prompt, kind of like it is performing a Google search over your knowledge base to return relevant results (via the semantic index) then collates that data and generates a response like a normal, base LM.
An example of when RAG would be useful is if you have a financial advisor LLM that could retrieve a client’s investment history before suggesting financial recommendations. This is data that varies from customer to customer and also for a particular customer, changes over time, so this is data that you would not want to have calcified in the model’s training phase. This is data you’d want available to the model just not inside the model.
RAG shines when you want to maintain a breadth of general knowledge extended with additional domain-specific data to answer specific questions in an ever-changing context, i.e. if the data you’re using the AI tool for is dynamic. This is because RAG comes in post -training so you would not need to retrain the model every time the data changes; the data used for the RAG system is easily updatable at any point in time and has lower data requirements compared to fine-tuning.
In an unlabelled list of fruits, you ask for all the citrus fruits and RAG can tap into its knowledge base to infer which fruits are categorised as citrus.
This is also more efficient and less expensive than other methods of augmentation as the model does not need to be retrained every time there is an update to the data. And like I touched on earlier, RAG maintains the models base capabilities rather than overwriting them, so it is still able to handle more general prompts too.
Fine-tuning Fine-tuning is great when you want consistent results based on static, fixed data. It allows for highly specialised use cases like a customer service chatbot where company policy remains static, and allows your chatbot to avoid scandals similar to what happened to Air Canada a few months back where they had to honour a refund policy that did not exist due to using an unconstrained LLM, prone to hallucination in their customer service bot.
A SLM with fine-tuning, for instance, constrains the model’s own knowledge to the domain it has been trained on. In other words, using the customer service example the knowledge would be constrained to the company’s policies. This avoids the possibility of hallucinations like what happened to Air Canada as the knowledge that a company could even have a refund policy is not even in the model’s corpus.
To achieve this, unlike RAG, fine-tuning requires retraining the model, which can be quite expensive computationally and therefore fiscally, but still much cheaper and less resource intensive than training a new model from scratch. Because of this, when fine-tuning is needed smaller models are preferred.
From IBM : “The intuition of fine-tuning comes from the fact that it's cheaper and easier to hone the capabilities of a pre-trained base model that has already acquired broad learnings relevant to the task at hand than it is to train a new model from scratch for that specific purpose. [...]
Leveraging prior model training via fine-tuning can reduce the amount of expensive computing power and labelled data needed to obtain large models tailored to niche use cases and business needs. For example, fine-tuning can be used to simply adjust the conversational tone of a pre-trained LLM or the illustration style of a pre-trained image generation model; it could also be used to supplement learnings from a model’s original training dataset with proprietary data or specialised, domain-specific knowledge.”
Fine-tuning is great for when the end task requires specialised knowledge over general abilities, and you cannot risk the model straying from your knowledge base. You can think of a fine-tuned model as being a highly eager human with a lot of knowledge on a specific topic, like HR policy, accounting, or even software engineering…
Fine-tuned model weights are called LoRAs (Low-Rank Adaptation of Large Language Models). Apple Intelligence uses LoRAs for specific tasks like in its Writing Tools: proofreading, summarising, tone adjustments, etc. Apple calls these Adapters.
RAG vs Fine-tuning: how do you choose? Remember earlier when I talked about model sizes? Well depending on what size model you choose, this will affect what you choose when it comes to augmentation. Typically, the smaller the model the easier it is to fine-tune, and the larger the harder it is to fine-tune and the more advantageous it is to use RAG.
For example, with large LMs like GPT-4, RAG is preferable as it avoids catastrophic forgetting, whereas fine-tuning would damage its versatility and require expensive retraining.
For medium-sized models like Llama 8b and Phi 14b, either RAG or fine-tuning are viable, fine-tuning preferable when the goal relies heavily on memorisation like Q&A over documents, RAG preferred for domain specific generation or classification from retrieval of relevant knowledge.
For small models fine-tuning is mostly the way to go as there is not much risk of losing more general capabilities due to minimal pre-training thus small models are easier to retrain on domain-specific data to imbue knowledge directly.
A (basic) decision tree that simplifies the process of choosing what kind of GenAI tool you will likely need for your use case.
The decision tree above is a very high level, low fidelity way to access what kind of tooling you would need for using GenAI in your use case, it does depend on a few other factors (that Loomery can help you identify) but it's just to give a general rule of thumb.
Below are some examples of use cases you could apply GenAI to, to help give a better understanding.
A graph of use cases for GenAI cross small to large LMs and if fine-tuning, RAG or no augmentations would be a good approach. The y-axis requires no augmentation and the further from the y-axis the use case the more likely it is that the augmentation on the x-axis will improve the model output.
Testing for groundedness Language models are statistics based meaning they effectively use probabilistic predictions to determine the next token (word, or part of a word) in a sequence that makes up the response (vs more “classical” symbolic AI ). This means that it can sometimes get facts wrong, or hallucinate , which is not ideal if you’re using a LM to parse and recall facts from a knowledge base.
The method for mitigation of hallucinations is called “grounding”, and there are a few tools to help ground your GenAI out in the wild. Each returns a groundedness score from 0 (ungrounded, very loose on the facts) to 1 (grounded, reliably recalls facts).
Depending on the outcome you want, i.e. how creative vs grounded you want your augmented model to be, you can then tweak parameters such as the temperature, p-value and the system prompt to then test again, through trial and error.
Thinking about what you need As mentioned at the start, this post complements the blog post I wrote about use cases for base LMs with use cases for when you would want to go further and use augmented models.
Businesses that are thinking strategically about GenAI can gain a competitive productive edge leveraging tools defined here both internally for ops and externally for enriching their CX. Loomery can help you with not only strategising your GenAI capabilities, but also help implement them. Get in touch with us today!