Techniques for training LLMs to be more compact and efficient have already improved dramatically over the past 18 months. So much so that you can run a GPT-4 quality model locally on your machine without an internet connection. This brings huge benefits such as privacy as your data is no longer being harvested by OpenAI or Google, and on top of this, it is completely free as you are using your own computer’s GPU.
Now, you may be thinking that GPT-4 quality frontier language models would not be possible to run locally on your device due to limitations from both storage and GPU performance, and you would be right, but only when it comes to a large language model like GPT-4.
Does size matter? Enter the small language model (SLM), these are designed to be far lighter than LLMs while focusing more on the quality of data used to train it rather than sheer quantity.
A common misconception is that one would think all the model’s training data needs to be stored in the model itself, i.e. the corpus used to train GPT-4 (the internet) so that it can reference it when it needs to, but that is not the case. All the model needs are the model weights .
In short, weights are just a bunch of floating point numbers stored in tensors (basically multidimensional lists of numbers) that control the type of output you get depending on what those numbers are. These numbers (weights) are determined by the training data only when the model is being trained, once determined the training data is no longer needed to be stored. Once trained, it's just tokens in and tokens out like any other system, using statistics to determine the most likely token that comes next in the sequence, which are then decoded into the words we see in the response.
Currently, two of the best SLMs are Meta’s Llama 3 and Microsoft’s Phi 3 both of which can run on both the MacBook and an iPhone yielding very promising results.
David and Goliath of language models Graphic illustrating how the quality of new Phi-3 models, as measured by performance on the Massive Multitask Language Understanding (MMLU) benchmark, compares to other models of similar size. (Image courtesy of Microsoft)
"What we’re going to start to see is not a shift from large to small, but a shift from a singular category of models to a portfolio of models where customers get the ability to make a decision on what is the best model for their scenario." - Sonali Yadav, principal product manager for Generative AI at Microsoft
This Microsoft article goes into high detail for the use cases of SLMs working at “the edge” on-device rather than dealing with latency on the cloud in much more detail, but in short; imagine a language model built into a restaurant or shop kiosk, smart sensors in a factory or building site, traffic systems, and other IoT devices. There are also more technical advantages to SLMs over LLMs, such as being far cheaper and quicker to train, and are more customisable for specific applications.
The benefits this could bring is a step change in how automated our daily lives are; for instance, imagine a traffic system that can actively make the most efficient decisions instantly for the town or city it's in to maximise the flow of traffic. In fact, TfL has done its own AI experiment with recent advancements in the field using image recognition making its train stations more efficient.
A museum or university could train their own SLM that runs on an on site server never touching the wider web that can act like a ChatGPT information point on campuses, answering visitor queries as a highly clued up tour guide would.
And the recently announced Apple Intelligence is a highly context aware SLM that runs locally on the latest Apple Silicon, touting privacy benefits as the user’s data never needs to leave the device - even when it does leave Apple has security measures in place so that the data is safe from inspection, but this is due to a hardware limitation of the device not the size of the model. I go into more specific benefits of Apple Intelligence in my WWDC24 reflections .
However, SLMs do fall short of LLMs when it comes to processing large amounts of data and when you need more advanced reasoning for data analysis and understanding of context. And even when you need less personalised, more general information that might not be in the SLM’s training data, you would need to reach out to an LLM for more information, similar to a search engine.
Like the quote from Microsoft states, both SLMs and LLMs bring their own advantages to the table and these can be combined in ways that maximises the value to the user. To take Apple Intelligence as an example, your highly personalised SLM can provide your preferences for holiday destinations based on your photo locations and make an informed suggestion for where you would like to go next, then use an LLM to find out more information about the location and what to do there. A combination of personalised and generalised assistance.
Shrinking a model to fit in your pocket Three methods of cutting the size of a model down are pruning, quantisation and palettisation. In short, both these methods essentially compress a model to a size of your choosing, however, these methods are lossy which means that you lose model accuracy the more you compress it - so of course, there is a trade off.
For reference, both Llama 3’s and Phi 3’s small 8 billion (8b) parameter models (that beat out GPT-3.5 and also 4 by a few measures) in their “full resolution” are between 8 and 9 GB, but you can quantise them down to half that size and still maintain a decent level of accuracy.
From Apple's WWDC23 video on model compression. Quantisation is how Apple fits its SLM onto an iPhone; they reduce the amount of bits to store each weight while maintaining accuracy by leveraging improved methods for storing tensor lookup tables - essentially a way to uncompress the model at run time.
A new wave of opportunities in GenAI One year ago if you wanted to run a quality frontier LLM you had to have an internet connection and pay £20/month to OpenAI to use ChatGPT, and while local models were available (like Llama 2) they weren’t anywhere near the quality of GPT-4 and other cloud based LLMs.
Today, things are different, Meta has managed to train a high quality lightweight model (GPT-4 is rumoured to be 1,760 billion parameters by comparison to Llama 3’s 8 billion!) that is small enough to run locally and that outperforms frontier models like GPT-4 by some measures (and is almost as good as in most other cases!) This shows phenomenal potential for integrated on-device SLMs in the very near future, and this outlook has been further vindicated with the advent of Apple Intelligence .
The possibilities are exciting as this unlocks the potential for an assistant like Siri to execute actions on your apps for you such as booking a train up to Scotland via an app like Trainline and doing all the research and price comparison for you already having learnt your preferences from months and years of usage data, your “Personal Context”.
As this reality becomes ever closer, Apple makes this very easy for native Swift apps to adopt in the new releases of iOS and macOS coming later in the year. Further to this, Apple also makes it easy to embed other small models talked about earlier like Phi and Llama, which I will go into more detail in my next post so stay tuned!
We have the tools and experience at Loomery to help you build products that take advantage of these next generation features, the time to get on board is now or risk falling behind the market. Don’t hesitate to get in touch .