2 min read

Link: What On-Device Models Mean for Open-Source AI — The Information

Apple’s on-device model is a small language model with 3 billion parameters, or the settings that determine how models respond to questions. (OpenAI’s GPT-4 is believed to have about 1 trillion parameters, to put that in perspective.) Apple’s model follows other small language models that can run on phones, such as Google’s Gemma and Microsoft’s Phi-2. These releases didn’t come out of nowhere. Independent developers have been working to run open-source AI models, such as Meta Platforms’ Llama 3 and those from Mistral, on phones, laptops and other devices for months. These developers say that they want to better control their models, partly because of their desire to ensure user privacy, but partly because they’re daunted by the high cost of using larger proprietary models, like OpenAI’s. Open-source models that can run on devices might be particularly attractive to businesses and users aiming to cut costs. On-device models are often faster than cloud-based models like OpenAI’s, and are usually cheaper as well, as my colleague Stephanie has written. They are particularly useful for working offline, such as on an airplane without WiFi. Alex Cheema and Mohamed Baioumy of Exo Labs—a startup working to run models on devices—in April showed how they ran the smallest version of Llama 3, which has 8 billion parameters, on an iPhone. Models that large typically can’t run on phones because they require more memory than phones have. Here’s how Cheema and Baioumy got Llama 3 to run on an iPhone: First, they ported Llama 3 to MLX, open-source software that Apple released in December for running models on Apple’s chips. This was necessary because most AI models are built using PyTorch, open-source software originally developed by Meta Platforms that isn’t optimized for Apple’s chips. Then, Cheema and Baioumy added a basic app with the ability to chat. To get around the iPhone’s memory limit, the pair used a version of Llama 3 that had undergone quantization, a technique for compressing models so they require less memory but can make them less accurate. (MLX recently improved its support for quantized models, so their accuracy doesn’t suffer as much.) They also used techniques outlined by Apple researchers that load the weights, which tell a model how important certain information is, as needed by an LLM, rather than all at once. There are similar methods for running larger open-source models on laptops. Apps such as LM Studio and Ollama make on-device AI more accessible to both hobbyists and businesses, said Shubham Saboo, who writes a newsletter with tips and tutorials for using AI. In the same way that Microsoft Excel makes it easier to use spreadsheets, these apps allow people to use AI models. Some businesses, especially those with sensitive financial, health or legal data, “don't feel comfortable using [AI] or sending confidential data to OpenAI or companies that host cloud services,” Saboo said. “This is a big savior.” #


Yoooo, this is a quick note on a link that made me go, WTF? Find all past links here.