\ \ I've always been captivated by the idea of running large language models directly on user devices. There's something magical about running Llama 3.1 8B, one of the most advanced language models, on your computer or smartphone.
\ In this post, I'll introduce you to AQLM.rs, my latest pet project that brings Llama 3.1 8B to your browser using WebAssembly. This implementation is made possible by a compression algorithm from Yandex Research, which allows this advanced language model to run without a GPU directly in your browser.
\ You can try it out yourself on the project website. Now, let's dive into how it works.
\
Why choose the 8B model?While running language models on user devices isn't new — models like Llama 3.2 1B and 3B were explicitly designed for low-power devices, the 8B Llama model presents an ideal opportunity to showcase the capabilities of advanced compression algorithms in a browser environment.
\ To put this in perspective, let's look at the model's memory requirements: each parameter requires 16 bits in its uncompressed form, making the 8B model approximately 16 GB. Standard 4-bit compression methods like nf4 can reduce this to 4 GB.
\ Our extreme compression approach takes this further, using just 2 bits per parameter and compressing the model body by a factor of 8. The head layers and embeddings still use 4-bit and 8-bit compression, bringing the total compressed model size to around 2.5 GB.
This extreme compression saves space and improves performance.
\ Since computation speed heavily depends on memory operations, reducing memory requirements directly translates to faster execution. Remarkably, our 2-bit compressed version of Llama 3.1 8B outperforms the uncompressed Llama 3.2 3B while occupying only half the space.
The matrix mathematics behind language modelsAt their core, large language models are collections of matrices. The primary computational workload involves matrix-vector multiplication, where compression methods focus their optimization efforts. These methods aim to create more compact matrix representations while minimizing quality loss.
\ In May 2024, the Yandex Research team, in collaboration with the Institute of Science and Technology Austria (ISTA) and King Abdullah University of Science and Technology (KAUST), published research on the PV-Tuning algorithm. The algorithm improves compression methods for large language models without modifying the compressed weight format (I wrote about this in detail in my article on Medium.)
\ Our project uses AQLM as the base method, which is enhanced by PV-Tuning. AQLM employs additive vector quantization for compression. In 2-bit quantization, where each parameter occupies only 2 bits instead of the original model's 16 bits — an 8x reduction — each matrix row is constructed from small groups of eight numbers.
\ Each such group is the sum of two vectors from 256-element dictionaries. This clever approach requires only 16 bits (2 × 8) to store indices for eight matrix elements, achieving our target of 2 bits per parameter.
\
Implementing in WebAssembly and RustWebAssembly has revolutionized browser-based programming, enabling development in virtually any language. After studying Rust at the Yandex School of Data Analysis, I fell in love with the language and had been waiting for the perfect opportunity to use it. This project gave me that chance, so I implemented the entire inference system in Rust.
\ I was delighted to discover that many fundamental LLM infrastructure libraries are written in Rust. Take Hugging Face's safetensors format—it was built entirely in Rust! Whenever someone uses a safetensors model from Hugging Face in their Python code, they use Rust. Similarly, OpenAI's tiktoken tokenizer, which new Llama models use, is also Rust-based.
Achieving multithreading in the browserTo optimize performance, I implemented multithreading using web workers, enabling bidirectional thread communication through message passing. The solution uses a model-parallel approach: matrices are divided by output dimension, with each worker handling its designated portion.
\ The most challenging aspect was orchestrating the interaction between workers and the main thread. To address this, I developed a custom RPC stack for workers with Rust-JavaScript interoperability. Let me explain the process.
\ The process involves several steps when the main thread needs to multiply a vector by a matrix. First, the thread creates a request for each worker, serializes it, and sends it to the JavaScript runtime. From there, JavaScript forwards the request to the worker, where it is deserialized and processed, and the result is serialized.
\ Finally, the result returns through JavaScript to the main thread for deserialization. Through this carefully orchestrated process, we improved the performance by about 2x.
Try it yourselfhttps://www.youtube.com/watch?v=fPOHT4Zf_NA&embedable=true
\ \ Check out this demonstration video for a quick preview. To experience it firsthand, visit the demo page.
\ Note that the initial loading takes several minutes. For best results, use English — the model performs significantly better this way.
The project is open-source and available on GitHub. I'm happy to receive any feedback and suggestions on how to improve it further.
All Rights Reserved. Copyright , Central Coast Communications, Inc.