The other day I was trying to get Vicuna-13B running on my local machine using the llama.cpp port. While getting it to work on the CPU was not difficult, I had some problems getting the GPU acceleration to work. This is a short description of how I managed to get it running.
Compile llama.cpp Port
First, we need to compile the llama.cpp
port for Apple Silicon, since the official releases and Docker images don’t support the M1/M2 processors. The following command can be found in the docs:
Note: I used cmake
for this step. If you haven’t installed it yet, you can do so using Homebrew by entering this command:
Find a compatible model version
To run the model with llama.cpp
you need to have a GGMLv3 compatible version of it. I dug into the Hugging Face model repository and found the compatible versions of Vicuna here: https://huggingface.co/TheBloke/stable-vicuna-13B-GGML
Make sure to download the 4-bit models for use on the GPU (eg. stable-vicuna-13B.ggmlv3.q4_0.bin
).
Run in interactive mode
From a GitHub discussion I found the following command to run the model in interactive mode:
Importantly, the -ngl 1
parameter tells llama.cpp
to run the model on the GPU.
Et voilà! We now have a working Vicuna-13B model running on an Apple Silicon GPU for us to chat with!