The other day I was trying to get Vicuna-13B running on my local machine using the llama.cpp port. While getting it to work on the CPU was not difficult, I had some problems getting the GPU acceleration to work. This is a short description of how I managed to get it running.

Compile llama.cpp Port
First, we need to compile the llama.cpp
port for Apple Silicon, since the official releases and Docker images don’t support the M1/M2 processors. The following command can be found in the docs:
mkdir build-metalcd build-metalcmake -DLLAMA_METAL=ON ..cmake --build . --config Release
Note: I used cmake
for this step. If you haven’t installed it yet, you can do so using Homebrew by entering this command:
brew install cmake
Find a compatible model version
To run the model with llama.cpp
you need to have a GGMLv3 compatible version of it. I dug into the Hugging Face model repository and found the compatible versions of Vicuna here: https://huggingface.co/TheBloke/stable-vicuna-13B-GGML
Make sure to download the 4-bit models for use on the GPU (eg. stable-vicuna-13B.ggmlv3.q4_0.bin
).
Run in interactive mode
From a GitHub discussion I found the following command to run the model in interactive mode:
./build-metal/bin/main \ -m ./models/stable-vicuna-13B.ggmlv3.q4_0.bin \ -t 4 -c 2048 -n 2048 -ngl 1 --color -i \ --reverse-prompt '### Human:' \ -p '### Human: What is the relation between llama and vicuna? ### Assistant:'
Importantly, the -ngl 1
parameter tells llama.cpp
to run the model on the GPU.
Et voilà! We now have a working Vicuna-13B model running on an Apple Silicon GPU for us to chat with!