How to run Vicuna-13B with llama.cpp on an Apple Silicon GPU

The other day I was trying to get Vicuna-13B running on my local machine using the llama.cpp port. While getting it to work on the CPU was not difficult, I had some problems getting the GPU acceleration to work. This is a short description of how I managed to get it running.

A Vicuna standing next to a computer screen with mountains in the background.

Compile llama.cpp Port

First, we need to compile the llama.cpp port for Apple Silicon, since the official releases and Docker images don’t support the M1/M2 processors. The following command can be found in the docs:

mkdir build-metal
cd build-metal
cmake -DLLAMA_METAL=ON ..
cmake --build . --config Release

Note: I used cmake for this step. If you haven’t installed it yet, you can do so using Homebrew by entering this command:

brew install cmake

Find a compatible model version

To run the model with llama.cpp you need to have a GGMLv3 compatible version of it. I dug into the Hugging Face model repository and found the compatible versions of Vicuna here: https://huggingface.co/TheBloke/stable-vicuna-13B-GGML

Make sure to download the 4-bit models for use on the GPU (eg. stable-vicuna-13B.ggmlv3.q4_0.bin).

Run in interactive mode

From a GitHub discussion I found the following command to run the model in interactive mode:

./build-metal/bin/main \
  -m ./models/stable-vicuna-13B.ggmlv3.q4_0.bin \
  -t 4 -c 2048 -n 2048 -ngl 1 --color -i \
  --reverse-prompt '### Human:' \
  -p '### Human: What is the relation between llama and vicuna? ### Assistant:'

Importantly, the -ngl 1 parameter tells llama.cpp to run the model on the GPU.

Et voilà! We now have a working Vicuna-13B model running on an Apple Silicon GPU for us to chat with!