Setting up Meta's Llama 2 with llama.cpp
Zuck our beloved • November 25, 2023
Introduction
Meta's Llama family of Language Models have become increasingly popular for Text and Image generation. Today it is possible to run these very easily thanks to projects like llama.cpp, a port of Llama in C which makes it possible to run models using 4-bit integer quantization.
Preparing
First figure out what model size you want to use, as described here.. I'll just use the 7B parameterized model.
Ideally we want to skip quantization and conversion, so I will use the models from TheBloke's HuggingFace repository.
There are tons of settings mentioned to play around with for your system in llama.cpp's README, like building for Metal support for MacOS users. Same for Nvidia users.
Setting up
Clone the repository using git and cd into it:
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cppgit clone https://github.com/ggerganov/llama.cpp.git
cd llama.cppNext, run the build command, again checkout the README for more build options:
# M1/M2 Mac users:
# LLAMA_METAL=1 make
make# M1/M2 Mac users:
# LLAMA_METAL=1 make
makeNow, download your preferred model (GGML is no longer supported so GGUF ones only) from HuggingFace in the same llama.cpp directory. Note that other models based on Llama are also supported.
REPO_ID="TheBloke/Llama-2-7B-Chat-GGUF"
FILE="llama-2-7b-chat.Q3_K_L.gguf"
curl -L "https://huggingface.co/${REPO_ID}/resolve/main/${FILE}" -o models/${FILE}REPO_ID="TheBloke/Llama-2-7B-Chat-GGUF"
FILE="llama-2-7b-chat.Q3_K_L.gguf"
curl -L "https://huggingface.co/${REPO_ID}/resolve/main/${FILE}" -o models/${FILE}Congrats! 🎊
Using interactive mode
Use the built main binary from before to run the model interactively:
./main -m ./models/${FILE} \
--color \
--ctx_size 2048 \
-n -1 \
-ins -b 256 \
--top_k 10000 \
--temp 0.2 \
--repeat_penalty 1.1 \
-t 8./main -m ./models/${FILE} \
--color \
--ctx_size 2048 \
-n -1 \
-ins -b 256 \
--top_k 10000 \
--temp 0.2 \
--repeat_penalty 1.1 \
-t 8Tweak settings as you prefer.
Using server api
You can do lots of things with this, like building discord chatbots, applications, or provide a simple OpenAI compatible endpoint for others to use.
./server -m ./models/llama-2-7b-chat.Q3_K_L.gguf \
--ctx_size 2048 \
-t 8./server -m ./models/llama-2-7b-chat.Q3_K_L.gguf \
--ctx_size 2048 \
-t 8Now you can use the OpenAI compatible endpoint in the official libraries:
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
{"role": "user", "content": "Write a limerick about python exceptions"}
]
)
print(completion.choices[0].message)import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1", # "http://<Your api-server IP>:port"
api_key = "sk-no-key-required"
)
completion = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
{"role": "user", "content": "Write a limerick about python exceptions"}
]
)
print(completion.choices[0].message)