llama.cpp
Run Local LLMs with llama.cpp
Maximum control over LLM inference. Build from source, run GGUF models on CPU, GPU, or mixed mode.
macoslinuxwindows
Build from Source
llama.cpp is a C/C++ inference engine that powers most local LLM tools under the hood. Building from source gives you the latest optimizations.
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && make -j
# For NVIDIA GPU support (CUDA):
make -j GGML_CUDA=1
# For Apple Silicon (Metal):
make -j GGML_METAL=1Download a GGUF Model
Models come in GGUF format with different quantization levels. Q4_K_M offers the best balance of quality and size for most users.
# Example: download Llama 3.1 8B Q4_K_M from HuggingFace
# Use huggingface-cli or wget
huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
--include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
--local-dir models/Run the CLI
Use llama-cli for interactive chat or llama-server for an OpenAI-compatible API.
# Interactive chat
./llama-cli -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
-c 8192 --chat-template llama3
# Start an OpenAI-compatible API server
./llama-server -m models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf \
--port 8080 -c 8192GPU Offloading
Use -ngl (number of GPU layers) to control how many model layers run on the GPU. Set to a large number to offload everything, or a lower number to split between CPU and GPU when VRAM is limited.
# Full GPU offload (all layers)
./llama-server -m model.gguf -ngl 99
# Partial offload (20 layers on GPU, rest on CPU)
./llama-server -m model.gguf -ngl 20