{/ This page is auto-generated from the skill's SKILL.md by website/scripts/generate-skill-docs.py. Edit the source SKILL.md, not this page. /}

Llama Cpp

llama.cpp local GGUF inference + HF Hub model discovery.

Skill metadata

Source Bundled (installed by default)
Path skills/mlops/inference/llama-cpp
Version 2.1.2
Author Orchestra Research
License MIT
Dependencies llama-cpp-python>=0.2.0
Tags llama.cpp, GGUF, Quantization, Hugging Face Hub, CPU Inference, Apple Silicon, Edge Deployment, AMD GPUs, Intel GPUs, NVIDIA, URL-first

Reference: full SKILL.md

ℹ️ Info

The following is the complete skill definition that Hermes loads when this skill is triggered. This is what the agent sees as instructions when the skill is active.

llama.cpp + GGUF

Use this skill for local GGUF inference, quant selection, or Hugging Face repo discovery for llama.cpp.

When to use

Model Discovery workflow

Prefer URL workflows before asking for hf, Python, or custom scripts.

  1. Search for candidate repos on the Hub:
  2. Base: https://huggingface.co/models?apps=llama.cpp&sort=trending
  3. Add search=<term> for a model family
  4. Add num_parameters=min:0,max:24B or similar when the user has size constraints
  5. Open the repo with the llama.cpp local-app view:
  6. https://huggingface.co/<repo>?local-app=llama.cpp
  7. Treat the local-app snippet as the source of truth when it is visible:
  8. copy the exact llama-server or llama-cli command
  9. report the recommended quant exactly as HF shows it
  10. Read the same ?local-app=llama.cpp URL as page text or HTML and extract the section under Hardware compatibility:
  11. prefer its exact quant labels and sizes over generic tables
  12. keep repo-specific labels such as UD-Q4_K_M or IQ4_NL_XL
  13. if that section is not visible in the fetched page source, say so and fall back to the tree API plus generic quant guidance
  14. Query the tree API to confirm what actually exists:
  15. https://huggingface.co/api/models/<repo>/tree/main?recursive=true
  16. keep entries where type is file and path ends with .gguf
  17. use path and size as the source of truth for filenames and byte sizes
  18. separate quantized checkpoints from mmproj-*.gguf projector files and BF16/ shard files
  19. use https://huggingface.co/<repo>/tree/main only as a human fallback
  20. If the local-app snippet is not text-visible, reconstruct the command from the repo plus the chosen quant:
  21. shorthand quant selection: llama-server -hf <repo>:<QUANT>
  22. exact-file fallback: llama-server --hf-repo <repo> --hf-file <filename.gguf>
  23. Only suggest conversion from Transformers weights if the repo does not already expose GGUF files.

Quick start

Install llama.cpp

# macOS / Linux (simplest)
brew install llama.cpp
winget install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build
cmake --build build --config Release

Run directly from the Hugging Face Hub

llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0
llama-server -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q8_0

Run an exact GGUF file from the Hub

Use this when the tree API shows custom file naming or the exact HF snippet is missing.

llama-server \
    --hf-repo microsoft/Phi-3-mini-4k-instruct-gguf \
    --hf-file Phi-3-mini-4k-instruct-q4.gguf \
    -c 4096

OpenAI-compatible server check

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "Write a limerick about Python exceptions"}
    ]
  }'

Python bindings (llama-cpp-python)

pip install llama-cpp-python (CUDA: CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir; Metal: CMAKE_ARGS="-DGGML_METAL=on" ...).

Basic generation

from llama_cpp import Llama

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,     # 0 for CPU, 99 to offload everything
    n_threads=8,
)

out = llm("What is machine learning?", max_tokens=256, temperature=0.7)
print(out["choices"][0]["text"])

Chat + streaming

llm = Llama(
    model_path="./model-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=35,
    chat_format="llama-3",   # or "chatml", "mistral", etc.
)

resp = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What is Python?"},
    ],
    max_tokens=256,
)
print(resp["choices"][0]["message"]["content"])

# Streaming
for chunk in llm("Explain quantum computing:", max_tokens=256, stream=True):
    print(chunk["choices"][0]["text"], end="", flush=True)

Embeddings

llm = Llama(model_path="./model-q4_k_m.gguf", embedding=True, n_gpu_layers=35)
vec = llm.embed("This is a test sentence.")
print(f"Embedding dimension: {len(vec)}")

You can also load a GGUF straight from the Hub:

llm = Llama.from_pretrained(
    repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
    filename="*Q4_K_M.gguf",
    n_gpu_layers=35,
)

Choosing a quant

Use the Hub page first, generic heuristics second.

Extracting available GGUFs from a repo

When the user asks what GGUFs exist, return:

Ignore unless requested:

Use the tree API for this step:

For a repo like unsloth/Qwen3.6-35B-A3B-GGUF, the local-app page can show quant chips such as UD-Q4_K_M, UD-Q5_K_M, UD-Q6_K, and Q8_0, while the tree API exposes exact file paths such as Qwen3.6-35B-A3B-UD-Q4_K_M.gguf and Qwen3.6-35B-A3B-Q8_0.gguf with byte sizes. Use the tree API to turn a quant label into an exact filename.

Search patterns

Use these URL shapes directly:

https://huggingface.co/models?apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&sort=trending
https://huggingface.co/models?search=<term>&apps=llama.cpp&num_parameters=min:0,max:24B&sort=trending
https://huggingface.co/<repo>?local-app=llama.cpp
https://huggingface.co/api/models/<repo>/tree/main?recursive=true
https://huggingface.co/<repo>/tree/main

Output format

When answering discovery requests, prefer a compact structured result like:

Repo: <repo>
Recommended quant from HF: <label> (<size>)
llama-server: <command>
Other GGUFs:
- <filename> - <size>
- <filename> - <size>
Source URLs:
- <local-app URL>
- <tree API URL>

References

Resources