Site Tools


guides:self_hosting_local_kobold

Host Local Models in your PC with Kobold

Contributors

blitzen

Last Updated

05/06/2025

Host Local Models in your PC with Kobold

Hello hello! If you're here, you're probably asking, “What is this guy talking about?” Well, LLMs are essentially models that you run on your own computer or hardware. This guide primarily focuses on Windows 10/11, but Linux can work as well.

Changelog

Guide version Date Details
0.3 05/06/2025 Split into 3 separate guides
0.2 05/06/2025 Added SillyTavern & Tailscale guides
0.1 05/05/2025 Initial draft

Check Your Hardware

  • RAM/VRAM: Press Ctrl + Shift + Esc > “Performance” tab.
    • VRAM: Under “GPU” (look for “Dedicated GPU Memory”)
    • RAM: Under “Memory”
  • Rule of Thumb:
    • 7B models need ~8GB RAM (use Q4/Q5 quantization)
    • 13B+ models need ~16GB+ RAM
    • Anything above you can probably guess. (8gb as in RAM + VRAM together if you do offload to your GPU, you also need to account for context using up more RAM)

Download a Model

  • Where? HuggingFace (search for GGUF files)
    • Starter Picks:
      • 8B: Stheno 3.2 8B or Llama 3 8B
      • 12B: MN-Violet-Lotus-12B
  • Quantization: Use Q4_K_M, Q5_K_M, or higher (avoid anything lower, they’re kinda dumb)

Install KoboldCPP

  • Download KoboldCPP (the easiest way to run GGUF models for me personally)
  • Open koboldcpp.exe.
  • (If you don’t have a GPU, use LM Studio! There are guides out there specifically for it)

Configure KoboldCPP

  • Click Browse and select your GGUF model file.
  • Backend Settings:
    • NVIDIA GPU? Use CUBlas.
    • AMD GPU? Use Vulkan.
    • No GPU? Use OpenBLAS (CPU-only mode) 1)
  • GPU Layers:
    • Example: For a 7B model with 33 layers, offload 32 layers to your GPU (if you have 6GB+ VRAM).
  • Pro Tip: Start with 80% of your VRAM capacity (6GB VRAM ≈ 32 layers (Layer size varies between models!) (You can also use this helpful calculator)

Tweak Settings

  • Context Size: Start at 4096 (increase if you have RAM to use).
  • Faster Processing: Enable MMQ, FlashAttention, ContextShift, and FastForwarding
    • MMQ: Basically, do math in a different way that makes it more VRAM friendly
    • FlashAttention: Calculates which parts are important instead of doing it for each individual piece (this is really dumbed down dont quote me)
    • ContextShift: Reduces preprocessing, basically it’s slow initially, but it doesn’t have to go through every single message unless you edit something previously. Makes it wayyyy faster to regenerate prompts.
    • FastForwarding: Let’s the model skip reused tokens in the context that has already been processed

Run the Model!

  • Click Launch. Once loaded, open http://localhost:5001 in your browser to chat.
    • If you're encountering any issues with memory, I.E “Failed to allocate memory”
      • Try: Reducing GPU layers, context, switching to a lower Quantization, OR swapping to a smaller model.
      • You can also attempt to Quantize KV cache in the “Tokens” tab, which essentially compresses down context to lower VRAM/RAM usage. (Same issues, more compression = more quality loss, not really noticeable(?))
    • If some models straight up just don't load (Newer models)
      • Try updating KoboldCPP to the latest version.

Janitor API Setup

  • Check Remote Tunnel in KoboldCPP.
  • In JAI:
    • API Endpoint: Paste the URL and add /chat/completions at the end.
    • API Key: Type anything (it’s ignored).
    • Model Name: Use the filename or anything you want (stheno-8b-q5_k_m)

SillyTavern API Setup

  • Interested in using your Kobold-hosted model in SillyTavern?
  • This is assuming you have finished your SillyTavern setup. If not, check out this guide: Wiki-Link (link, goes there once its up)
    • Go to the API Tab in SillyTavern and use these settings
      • API: Text Completion
      • API Type: KoboldCPP
      • API Key: Don't put anything here!
      • API URL: Paste in
         https://localhost:5001/v1 

Okay... Can I see an example of how much speed I'd get?

My Personal Setup:

  • PC Specs: i5-11400H, RTX 3060 (6GB VRAM), 32GB DDR4 RAM.
    • Stheno 8B (Q5_K_M):
      • Offload 32/33 layers to GPU → Processes 350 tokens per sec, generates 12–17 tokens/sec.
    • MN-Violet-Lotus-12B (Q6_K):
      • Offload 26/41 layers to GPU → Processes 120 tokens per sec, generates 4 tokens/sec (slow but usable).

End Notes

If you have any questions, either DM me at @blitzen1122 on discord, ping me on the JAI discord server in #AI-Models, or just google it!

1)
OR LM Studio
guides/self_hosting_local_kobold.txt · Last modified: 2025/05/08 07:58 by Severian