Host Local Models in your PC with Kobold

guides, kobold, janitor, sillytavern

Hello hello! If you're here, you're probably asking, “What is this guy talking about?” Well, LLMs are essentially models that you run on your own computer or hardware. This guide primarily focuses on Windows 10/11, but Linux can work as well.

Changelog

Guide version	Date	Details
0.3	05/06/2025	Split into 3 separate guides
0.2	05/06/2025	Added SillyTavern & Tailscale guides
0.1	05/05/2025	Initial draft

Check Your Hardware

RAM/VRAM: Press Ctrl + Shift + Esc > “Performance” tab.
- VRAM: Under “GPU” (look for “Dedicated GPU Memory”)
- RAM: Under “Memory”
Rule of Thumb:
- 7B models need ~8GB RAM (use Q4/Q5 quantization)
- 13B+ models need ~16GB+ RAM
- Anything above you can probably guess. (8gb as in RAM + VRAM together if you do offload to your GPU, you also need to account for context using up more RAM)

Download a Model

Where? HuggingFace (search for GGUF files)
- Starter Picks:
  - 8B: Stheno 3.2 8B or Llama 3 8B
  - 12B: MN-Violet-Lotus-12B
Quantization: Use Q4_K_M, Q5_K_M, or higher (avoid anything lower, they’re kinda dumb)

Install KoboldCPP

Download KoboldCPP (the easiest way to run GGUF models for me personally)
Open koboldcpp.exe.
(If you don’t have a GPU, use LM Studio! There are guides out there specifically for it)

Configure KoboldCPP

Click Browse and select your GGUF model file.
Backend Settings:
- NVIDIA GPU? Use CUBlas.
- AMD GPU? Use Vulkan.
- No GPU? Use OpenBLAS (CPU-only mode) ¹⁾
GPU Layers:
- Example: For a 7B model with 33 layers, offload 32 layers to your GPU (if you have 6GB+ VRAM).
Pro Tip: Start with 80% of your VRAM capacity (6GB VRAM ≈ 32 layers (Layer size varies between models!) (You can also use this helpful calculator)

Tweak Settings

Context Size: Start at 4096 (increase if you have RAM to use).
Faster Processing: Enable MMQ, FlashAttention, ContextShift, and FastForwarding
- MMQ: Basically, do math in a different way that makes it more VRAM friendly
- FlashAttention: Calculates which parts are important instead of doing it for each individual piece (this is really dumbed down dont quote me)
- ContextShift: Reduces preprocessing, basically it’s slow initially, but it doesn’t have to go through every single message unless you edit something previously. Makes it wayyyy faster to regenerate prompts.
- FastForwarding: Let’s the model skip reused tokens in the context that has already been processed

Run the Model!

Click Launch. Once loaded, open http://localhost:5001 in your browser to chat.
- If you're encountering any issues with memory, I.E “Failed to allocate memory”
  - Try: Reducing GPU layers, context, switching to a lower Quantization, OR swapping to a smaller model.
  - You can also attempt to Quantize KV cache in the “Tokens” tab, which essentially compresses down context to lower VRAM/RAM usage. (Same issues, more compression = more quality loss, not really noticeable(?))
- If some models straight up just don't load (Newer models)
  - Try updating KoboldCPP to the latest version.

Janitor API Setup

Check Remote Tunnel in KoboldCPP.
- Copy the Cloudflare API URL from the console (looks like http://random.words.here.trycloudflare/v1)
In JAI:
- API Endpoint: Paste the URL and add /chat/completions at the end.
- API Key: Type anything (it’s ignored).
- Model Name: Use the filename or anything you want (stheno-8b-q5_k_m)

SillyTavern API Setup

Interested in using your Kobold-hosted model in SillyTavern?
This is assuming you have finished your SillyTavern setup. If not, check out this guide: Wiki-Link (link, goes there once its up)
- Go to the API Tab in SillyTavern and use these settings
  - API: Text Completion
  - API Type: KoboldCPP
  - API Key: Don't put anything here!
  - API URL: Paste in
```
 https://localhost:5001/v1 
```

Okay... Can I see an example of how much speed I'd get?

My Personal Setup:

PC Specs: i5-11400H, RTX 3060 (6GB VRAM), 32GB DDR4 RAM.
- Stheno 8B (Q5_K_M):
  - Offload 32/33 layers to GPU → Processes 350 tokens per sec, generates 12–17 tokens/sec.
- MN-Violet-Lotus-12B (Q6_K):
  - Offload 26/41 layers to GPU → Processes 120 tokens per sec, generates 4 tokens/sec (slow but usable).

End Notes

If you have any questions, either DM me at @blitzen1122 on discord, ping me on the JAI discord server in #AI-Models, or just google it!

¹⁾

OR LM Studio

wAIki

Sidebar

Navigation

Text generation

Image generation

Guides

Meta

Table of Contents

Host Local Models in your PC with Kobold

Changelog

Check Your Hardware

Download a Model

Install KoboldCPP

Configure KoboldCPP

Tweak Settings

Run the Model!

Janitor API Setup

SillyTavern API Setup

Okay... Can I see an example of how much speed I'd get?

End Notes

Contributors	blitzen
Last Updated	05/06/2025

wAIki

User Tools

Site Tools

Sidebar

Navigation

Text generation

Image generation

Guides

Meta

Table of Contents

Host Local Models in your PC with Kobold

Changelog

Check Your Hardware

Download a Model

Install KoboldCPP

Configure KoboldCPP

Tweak Settings

Run the Model!

Janitor API Setup

SillyTavern API Setup

Okay... Can I see an example of how much speed I'd get?

End Notes

Page Tools