Running Qwen 3.6 27B Locally: A Quality Build for RTX 3090 Owners
Quantization compresses a model to fit in available memory. For years, the compression was uniform: every weight got the same treatment regardless of what it actually did. The result was smaller models that were also, predictably, worse.
Unsloth’s imatrix calibration changed the math. Before compressing, it runs the model on representative data and measures which weights most influence output quality. The resulting importance scores let it protect what matters and compress what does not. A 5-bit quantization built this way holds on to more quality than a standard 5-bit quantization by a meaningful margin.
The Qwen 3.6 27B UD-Q5_K_XL is 20 GB. It fits on an RTX 3090 with room for 128K of context. This guide sets it up.
What You Need
| Component | Requirement | Notes |
|---|---|---|
| GPU | 24GB VRAM | RTX 3090, 3090 Ti, 4090, or equivalent |
| RAM | 32GB minimum, 64GB preferred | Less than 32GB creates pressure at long context |
| Storage | SSD, 25GB free | The model is 20 GB. An HDD will not keep up. |
| OS | Windows 10/11 |
If your GPU has less than 24GB, this model will not fit. The 35B MoE variant at 12.4 GB is the better fit for 16GB cards. See the companion guide for that setup.
Why UD-Q5_K_XL
Standard GGUF quantization formats like Q5_K_M apply compression uniformly across the model, with some crude tiering that gives marginally better treatment to a handful of early layers. It works. It is also a blunt instrument.
The UD (Unsloth Dynamic) formats take a different approach. The calibration pass runs the model on real data and maps how sensitive each weight is to compression. Critical attention weights get Q8 treatment. Less sensitive feed-forward layers get Q4 or lower. The _XL suffix means more of the precision budget went to the most sensitive layers, at the cost of a slightly larger file than a plain UD-Q5_K_M.
The practical difference: UD-Q5_K_XL at 20 GB produces output noticeably closer to the Q6_K (20.5 GB) than a standard Q5_K_M would. You are not giving up half a bit of average precision equally everywhere. You are giving it up where it matters least.
For daily coding work, the quality difference over a standard quantization shows up in the places where precision matters: code that handles edge cases correctly the first time, explanations that do not drift in long conversations, outputs that stay coherent across the full 128K context.
What You Are Building
G:\qwen-local\ ├── llama.cpp-tq3\ # Inference engine (compiled) │ └── build\bin\Release\llama-server.exe ├── models\ │ └── unsloth\ │ └── qwen3.6-27b\ │ ├── Qwen3.6-27B-UD-Q5_K_XL.gguf (20.0 GB) │ └── mmproj-F16.gguf (500 MB, optional, for vision) ├── start_36_27b_ud_q5k_xl.bat ├── start_36_27b_ud_q5k_xl_vision.bat └── upgrade_llama.cpp-tq3.bat
I use G:\ for a dedicated SSD. Substitute your own drive letter throughout.
Phase 1: The Inference Engine
This guide uses llama.cpp-tq3, the same TurboQuant fork covered in the companion guide for the TQ3_4S models. If you followed that guide, you already have it built and can skip to Phase 2.
If you are starting fresh, the build process is covered in full in that guide. The short version: install Git, CMake, Visual Studio 2022 Build Tools with the C++ workload, and Miniconda3. Then open Anaconda Prompt (Miniconda3) from the Start menu and run:
G: cd qwen-local git clone https://github.com/turbo-tan/llama.cpp-tq3.git cd llama.cpp-tq3 "C:\Program Files\CMake\bin\cmake.exe" -B build -G "Visual Studio 17 2022" -A x64 ^ -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 ^ -DGGML_CUDA_FA=ON -DGGML_CUDA_GRAPHS=ON ^ -DCMAKE_BUILD_TYPE=Release "C:\Program Files\CMake\bin\cmake.exe" --build build -j --config Release
-DCMAKE_CUDA_ARCHITECTURES=86 targets the RTX 3090. If you have a different GPU, find your compute capability number at developer.nvidia.com/cuda-gpus. The build takes about ten minutes and produces llama-server.exe at build\bin\Release\.
Standard llama.cpp also works. UD-Q5_K_XL is a standard GGUF and will run on any recent llama.cpp build with CUDA support. The performance figures in this guide were measured on llama.cpp-tq3.
Keeping it up to date
Save this as G:\qwen-local\upgrade_llama.cpp-tq3.bat and run it from Anaconda Prompt whenever you want to pull the latest build:
@echo off
title Upgrade llama.cpp-tq3
echo ========================================
echo Upgrading llama.cpp-tq3
echo ========================================
echo.
call conda activate qwen-local
cd /d G:\qwen-local\llama.cpp-tq3
if not exist ".git" (
echo ERROR: Not a git repository. Check that G:\qwen-local\llama.cpp-tq3 exists.
pause
exit /b 1
)
echo [1/3] Pulling latest changes...
git pull
echo [2/3] Clearing previous build...
rmdir /s /q build
echo [3/3] Building...
"C:\Program Files\CMake\bin\cmake.exe" -B build -G "Visual Studio 17 2022" -A x64 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA=ON -DGGML_CUDA_GRAPHS=ON -DCMAKE_BUILD_TYPE=Release && "C:\Program Files\CMake\bin\cmake.exe" --build build -j --config Release
if %errorlevel% neq 0 (
echo.
echo ERROR: Build failed. Check output above.
pause
exit /b 1
)
echo.
echo ========================================
echo Done! llama-server.exe is ready.
echo ========================================
pause
The build deletes and rebuilds from scratch each time. This is intentional. Incremental builds after a pull occasionally produce subtle bugs.
Phase 2: Download the Model
Install Miniconda3 if you do not have it. Then open Anaconda Prompt (Miniconda3) and run:
conda create -n qwen-local python=3.11 -y conda activate qwen-local pip install "huggingface_hub[cli]"
The hf tool handles large resumable downloads. If your connection drops midway, running the same command again picks up where it left off.
Still in Anaconda Prompt with the qwen-local environment active, download the model:
mkdir G:\qwen-local\models\unsloth\qwen3.6-27b hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-UD-Q5_K_XL.gguf ^ --local-dir G:\qwen-local\models\unsloth\qwen3.6-27b
If you want vision (image input), grab the encoder too:
hf download unsloth/Qwen3.6-27B-GGUF mmproj-F16.gguf ^ --local-dir G:\qwen-local\models\unsloth\qwen3.6-27b
The vision encoder is 500 MB. Download it now if you might want it later. It cannot be added to a running server.
Phase 3: VRAM and the KV Cache
The model weights take 20.0 GB. The RTX 3090 has 24 GB. That leaves 4 GB for the KV cache and everything else touching the GPU.
The KV cache is the working memory the model builds as the conversation grows. It scales directly with context length. At 128K tokens with q4_0 compression, the KV cache uses approximately 2.5 GB, putting total VRAM around 22 GB. That fits, with about 2 GB of headroom.
The options
| KV Setting | VRAM at 128K | Quality | Notes |
|---|---|---|---|
-ctk q4_0 -ctv q4_0 | ~2.5 GB | Good | Default. Use this. |
-ctk q8_0 -ctv q8_0 | ~5 GB | Near-perfect | Fits at 64K context, not 128K |
With this model on a 3090, q4_0 is effectively the default setting. The weights consume too much VRAM to leave room for q8_0 KV at full context. q4_0 at 128K context is the right call. The quality loss from KV compression is not where the meaningful quality difference lies.
If the conversation runs long and you hit an out-of-memory error, reduce context rather than changing KV type:
set CONTEXT=98304
Dropping from 128K to 96K saves approximately 600 MB of KV cache and usually clears the OOM without restarting the server.
Phase 4: The Launch Script
Save this as G:\qwen-local\start_36_27b_ud_q5k_xl.bat. Open Notepad, paste the content below, and save with the .bat extension. Run it from Anaconda Prompt.
@echo off title Qwen 3.6 27B UD-Q5_K_XL - RTX 3090 echo ======================================== echo Qwen 3.6 27B UD-Q5_K_XL (Unsloth) echo Size: 20.0 GB echo Port: 1234 echo ======================================== echo. call conda activate qwen-local cd /d G:\qwen-local\llama.cpp-tq3\build\bin\Release :: === TUNABLE PARAMETERS === set CONTEXT=131072 set KEY_TYPE=q4_0 set VAL_TYPE=q4_0 set BATCH_SIZE=4096 :: =========================== echo Context: %CONTEXT% tokens echo KV Cache - Key: %KEY_TYPE%, Value: %VAL_TYPE% echo. llama-server.exe ^ -m G:\qwen-local\models\unsloth\qwen3.6-27b\Qwen3.6-27B-UD-Q5_K_XL.gguf ^ -ngl 99 ^ --host 0.0.0.0 --port 1234 ^ -c %CONTEXT% ^ -fa on ^ -ctk %KEY_TYPE% -ctv %VAL_TYPE% ^ -b %BATCH_SIZE% ^ --jinja ^ --temp 0.6 --top-k 20 --top-p 0.95 --presence-penalty 0.0 pause
| Parameter | Default | Range | Effect |
|---|---|---|---|
CONTEXT | 131072 | 98304 → 131072 | Lower saves ~600 MB KV on OOM |
KEY_TYPE | q4_0 | q4_0 at 128K | Switch to q8_0 only if you reduce CONTEXT to 65536 |
VAL_TYPE | q4_0 | q4_0 at 128K | Usually matches KEY_TYPE |
BATCH_SIZE | 4096 | 2048 → 8192 | Higher speeds up prompt processing |
-ngl 99 offloads all layers to the GPU. -fa on enables Flash Attention. --jinja activates the correct chat template handling. Leave --temp 0.6 --top-k 20 --top-p 0.95 alone. These are Qwen 3.6’s published sampling parameters.
Vision variant
Save this as G:\qwen-local\start_36_27b_ud_q5k_xl_vision.bat. Requires mmproj-F16.gguf downloaded to the same folder as the model. Run it from Anaconda Prompt.
@echo off title Qwen 3.6 27B UD-Q5_K_XL Vision - RTX 3090 echo ======================================== echo Qwen 3.6 27B UD-Q5_K_XL (Unsloth) + Vision echo Size: 20.0 GB echo Port: 1234 echo ======================================== echo. call conda activate qwen-local cd /d G:\qwen-local\llama.cpp-tq3\build\bin\Release :: === TUNABLE PARAMETERS === set CONTEXT=131072 set KEY_TYPE=q4_0 set VAL_TYPE=q4_0 set BATCH_SIZE=4096 :: =========================== echo Context: %CONTEXT% tokens echo KV Cache - Key: %KEY_TYPE%, Value: %VAL_TYPE% echo Vision: enabled echo. llama-server.exe ^ -m G:\qwen-local\models\unsloth\qwen3.6-27b\Qwen3.6-27B-UD-Q5_K_XL.gguf ^ --mmproj G:\qwen-local\models\unsloth\qwen3.6-27b\mmproj-F16.gguf ^ --image-min-tokens 1024 ^ -ngl 99 ^ --host 0.0.0.0 --port 1234 ^ -c %CONTEXT% ^ -fa on ^ -ctk %KEY_TYPE% -ctv %VAL_TYPE% ^ -b %BATCH_SIZE% ^ --jinja ^ --temp 0.6 --top-k 20 --top-p 0.95 --presence-penalty 0.0 pause
--image-min-tokens 1024 improves spatial accuracy for tasks like reading text from a screenshot. Without it, the model tends to describe images loosely rather than ground them precisely. Omit it for casual image chat where precise spatial grounding is not needed.
Phase 5: Connecting a Harness
The model runs as an OpenAI-compatible server on port 1234. Any client that supports a custom base URL works: OpenWebUI for browser-based chat, Continue.dev for VS Code integration, or OpenCode for a terminal-based coding session. The instructions below cover OpenCode running in WSL, which is how I use it.
5.1 Find your WSL gateway IP
From WSL2 (Kali Linux):
ip route show default | awk '{print $3; exit}'
This returns the IP address your WSL instance uses to reach the Windows host. The default is usually 172.19.64.1 but it can change between reboots. Use the actual output of that command in the config below.
5.2 Add the provider to OpenCode
In WSL2 (Kali Linux), edit ~/.config/opencode/opencode.json and add:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"qwen3.6-27b-q5xl": {
"npm": "@ai-sdk/openai-compatible",
"name": "Qwen 3.6 27B UD-Q5_K_XL (128K)",
"options": {
"baseURL": "http://172.19.64.1:1234/v1",
"toolParser": [{ "type": "raw-function-call" }, { "type": "json" }]
},
"models": {
"Qwen3.6-27B-UD-Q5_K_XL.gguf": {
"name": "Qwen 3.6 27B UD-Q5_K_XL",
"tool_call": true,
"modalities": {
"input": ["text", "image"],
"output": ["text"]
},
"limit": { "context": 131072, "output": 8192 }
}
}
}
},
"model": "qwen3.6-27b-q5xl/Qwen3.6-27B-UD-Q5_K_XL.gguf"
}
Replace 172.19.64.1 with the IP from the gateway check above. The modalities block enables image input in OpenCode. The vision bat file must be the one running for images to actually process. The modalities block alone does nothing if the server started without --mmproj.
5.3 Test the connection
Before opening OpenCode, confirm the server is reachable from WSL2 (Kali Linux):
curl http://172.19.64.1:1234/v1/models
You should see a JSON response listing the model filename. If you get a connection error, the server is not running or the IP is wrong.
Phase 6: Daily Use
- In Anaconda Prompt, run
start_36_27b_ud_q5k_xl.bat - Wait for “llama server listening” in that terminal window
- In WSL2 (Kali Linux) within Windows Terminal, run
opencode - Verify with
/modelthat it shows the UD-Q5_K_XL
Model loading takes 15–20 seconds. Time to first token ranges from about 7 seconds at the start of a fresh conversation to around 30 seconds as the context fills. The model prefills the entire context before generating the first output token. Longer history means longer wait before anything appears. This is expected behavior, not a sign that something is wrong.
Before each response, the model produces a brief reasoning phase visible as gray text in OpenCode. This is the Qwen3 hybrid thinking mode: a short internal plan before committing to an answer. It is not extended chain-of-thought reasoning. The output that follows tends to pick a direction and stay there.
Practical context limit
The theoretical context window is 128K tokens. The practical working window is roughly half that.
Past 45–55% fill, shown as a running percentage in the OpenCode interface, early context starts losing influence. The model does not lose those tokens outright, but its attention over them thins to the point where specific details become unreachable. Instructions loaded at the start of a session may need to be re-read explicitly before the model can act on them reliably.
There is no autocompact equivalent for local models. The context does not compress; it just fills. Treat the 50% mark as a signal to wrap up the current task and start a fresh thread rather than pushing into the second half of the window and noticing the degradation mid-task.
Monitoring VRAM
From Windows Terminal (PowerShell):
nvidia-smi -l 1
Or install nvitop once from Anaconda Prompt and use it as your standard monitor:
pip install nvitop nvitop
Consistently under 22 GB means you have reasonable headroom. Touching 23 GB means the next message may OOM. Reduce CONTEXT to 98304 in the bat file and restart the server.
Performance at a Glance
| Setting | VRAM (approx) | Context | Notes |
|---|---|---|---|
| 131K context, q4_0 | ~22 GB | 128K | Default configuration |
| 96K context, q4_0 | ~21.4 GB | 96K | OOM recovery. Adds ~600 MB headroom. |
| 64K context, q8_0 | ~22 GB | 64K | Better KV quality if 128K is not needed |
Generation speed on the RTX 3090 averages around 27 tokens per second, sometimes higher. Time to first token ranges from about 7 seconds on a fresh short context to around 30 seconds as the conversation fills toward 128K. If you are coming from TurboQuant models optimized for a different compression architecture, those numbers look slower. What UD-Q5_K_XL offers is not speed. It is quality per VRAM, and on that axis it is close to the best you can do on 24 GB.
Troubleshooting
| Problem | Fix |
|---|---|
| OOM on server startup | Another process is holding VRAM. Close games, browsers, other ML tools and try again. |
| OOM mid-conversation | Reduce CONTEXT to 98304 in the bat file and restart |
| CUDA error during generation | Context exceeded available VRAM mid-fill. Reduce CONTEXT. |
| Slow prompt processing | Increase BATCH_SIZE to 8192 |
| Incoherent or degraded output | Confirm --jinja is in the launch command |
| Model file not found | Check that the -m path in the bat matches the exact filename on disk |
| Vision input not working | Confirm --mmproj is in the bat, mmproj-F16.gguf exists at that path, and modalities block is in the OpenCode config |
| WSL cannot reach the server | Run ip route show default in WSL and update baseURL in opencode.json |
conda activate fails | Run conda init bash in WSL, restart the shell, then try again |
One Trade-Off Worth Naming
This is not the fastest model you can run on an RTX 3090. The 35B MoE in the companion guide generates faster and supports a larger context window. On raw throughput, it wins.
In extended coding sessions, that advantage evaporates into the over-thinking loops. After one or two user queries, the reasoning phase runs (even when turned off), and loops continually in a state of indecision. Code tends not to get written and when it is, it’s not very good. Explicit instruction in AGENTS.md or a system prompt to stop after five decision iterations do not help. The model goes into committee and stays there. Faster tokens are no use when they are all indecision.
Real coding does not happen in one shot. A session is a thread. It ends when context fills, and the next thread picks up from a handoff: a markdown file that captures what was done, what was decided, and what comes next. The next agent reads it and continues. The work is distributed across threads by design. Plenty of YouTube benchmarks test models by asking for a complete application in a single prompt. That measures something, but not this. The MoE might pass that test. In stepped thread use it compounds its indecision across sessions, building layers of unresolved thinking where a clean decision log belongs.
The UD-Q5_K_XL does not pretend to one-shot anything. It is built for thread-by-thread, step-by-step work: a harness that keeps state, conventions captured in project lore, handoff documents that carry context between threads. Given those tools, it is decisive, follows a thread to completion, and hands off cleanly.
UD-Q5_K_XL is the answer to a different question: what is sweet spot for the best output quality a human can comfortably run locally on 24 GB of VRAM? Imatrix scoring keeps the weights that matter. The result is close to Q6_K quality at Q5_K size. You will notice the difference on tasks where precision is the point: code that handles edge cases on the first pass, explanations that hold their logic across a long document, outputs that do not drift.
The calculation has not changed: your hardware, your terms, no subscription. UD-Q5_K_XL just makes that calculation work out to a better answer.