unsloth_qwen3.6_UD_Q5_K_XL

Running Qwen 3.6 27B Locally: A Quality Build for RTX 3090 Owners

Quantization compresses a model to fit in available memory. For years, the compression was uniform: every weight got the same treatment regardless of what it actually did. The result was smaller models that were also, predictably, worse.

Unsloth’s imatrix calibration changed the math. Before compressing, it runs the model on representative data and measures which weights most influence output quality. The resulting importance scores let it protect what matters and compress what does not. A 5-bit quantization built this way holds on to more quality than a standard 5-bit quantization by a meaningful margin.

The Qwen 3.6 27B UD-Q5_K_XL is 20 GB. It fits on an RTX 3090 with room for 128K of context. This guide sets it up.


What You Need

ComponentRequirementNotes
GPU24GB VRAMRTX 3090, 3090 Ti, 4090, or equivalent
RAM32GB minimum, 64GB preferredLess than 32GB creates pressure at long context
StorageSSD, 25GB freeThe model is 20 GB. An HDD will not keep up.
OSWindows 10/11

If your GPU has less than 24GB, this model will not fit. The 35B MoE variant at 12.4 GB is the better fit for 16GB cards. See the companion guide for that setup.


Why UD-Q5_K_XL

Standard GGUF quantization formats like Q5_K_M apply compression uniformly across the model, with some crude tiering that gives marginally better treatment to a handful of early layers. It works. It is also a blunt instrument.

The UD (Unsloth Dynamic) formats take a different approach. The calibration pass runs the model on real data and maps how sensitive each weight is to compression. Critical attention weights get Q8 treatment. Less sensitive feed-forward layers get Q4 or lower. The _XL suffix means more of the precision budget went to the most sensitive layers, at the cost of a slightly larger file than a plain UD-Q5_K_M.

The practical difference: UD-Q5_K_XL at 20 GB produces output noticeably closer to the Q6_K (20.5 GB) than a standard Q5_K_M would. You are not giving up half a bit of average precision equally everywhere. You are giving it up where it matters least.

For daily coding work, the quality difference over a standard quantization shows up in the places where precision matters: code that handles edge cases correctly the first time, explanations that do not drift in long conversations, outputs that stay coherent across the full 128K context.


What You Are Building

G:\qwen-local\
├── llama.cpp-tq3\                               # Inference engine (compiled)
│   └── build\bin\Release\llama-server.exe
├── models\
│   └── unsloth\
│       └── qwen3.6-27b\
│           ├── Qwen3.6-27B-UD-Q5_K_XL.gguf      (20.0 GB)
│           └── mmproj-F16.gguf                    (500 MB, optional, for vision)
├── start_36_27b_ud_q5k_xl.bat
├── start_36_27b_ud_q5k_xl_vision.bat
└── upgrade_llama.cpp-tq3.bat

I use G:\ for a dedicated SSD. Substitute your own drive letter throughout.


Phase 1: The Inference Engine

This guide uses llama.cpp-tq3, the same TurboQuant fork covered in the companion guide for the TQ3_4S models. If you followed that guide, you already have it built and can skip to Phase 2.

If you are starting fresh, the build process is covered in full in that guide. The short version: install Git, CMake, Visual Studio 2022 Build Tools with the C++ workload, and Miniconda3. Then open Anaconda Prompt (Miniconda3) from the Start menu and run:

G:
cd qwen-local
git clone https://github.com/turbo-tan/llama.cpp-tq3.git
cd llama.cpp-tq3

"C:\Program Files\CMake\bin\cmake.exe" -B build -G "Visual Studio 17 2022" -A x64 ^
  -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 ^
  -DGGML_CUDA_FA=ON -DGGML_CUDA_GRAPHS=ON ^
  -DCMAKE_BUILD_TYPE=Release

"C:\Program Files\CMake\bin\cmake.exe" --build build -j --config Release

-DCMAKE_CUDA_ARCHITECTURES=86 targets the RTX 3090. If you have a different GPU, find your compute capability number at developer.nvidia.com/cuda-gpus. The build takes about ten minutes and produces llama-server.exe at build\bin\Release\.

Standard llama.cpp also works. UD-Q5_K_XL is a standard GGUF and will run on any recent llama.cpp build with CUDA support. The performance figures in this guide were measured on llama.cpp-tq3.

Keeping it up to date

Save this as G:\qwen-local\upgrade_llama.cpp-tq3.bat and run it from Anaconda Prompt whenever you want to pull the latest build:

@echo off
title Upgrade llama.cpp-tq3
echo ========================================
echo   Upgrading llama.cpp-tq3
echo ========================================
echo.

call conda activate qwen-local
cd /d G:\qwen-local\llama.cpp-tq3

if not exist ".git" (
    echo ERROR: Not a git repository. Check that G:\qwen-local\llama.cpp-tq3 exists.
    pause
    exit /b 1
)

echo [1/3] Pulling latest changes...
git pull

echo [2/3] Clearing previous build...
rmdir /s /q build

echo [3/3] Building...
"C:\Program Files\CMake\bin\cmake.exe" -B build -G "Visual Studio 17 2022" -A x64 -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 -DGGML_CUDA_FA=ON -DGGML_CUDA_GRAPHS=ON -DCMAKE_BUILD_TYPE=Release && "C:\Program Files\CMake\bin\cmake.exe" --build build -j --config Release

if %errorlevel% neq 0 (
    echo.
    echo ERROR: Build failed. Check output above.
    pause
    exit /b 1
)

echo.
echo ========================================
echo   Done! llama-server.exe is ready.
echo ========================================
pause

The build deletes and rebuilds from scratch each time. This is intentional. Incremental builds after a pull occasionally produce subtle bugs.


Phase 2: Download the Model

Install Miniconda3 if you do not have it. Then open Anaconda Prompt (Miniconda3) and run:

conda create -n qwen-local python=3.11 -y
conda activate qwen-local
pip install "huggingface_hub[cli]"

The hf tool handles large resumable downloads. If your connection drops midway, running the same command again picks up where it left off.

Still in Anaconda Prompt with the qwen-local environment active, download the model:

mkdir G:\qwen-local\models\unsloth\qwen3.6-27b

hf download unsloth/Qwen3.6-27B-GGUF Qwen3.6-27B-UD-Q5_K_XL.gguf ^
  --local-dir G:\qwen-local\models\unsloth\qwen3.6-27b

If you want vision (image input), grab the encoder too:

hf download unsloth/Qwen3.6-27B-GGUF mmproj-F16.gguf ^
  --local-dir G:\qwen-local\models\unsloth\qwen3.6-27b

The vision encoder is 500 MB. Download it now if you might want it later. It cannot be added to a running server.


Phase 3: VRAM and the KV Cache

The model weights take 20.0 GB. The RTX 3090 has 24 GB. That leaves 4 GB for the KV cache and everything else touching the GPU.

The KV cache is the working memory the model builds as the conversation grows. It scales directly with context length. At 128K tokens with q4_0 compression, the KV cache uses approximately 2.5 GB, putting total VRAM around 22 GB. That fits, with about 2 GB of headroom.

The options

KV SettingVRAM at 128KQualityNotes
-ctk q4_0 -ctv q4_0~2.5 GBGoodDefault. Use this.
-ctk q8_0 -ctv q8_0~5 GBNear-perfectFits at 64K context, not 128K

With this model on a 3090, q4_0 is effectively the default setting. The weights consume too much VRAM to leave room for q8_0 KV at full context. q4_0 at 128K context is the right call. The quality loss from KV compression is not where the meaningful quality difference lies.

If the conversation runs long and you hit an out-of-memory error, reduce context rather than changing KV type:

set CONTEXT=98304

Dropping from 128K to 96K saves approximately 600 MB of KV cache and usually clears the OOM without restarting the server.


Phase 4: The Launch Script

Save this as G:\qwen-local\start_36_27b_ud_q5k_xl.bat. Open Notepad, paste the content below, and save with the .bat extension. Run it from Anaconda Prompt.

@echo off
title Qwen 3.6 27B UD-Q5_K_XL - RTX 3090
echo ========================================
echo   Qwen 3.6 27B UD-Q5_K_XL (Unsloth)
echo   Size: 20.0 GB
echo   Port: 1234
echo ========================================
echo.

call conda activate qwen-local
cd /d G:\qwen-local\llama.cpp-tq3\build\bin\Release

:: === TUNABLE PARAMETERS ===
set CONTEXT=131072
set KEY_TYPE=q4_0
set VAL_TYPE=q4_0
set BATCH_SIZE=4096
:: ===========================

echo Context: %CONTEXT% tokens
echo KV Cache - Key: %KEY_TYPE%, Value: %VAL_TYPE%
echo.

llama-server.exe ^
  -m G:\qwen-local\models\unsloth\qwen3.6-27b\Qwen3.6-27B-UD-Q5_K_XL.gguf ^
  -ngl 99 ^
  --host 0.0.0.0 --port 1234 ^
  -c %CONTEXT% ^
  -fa on ^
  -ctk %KEY_TYPE% -ctv %VAL_TYPE% ^
  -b %BATCH_SIZE% ^
  --jinja ^
  --temp 0.6 --top-k 20 --top-p 0.95 --presence-penalty 0.0

pause
ParameterDefaultRangeEffect
CONTEXT13107298304 → 131072Lower saves ~600 MB KV on OOM
KEY_TYPEq4_0q4_0 at 128KSwitch to q8_0 only if you reduce CONTEXT to 65536
VAL_TYPEq4_0q4_0 at 128KUsually matches KEY_TYPE
BATCH_SIZE40962048 → 8192Higher speeds up prompt processing

-ngl 99 offloads all layers to the GPU. -fa on enables Flash Attention. --jinja activates the correct chat template handling. Leave --temp 0.6 --top-k 20 --top-p 0.95 alone. These are Qwen 3.6’s published sampling parameters.

Vision variant

Save this as G:\qwen-local\start_36_27b_ud_q5k_xl_vision.bat. Requires mmproj-F16.gguf downloaded to the same folder as the model. Run it from Anaconda Prompt.

@echo off
title Qwen 3.6 27B UD-Q5_K_XL Vision - RTX 3090
echo ========================================
echo   Qwen 3.6 27B UD-Q5_K_XL (Unsloth) + Vision
echo   Size: 20.0 GB
echo   Port: 1234
echo ========================================
echo.

call conda activate qwen-local
cd /d G:\qwen-local\llama.cpp-tq3\build\bin\Release

:: === TUNABLE PARAMETERS ===
set CONTEXT=131072
set KEY_TYPE=q4_0
set VAL_TYPE=q4_0
set BATCH_SIZE=4096
:: ===========================

echo Context: %CONTEXT% tokens
echo KV Cache - Key: %KEY_TYPE%, Value: %VAL_TYPE%
echo Vision: enabled
echo.

llama-server.exe ^
  -m G:\qwen-local\models\unsloth\qwen3.6-27b\Qwen3.6-27B-UD-Q5_K_XL.gguf ^
  --mmproj G:\qwen-local\models\unsloth\qwen3.6-27b\mmproj-F16.gguf ^
  --image-min-tokens 1024 ^
  -ngl 99 ^
  --host 0.0.0.0 --port 1234 ^
  -c %CONTEXT% ^
  -fa on ^
  -ctk %KEY_TYPE% -ctv %VAL_TYPE% ^
  -b %BATCH_SIZE% ^
  --jinja ^
  --temp 0.6 --top-k 20 --top-p 0.95 --presence-penalty 0.0

pause

--image-min-tokens 1024 improves spatial accuracy for tasks like reading text from a screenshot. Without it, the model tends to describe images loosely rather than ground them precisely. Omit it for casual image chat where precise spatial grounding is not needed.


Phase 5: Connecting a Harness

The model runs as an OpenAI-compatible server on port 1234. Any client that supports a custom base URL works: OpenWebUI for browser-based chat, Continue.dev for VS Code integration, or OpenCode for a terminal-based coding session. The instructions below cover OpenCode running in WSL, which is how I use it.

5.1 Find your WSL gateway IP

From WSL2 (Kali Linux):

ip route show default | awk '{print $3; exit}'

This returns the IP address your WSL instance uses to reach the Windows host. The default is usually 172.19.64.1 but it can change between reboots. Use the actual output of that command in the config below.

5.2 Add the provider to OpenCode

In WSL2 (Kali Linux), edit ~/.config/opencode/opencode.json and add:

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "qwen3.6-27b-q5xl": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Qwen 3.6 27B UD-Q5_K_XL (128K)",
      "options": {
        "baseURL": "http://172.19.64.1:1234/v1",
        "toolParser": [{ "type": "raw-function-call" }, { "type": "json" }]
      },
      "models": {
        "Qwen3.6-27B-UD-Q5_K_XL.gguf": {
          "name": "Qwen 3.6 27B UD-Q5_K_XL",
          "tool_call": true,
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          },
          "limit": { "context": 131072, "output": 8192 }
        }
      }
    }
  },
  "model": "qwen3.6-27b-q5xl/Qwen3.6-27B-UD-Q5_K_XL.gguf"
}

Replace 172.19.64.1 with the IP from the gateway check above. The modalities block enables image input in OpenCode. The vision bat file must be the one running for images to actually process. The modalities block alone does nothing if the server started without --mmproj.

5.3 Test the connection

Before opening OpenCode, confirm the server is reachable from WSL2 (Kali Linux):

curl http://172.19.64.1:1234/v1/models

You should see a JSON response listing the model filename. If you get a connection error, the server is not running or the IP is wrong.


Phase 6: Daily Use

  1. In Anaconda Prompt, run start_36_27b_ud_q5k_xl.bat
  2. Wait for “llama server listening” in that terminal window
  3. In WSL2 (Kali Linux) within Windows Terminal, run opencode
  4. Verify with /model that it shows the UD-Q5_K_XL

Model loading takes 15–20 seconds. Time to first token ranges from about 7 seconds at the start of a fresh conversation to around 30 seconds as the context fills. The model prefills the entire context before generating the first output token. Longer history means longer wait before anything appears. This is expected behavior, not a sign that something is wrong.

Before each response, the model produces a brief reasoning phase visible as gray text in OpenCode. This is the Qwen3 hybrid thinking mode: a short internal plan before committing to an answer. It is not extended chain-of-thought reasoning. The output that follows tends to pick a direction and stay there.

Practical context limit

The theoretical context window is 128K tokens. The practical working window is roughly half that.

Past 45–55% fill, shown as a running percentage in the OpenCode interface, early context starts losing influence. The model does not lose those tokens outright, but its attention over them thins to the point where specific details become unreachable. Instructions loaded at the start of a session may need to be re-read explicitly before the model can act on them reliably.

There is no autocompact equivalent for local models. The context does not compress; it just fills. Treat the 50% mark as a signal to wrap up the current task and start a fresh thread rather than pushing into the second half of the window and noticing the degradation mid-task.

Monitoring VRAM

From Windows Terminal (PowerShell):

nvidia-smi -l 1

Or install nvitop once from Anaconda Prompt and use it as your standard monitor:

pip install nvitop
nvitop

Consistently under 22 GB means you have reasonable headroom. Touching 23 GB means the next message may OOM. Reduce CONTEXT to 98304 in the bat file and restart the server.


Performance at a Glance

SettingVRAM (approx)ContextNotes
131K context, q4_0~22 GB128KDefault configuration
96K context, q4_0~21.4 GB96KOOM recovery. Adds ~600 MB headroom.
64K context, q8_0~22 GB64KBetter KV quality if 128K is not needed

Generation speed on the RTX 3090 averages around 27 tokens per second, sometimes higher. Time to first token ranges from about 7 seconds on a fresh short context to around 30 seconds as the conversation fills toward 128K. If you are coming from TurboQuant models optimized for a different compression architecture, those numbers look slower. What UD-Q5_K_XL offers is not speed. It is quality per VRAM, and on that axis it is close to the best you can do on 24 GB.


Troubleshooting

ProblemFix
OOM on server startupAnother process is holding VRAM. Close games, browsers, other ML tools and try again.
OOM mid-conversationReduce CONTEXT to 98304 in the bat file and restart
CUDA error during generationContext exceeded available VRAM mid-fill. Reduce CONTEXT.
Slow prompt processingIncrease BATCH_SIZE to 8192
Incoherent or degraded outputConfirm --jinja is in the launch command
Model file not foundCheck that the -m path in the bat matches the exact filename on disk
Vision input not workingConfirm --mmproj is in the bat, mmproj-F16.gguf exists at that path, and modalities block is in the OpenCode config
WSL cannot reach the serverRun ip route show default in WSL and update baseURL in opencode.json
conda activate failsRun conda init bash in WSL, restart the shell, then try again

One Trade-Off Worth Naming

This is not the fastest model you can run on an RTX 3090. The 35B MoE in the companion guide generates faster and supports a larger context window. On raw throughput, it wins.

In extended coding sessions, that advantage evaporates into the over-thinking loops. After one or two user queries, the reasoning phase runs (even when turned off), and loops continually in a state of indecision. Code tends not to get written and when it is, it’s not very good. Explicit instruction in AGENTS.md or a system prompt to stop after five decision iterations do not help. The model goes into committee and stays there. Faster tokens are no use when they are all indecision.

Real coding does not happen in one shot. A session is a thread. It ends when context fills, and the next thread picks up from a handoff: a markdown file that captures what was done, what was decided, and what comes next. The next agent reads it and continues. The work is distributed across threads by design. Plenty of YouTube benchmarks test models by asking for a complete application in a single prompt. That measures something, but not this. The MoE might pass that test. In stepped thread use it compounds its indecision across sessions, building layers of unresolved thinking where a clean decision log belongs.

The UD-Q5_K_XL does not pretend to one-shot anything. It is built for thread-by-thread, step-by-step work: a harness that keeps state, conventions captured in project lore, handoff documents that carry context between threads. Given those tools, it is decisive, follows a thread to completion, and hands off cleanly.

UD-Q5_K_XL is the answer to a different question: what is sweet spot for the best output quality a human can comfortably run locally on 24 GB of VRAM? Imatrix scoring keeps the weights that matter. The result is close to Q6_K quality at Q5_K size. You will notice the difference on tasks where precision is the point: code that handles edge cases on the first pass, explanations that hold their logic across a long document, outputs that do not drift.

The calculation has not changed: your hardware, your terms, no subscription. UD-Q5_K_XL just makes that calculation work out to a better answer.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *