Running Qwen 3.6 Locally: A Setup Guide for RTX Owners
Something changed in the last few months. Local models reached a threshold: not perfect, not always faster than hosted alternatives, but good enough to be genuinely useful for long-context coding work on consumer hardware. Meanwhile, cloud AI terms keep shifting, and the direction is not toward individual developers running personal setups.
The calculation is simple: something that runs on your own machine, for the cost of electricity, with no terms to re-read. It does not have to be the best model available. It has to work.
We are apparently there now. It can only get better.
Qwen 3.6 35B MoE fits in 12.4 GB of VRAM, generates at around 111 tokens per second on an RTX 3090, and supports a 262K token context window. That is long enough to hold most codebases. Because of its Mixture of Experts architecture, it uses far less VRAM for the KV cache than a dense model of comparable size. This means the full 262K context actually fits, comfortably, on a single consumer card.
This post is a working recipe. I run this model on an RTX 3090, and this is exactly how I set it up.
What You Need
| Component | Spec | Status |
|---|---|---|
| GPU | RTX 3090 24GB VRAM | ✅ |
| RAM | 64GB | ✅ |
| Storage | SSD, 30GB+ free | ✅ |
If your GPU has at least 16GB VRAM, most of this still applies. You will need to reduce the context and KV cache settings in Phase 3 to fit your memory budget.
Why Qwen 3.6
Until recently, long context was an infrastructure problem. When a model processes a long conversation or a large codebase, it builds an attention cache: the working memory that lets it connect a reference on line 3 to a bug on line 4,000. That cache grows with every token. At 100K to 200K tokens, it becomes large enough that running it required dedicated server hardware with tens of gigabytes of fast memory. That meant cloud endpoints and someone else’s terms.
Several things had to change at once. Flash Attention rewrote how the attention computation works: instead of loading the full cache into memory at each step, it tiles the calculation, keeping memory use roughly flat as context grows. Quantization formats got smarter: early approaches compressed model weights indiscriminately and paid a quality penalty; newer formats identify which weights matter most and protect them, so a 12GB compressed model retains most of the quality of a full-precision original. The KV cache turned out to be compressible too, with near-lossless results at half the original size. Mixture of Experts architecture changed the basic math: a 35B MoE model only activates a fraction of its parameters per token, so the effective compute cost at inference is far smaller than the total parameter count suggests, and crucially, only the attention layers contribute to the KV cache, not the expert layers. Qwen 3.6 adds DeltaNet with Gated Attention on top, designed not just to store long contexts but to reason across them without the quality degradation that typically sets in past 100K tokens.
None of these are new ideas. What changed is that they arrived together, at this model size, in a format that fits on one consumer card.
| Feature | Qwen 3.5 | Qwen 3.6 35B MoE | What it means |
|---|---|---|---|
| Native context | 128K | 262K | Load an entire codebase |
| Model size | 16.5 GB | 12.4 GB | Fits on one consumer GPU |
| Speed | ~20 tok/s | ~111 tok/s | 5x faster |
| KV cache @ 262K | ~8 GB | ~2.7 GB | MoE advantage: fewer attn layers |
| Architecture | Standard | DeltaNet + Gated Attention | Reasons better at long context |
The KV cache difference is the number that matters most for your hardware. A dense model at 262K context needs roughly 8GB just for the cache. The 35B MoE needs 2.7GB — because only 10 of its 40 layers are attention layers. The rest are SSM layers that use a fixed-size recurrent state instead of a growing cache. That is why the full 262K context fits with headroom to spare on a 24GB card.
Project Structure
Here is what we are building. I use G:\ — a dedicated SSD I keep for coding projects — but anywhere with 30GB of fast storage works. Substitute your own drive letter throughout.
G:\qwen-local\ ├── llama.cpp-tq3\ # TurboQuant engine ├── models\ │ └── qwen3.6-35b\ # 35B MoE (12.4 GB) │ ├── Qwen3.6-35B-A3B-TQ3_4S.gguf │ └── mmproj-BF16.gguf # Vision support ├── start_35b.bat
Phase 1: Build the Engine (One-Time)
This phase compiles the TurboQuant engine from source. You do it once. Before you start, you need four things installed on Windows: Git, CMake, Ninja, and the Visual Studio 2022 Build Tools with the C++ workload. The CUDA Toolkit from NVIDIA is also required for GPU compilation. The winget commands below handle the first three. The Visual Studio Build Tools and CUDA Toolkit installs each require a few manual steps that are outside the scope of this post. If you need a walkthrough, see [companion post: Setting Up a C++/CUDA Build Environment on Windows, coming soon].
powershell
winget install --id Git.Git -e winget install --id Kitware.CMake -e winget install --id Ninja-build.Ninja -e
1.1 Open Anaconda Prompt (Miniconda3)
We use Miniconda3 rather than the system Python for two reasons. First, the hf command-line tool used to download models in Phase 2 needs to be installed somewhere, and a conda environment keeps it from touching your system Python. Second, the .bat launch scripts in Phase 4 call conda activate qwen-local directly, so conda needs to be accessible from the command line. Miniconda3 is the minimal distribution: just conda, Python, and pip, without the full Anaconda package set.
If you do not have Miniconda3, install it first: https://docs.anaconda.com/miniconda/
Once installed, open Anaconda Prompt (miniconda3) from the Start menu. Do not run it as Administrator.
1.2 Create Conda Environment
cmd
conda create -n qwen-local python=3.11 -y conda activate qwen-local
With the environment active, install the Hugging Face CLI:
cmd
pip install "huggingface_hub[cli]"
Hugging Face is where essentially all serious open-source model releases live. The hf CLI is the right way to pull files from it: the model in Phase 2 is about 12.4 GB, and the CLI handles resumable downloads automatically. If your connection drops halfway through, you run the same command again and it picks up where it left off. A browser download does not do that.
1.3 Clone and Build
Standard llama.cpp can run Qwen 3.6. We are not using standard llama.cpp.
llama.cpp-tq3 is a fork that adds TurboQuant, a quantization format built for modern attention architectures. Quantization is the process of compressing model weights to use less memory. Standard GGUF formats like Q4_K_M do this by packing weights into a fixed number of bits, with some crude tiering to give slightly higher precision to layers that seem more important. It works, but it is a blunt instrument: the compression is applied uniformly without real understanding of how the model uses those weights. TurboQuant is designed specifically for hybrid attention architectures like the one Qwen 3.6 uses. It compresses weights based on their actual role in the attention structure, which is why a TurboQuant model at 12GB retains more quality than a standard GGUF at the same size. More importantly for this setup, TurboQuant extends that same logic to the KV cache itself. Standard llama.cpp can compress the cache a little; TurboQuant’s tq3_0 cache format compresses it significantly further without the quality hit that standard formats would produce at the same size. Building takes about ten minutes. You do it once.
cmd
G: cd qwen-local git clone https://github.com/turbo-tan/llama.cpp-tq3.git cd llama.cpp-tq3 cmake -B build -G "Visual Studio 17 2022" -A x64 ^ -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 ^ -DGGML_CUDA_FA=ON -DGGML_CUDA_GRAPHS=ON ^ -DCMAKE_BUILD_TYPE=Release cmake --build build -j --config Release
-DCMAKE_CUDA_ARCHITECTURES=86 targets the RTX 3090’s compute capability. If you have a different GPU, find your number at developer.nvidia.com/cuda-gpus.
Phase 2: Download the Model
The model is hosted on Hugging Face by YTan2000, who maintains the TurboQuant-quantized builds. The hf CLI handles the download and creates the target directory if it does not exist.
2.1 Download 35B MoE (12.4 GB)
cmd
hf download YTan2000/Qwen3.6-35B-A3B-TQ3_4S Qwen3.6-35B-A3B-TQ3_4S.gguf --local-dir G:\qwen-local\models\qwen3.6-35b
cmd
hf download YTan2000/Qwen3.6-35B-A3B-TQ3_4S mmproj-BF16.gguf --local-dir G:\qwen-local\models\qwen3.6-35b
The mmproj file adds vision support. It is small. Grab it now.
Phase 3: Tuning the KV Cache
This is the part most guides skip. It determines whether the setup actually works.
Every token you send the model gets stored in the KV cache. The K and V stand for keys and values: the two tensors the attention mechanism produces for each token and holds in memory so it can relate new tokens to everything that came before. They grow with every token you add. At 262K tokens, they would normally be large. But because the 35B MoE only has 10 attention layers out of 40 total, the cache stays remarkably small. At q8_0/q8_0 and full 262K context, it uses only 2.7GB.
The -ctk and -ctv flags set the compression format for cached keys and cached values respectively. The option to set them differently exists because keys and values have different statistical properties. Keys are more sensitive to quantization than values. But for this model, q8_0/q8_0 fits comfortably and is the recommended starting point.
KV Cache Options
| Setting | VRAM per 1K tokens | Quality | Speed | Best for |
|---|---|---|---|---|
-ctk f16 -ctv f16 | ~8 MB | Full | Baseline | Quality-obsessed |
-ctk q8_0 -ctv q8_0 | ~4 MB | Near-perfect | Fast | Daily driver (recommended) |
-ctk tq3_0 -ctv tq3_0 | ~2.5 MB | Excellent | Very fast | TurboQuant optimized |
-ctk q4_0 -ctv q4_0 | ~2 MB | Good | Fast | Max context headroom |
For the RTX 3090 with 35B MoE
Because the MoE KV cache is so compact, you have more headroom than you might expect:
| Goal | KV Cache | Expected Max Context | Notes |
|---|---|---|---|
| Start here | q8_0 | 262K | Fits with ~7GB headroom |
| Context-focused | tq3_0 | 262K+ | Extra headroom for large repos |
| Maximum headroom | q4_0 | 262K+ | If you need breathing room |
Start with q8_0. It is near-lossless and fits the full context window with room to spare on a 3090.
Phase 4: The Launch Script
A .bat file is a Windows batch script: a plain text file containing a sequence of commands that Windows runs in order when you double-click it. It is the Windows equivalent of a shell script. We use one here because starting the model server requires navigating to the right directory, activating the conda environment, and passing a specific set of flags to llama-server.exe. These are commands you would otherwise have to type correctly every time. The .bat file does all of that in one double-click and stays easy to edit when you want to change a setting.
This file lives at G:\qwen-local\. The tunable parameters at the top are the only things you should need to change day to day.
Changing parameters requires a restart. The model server loads its configuration at startup and holds it until the process exits. You cannot change context size, KV cache type, or any other setting while the server is running. The workflow is: stop the server, edit the .bat file in any text editor, save it, double-click to restart. To stop the server, press Ctrl+C in the terminal window. This sends a clean shutdown signal and you can watch the process confirm it has exited before starting fresh. Closing the window works too but can occasionally leave the process running briefly in the background, still holding VRAM, before Windows catches up.
start_35b.bat
batch
@echo off title Qwen 3.6 35B MoE - RTX 3090 echo ======================================== echo Qwen 3.6 35B MoE (TurboQuant) echo Speed: ~111 tok/s ^| Size: 12.4 GB echo Port: 1234 echo ======================================== echo. call conda activate qwen-local cd /d G:\qwen-local\llama.cpp-tq3\build\bin\Release :: === TUNABLE PARAMETERS === set CONTEXT=262144 set KEY_TYPE=q8_0 set VAL_TYPE=q8_0 set BATCH_SIZE=2048 set REASONING=on :: =========================== echo Context: %CONTEXT% tokens echo KV Cache - Key: %KEY_TYPE%, Value: %VAL_TYPE% echo. llama-server.exe ^ -m G:\qwen-local\models\qwen3.6-35b\Qwen3.6-35B-A3B-TQ3_4S.gguf ^ --mmproj G:\qwen-local\models\qwen3.6-35b\mmproj-BF16.gguf ^ -ngl 99 ^ --host 0.0.0.0 --port 1234 ^ -c %CONTEXT% ^ -fa on ^ -ctk %KEY_TYPE% -ctv %VAL_TYPE% ^ -b %BATCH_SIZE% ^ --jinja ^ --reasoning %REASONING% --reasoning-budget 2048 --reasoning-format deepseek pause
| Parameter | Value | Range | Effect |
|---|---|---|---|
CONTEXT | 262144 | 131072 → 262144 | Full context fits comfortably on a 3090 |
KEY_TYPE | q8_0 | q8_0, tq3_0, q4_0 | Down for more headroom; q8_0 recommended |
VAL_TYPE | q8_0 | q8_0, q5_0, q4_0 | Can be set lower than KEY_TYPE independently |
BATCH_SIZE | 2048 | 1024 → 4096 | Higher = faster prompt processing |
REASONING | on | on, off | Enables chain-of-thought; budget caps tokens |
A note on --host 0.0.0.0: this exposes the server to your local network, not just localhost. If you are on a network you do not fully control, change this to --host 127.0.0.1 to restrict access to the local machine only.
Phase 5: Recipes
Three configurations worth knowing. Start with Recipe 1. Only adjust when you have a specific reason.
Recipe 1: Default — Full Context, Full Quality
This is the configuration confirmed to run cleanly on an RTX 3090 with ~7GB of VRAM headroom. Use it unless you have a reason not to.
batch
:: === TUNABLE PARAMETERS === set CONTEXT=262144 set KEY_TYPE=q8_0 set VAL_TYPE=q8_0 set BATCH_SIZE=2048 set REASONING=on :: ===========================
Expected VRAM: ~16 GB. Context: full 262K. Speed: ~111 tok/s.
Recipe 2: Maximum Headroom
You are working with an unusually large repository or you want extra breathing room as the session grows long. Dropping the KV types frees meaningful VRAM at minimal quality cost.
batch
:: === TUNABLE PARAMETERS === set CONTEXT=262144 set KEY_TYPE=q8_0 set VAL_TYPE=q4_0 set BATCH_SIZE=2048 set REASONING=on :: ===========================
Expected VRAM: ~14 GB. Context: full 262K.
Recipe 3: Speed Priority
Tight iteration loop. You are running, breaking, asking, fixing, and the bottleneck is output latency. Smaller context keeps the cache lean and generation fast.
batch
:: === TUNABLE PARAMETERS === set CONTEXT=65536 set KEY_TYPE=q8_0 set VAL_TYPE=q8_0 set BATCH_SIZE=4096 set REASONING=off :: ===========================
Expected speed: 115–120 tok/s. Context: 64K.
Phase 6: Connecting a Harness
The model is now running as a local server on port 1234. You need something to talk to it.
A harness is the application layer between you and the model: it handles the conversation interface, tool use, file access, and anything else layered on top of raw inference. There are many. OpenWebUI is browser-based and good for chat. Continue.dev integrates directly into VS Code. My preference is OpenCode, which runs in the terminal and connects to any OpenAI-compatible endpoint — local or cloud — with minimal configuration overhead. One practical note: some AI coding harnesses inject large system prompts that quietly constrain what a model can do. OpenCode does not. When you are running a model on your own hardware, you want all of it.
If you prefer a different harness, the config in the next section still applies. Only the provider name and baseURL change.
6.0 Install OpenCode
OpenCode requires Node.js. If you do not have it, install it from nodejs.org — the LTS version is fine. Then from within your WSL shell:
bash
npm install -g opencode-ai
That’s it. Verify the install with opencode --version before continuing.
6.1 Create the Config
bash
mkdir -p ~/.config/opencode nano ~/.config/opencode/opencode.json
6.2 Find Your Model ID and Host IP
OpenCode matches model IDs against what llama-server actually reports. Before writing the config, confirm both values while the server is running.
Find the Windows host IP from WSL (this is the address WSL uses to reach your Windows machine — host.docker.internal does not work in WSL):
bash
cat /etc/resolv.conf
The nameserver line is your host IP. Note that this address can change when WSL restarts. If OpenCode stops connecting after a reboot, re-run this command and update the config.
Confirm the model ID:
bash
curl http://<your-host-ip>:1234/v1/models
The id field in the response is what you need — it will be the .gguf filename, something like Qwen3.6-35B-A3B-TQ3_4S.gguf. Use that exact string in the config below.
6.3 Add the Provider Config
Replace 172.x.x.x with your actual host IP, and replace the model key with the ID returned by the curl above. [Edit 2026.05.04: Added “modalities” to allow images to be viewed and understood! Thanks to u/aeroumbria on reddit. ]
json
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"qwen3.6-35b": {
"npm": "@ai-sdk/openai-compatible",
"name": "Qwen 3.6 35B MoE",
"options": {
"baseURL": "http://172.x.x.x:1234/v1",
"toolParser": [{ "type": "raw-function-call" }, { "type": "json" }]
},
"models": {
"Qwen3.6-35B-A3B-TQ3_4S.gguf": {
"name": "Qwen 3.6 35B MoE",
"tool_call": true,
"modalities": {
"input": ["text", "image"],
"output": ["text"]
},
"limit": { "context": 262144, "output": 8192 }
}
}
}
},
"model": "qwen3.6-35b/Qwen3.6-35B-A3B-TQ3_4S.gguf"
}
Note: OpenCode validates this config against its schema and will reject unknown keys. There is no way to add comments inside the JSON. Keep a separate notes file if you want to document your host IP history or model IDs.
6.4 Optional: Auto-Detect Script
Because the host IP can change on WSL restart, a small script can update the config automatically before launching OpenCode:
bash
#!/bin/bash
HOST_IP=$(grep nameserver /etc/resolv.conf | awk '{print $2}')
MODEL_ID=$(curl -s http://$HOST_IP:1234/v1/models | python3 -c "import sys,json; print(json.load(sys.stdin)['data'][0]['id'])")
CONFIG="$HOME/.config/opencode/opencode.json"
python3 -c "
import json
with open('$CONFIG') as f: c = json.load(f)
c['provider']['qwen3.6-35b']['options']['baseURL'] = 'http://$HOST_IP:1234/v1'
c['model'] = 'qwen3.6-35b/$MODEL_ID'
with open('$CONFIG', 'w') as f: json.dump(c, f, indent=2)
"
opencode
Save as ~/bin/oc.sh, make it executable with chmod +x ~/bin/oc.sh, and use it to launch OpenCode instead of calling it directly.
6.5 Running OpenCode from WSL
WSL2 (Windows Subsystem for Linux 2) lets you run a full Linux environment inside Windows. If you do not have it set up, the two commands below are all you need — WSL2 first, then Kali Linux as the distribution. Microsoft’s full setup guide is at learn.microsoft.com/en-us/windows/wsl/install; Kali’s WSL-specific notes are at kali.org/docs/wsl/wsl-preparations.
powershell
wsl --install --no-distribution wsl --install -d kali-linux
Restart when prompted. After that, wsl -d kali-linux drops you into a Kali shell.
I run OpenCode from within WSL — specifically wsl -d kali-linux — rather than from Windows directly. OpenCode’s Linux install is cleaner. Node, npm, and the shell utilities it depends on behave more naturally in Linux than on Windows. The model stays on Windows where the GPU is. OpenCode lives in Linux where the terminal is better.
To start a session:
bash
wsl -d kali-linux opencode
Then /model to confirm you are connected to the right instance.
Phase 7: Daily Use
Once it is running, the workflow is simple.
- Double-click
start_35b.bat - Wait for “llama server listening”
- Open PowerShell and drop into Kali:
wsl -d kali-linux - Run
opencode(oroc.shif using the auto-detect script) - Run
/modelto confirm the model is connected
Monitoring VRAM
nvidia-smi -l 1 works but takes up a full terminal window. Two better options:
nvitop — htop-style GPU monitor. Install once, run anywhere:
cmd
pip install nvitop nvitop
Shows VRAM, utilization, temperature, and running processes in a compact layout. This is the one to keep open in a corner.
Task Manager — zero setup. Ctrl+Shift+Esc → Performance → GPU. Shows dedicated GPU memory in a small, resizable window you can pin to a corner without occupying a terminal.
At Recipe 1 settings, expect to see around 16GB in use. If you are consistently sitting above 22GB, drop VAL_TYPE to q4_0 in the .bat file and restart.
Performance at a Glance
| Setting | Context | VRAM | Speed | KV Cache |
|---|---|---|---|---|
| Recipe 1 | 262K | ~16GB | ~111 tok/s | 2.7 GB |
| Recipe 2 | 262K | ~14GB | ~111 tok/s | ~1.8 GB |
| Recipe 3 | 64K | ~13GB | ~115 tok/s | 0.7 GB |
The KV cache numbers reflect the MoE advantage: only 10 of the model’s 40 layers are attention layers. The rest use a fixed-size recurrent state that does not grow with context.
Troubleshooting
| Problem | Fix |
|---|---|
Model not appearing in /model | Check model ID with curl http://<host-ip>:1234/v1/models |
| OpenCode not connecting | Re-check host IP via cat /etc/resolv.conf — it changes on restart |
| Out of memory | Reduce CONTEXT or drop VAL_TYPE to q4_0 |
| Slow generation | Increase BATCH_SIZE |
| Poor quality responses | Switch from q4_0 to q8_0 KV |
| OpenCode tool calls failing | Ensure --jinja flag is present in the .bat file |
q6_0 KV type not accepted | Not supported — use q5_0 or q5_1 instead |
You’re Ready
Start with Recipe 1. Monitor your VRAM. The 35B MoE is your only model and it is the right one. It’s smaller than the dense alternatives, faster, better at coding tasks, and comfortable at full 262K context on a 3090.
Not One Spinning Plate Dropped
This is not a single-click solution. You are managing multiple moving parts: compiling a specialized server, starting it before each session, and being comfortable enough in the terminal to navigate a few shell commands. There are enough plates spinning that it shouldn’t work, and yet it does. A local model that ranks with the best subscription offerings a few months ago, running on hardware you own, with a context window that holds entire codebases.
And it’s only going to get better from here.