Running Qwen 3.6 Locally: A Setup Guide for RTX Owners

Something changed in the last few months. Local models reached a threshold: not perfect, not always faster than hosted alternatives, but good enough to be genuinely useful for long-context coding work on consumer hardware. Meanwhile, cloud AI terms keep shifting, and the direction is not toward individual developers running personal setups.

The calculation is simple: something that runs on your own machine, for the cost of electricity, with no terms to re-read. It does not have to be the best model available. It has to work.

We are apparently there now. It can only get better.


Qwen 3.6 35B MoE fits in 12.4 GB of VRAM, generates at around 111 tokens per second on an RTX 3090, and supports a 262K token context window. That is long enough to hold most codebases. Because of its Mixture of Experts architecture, it uses far less VRAM for the KV cache than a dense model of comparable size. This means the full 262K context actually fits, comfortably, on a single consumer card.

This post is a working recipe. I run this model on an RTX 3090, and this is exactly how I set it up.


What You Need

ComponentSpecStatus
GPURTX 3090 24GB VRAM
RAM64GB
StorageSSD, 30GB+ free

If your GPU has at least 16GB VRAM, most of this still applies. You will need to reduce the context and KV cache settings in Phase 3 to fit your memory budget.


Why Qwen 3.6

Until recently, long context was an infrastructure problem. When a model processes a long conversation or a large codebase, it builds an attention cache: the working memory that lets it connect a reference on line 3 to a bug on line 4,000. That cache grows with every token. At 100K to 200K tokens, it becomes large enough that running it required dedicated server hardware with tens of gigabytes of fast memory. That meant cloud endpoints and someone else’s terms.

Several things had to change at once. Flash Attention rewrote how the attention computation works: instead of loading the full cache into memory at each step, it tiles the calculation, keeping memory use roughly flat as context grows. Quantization formats got smarter: early approaches compressed model weights indiscriminately and paid a quality penalty; newer formats identify which weights matter most and protect them, so a 12GB compressed model retains most of the quality of a full-precision original. The KV cache turned out to be compressible too, with near-lossless results at half the original size. Mixture of Experts architecture changed the basic math: a 35B MoE model only activates a fraction of its parameters per token, so the effective compute cost at inference is far smaller than the total parameter count suggests, and crucially, only the attention layers contribute to the KV cache, not the expert layers. Qwen 3.6 adds DeltaNet with Gated Attention on top, designed not just to store long contexts but to reason across them without the quality degradation that typically sets in past 100K tokens.

None of these are new ideas. What changed is that they arrived together, at this model size, in a format that fits on one consumer card.

FeatureQwen 3.5Qwen 3.6 35B MoEWhat it means
Native context128K262KLoad an entire codebase
Model size16.5 GB12.4 GBFits on one consumer GPU
Speed~20 tok/s~111 tok/s5x faster
KV cache @ 262K~8 GB~2.7 GBMoE advantage: fewer attn layers
ArchitectureStandardDeltaNet + Gated AttentionReasons better at long context

The KV cache difference is the number that matters most for your hardware. A dense model at 262K context needs roughly 8GB just for the cache. The 35B MoE needs 2.7GB — because only 10 of its 40 layers are attention layers. The rest are SSM layers that use a fixed-size recurrent state instead of a growing cache. That is why the full 262K context fits with headroom to spare on a 24GB card.


Project Structure

Here is what we are building. I use G:\ — a dedicated SSD I keep for coding projects — but anywhere with 30GB of fast storage works. Substitute your own drive letter throughout.

G:\qwen-local\
├── llama.cpp-tq3\                          # TurboQuant engine
├── models\
│   └── qwen3.6-35b\                        # 35B MoE (12.4 GB)
│       ├── Qwen3.6-35B-A3B-TQ3_4S.gguf
│       └── mmproj-BF16.gguf                # Vision support
├── start_35b.bat

Phase 1: Build the Engine (One-Time)

This phase compiles the TurboQuant engine from source. You do it once. Before you start, you need four things installed on Windows: Git, CMake, Ninja, and the Visual Studio 2022 Build Tools with the C++ workload. The CUDA Toolkit from NVIDIA is also required for GPU compilation. The winget commands below handle the first three. The Visual Studio Build Tools and CUDA Toolkit installs each require a few manual steps that are outside the scope of this post. If you need a walkthrough, see [companion post: Setting Up a C++/CUDA Build Environment on Windows, coming soon].

powershell

winget install --id Git.Git -e
winget install --id Kitware.CMake -e
winget install --id Ninja-build.Ninja -e

1.1 Open Anaconda Prompt (Miniconda3)

We use Miniconda3 rather than the system Python for two reasons. First, the hf command-line tool used to download models in Phase 2 needs to be installed somewhere, and a conda environment keeps it from touching your system Python. Second, the .bat launch scripts in Phase 4 call conda activate qwen-local directly, so conda needs to be accessible from the command line. Miniconda3 is the minimal distribution: just conda, Python, and pip, without the full Anaconda package set.

If you do not have Miniconda3, install it first: https://docs.anaconda.com/miniconda/

Once installed, open Anaconda Prompt (miniconda3) from the Start menu. Do not run it as Administrator.

1.2 Create Conda Environment

cmd

conda create -n qwen-local python=3.11 -y
conda activate qwen-local

With the environment active, install the Hugging Face CLI:

cmd

pip install "huggingface_hub[cli]"

Hugging Face is where essentially all serious open-source model releases live. The hf CLI is the right way to pull files from it: the model in Phase 2 is about 12.4 GB, and the CLI handles resumable downloads automatically. If your connection drops halfway through, you run the same command again and it picks up where it left off. A browser download does not do that.

1.3 Clone and Build

Standard llama.cpp can run Qwen 3.6. We are not using standard llama.cpp.

llama.cpp-tq3 is a fork that adds TurboQuant, a quantization format built for modern attention architectures. Quantization is the process of compressing model weights to use less memory. Standard GGUF formats like Q4_K_M do this by packing weights into a fixed number of bits, with some crude tiering to give slightly higher precision to layers that seem more important. It works, but it is a blunt instrument: the compression is applied uniformly without real understanding of how the model uses those weights. TurboQuant is designed specifically for hybrid attention architectures like the one Qwen 3.6 uses. It compresses weights based on their actual role in the attention structure, which is why a TurboQuant model at 12GB retains more quality than a standard GGUF at the same size. More importantly for this setup, TurboQuant extends that same logic to the KV cache itself. Standard llama.cpp can compress the cache a little; TurboQuant’s tq3_0 cache format compresses it significantly further without the quality hit that standard formats would produce at the same size. Building takes about ten minutes. You do it once.

cmd

G:
cd qwen-local
git clone https://github.com/turbo-tan/llama.cpp-tq3.git
cd llama.cpp-tq3

cmake -B build -G "Visual Studio 17 2022" -A x64 ^
  -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=86 ^
  -DGGML_CUDA_FA=ON -DGGML_CUDA_GRAPHS=ON ^
  -DCMAKE_BUILD_TYPE=Release

cmake --build build -j --config Release

-DCMAKE_CUDA_ARCHITECTURES=86 targets the RTX 3090’s compute capability. If you have a different GPU, find your number at developer.nvidia.com/cuda-gpus.


Phase 2: Download the Model

The model is hosted on Hugging Face by YTan2000, who maintains the TurboQuant-quantized builds. The hf CLI handles the download and creates the target directory if it does not exist.

2.1 Download 35B MoE (12.4 GB)

cmd

hf download YTan2000/Qwen3.6-35B-A3B-TQ3_4S Qwen3.6-35B-A3B-TQ3_4S.gguf --local-dir G:\qwen-local\models\qwen3.6-35b

cmd

hf download YTan2000/Qwen3.6-35B-A3B-TQ3_4S mmproj-BF16.gguf --local-dir G:\qwen-local\models\qwen3.6-35b

The mmproj file adds vision support. It is small. Grab it now.


Phase 3: Tuning the KV Cache

This is the part most guides skip. It determines whether the setup actually works.

Every token you send the model gets stored in the KV cache. The K and V stand for keys and values: the two tensors the attention mechanism produces for each token and holds in memory so it can relate new tokens to everything that came before. They grow with every token you add. At 262K tokens, they would normally be large. But because the 35B MoE only has 10 attention layers out of 40 total, the cache stays remarkably small. At q8_0/q8_0 and full 262K context, it uses only 2.7GB.

The -ctk and -ctv flags set the compression format for cached keys and cached values respectively. The option to set them differently exists because keys and values have different statistical properties. Keys are more sensitive to quantization than values. But for this model, q8_0/q8_0 fits comfortably and is the recommended starting point.

KV Cache Options

SettingVRAM per 1K tokensQualitySpeedBest for
-ctk f16 -ctv f16~8 MBFullBaselineQuality-obsessed
-ctk q8_0 -ctv q8_0~4 MBNear-perfectFastDaily driver (recommended)
-ctk tq3_0 -ctv tq3_0~2.5 MBExcellentVery fastTurboQuant optimized
-ctk q4_0 -ctv q4_0~2 MBGoodFastMax context headroom

For the RTX 3090 with 35B MoE

Because the MoE KV cache is so compact, you have more headroom than you might expect:

GoalKV CacheExpected Max ContextNotes
Start hereq8_0262KFits with ~7GB headroom
Context-focusedtq3_0262K+Extra headroom for large repos
Maximum headroomq4_0262K+If you need breathing room

Start with q8_0. It is near-lossless and fits the full context window with room to spare on a 3090.


Phase 4: The Launch Script

A .bat file is a Windows batch script: a plain text file containing a sequence of commands that Windows runs in order when you double-click it. It is the Windows equivalent of a shell script. We use one here because starting the model server requires navigating to the right directory, activating the conda environment, and passing a specific set of flags to llama-server.exe. These are commands you would otherwise have to type correctly every time. The .bat file does all of that in one double-click and stays easy to edit when you want to change a setting.

This file lives at G:\qwen-local\. The tunable parameters at the top are the only things you should need to change day to day.

Changing parameters requires a restart. The model server loads its configuration at startup and holds it until the process exits. You cannot change context size, KV cache type, or any other setting while the server is running. The workflow is: stop the server, edit the .bat file in any text editor, save it, double-click to restart. To stop the server, press Ctrl+C in the terminal window. This sends a clean shutdown signal and you can watch the process confirm it has exited before starting fresh. Closing the window works too but can occasionally leave the process running briefly in the background, still holding VRAM, before Windows catches up.

start_35b.bat

batch

@echo off
title Qwen 3.6 35B MoE - RTX 3090
echo ========================================
echo   Qwen 3.6 35B MoE (TurboQuant)
echo   Speed: ~111 tok/s ^| Size: 12.4 GB
echo   Port: 1234
echo ========================================
echo.

call conda activate qwen-local
cd /d G:\qwen-local\llama.cpp-tq3\build\bin\Release

:: === TUNABLE PARAMETERS ===
set CONTEXT=262144
set KEY_TYPE=q8_0
set VAL_TYPE=q8_0
set BATCH_SIZE=2048
set REASONING=on
:: ===========================

echo Context: %CONTEXT% tokens
echo KV Cache - Key: %KEY_TYPE%, Value: %VAL_TYPE%
echo.

llama-server.exe ^
  -m G:\qwen-local\models\qwen3.6-35b\Qwen3.6-35B-A3B-TQ3_4S.gguf ^
  --mmproj G:\qwen-local\models\qwen3.6-35b\mmproj-BF16.gguf ^
  -ngl 99 ^
  --host 0.0.0.0 --port 1234 ^
  -c %CONTEXT% ^
  -fa on ^
  -ctk %KEY_TYPE% -ctv %VAL_TYPE% ^
  -b %BATCH_SIZE% ^
  --jinja ^
  --reasoning %REASONING% --reasoning-budget 2048 --reasoning-format deepseek

pause
ParameterValueRangeEffect
CONTEXT262144131072 → 262144Full context fits comfortably on a 3090
KEY_TYPEq8_0q8_0, tq3_0, q4_0Down for more headroom; q8_0 recommended
VAL_TYPEq8_0q8_0, q5_0, q4_0Can be set lower than KEY_TYPE independently
BATCH_SIZE20481024 → 4096Higher = faster prompt processing
REASONINGonon, offEnables chain-of-thought; budget caps tokens

A note on --host 0.0.0.0: this exposes the server to your local network, not just localhost. If you are on a network you do not fully control, change this to --host 127.0.0.1 to restrict access to the local machine only.


Phase 5: Recipes

Three configurations worth knowing. Start with Recipe 1. Only adjust when you have a specific reason.

Recipe 1: Default — Full Context, Full Quality

This is the configuration confirmed to run cleanly on an RTX 3090 with ~7GB of VRAM headroom. Use it unless you have a reason not to.

batch

:: === TUNABLE PARAMETERS ===
set CONTEXT=262144
set KEY_TYPE=q8_0
set VAL_TYPE=q8_0
set BATCH_SIZE=2048
set REASONING=on
:: ===========================

Expected VRAM: ~16 GB. Context: full 262K. Speed: ~111 tok/s.

Recipe 2: Maximum Headroom

You are working with an unusually large repository or you want extra breathing room as the session grows long. Dropping the KV types frees meaningful VRAM at minimal quality cost.

batch

:: === TUNABLE PARAMETERS ===
set CONTEXT=262144
set KEY_TYPE=q8_0
set VAL_TYPE=q4_0
set BATCH_SIZE=2048
set REASONING=on
:: ===========================

Expected VRAM: ~14 GB. Context: full 262K.

Recipe 3: Speed Priority

Tight iteration loop. You are running, breaking, asking, fixing, and the bottleneck is output latency. Smaller context keeps the cache lean and generation fast.

batch

:: === TUNABLE PARAMETERS ===
set CONTEXT=65536
set KEY_TYPE=q8_0
set VAL_TYPE=q8_0
set BATCH_SIZE=4096
set REASONING=off
:: ===========================

Expected speed: 115–120 tok/s. Context: 64K.


Phase 6: Connecting a Harness

The model is now running as a local server on port 1234. You need something to talk to it.

A harness is the application layer between you and the model: it handles the conversation interface, tool use, file access, and anything else layered on top of raw inference. There are many. OpenWebUI is browser-based and good for chat. Continue.dev integrates directly into VS Code. My preference is OpenCode, which runs in the terminal and connects to any OpenAI-compatible endpoint — local or cloud — with minimal configuration overhead. One practical note: some AI coding harnesses inject large system prompts that quietly constrain what a model can do. OpenCode does not. When you are running a model on your own hardware, you want all of it.

If you prefer a different harness, the config in the next section still applies. Only the provider name and baseURL change.

6.0 Install OpenCode

OpenCode requires Node.js. If you do not have it, install it from nodejs.org — the LTS version is fine. Then from within your WSL shell:

bash

npm install -g opencode-ai

That’s it. Verify the install with opencode --version before continuing.

6.1 Create the Config

bash

mkdir -p ~/.config/opencode
nano ~/.config/opencode/opencode.json

6.2 Find Your Model ID and Host IP

OpenCode matches model IDs against what llama-server actually reports. Before writing the config, confirm both values while the server is running.

Find the Windows host IP from WSL (this is the address WSL uses to reach your Windows machine — host.docker.internal does not work in WSL):

bash

cat /etc/resolv.conf

The nameserver line is your host IP. Note that this address can change when WSL restarts. If OpenCode stops connecting after a reboot, re-run this command and update the config.

Confirm the model ID:

bash

curl http://<your-host-ip>:1234/v1/models

The id field in the response is what you need — it will be the .gguf filename, something like Qwen3.6-35B-A3B-TQ3_4S.gguf. Use that exact string in the config below.

6.3 Add the Provider Config

Replace 172.x.x.x with your actual host IP, and replace the model key with the ID returned by the curl above. [Edit 2026.05.04: Added “modalities” to allow images to be viewed and understood! Thanks to u/aeroumbria on reddit. ]

json

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "qwen3.6-35b": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Qwen 3.6 35B MoE",
      "options": {
        "baseURL": "http://172.x.x.x:1234/v1",
        "toolParser": [{ "type": "raw-function-call" }, { "type": "json" }]
      },
      "models": {
        "Qwen3.6-35B-A3B-TQ3_4S.gguf": {
          "name": "Qwen 3.6 35B MoE",
          "tool_call": true,
          "modalities": {
            "input": ["text", "image"],
            "output": ["text"]
          },
          "limit": { "context": 262144, "output": 8192 }
        }
      }
    }
  },
  "model": "qwen3.6-35b/Qwen3.6-35B-A3B-TQ3_4S.gguf"
}

Note: OpenCode validates this config against its schema and will reject unknown keys. There is no way to add comments inside the JSON. Keep a separate notes file if you want to document your host IP history or model IDs.

6.4 Optional: Auto-Detect Script

Because the host IP can change on WSL restart, a small script can update the config automatically before launching OpenCode:

bash

#!/bin/bash
HOST_IP=$(grep nameserver /etc/resolv.conf | awk '{print $2}')
MODEL_ID=$(curl -s http://$HOST_IP:1234/v1/models | python3 -c "import sys,json; print(json.load(sys.stdin)['data'][0]['id'])")

CONFIG="$HOME/.config/opencode/opencode.json"
python3 -c "
import json
with open('$CONFIG') as f: c = json.load(f)
c['provider']['qwen3.6-35b']['options']['baseURL'] = 'http://$HOST_IP:1234/v1'
c['model'] = 'qwen3.6-35b/$MODEL_ID'
with open('$CONFIG', 'w') as f: json.dump(c, f, indent=2)
"

opencode

Save as ~/bin/oc.sh, make it executable with chmod +x ~/bin/oc.sh, and use it to launch OpenCode instead of calling it directly.

6.5 Running OpenCode from WSL

WSL2 (Windows Subsystem for Linux 2) lets you run a full Linux environment inside Windows. If you do not have it set up, the two commands below are all you need — WSL2 first, then Kali Linux as the distribution. Microsoft’s full setup guide is at learn.microsoft.com/en-us/windows/wsl/install; Kali’s WSL-specific notes are at kali.org/docs/wsl/wsl-preparations.

powershell

wsl --install --no-distribution
wsl --install -d kali-linux

Restart when prompted. After that, wsl -d kali-linux drops you into a Kali shell.

I run OpenCode from within WSL — specifically wsl -d kali-linux — rather than from Windows directly. OpenCode’s Linux install is cleaner. Node, npm, and the shell utilities it depends on behave more naturally in Linux than on Windows. The model stays on Windows where the GPU is. OpenCode lives in Linux where the terminal is better.

To start a session:

bash

wsl -d kali-linux
opencode

Then /model to confirm you are connected to the right instance.


Phase 7: Daily Use

Once it is running, the workflow is simple.

  1. Double-click start_35b.bat
  2. Wait for “llama server listening”
  3. Open PowerShell and drop into Kali: wsl -d kali-linux
  4. Run opencode (or oc.sh if using the auto-detect script)
  5. Run /model to confirm the model is connected

Monitoring VRAM

nvidia-smi -l 1 works but takes up a full terminal window. Two better options:

nvitop — htop-style GPU monitor. Install once, run anywhere:

cmd

pip install nvitop
nvitop

Shows VRAM, utilization, temperature, and running processes in a compact layout. This is the one to keep open in a corner.

Task Manager — zero setup. Ctrl+Shift+Esc → Performance → GPU. Shows dedicated GPU memory in a small, resizable window you can pin to a corner without occupying a terminal.

At Recipe 1 settings, expect to see around 16GB in use. If you are consistently sitting above 22GB, drop VAL_TYPE to q4_0 in the .bat file and restart.


Performance at a Glance

SettingContextVRAMSpeedKV Cache
Recipe 1262K~16GB~111 tok/s2.7 GB
Recipe 2262K~14GB~111 tok/s~1.8 GB
Recipe 364K~13GB~115 tok/s0.7 GB

The KV cache numbers reflect the MoE advantage: only 10 of the model’s 40 layers are attention layers. The rest use a fixed-size recurrent state that does not grow with context.


Troubleshooting

ProblemFix
Model not appearing in /modelCheck model ID with curl http://<host-ip>:1234/v1/models
OpenCode not connectingRe-check host IP via cat /etc/resolv.conf — it changes on restart
Out of memoryReduce CONTEXT or drop VAL_TYPE to q4_0
Slow generationIncrease BATCH_SIZE
Poor quality responsesSwitch from q4_0 to q8_0 KV
OpenCode tool calls failingEnsure --jinja flag is present in the .bat file
q6_0 KV type not acceptedNot supported — use q5_0 or q5_1 instead

You’re Ready

Start with Recipe 1. Monitor your VRAM. The 35B MoE is your only model and it is the right one. It’s smaller than the dense alternatives, faster, better at coding tasks, and comfortable at full 262K context on a 3090.

Not One Spinning Plate Dropped

This is not a single-click solution. You are managing multiple moving parts: compiling a specialized server, starting it before each session, and being comfortable enough in the terminal to navigate a few shell commands. There are enough plates spinning that it shouldn’t work, and yet it does. A local model that ranks with the best subscription offerings a few months ago, running on hardware you own, with a context window that holds entire codebases.

And it’s only going to get better from here.

Leave a Reply

Your email address will not be published. Required fields are marked *