Crafted RC | Maker Gear & Tools Curated by Code
← Back to blog

Deep Dive: Building Your Own LLM Server on Windows 11 with Code Llama Coder

By Joe Stasio on May 5, 2025

1. Introductions

If you’ve been banging your head against slow hosted LLM APIs or waiting days for Awanllm to approve your account, it’s time to cut the cord. In this 1,000‑word deep dive we’ll build a dedicated LLM appliance on Windows 11—using WSL2, CUDA on WSL, and Hugging Face’s Text‑Generation‑Inference (TGI) framework to serve Code Llama 7B‑Instruct at blazing speeds. We’ll also hook it into your crawlers and bots for comment generation, trend classification, and automated posting.

2. Why Self‑Host?

  • Ultra‑low latency: Local GPU inference avoids cloud round‑trips.
  • Unlimited usage: No per‑month quota or surprise throttling—your 32 GB RAM and 8 GB VRAM are the only limits.
  • Cost efficiency: One‑time hardware vs. recurring $100s/month on cloud GPUs.
  • Privacy & control: All your data, prompts, and logs stay on your network.

3. Hardware Recap & eBay Budget Build

Your main workstation already packs an RTX 2070 Super, 32 GB RAM, and an i5‑10th‑Gen CPU—ideal for an LLM host. If you want a second dedicated mini‑ITX box, grab these used parts for ~$250:

# eBay‑sourced mini‑ITX appliance build CPU: Intel Core i5‑10400 (6c/12t) – ~$70 MB: mini‑ITX board (LGA1200) – ~$60 RAM: 32 GB DDR4 (2×16 GB) – ~$80 SSD: 512 GB NVMe – ~$40 Case + SFX PSU kit – ~$50

4. Software Stack Overview

  1. Windows 11 + WSL2 (Ubuntu 22.04 LTS)
  2. CUDA on WSL driver from NVIDIA (exposes your RTX 2070 Super into WSL)
  3. CUDA Toolkit 12.2 inside Ubuntu via apt
  4. Python 3.10 venv with transformers, accelerate, bitsandbytes
  5. Hugging Face Text‑Generation‑Inference (TGI) server cloned & built locally
  6. Code Llama 7B‑Instruct quantized to 4‑bit (nf4) for ~8 GB VRAM footprint

5. Step‑By‑Step Setup

5.1 Enable WSL2 & Install Ubuntu

powershell> wsl --install -d Ubuntu-22.04 # Reboot, then launch Ubuntu app and set UNIX user/pass

5.2 Install NVIDIA “CUDA on WSL” Driver (Windows)

Download & run NVIDIA’s CUDA‑on‑WSL driver—no Linux driver inside WSL. Reboot when prompted.

5.3 Install CUDA Toolkit in WSL2

sudo apt update sudo apt install -y cuda-toolkit-12-2 nvcc --version    # expect release 12.2.x nvidia-smi        # verify GPU in WSL

5.4 Clone & Prep TGI Server

git clone https://github.com/huggingface/text-generation-inference.git cd text-generation-inference sudo apt install -y python3.10-venv build-essential libssl-dev gcc python3 -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install -r server/requirements.txt pip install -r server/requirements_gen.txt

5.5 Build & Install with CUDA Kernels

If flash‑attn stalls, increase parallel jobs: export MAKEFLAGS="-j6"

export CUDA_HOME=/usr/local/cuda-12.2 export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH export BUILD_EXTENSIONS=True make install

5.6 Launch Code Llama 7B‑Instruct

text-generation-launcher  --model-id meta-llama/CodeLlama-7b-Instruct-hf  --quantize bitsandbytes-nf4  --device cuda:0  --port 5000

6. Automation & Resource Caps

In %USERPROFILE%.wslconfig:

[wsl2] memory=24GB processors=6

Use Task Scheduler to run wsl -d Ubuntu -- ~/start-llm.sh at boot.

7. Integrating with Crawlers & Bots

import requests def gen(article):  resp = requests.post("http://localhost:5000/generate", json={"inputs":article,"parameters":{"max_new_tokens":32}})  return resp.json()[0]["generated_text"].strip()

8. Performance Tips

  • Quantize to nf4
  • Batch with batch_size
  • Tune temperature
  • Add RAG via Qdrant

9. Conclusion

You now own an LLM inference stack on Windows 11—fast, private, and unlimited.