1. Introductions
If you’ve been banging your head against slow hosted LLM APIs or waiting days for Awanllm to approve your account, it’s time to cut the cord. In this 1,000‑word deep dive we’ll build a dedicated LLM appliance on Windows 11—using WSL2, CUDA on WSL, and Hugging Face’s Text‑Generation‑Inference (TGI) framework to serve Code Llama 7B‑Instruct at blazing speeds. We’ll also hook it into your crawlers and bots for comment generation, trend classification, and automated posting.
2. Why Self‑Host?
- Ultra‑low latency: Local GPU inference avoids cloud round‑trips.
- Unlimited usage: No per‑month quota or surprise throttling—your 32 GB RAM and 8 GB VRAM are the only limits.
- Cost efficiency: One‑time hardware vs. recurring $100s/month on cloud GPUs.
- Privacy & control: All your data, prompts, and logs stay on your network.
3. Hardware Recap & eBay Budget Build
Your main workstation already packs an RTX 2070 Super, 32 GB RAM, and an i5‑10th‑Gen CPU—ideal for an LLM host. If you want a second dedicated mini‑ITX box, grab these used parts for ~$250:
# eBay‑sourced mini‑ITX appliance build CPU: Intel Core i5‑10400 (6c/12t) – ~$70 MB: mini‑ITX board (LGA1200) – ~$60 RAM: 32 GB DDR4 (2×16 GB) – ~$80 SSD: 512 GB NVMe – ~$40 Case + SFX PSU kit – ~$50
4. Software Stack Overview
- Windows 11 + WSL2 (Ubuntu 22.04 LTS)
- CUDA on WSL driver from NVIDIA (exposes your RTX 2070 Super into WSL)
- CUDA Toolkit 12.2 inside Ubuntu via apt
- Python 3.10 venv with transformers, accelerate, bitsandbytes
- Hugging Face Text‑Generation‑Inference (TGI) server cloned & built locally
- Code Llama 7B‑Instruct quantized to 4‑bit (nf4) for ~8 GB VRAM footprint
5. Step‑By‑Step Setup
5.1 Enable WSL2 & Install Ubuntu
powershell> wsl --install -d Ubuntu-22.04 # Reboot, then launch Ubuntu app and set UNIX user/pass
5.2 Install NVIDIA “CUDA on WSL” Driver (Windows)
Download & run NVIDIA’s CUDA‑on‑WSL driver—no Linux driver inside WSL. Reboot when prompted.
5.3 Install CUDA Toolkit in WSL2
sudo apt update sudo apt install -y cuda-toolkit-12-2 nvcc --version # expect release 12.2.x nvidia-smi # verify GPU in WSL
5.4 Clone & Prep TGI Server
git clone https://github.com/huggingface/text-generation-inference.git cd text-generation-inference sudo apt install -y python3.10-venv build-essential libssl-dev gcc python3 -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install -r server/requirements.txt pip install -r server/requirements_gen.txt
5.5 Build & Install with CUDA Kernels
If flash‑attn stalls, increase parallel jobs: export MAKEFLAGS="-j6"
export CUDA_HOME=/usr/local/cuda-12.2 export PATH=$CUDA_HOME/bin:$PATH export LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH export BUILD_EXTENSIONS=True make install
5.6 Launch Code Llama 7B‑Instruct
text-generation-launcher --model-id meta-llama/CodeLlama-7b-Instruct-hf --quantize bitsandbytes-nf4 --device cuda:0 --port 5000
6. Automation & Resource Caps
In %USERPROFILE%.wslconfig:
[wsl2] memory=24GB processors=6
Use Task Scheduler to run wsl -d Ubuntu -- ~/start-llm.sh at boot.
7. Integrating with Crawlers & Bots
import requests def gen(article): resp = requests.post("http://localhost:5000/generate", json={"inputs":article,"parameters":{"max_new_tokens":32}}) return resp.json()[0]["generated_text"].strip()
8. Performance Tips
- Quantize to nf4
- Batch with batch_size
- Tune temperature
- Add RAG via Qdrant
9. Conclusion
You now own an LLM inference stack on Windows 11—fast, private, and unlimited.