Why does my GPU matter more than RAM for LLMs?

LLM inference is heavily dependent on VRAM (GPU memory) because the entire model must be loaded into GPU memory for fast inference. While you can offload layers to system RAM, this dramatically slows down generation speed — often to 1-2 tokens per second versus 20-100+ with full GPU loading. Having enough VRAM to load the entire model ensures smooth, fast text generation.

What is MoE and why does it matter for VRAM?

Mixture of Experts (MoE) models like Qwen 3.6-35B-A3B have many total parameters but only activate a fraction per token (e.g., 3B out of 35B). This means faster inference speed, but all parameters still need to be loaded into VRAM. MoE models give you the knowledge of a larger model at the speed of a smaller one — but you still need VRAM for the full parameter set.

🧠 AI Model Checker

Check if your hardware can run local AI models — LLMs, image generators, and more

Your GPU / Graphics Card

System RAM

How to Use the AI Model Checker

Running local AI models on your own hardware gives you complete privacy, no API costs, and unlimited usage. But not every computer can handle every model — the AI Model Checker helps you find out which LLMs and image generation models are compatible with your GPU and system RAM before you spend time downloading multi-gigabyte model files.

Understanding VRAM and RAM Requirements

Every AI model has two key memory requirements: VRAM (video memory on your GPU) and system RAM. VRAM is the most critical factor because the model weights must be loaded into GPU memory for fast inference. If your GPU doesn't have enough VRAM, the model will either fail to load or run extremely slowly by offloading to system RAM. System RAM serves as a secondary requirement — even with enough VRAM, you need sufficient system memory for the OS, model context, and other processes.

Quantization Explained: Q4 vs FP16

Quantization is a technique that reduces model precision to save memory. A Q4 (4-bit) quantized model uses roughly 25% of the original model's memory footprint, making it the most popular choice for local inference. FP16 (16-bit) models use about 50% of the original size and provide higher quality output but require significantly more VRAM. For most users, Q4 quantized models offer the best balance of quality and performance — the quality loss is minimal (95-98% retention) while the memory savings are substantial.

MoE Models: More Knowledge, Less Compute

Mixture of Experts (MoE) models like Qwen 3.6-35B-A3B and Gemma 4 27B MoE carry many total parameters but only activate a fraction per token. This means you get the knowledge depth of a larger model at the inference speed of a smaller one. However, all parameters still need to be loaded into VRAM — so MoE models need more memory than their "active" parameter count suggests, but less compute during generation.

Apple Silicon Advantage

Apple's M-series chips use unified memory architecture, where the GPU shares the same memory pool as the CPU. The M5 Max supports up to 128GB of unified memory with 614GB/s bandwidth, effectively giving you 128GB of VRAM for AI models — far more than any consumer GPU. Even the base M5 with 16GB unified memory is capable for smaller models. This makes Apple Silicon Macs exceptionally powerful machines for local AI inference, especially when paired with optimized frameworks like Ollama or llama.cpp with Metal acceleration.

Step-by-Step Guide

Select your GPU — Choose your graphics card from the dropdown. If you're using an Apple Silicon Mac, select the appropriate M-series chip. If you don't know your GPU, choose "I don't know" and the checker will only use RAM requirements.
Select your system RAM — Choose your total system memory. This is the RAM installed in your computer, not GPU memory.
Click "Check My Hardware" — The tool compares your specs against a database of popular AI models and shows which ones you can run.
Review results — Green (✅) means your hardware meets or exceeds recommended specs. Yellow (⚠️) means you meet minimum requirements but may experience slow performance. Red (❌) means your hardware is below minimum requirements.

Tips for Running Local AI Models

Start with small models — If you're new to local AI, start with Qwen 3.5-4B or Phi-4-Mini. They run on modest hardware and still deliver impressive results.
Use Q4 quantization — Unless you have abundant VRAM, always opt for Q4 quantized models. The quality difference is negligible for most use cases.
Close other apps — Free up as much VRAM and RAM as possible before launching a model. Browsers with many tabs can consume several GB of RAM.
Try Ollama first — Ollama is the easiest way to get started with local LLMs. One install command, one run command, and you're chatting with a model.
Consider MoE for speed — If you have 16-24GB VRAM, MoE models like Gemma 4 27B MoE or Qwen 3.6-35B-A3B offer an excellent speed-to-quality ratio.

Frequently Asked Questions

Yes! Apple Silicon Macs (M1/M2/M3/M4/M5) use unified memory, which means the GPU can access all available RAM as VRAM. This makes them surprisingly capable for running local LLMs. An M5 Pro with 64GB unified memory can comfortably run Qwen 3.6-27B or Gemma 4 27B MoE at Q4 quantization. Even the base M5 with 16GB unified memory handles smaller models well. For the best experience, use Ollama or MLX-optimized models.

Quantization reduces model precision to save memory. Q4 (4-bit) uses about 25% of the original model size, making it the most popular format for local inference. FP16 (16-bit) uses about 50% and provides better quality but requires much more VRAM. For most users, Q4 quantized models offer the best balance of quality and performance — quality retention is typically 95-98% with minimal perceptible difference.

Mixture of Experts (MoE) models like Qwen 3.6-35B-A3B have many total parameters but only activate a fraction per token (e.g., 3B out of 35B). This means faster inference — sometimes 100+ tokens/second on consumer GPUs — but all parameters still need to be loaded into VRAM. MoE models give you the knowledge of a larger model at the speed of a smaller one, but you still need enough VRAM for the full parameter set.

Popular options include Ollama (easiest, one-command install), llama.cpp (lightweight, cross-platform), LM Studio (GUI-based), and MLX (optimized for Apple Silicon). For NVIDIA GPUs, all options support CUDA acceleration. For image generation, ComfyUI and Automatic1111 are the leading tools. Most tools are free and open source.