Running large LLMs on small hardware: Gemma 4 12B on a VRAM-constrained Radeon laptop

Google released Gemma 4 12B today. I’m a huge fan of the Gemma model family, they have improved with each iteration and consistently perform on par with larger models. It didn’t run at first because it needs more VRAM that my laptop has, but there’s a workaround. Here’s a short instruction for how to run it with Ollama in Ubuntu:

1. Download the model. I prefer Q5_K_M quantisation downloaded from the Unsloth Hugging Face repository since the official Google repo lacks direct GGUF support. People say that 5-bit (Q5) is the “value for money”^{[missing citation]} sweet spot, specifically when paired with the K_M precision strategy, which keeps critical layers like attention heads at a higher precision while compressing less sensitive tensors to minimise file size.

2. Create a model file:

			
# 1. Point to your local GGUF file
FROM ./gemma-4-12b-it-Q5_K_M.gguf
# 2. Set runtime parameters (Recommended for Gemma 4)
PARAMETER temperature 1.0
PARAMETER num_ctx 8192
PARAMETER num_gpu 30
# 3. Optional
#SYSTEM """You are a highly capable AI assistant. Answer concisely and accurately."""

		

Without the num_gpu parameter, the model won’t load. At roughly 8.4 GB, my initial attempt to load the model crashed immediately with a fatal 500 Internal Server Error (cudaMalloc/ROCm0 out of memory) because Ollama allocated every single layer to VRAM. You might have to experiment with different num_gpu values.

3. Create the model with:
ollama create gemma4:12b-it-Q5_K_M -f gemma4-12b-it-Q5_K_M.modelfile

Closing notes

I think that there isn’t a perfect value for num_gpu. The largest value that won’t crash makes the model load more slowly. I assume that there’s a bottleneck in transfering data to VRAM. I also noticed that larger values reduce the number of CPU cores used, which had a measurable impact on a simple benchmark I ran. Believe it or not, althouth the highest value for num_gpu which didn’t crash the model was 40 (ok, I didn’t test every single value from 0 to 41), the benchmark ran fastest with a value of 10.