Q: What quantization of Llama 3.1 Nemotron 70B should I use on a NVIDIA GeForce RTX 2060 Super?

For 8 GB VRAM on the NVIDIA GeForce RTX 2060 Super, the Q4_K_M variant is the best fit. Estimated ~3 tokens/sec on the Q4_K_M quantization.

Q: How fast does Llama 3.1 Nemotron 70B run on NVIDIA GeForce RTX 2060 Super?

Roughly 3 tokens/sec for Q4_K_M. Real speed depends on context length, backend (Ollama, llama.cpp, LM Studio), and KV cache size.

Q: What if NVIDIA GeForce RTX 2060 Super is not enough for Llama 3.1 Nemotron 70B?

Consider upgrading to Apple M4 Pro (48 GB VRAM) which fits the recommended 48 GB target. Or pick a smaller quantization to stay on your current card.

Question 1

Can I run Llama 3.1 Nemotron 70B on a NVIDIA GeForce RTX 2060 Super?

Accepted Answer

Sort of — NVIDIA GeForce RTX 2060 Super can run Llama 3.1 Nemotron 70B (Q4_K_M) only by spilling layers to RAM. Generation will be slow. CPU + GPU hybrid — not enough VRAM (8 GB < 42 GB min), but 64 GB RAM is sufficient. Expect significantly slower inference.

Question 2

What quantization of Llama 3.1 Nemotron 70B should I use on a NVIDIA GeForce RTX 2060 Super?

Accepted Answer

For 8 GB VRAM on the NVIDIA GeForce RTX 2060 Super, the Q4_K_M variant is the best fit. Estimated ~3 tokens/sec on the Q4_K_M quantization.

Question 3

How fast does Llama 3.1 Nemotron 70B run on NVIDIA GeForce RTX 2060 Super?

Accepted Answer

Roughly 3 tokens/sec for Q4_K_M. Real speed depends on context length, backend (Ollama, llama.cpp, LM Studio), and KV cache size.

Question 4

What if NVIDIA GeForce RTX 2060 Super is not enough for Llama 3.1 Nemotron 70B?

Accepted Answer

Consider upgrading to Apple M4 Pro (48 GB VRAM) which fits the recommended 48 GB target. Or pick a smaller quantization to stay on your current card.

Can I Run Llama 3.1 Nemotron 70B on NVIDIA GeForce RTX 2060 Super?

Share this matchup

Every Llama 3.1 Nemotron 70B quantization on NVIDIA GeForce RTX 2060 Super

Upgrade options that fit Llama 3.1 Nemotron 70B better