LLM Playground

Model

Gemma 4 31B QAT

TP=2 · 220K ctx · BF16 KV

VRAM

—

CONTEXT WINDOW

—

Presets

Sampling

temperature ⓘControls randomness in the output. At 0, the model always picks the single most likely next word (deterministic). Higher values make it consider less likely words, producing more creative and varied responses. Most models default to 0.6-0.8. Try 0 for factual Q&A, 1.0+ for creative writing.

top_k ⓘLimits the model to choosing from only the K most probable next words. At top_k=20, it ignores every word outside the top 20 candidates. Lower values make output more focused and predictable. Higher values allow more variety. Works alongside top_p — whichever is more restrictive wins.

top_p ⓘNucleus sampling. Instead of a fixed count (top_k), this keeps the smallest set of words whose combined probability reaches P. At 0.95, the model considers words until their probabilities add up to 95%, then ignores the rest. Lower values = more focused. At 1.0, nothing is filtered. More adaptive than top_k because the pool size adjusts based on how confident the model is.

min_p ⓘFilters out any word whose probability is less than min_p times the most likely word's probability. At 0.05, if the top word has 40% probability, anything below 2% (40% x 0.05) is cut. Unlike top_k/top_p, this adapts to the model's confidence — when it's sure, the pool shrinks; when it's uncertain, more options stay open. Set to 0 to disable.

repeat_penalty ⓘDiscourages the model from repeating the same words or phrases. At 1.0, no penalty is applied. Values above 1.0 make repeated tokens less likely — useful for preventing loops where the model says the same thing over and over. Too high (above 1.5) can make responses awkward by forcing unnatural word choices.

Context & Generation

num_ctx ⓘThe context window — how many tokens of conversation history the model can see at once. Larger values let it remember more of the conversation, but use more VRAM. Ollama runs on a single RTX 3090 (~23 GB available after SmartSearch). A 27B model fits with moderate context; 32B models are tight. 1 token is roughly 3/4 of a word.

num_predict ⓘMaximum number of tokens the model will generate in a single response. Set to -1 for unlimited (it stops when the model decides it's done). Set to -2 to fill the entire remaining context window. A specific number like 500 hard-caps the response length.

think ⓘEnables extended thinking mode (supported by qwen3 and deepseek-r1). The model reasons through the problem step-by-step in a hidden "thinking" block before writing its visible response. Produces better answers for complex questions but generates more tokens and takes longer. The thinking text appears in a separate bubble above the response.

stream ⓘWhen on, tokens appear in the chat as they are generated — you see the response being written word by word. When off, the entire response arrives at once after the model finishes. Streaming feels faster (you can start reading immediately) but the total generation time is the same either way.

Performance

num_gpu ⓘControls how many of the model's layers run on GPU vs CPU. At -1 (default), Ollama distributes layers across all available GPUs automatically (this server has 3× RTX 3090). Set to 0 to force everything onto CPU (very slow but useful for testing). A specific number puts exactly that many layers on GPU and the rest on CPU.

num_batch ⓘHow many tokens of your prompt the model processes at once during the initial "prompt eval" phase (before it starts generating). Higher values process the prompt faster but use more memory. The default of 512 is a good balance. You would mainly change this if prompt evaluation is slow on very long inputs.

num_thread ⓘNumber of CPU threads used for the CPU portion of inference. At 0, Ollama auto-detects based on your processor. Only matters if some layers are running on CPU (either because the model doesn't fully fit in VRAM, or you set num_gpu to a specific number). More threads help up to your physical core count, then hurt.

◎

Connect and start chatting

Adjust parameters in the sidebar

◎

Connect and start chatting

Adjust parameters in the sidebar

LLM Playground

● LLM Playground

Model Manager