Model
VRAM
—
Presets
Sampling
temperature ⓘControls randomness in the output. At 0, the model always picks the single most likely next word (deterministic). Higher values make it consider less likely words, producing more creative and varied responses. Most models default to 0.6-0.8. Try 0 for factual Q&A, 1.0+ for creative writing.
top_k ⓘLimits the model to choosing from only the K most probable next words. At top_k=20, it ignores every word outside the top 20 candidates. Lower values make output more focused and predictable. Higher values allow more variety. Works alongside top_p — whichever is more restrictive wins.
top_p ⓘNucleus sampling. Instead of a fixed count (top_k), this keeps the smallest set of words whose combined probability reaches P. At 0.95, the model considers words until their probabilities add up to 95%, then ignores the rest. Lower values = more focused. At 1.0, nothing is filtered. More adaptive than top_k because the pool size adjusts based on how confident the model is.
min_p ⓘFilters out any word whose probability is less than min_p times the most likely word's probability. At 0.05, if the top word has 40% probability, anything below 2% (40% x 0.05) is cut. Unlike top_k/top_p, this adapts to the model's confidence — when it's sure, the pool shrinks; when it's uncertain, more options stay open. Set to 0 to disable.
repeat_penalty ⓘDiscourages the model from repeating the same words or phrases. At 1.0, no penalty is applied. Values above 1.0 make repeated tokens less likely — useful for preventing loops where the model says the same thing over and over. Too high (above 1.5) can make responses awkward by forcing unnatural word choices.
Context & Generation
num_ctx ⓘThe context window — how many tokens of conversation history the model can see at once. Larger values let it remember more of the conversation, but use more VRAM. On a 24GB GPU, qwen3:32b fits 100% on GPU up to ~17,664 tokens. Beyond that, it spills to CPU and slows down. 1 token is roughly 3/4 of a word.
num_predict ⓘMaximum number of tokens the model will generate in a single response. Set to -1 for unlimited (it stops when the model decides it's done). Set to -2 to fill the entire remaining context window. A specific number like 500 hard-caps the response length.
think ⓘEnables extended thinking mode (supported by qwen3 and deepseek-r1). The model reasons through the problem step-by-step in a hidden "thinking" block before writing its visible response. Produces better answers for complex questions but generates more tokens and takes longer. The thinking text appears in a separate bubble above the response.
stream ⓘWhen on, tokens appear in the chat as they are generated — you see the response being written word by word. When off, the entire response arrives at once after the model finishes. Streaming feels faster (you can start reading immediately) but the total generation time is the same either way.
Performance
num_gpu ⓘControls how many of the model's layers run on the GPU vs CPU. At -1 (default), Ollama puts as many layers on GPU as VRAM allows. Set to 0 to force everything onto CPU (very slow but useful for testing). A specific number like 30 puts exactly 30 layers on GPU and the rest on CPU — lets you manually balance the split.
num_batch ⓘHow many tokens of your prompt the model processes at once during the initial "prompt eval" phase (before it starts generating). Higher values process the prompt faster but use more memory. The default of 512 is a good balance. You would mainly change this if prompt evaluation is slow on very long inputs.
num_thread ⓘNumber of CPU threads used for the CPU portion of inference. At 0, Ollama auto-detects based on your processor. Only matters if some layers are running on CPU (either because the model doesn't fully fit in VRAM, or you set num_gpu to a specific number). More threads help up to your physical core count, then hurt.
◎
Connect to Ollama and start chatting
Adjust parameters in the sidebar