Generation parameters • Llamatik Documentation

Llamatik exposes runtime generation parameters through updateGenerateParams(...).

LlamaBridge.updateGenerateParams(
    temperature = 0.7f,
    maxTokens = 256,
    topP = 0.95f,
    topK = 40,
    repeatPenalty = 1.1f,
    contextLength = 4096,
    numThreads = 4,
    useMmap = true,
    flashAttention = false,
    batchSize = 512,
    gpuLayers = 0,
)

Sampling parameters#

temperature#

Controls randomness.

lower values: more deterministic
higher values: more varied and creative

maxTokens#

Sets the maximum number of tokens the model may generate. Use this to control response length and latency.

topP#

Nucleus sampling threshold. The model samples from the smallest set of likely tokens whose cumulative probability reaches topP.

topK#

Limits sampling to the K most likely next tokens.

repeatPenalty#

Discourages repeated phrases and loops. Useful for chat, summaries, and structured outputs.

Backend and memory parameters#

contextLength#

The model’s context window in tokens. Larger values allow longer conversations but require more memory. Must not exceed what the loaded model was compiled to support.

numThreads#

Number of CPU threads used during inference. Pass -1 to let the platform choose a sensible default. On mobile, matching the number of performance cores is a good starting point.

useMmap#

Whether to use memory-mapped I/O for the model weights. Enabled by default on most platforms. Turn it off if you encounter issues with certain file systems or need full memory ownership.

flashAttention#

Enables Flash Attention when supported by the backend. Can reduce memory usage and improve throughput on large context windows. Not supported on all platforms — verify with the underlying llama.cpp version in use.

batchSize#

The batch size used during prompt processing (the prefill phase). Larger values can improve throughput on long prompts at the cost of memory. A value of 512 works well for most use cases.

gpuLayers#

The number of transformer layers to offload to the GPU accelerator.

0 — all computation runs on CPU (default)
-1 — all layers are offloaded to GPU (Metal on iOS/macOS, CUDA or Vulkan on Android/JVM where supported)
N (positive integer) — exactly N layers are offloaded; the rest remain on CPU

Offloading more layers increases throughput significantly on devices with a capable GPU. Start with -1 to offload everything, then reduce if you hit memory limits.

This parameter requires a model reload to take effect. On WASM, gpuLayers is silently ignored — there is no GPU offload path in the WebAssembly target.

Tuning advice#

Start from moderate values and test on a fixed prompt set.
Change one parameter at a time.
For extraction or JSON, lean toward lower temperature.
For brainstorming or creative tasks, slightly higher temperature can help.
For long chat history, increase contextLength and consider enabling flashAttention.
If memory is tight on mobile, reduce batchSize and contextLength.
On Metal-capable iOS/macOS devices, set gpuLayers = -1 to offload all layers and gain significant throughput.
On Android with a CUDA or Vulkan-capable GPU, the same -1 value applies; fall back to a specific layer count if you hit OOM errors.