LlamaBridge is the main multiplatform entry point for GGUF-based language models in Llamatik. It exposes a compact API for:

  • text generation
  • retrieval-friendly embeddings
  • streaming tokens
  • JSON and JSON Schema constrained output
  • generation parameter tuning
  • KV cache reuse and session persistence

Under the hood, the actual implementation is platform-specific, but the public API stays the same.

Platform support#

Current support in the library code:

  • Android: supported
  • iOS: supported
  • JVM / Desktop: supported
  • WASM: text generation is supported, but embeddings and KV session persistence are currently not available, and synchronous generateContinue() is not supported in worker-only mode

Model path helper#

fun getModelPath(modelFileName: String): String

This returns a platform-usable path for a model file.

Typical usage:

val modelPath = LlamaBridge.getModelPath("qwen2.5-0.5b-instruct-q4_k_m.gguf")

Use this when your app stores or ships models differently across platforms and you want one common entry point before initialization.

Embeddings API#

fun initEmbedModel(modelPath: String): Boolean
fun embed(input: String): FloatArray

initEmbedModel(modelPath)#

Loads a model for embeddings. Returns true on success.

val ok = LlamaBridge.initEmbedModel(modelPath)
check(ok) { "Failed to initialize embedding model" }

embed(input)#

Computes a vector representation of the input text. This is useful for:

  • semantic search
  • document retrieval
  • clustering
  • reranking pipelines
  • PDF RAG and other retrieval workflows
val vector = LlamaBridge.embed("How do coroutines work?")
println("Embedding size = ${vector.size}")

Important notes:

  • Use the same embedding model for all vectors inside one index.
  • Initialize the embedding model before calling embed.
  • On WASM, embeddings are currently not implemented.

Generation model initialization#

fun initGenerateModel(modelPath: String): Boolean

Loads a generation model and prepares it for inference.

val ok = LlamaBridge.initGenerateModel(modelPath)
check(ok) { "Failed to initialize generation model" }

Call this before any generation method.

One-shot generation#

fun generate(prompt: String): String

Generates a full response from a single prompt. This is the simplest API and a good default when each request is independent.

val answer = LlamaBridge.generate("Explain Kotlin Multiplatform in one paragraph.")
println(answer)

generate(...) starts a fresh generation flow. If you are building a chat interface and want continuity across turns, prefer the session APIs described below.

Context-aware generation#

fun generateWithContext(systemPrompt: String, contextBlock: String, userPrompt: String): String

Use this when you want clearer structure between instructions, external context, and user input.

val result = LlamaBridge.generateWithContext(
    systemPrompt = "You are a concise technical assistant.",
    contextBlock = "Project: Llamatik is a Kotlin Multiplatform library for on-device AI.",
    userPrompt = "Write a short summary for the README."
)

This is useful for:

  • RAG results
  • chat prompts with a fixed system role
  • structured prompting where you want deterministic sections

JSON generation#

fun generateJson(prompt: String, jsonSchema: String? = null): String
fun generateJsonWithContext(
    systemPrompt: String,
    contextBlock: String,
    userPrompt: String,
    jsonSchema: String? = null
): String

These methods instruct the model to return JSON output. If you provide a JSON Schema, generation is additionally constrained to that schema.

val schema = """
{
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "title": { "type": "string" },
    "priority": { "type": "integer" }
  },
  "required": ["title", "priority"]
}
""".trimIndent()

val json = LlamaBridge.generateJson(
    prompt = "Return one task object.",
    jsonSchema = schema
)

This is especially useful when parsing model output into Kotlin data classes.

Streaming APIs#

fun generateStream(prompt: String, callback: GenStream)
fun generateStreamWithContext(systemPrompt: String, contextBlock: String, userPrompt: String, callback: GenStream)
fun generateJsonStream(prompt: String, jsonSchema: String? = null, callback: GenStream)
fun generateJsonStreamWithContext(
    systemPrompt: String,
    contextBlock: String,
    userPrompt: String,
    jsonSchema: String? = null,
    callback: GenStream
)

These are the streaming equivalents of the one-shot APIs above. They are the best choice for chat interfaces because the UI can render tokens incrementally.

LlamaBridge.generateStream(
    prompt = "Write a haiku about local AI.",
    callback = object : GenStream {
        override fun onDelta(text: String) = print(text)
        override fun onComplete() = println("\nDone")
        override fun onError(message: String) = println("Error: $message")
    }
)

Lambda-friendly streaming helper#

fun generateWithContextStream(
    system: String,
    context: String,
    user: String,
    onDelta: (String) -> Unit,
    onDone: () -> Unit,
    onError: (String) -> Unit
)

This is a convenience wrapper around the callback-based context streaming API when you prefer lambdas.

LlamaBridge.generateWithContextStream(
    system = "You are helpful.",
    context = "Product: Llamatik",
    user = "Write a short tagline.",
    onDelta = { print(it) },
    onDone = { println("\nDone") },
    onError = { println("Error: $it") }
)

KV cache and sessions#

fun sessionReset(): Boolean
fun sessionSave(path: String): Boolean
fun sessionLoad(path: String): Boolean
fun generateContinue(prompt: String): String

These methods are designed for multi-turn interactions.

generateContinue(prompt)#

Continues generation using the current KV cache instead of starting from scratch.

LlamaBridge.initGenerateModel(modelPath)

val first = LlamaBridge.generate("Explain Kotlin coroutines")
val second = LlamaBridge.generateContinue("Now show a short example")

sessionSave(path) and sessionLoad(path)#

Persist the current KV/session state to disk and restore it later.

LlamaBridge.sessionSave(sessionPath)

LlamaBridge.initGenerateModel(modelPath)
LlamaBridge.sessionLoad(sessionPath)

val resumed = LlamaBridge.generateContinue("Continue the explanation")

sessionReset()#

Clears the current session while keeping the model loaded. This is useful when you want to start a new conversation without paying the full model load cost again.

Important notes:

  • Session files are tied to the same model and runtime assumptions.
  • On WASM, session persistence is currently not implemented.
  • If no active session exists yet, generateContinue() behaves like a fresh generation.

Runtime controls#

fun nativeCancelGenerate()
fun updateGenerateParams(
    temperature: Float,
    maxTokens: Int,
    topP: Float,
    topK: Int,
    repeatPenalty: Float,
)
fun shutdown()

nativeCancelGenerate()#

Requests cancellation of the current generation. Useful when the user taps Stop or leaves the screen.

updateGenerateParams(...)#

Updates the runtime generation parameters used by the native backend.

LlamaBridge.updateGenerateParams(
    temperature = 0.7f,
    maxTokens = 256,
    topP = 0.95f,
    topK = 40,
    repeatPenalty = 1.1f,
)

shutdown()#

Releases native resources when you are done with the bridge. In long-lived apps this is usually called when the feature or process is shutting down, not after every request.