- Use efficient fine‑tuning (PEFT, LoRA) and on‑device stacks like LiteRT to adapt LLMs cost‑effectively.
- Combine model‑level, system‑level, online and offline evaluations with diverse metrics and human review.
- Instrument full observability with Prometheus, OpenTelemetry and GPU metrics to monitor latency, tokens and safety.
- Integrate LLMOps, benchmarking loops and strict privacy controls to run LLMs reliably in production.
Large Language Models (LLMs) are moving from cool demos to mission‑critical infrastructure, and that changes everything about how we program, evaluate and operate them. Once your chatbot is helping doctors, lawyers or logistics teams make real decisions, you can no longer treat the model as a black box that just “seems smart enough” without assessing its limits and biases. You need a disciplined way to trace every request, measure quality, control cost and prove that the system behaves safely over time.
This guide brings together three pillars that usually live in separate documents: fine‑tuning strategies, evaluation frameworks and production observability, and blends them into a single programming playbook. We will walk through how to choose between full fine‑tuning and parameter‑efficient fine‑tuning, how to design robust LLM evaluations (online and offline, model and system level), how to instrument tracing and metrics with OpenTelemetry and Prometheus, and how to wire all of that into a continuous, business‑aware workflow.
Fine‑tuning strategies for LLMs: full vs PEFT and LoRA
When you adapt a pre‑trained LLM to your own use case, the first architectural choice is how many parameters you are actually going to touch, because that decision drives hardware needs, training time, cost and even how you deploy the model in production.
Full fine‑tuning means you update the entire parameter set of the base LLM during training, which is only realistic when you have a large, high‑quality, task‑specific dataset and serious compute. This approach is useful if your domain data diverges strongly from the original pre‑training corpus – for example, a legal assistant trained on jurisdiction‑specific case law or a clinical support tool for specialized medical subfields.
Parameter‑Efficient Fine‑Tuning (PEFT) is a more surgical way to specialize a model by freezing the original weights and adding small, trainable components, such as low‑rank adaptation modules. Instead of rewriting every page of a 1,000‑page textbook, you are essentially attaching a stack of annotated post‑its with domain knowledge. Training focuses on these extra parameters, which keeps GPU memory usage and wall‑clock time dramatically lower.
LoRA (Low‑Rank Adaptation) and QLoRA are the most widely used PEFT techniques today, injecting low‑rank matrices into key attention projections so you can adapt behavior with a modest number of additional parameters. QLoRA layers quantization tricks on top to push memory usage down further, enabling fine‑tuning of surprisingly large models on a single GPU or even prosumer hardware while still achieving competitive quality.
Running and configuring LLMs on device with LiteRT & MediaPipe
Not every LLM deployment needs a cluster of GPUs in the cloud; sometimes you want the model running entirely on device, either for latency, privacy, offline usage or cost reasons. This is where the LiteRT and MediaPipe LLM Inference stack comes into play.
The MediaPipe LLM Inference API lets you run text‑to‑text LLMs directly in browsers and mobile apps, generating text, summarizing documents or answering questions without sending prompts to a remote server. Models published in the LiteRT Community already come in a compatible format, so you avoid lengthy custom conversion steps, and you can serve them from your app bundle or local storage.
When configuring the LLM Inference task, you control behavior through a handful of core options such as modelPath (where the LiteRT model lives in your project), maxTokens (total input plus output tokens for a single call), topK (how many candidate tokens are considered at each generation step), temperature (randomness vs determinism), randomSeed (for reproducible generations), and optional callbacks via resultListener and errorListener for asynchronous usage.
Beyond vanilla generation, the API supports selecting between multiple models and applying LoRA adapters for custom behavior, so you can ship a compact base model plus several LoRA heads tuned for different domains (for example, customer support, summarization, or code review) and switch them dynamically at runtime on GPU‑enabled devices.
Choosing and using open LLM families (Gemma & friends)
For on‑device and lightweight deployments, small open models like the Gemma family and compact Gemma‑2 variants are particularly attractive, because they strike a practical balance between capability and resource requirements.
Gemma‑3n E2B and E4B are designed specifically for constrained hardware, using selective parameter activation so that only a subset of parameters is active per token. In practice, this gives you the quality of models with billions of parameters while presenting an “effective” parameter count closer to 2B or 4B, which is far more manageable for mobile GPUs and browser environments.
Gemma‑3 1B is an even leaner option, with roughly one billion open weights packaged in LiteRT‑ready formats (such as .task and .litertlm) for Android and web. When deploying it with the LLM Inference API, you typically choose between CPU and GPU backends, ensure that maxTokens matches the context length baked into the model, and keep numResponses at 1 on the web side for predictable performance.
Gemma‑2 2B pushes reasoning quality for its size class while still staying small enough to run widely, and serves as a strong baseline for on‑device assistants or specialized domain agents, especially when combined with LoRA adapters and careful evaluation.
Converting PyTorch LLMs to LiteRT and packaging them
If you are starting from a PyTorch generative model, you can convert it into a MediaPipe‑compatible LiteRT artifact with the LiteRT Torch Generative tooling, which handles the graph translation, quantization and signature export needed for efficient on‑device inference.
The high‑level workflow looks like this: download your PyTorch checkpoints, run the LiteRT Torch Generative conversion to produce a .tflite file, and then create a task bundle that combines this model file with tokenizer parameters and metadata. The bundler script (via mediapipe.tasks.python.genai.bundler) takes a configuration object that includes the TFLite path, the SentencePiece tokenizer, start and stop tokens, and the desired output filename.
Because this conversion performs CPU‑targeted optimizations and can be memory‑intensive, you typically need a Linux machine with at least 64 GB of RAM, and you will also want to install the right MediaPipe version from PyPI to get the bundling script. The output is a self‑contained task package that your Android or web app can consume through the LLM Inference API without extra glue code.
Inside the bundling configuration you specify all runtime‑critical elements such as tokenizer models, control tokens and output paths, so that the final artifact includes every piece required for end‑to‑end inference, keeping deployment reproducible and making it easier to test various versions in CI/CD.
LoRA customization: from training to on‑device inference
LoRA is not just a training trick; you also have to think through how those low‑rank adapters are represented and loaded in your inference stack, especially when you want to apply them selectively on GPU‑backed devices.
During training, you typically rely on libraries like PEFT to define the LoRA configuration for supported architectures such as Gemma or Phi‑2, pointing the adapter to attention‑related modules only. For Gemma, that often means wrapping q_proj, k_proj, v_proj and o_proj; for Phi‑2, the common pattern is to adapt attention projections plus the main dense layer. The rank r in LoraConfig controls how many new parameters you add and therefore the expressive capacity of the adapter.
After fine‑tuning on your dataset, the resulting checkpoint is stored as an adapter_model.safetensors file, which holds only the LoRA weights. To push this into your MediaPipe pipeline, you convert the adapter to a LoRA‑specific TFLite file using the MediaPipe converter, passing a ConversionConfig that includes the base model options, a GPU backend (LoRA support is GPU‑only here), the LoRA checkpoint path, the chosen rank and the name of the output TFLite file.
The conversion step produces two flatbuffers: one for the frozen base LLM and one for the LoRA overlay, and both are required at inference time. On Android, for example, you initialize the LLM Inference task by pointing modelPath to the base model artifact and loraPath to the LoRA TFLite file, plus typical generation parameters like maxTokens, topK, temperature and randomSeed.
From the app developer’s point of view, running a LoRA‑augmented model is transparent: you still call generateResponse() or its async variant, but under the hood the LoRA weights modulate attention, giving you domain‑specific behavior without shipping a huge, fully fine‑tuned model.
LLM temperature and decoding behavior in practice
Among decoding hyperparameters, temperature is the one that most directly shapes how “creative” or conservative your LLM feels, because it rescales the probability distribution over the next token during generation. A value of 1.0 uses the raw distribution; values below 1 sharpen it so that highly probable tokens become even more dominant, while values above 1 flatten it and give lower‑probability tokens a better chance.
At lower temperatures (for example 0.1-0.2) the model behaves almost deterministically, returning very similar outputs for the same prompt and favoring safe, unsurprising completions. This is desirable in heavily regulated scenarios like legal summarization, medical reporting or financial explanations, where consistency, clarity and factual grounding matter more than stylistic flair.
Moderate temperatures around 0.7-0.9 tend to hit a sweet spot for chatbots and assistants that should sound human but still stay on track, injecting enough variation to avoid repetitive answers while usually preserving coherence. Many conversational products run in this range and combine temperature with constraints like maximum output tokens and safety filters.
Very high temperatures close to 2.0 make the model much more prone to incoherent or off‑topic generations, which might be fun in brainstorming toys but is rarely acceptable in critical workflows. As always, you tune temperature jointly with other sampling parameters (top‑k, top‑p, repetition penalties) and verify the impact through systematic evaluation, not intuition alone.
Why rigorous LLM evaluation is non‑negotiable
As organizations embed LLMs into workflows ranging from healthcare scheduling to legal triage and supply‑chain planning, the cost of bad outputs climbs rapidly – think hallucinated diagnoses, biased recommendations or toxic responses delivered at scale. That’s why evaluation cannot be an afterthought or a one‑off benchmark run; it has to become part of the culture and lifecycle of your AI systems.
LLM evaluation, at its core, is about systematically measuring how a model behaves along four dimensions: accuracy, efficiency, trustworthiness and safety, using a mix of quantitative metrics and human judgment. Done well, it gives developers and stakeholders a clear picture of strengths, weaknesses, failure modes and fit for purpose across different domains and user segments.
The benefits span multiple layers of the stack: you improve raw model performance, uncover and mitigate harmful biases, validate that answers remain grounded in reality, and verify that multilingual and domain‑specific behaviors meet expectations, all while tracking how these properties shift as you fine‑tune, update prompts, or roll out new model versions.
Because the same LLM can be repurposed for everything from playful chat to high‑stakes decision support, your evaluation strategy must be tightly aligned with business goals and risk tolerance, rather than relying solely on generic leaderboards or crowd‑sourced scores.
Key applications of LLM performance evaluation
One obvious use of evaluation is monitoring and improving baseline performance: how well the model understands instructions, interprets context and retrieves or composes relevant information, given the type of prompts your users actually send. Here you combine task‑specific metrics with domain‑tuned datasets to track progress over time.
Another critical area is bias detection and mitigation, since training data can encode societal prejudices that surface in generated outputs, producing unfair, one‑sided or discriminatory content. Regular evaluation passes using curated prompts and labeled examples help you surface these issues and iteratively reduce harmful behavior through data curation, fine‑tuning and safety policies.
Ground‑truth comparison is where you match model outputs against validated facts or expected answers, tagging each generation for correctness, completeness and relevance. Whether you use human annotators or automatic fact‑checking and retrieval‑based verification, this process reveals how often the model hallucinates, omits crucial details, or overstates its confidence.
Model comparison is another practical application: when you are choosing between different LLM families or variants, you run the same evaluation battery across candidates to see which one offers the best trade‑off of accuracy, latency, cost and safety for your specific workload and domain, instead of relying on generic benchmark rankings.
Evaluation frameworks and metrics for LLMs
Enterprise‑grade evaluation rarely relies on a single number; instead, you assemble a toolkit of frameworks and metrics tailored to your tasks, blending context‑aware tests, human feedback, UX signals and standardized benchmarks when appropriate.
Context‑specific evaluation asks whether outputs actually match your domain, tone and risk profile, for instance by checking that a model deployed in schools avoids toxic content, misinformation and biased language, while a retail chatbot is judged more on resolution rate, tone of voice and product relevance. Typical metrics here include relevance, question‑answering accuracy, BLEU and ROUGE scores, toxicity ratings and hallucination frequency.
User‑driven evaluation, often considered the gold standard, embeds human reviewers in the loop to score responses for coherence, usefulness, politeness and safety, which is particularly valuable for subtle issues that automated scores miss. The downside is cost and time, especially at scale, so you typically combine human reviews with automated triage.
UI/UX metrics complete the picture by focusing on how users experience the system rather than how it scores on a benchmark, tracking user satisfaction, frustration signals, perceived response time and how gracefully the model recovers from errors or misunderstandings. These signals map directly to business KPIs like retention and task success.
Generic comparative benchmarks such as MT‑Bench, AlpacaEval, MMMU or GAIA provide standardized question-answer sets for measuring broad capabilities, but they are inherently domain‑agnostic. They are great for high‑level sanity checks and cross‑model comparisons, yet they must be complemented with evaluations that reflect your actual use cases and data.
Model‑level vs system‑level LLM evaluation
It’s useful to distinguish between evaluating the naked model and evaluating the full system built around it, because many real‑world issues come from orchestration logic, retrieval pipelines or safety layers, not from the base LLM weights alone.
Model‑level evaluation focuses on generic capabilities such as reasoning, coherence, multilingual handling or knowledge coverage, often using broad benchmarks like MMLU or custom test sets designed to stretch the model across many scenarios. These scores inform which base models you choose and where to invest in fine‑tuning.
System‑level evaluation, on the other hand, measures how the entire application performs in its actual environment and use case, including retrieval components, tool calls, multi-agent patterns, guardrails, caching and business logic. Metrics here might include retrieval accuracy, end‑to‑end task success, domain‑specific precision, and user satisfaction, giving you a realistic view of production behavior.
In practice, both views are necessary: model‑centric tests drive fundamental R&D and architecture decisions, while system‑centric tests support rapid iteration, UX optimization and alignment with user expectations and regulatory requirements.
Online vs offline LLM evaluation
Another crucial axis is whether evaluation happens offline in controlled environments or online against real production traffic, each mode offering distinct strengths and trade‑offs.
Offline evaluation uses fixed datasets, synthetic prompts or shadow traffic to test models before they ever touch live users, ensuring that baseline performance meets a minimum bar, that safety filters catch obvious issues, and that regressions are detected before rollout. This is your pre‑launch gate, typically automated in CI pipelines.
Online evaluation captures how the model behaves with real user inputs, constraints, load patterns and edge cases, tracking live metrics such as user satisfaction, escalation rates, incident reports, and performance under different traffic profiles. It’s especially powerful when combined with A/B testing to compare prompts, hyperparameters or model versions based on actual business outcomes.
A mature setup weaves both approaches together: offline tests act as a safety net and early warning system, while online experiments guide fine‑grained tuning and ensure that optimizations truly translate into better user experiences and reduced operational risk.
Best practices: LLMOps, real‑world testing and rich metric suites
To manage LLMs responsibly at scale, you need LLMOps practices analogous to DevOps, emphasizing automation, collaboration and continuous delivery, but oriented around data, models and evaluation. This usually brings data scientists, ML engineers and operations teams together around shared tooling and processes such as building agent teams.
LLMOps platforms automate model training and deployment, monitor quality and drift, and integrate evaluation steps directly into CI/CD pipelines, so that every change to data, prompts or code triggers a standardized battery of tests. The result is faster iteration with fewer surprises in production.
Real‑world evaluation – placing models in front of real users or realistic simulators – is indispensable for uncovering odd, unexpected scenarios, especially for open‑ended language interaction. Controlled lab tests can validate stability and base functionality, but messy, human‑generated prompts reveal jailbreak attempts, ambiguous phrasing and corner cases that no curated dataset could anticipate.
A diverse metric arsenal is key to avoiding tunnel vision on a single score like BLEU or perplexity, so your dashboards should track coherence, fluency, factuality, relevance, contextual understanding, latency, throughput and safety indicators. The broader your observation surface, the better your chances of catching regressions early.
Consultancies and engineering partners that specialize in custom AI solutions can help organizations embed these practices end‑to‑end, from building evaluation pipelines and integrating them into CI/CD to hardening cloud deployments, implementing security reviews and wiring dashboards that tie model behavior directly to business metrics.
Benchmarking LLMs: a practical five‑step flow
A structured benchmarking process helps you move from ad‑hoc experiments to repeatable, data‑driven decisions, especially when you are comparing multiple models, configurations or fine‑tuning strategies.
A robust five‑step flow typically starts with choosing a set of evaluation tasks that reflect both simple and complex use cases, ensuring that you test the model across the entire spectrum of difficulty and domain coverage relevant to your application.
Next, you curate or construct datasets that are as unbiased and representative as possible, capturing real user queries, domain‑specific jargon, edge cases and even adversarial prompts. This is the foundation on which all other evaluation layers depend.
Then you configure the model gateway and fine‑tuning or adaptation mechanisms, such as LoRA adapters, so that your benchmark reflects the actual way the model will be deployed. This includes aligning context length, sampling parameters and safety middleware with production settings.
Once the environment is in place, you run the evaluations using the right mix of metrics for each task, from perplexity for language modeling competence to ROUGE for summarization, diversity scores for creativity, and human judgments for relevance and coherence.
Finally, you perform a detailed analysis and kick off an iterative feedback cycle, feeding insights back into prompt engineering, data cleaning, fine‑tuning strategies and guardrail configuration, so that benchmarking becomes a continuous improvement loop rather than a one‑time report.
Observability for LLM systems: beyond HTTP latency
Traditional API monitoring – counting errors and measuring average HTTP latency – is nowhere near enough for LLM workloads, because many of the most damaging failure modes happen in queues, GPU memory or token streaming behavior long before your web layer raises an alarm.
LLM observability hinges on a multi‑signal pipeline combining metrics, traces, logs, profiles, synthetic tests and SLOs, giving you a detailed, causal view of where time is spent, what is saturating first and how user experience is evolving as load patterns shift.
At the metric level, you care not just about requests per second and p99 latency, but also about time‑to‑first‑token (TTFT), inter‑token latency, queue length, batch size, tokens per second, GPU utilization and KV‑cache pressure, since these are the leading indicators of throughput collapse and user‑visible slowness in streaming interfaces.
Traces, instrumented via OpenTelemetry, stitch together all stages of a single request – routing, retrieval, tool calls, safety filters, model execution and post‑processing – so that when latency spikes or outputs degrade, you can pinpoint whether the culprit is a slow vector store, overloaded GPU or misbehaving middleware component.
Logs still matter for human debugging and audits, but at LLM scale you must design them carefully, avoiding unbounded high‑cardinality attributes (such as raw prompts, session IDs or full tool arguments) and focusing instead on structured, low‑cardinality metadata like model family, endpoint, region, status code and coarse‑grained outcome types.
Metrics blueprints and semantic conventions for LLMs
Different LLM serving frameworks expose slightly different metric names, but the underlying concepts are consistent, and OpenTelemetry’s semantic conventions for GenAI are starting to unify them into a portable schema.
Systems like Hugging Face TGI, vLLM and NVIDIA Triton typically offer Prometheus endpoints with histograms for end‑to‑end request duration, counters for tokens generated and successful requests, gauges for queue size and batch size, and specialized time‑per‑token and TTFT metrics that correlate directly with user experience.
GPU telemetry is just as important, and exporters like NVIDIA’s DCGM adapter expose Prometheus metrics for utilization, memory usage and other low‑level signals, which you can use to predict out‑of‑memory events, decide when to scale and understand how different workloads stress your accelerators.
OpenTelemetry’s GenAI semantic conventions define standard names for core metrics such as gen_ai.server.request.duration, gen_ai.server.time_to_first_token, gen_ai.server.time_per_output_token and gen_ai.client.token.usage, enabling you to instrument once and then route telemetry to various backends (Prometheus, Mimir, commercial APMs) without rewiring your code every time.
On top of these raw metrics, you layer dashboards and PromQL queries that calculate percentiles, error rates, saturation indicators and cost proxies, building a live control panel for your LLM cluster that operations teams can actually use to make capacity and reliability decisions.
Designing the telemetry pipeline: pull, push and collectors
A robust LLM observability stack usually combines pull‑based metrics scraping with push‑based OTLP telemetry, fitting into the grain of tools like Prometheus while leveraging OpenTelemetry collectors for traces and logs.
Prometheus remains pull‑first: servers and exporters expose a /metrics endpoint, and Prometheus scrapes it at configured intervals. This works well for inference servers (TGI, vLLM, Triton), GPU exporters, node exporters and k6 load tests, giving you a uniform workflow for capacity metrics.
For traces, logs and sometimes metrics produced by instrumented applications, you typically use OTLP push, sending spans and structured events to one or more OpenTelemetry collectors that perform batching, sampling, redaction and export to backends like Tempo, Jaeger, Loki, Elastic APM or commercial platforms.
Deployment patterns often mix node‑level DaemonSets, sidecar collectors and centralized gateways, where DaemonSets handle host enrichment and shared processing, sidecars provide isolation for workloads that manipulate sensitive prompts, and gateway collectors enforce organization‑wide sampling and routing policies.
Throughout this pipeline you must keep an eye on sampling strategies and label cardinality, using tail‑based sampling to retain interesting traces (slow, error‑prone) while discarding noise, and designing metric labels so that you do not accidentally explode memory and CPU usage on your observability infrastructure.
Tooling landscape for LLM observability
The open‑source observability ecosystem is broad, and LLM workloads sit at the intersection of several tools, each bringing strengths for specific signal types: Prometheus for metrics, Tempo or Jaeger for traces, Loki or Elastic for logs, and Pyroscope for continuous profiling.
Grafana commonly acts as the unifying UI layer on top of this stack, offering dashboards that can query multiple data sources in one place, visualize SLOs, correlate metrics with traces and logs, and power on‑call workflows for SRE teams managing LLM‑heavy services.
For organizations that prefer managed solutions, services like Grafana Cloud, Datadog, New Relic or Amazon Managed Prometheus provide hosted backends, accepting OTLP or Prometheus remote‑write traffic and handling scaling, retention and high availability, at the cost of vendor lock‑in and per‑ingest pricing models.
Whichever combination you choose, the priority is consistency: standardize around OpenTelemetry where possible, adopt semantic conventions for GenAI metrics and spans, and treat your observability setup as part of your core LLM architecture rather than as an afterthought bolted on at the end.
Deployment, scaling, security and troubleshooting
Deploying observability for LLMs in Kubernetes often starts with opinionated bundles like kube‑prometheus‑stack plus OpenTelemetry collectors, while simpler experiments can run with Docker Compose or basic VM setups. The key is that discovery, retention and dashboarding are thought through from day one, not improvised mid‑incident.
As traffic grows, you move from Prometheus’s default local retention (around 15 days) to long‑term storage via systems such as Mimir, Thanos, Cortex or managed Prometheus services, and adopt trace backends like Tempo that can generate metrics from spans when needed. Log stores like Loki or Elastic need careful label design to stay affordable.
Security and privacy are especially sensitive for LLM applications, because prompts and outputs may contain personal or confidential data, and both OpenTelemetry and Prometheus documentation explicitly warn about leaking sensitive information through telemetry data. You mitigate these risks by redacting prompts and responses by default, filtering attributes at the collector, enforcing RBAC and tight network boundaries, and setting retention policies that reflect regulatory obligations.
When dashboards look wrong or signals go missing, you debug from ingestion health and schema mismatches down to sampling and cardinality issues, checking scrape success, OTLP endpoints, label names, histogram usage, sampling rules and GPU exporter status until the root cause is clear and fixed.
Bringing all of these strands together – fine‑tuning strategies, rigorous evaluation, on‑device deployment and deep observability – is what turns LLMs from experimental prototypes into reliable, auditable systems that organizations can trust in sensitive domains, while still evolving quickly enough to keep up with the pace of AI research and changing business needs.