Every client conversation eventually turns to AI now. And for good reason — when LLMs are integrated thoughtfully, they can genuinely transform a product. When they’re bolted on carelessly, they produce a chatbot that hallucinates company policies and drives your support team mad.

Let’s talk about how to do this properly.

Start With the Use Case, Not the Technology

The most common mistake: “We want to add AI to our product” without a clear answer to what problem it solves for which user.

LLMs are extraordinary at a specific set of tasks:

  • Transforming unstructured text into structured data
  • Drafting content given a context and constraints
  • Answering questions over a corpus of documents (RAG)
  • Classifying, routing, and triaging inputs
  • Generating and explaining code
  • Summarizing long-form content

They are not well-suited for:

  • Real-time data lookups (without tools)
  • Precise numerical computation
  • Anything requiring guaranteed accuracy (medical dosing, legal citations, financial calculations)

Define your use case tightly before writing a single line of integration code.

Retrieval-Augmented Generation (RAG) Is Usually the Answer

The most valuable LLM applications we’ve built share a common pattern: a user asks something, we retrieve relevant context from a knowledge base, we inject that context into the prompt, and the model answers grounded in real information.

This pattern — RAG — solves the hallucination problem for domain-specific applications. The model doesn’t need to “know” your product documentation, support articles, or internal knowledge base. It just needs those documents at inference time.

A minimal RAG pipeline:

// 1. Embed the user's query
const queryEmbedding = await embed(userQuestion);

// 2. Find the most relevant chunks from your knowledge base
const relevantChunks = await vectorDB.search(queryEmbedding, { topK: 5 });

// 3. Build a prompt with that context
const prompt = `
  Answer based only on the provided context.
  
  Context:
  ${relevantChunks.map(c => c.text).join('\n\n')}
  
  Question: ${userQuestion}
`;

// 4. Generate the answer
const answer = await llm.complete(prompt);

This is simplistic — production RAG involves chunking strategy, re-ranking, metadata filtering, and hybrid search — but it illustrates the core idea.

Prompt Engineering Is Real Engineering

Prompts are not optional extras. A well-crafted system prompt is the difference between a helpful, on-brand assistant and a liability.

The things that actually matter:

Persona and tone. Be explicit about who the model is in this context. “You are a helpful customer support assistant for Acme Software. You are friendly, concise, and focused on solving the user’s problem.” This matters.

Scope constraints. Tell the model what it should NOT do. “Do not discuss competitor products. Do not make promises about upcoming features. If asked about pricing, direct the user to our pricing page.” Without constraints, you’ll get surprises.

Output format. If you need structured output, ask for it explicitly and use JSON mode or structured outputs. Parsing free-form LLM text in production is a fragile nightmare.

Few-shot examples. Show, don’t tell. Two to three examples of ideal inputs and outputs often outperform paragraphs of instructions.

Streaming Is Table Stakes

If your LLM responses take more than 2 seconds, users think the app is broken. Stream responses using the API’s streaming endpoints, and show a visible typing indicator while the first tokens arrive.

This single UX change transforms perceived quality more than any other optimization.

Observability From Day One

You cannot improve what you cannot measure. From the moment you go to production:

  • Log every prompt and completion (with appropriate privacy controls)
  • Capture latency, token counts, and costs per request
  • Implement a simple thumbs up/down feedback mechanism
  • Review flagged responses weekly

LLM behavior in production is different from behavior in your dev environment. Real users ask questions you didn’t anticipate. You need to see what’s happening.

Cost and Latency: The Real Constraints

Model pricing changes constantly, but the fundamental tradeoff is fixed: more capable models cost more and are slower.

Design your architecture so different tasks route to appropriate models:

  • Simple classification or intent detection → use a fast, cheap model
  • Nuanced content generation or complex reasoning → use a capable model
  • User-facing real-time responses → optimize hard for latency
  • Async background processing → optimize for cost and quality

Don’t use a sports car to do grocery runs.


The Meta-Lesson

Every LLM integration project we’ve delivered well shared one thing in common: we spent as much time on the product thinking as on the technical implementation.

Why does the user need this? What does success look like? How do we handle failure gracefully? What are the trust and safety implications?

The models are remarkable. The hard work is wrapping them in a product experience that earns user trust over time.

If you’re exploring AI integration for your product and want to think through the right approach, let’s talk. This is genuinely our favorite type of problem.