vandriichuk’s Substack

From Twilio Webhooks to LiveKit Agents: How a Client's Insistence Led to a Better Architecture

vandriichuk — Thu, 26 Mar 2026 09:03:46 GMT

There are calls you remember.

This was one of them.

A few weeks ago, a client called and said the sentence I’d been expecting since the project started: “You were right. What do we do next?”

To explain why that mattered, I need to start from the beginning.

How it started

The client came with a specific ask: build a phone ordering AI for a pizza restaurant in New Jersey. Forty to sixty calls every evening. The owner was answering the phone himself instead of running the kitchen. Orders were getting lost during peak hours.

The technical spec was clear: Twilio Voice API, webhooks, Node.js. Standard stack. He’d seen similar setups before and knew what he wanted.

I had experience with this type of system. I told him upfront: webhook architecture introduces 1.5 to 2 seconds of latency per conversational turn. In a chat interface, that’s manageable. On a phone call, it feels like a dropped line. I recommended a streaming approach instead.

He listened and said: I understand your reasoning, but I want to see it for myself. Build it the way I’m asking.

The first version: exactly what he asked for

I could have pushed back harder. Instead I made a different call: if he needs to see it to believe it, let him see it. That’s more honest than winning an argument by authority.

I built the webhook pipeline exactly to spec. Twilio captures the call, sends audio to our server, Deepgram transcribes it, GPT generates a response, Amazon Polly converts it back to speech, TwiML returns it to the customer. Six services in sequence.

We ran the test calls. Here’s what a typical exchange looked like:

“I’d like two cheese pizzas.”

[1.9 seconds of silence]

“Great, two cheese pizzas. Anything else?”

“Yeah, and also—” [system cuts off mid-sentence]

[2.1 seconds of silence]

“I’m sorry, I didn’t catch that. Could you repeat?”

Pass rate: 68%. One in three orders had errors.

I sent him the call recordings and the numbers. No commentary.

“You were right. What do we do next?”

That’s the call I’ll remember.

Not because I’d been right. Because of what it changed. From that point on, the client stopped issuing technical directives and started asking questions. We started working like actual partners.

I suggested trying the OpenAI Realtime API — a bidirectional WebSocket that handles speech recognition, language model, and voice synthesis in a single streaming pipeline. No sequential processing. Natural barge-in support. The kind of UX that actually feels like a conversation.

The quality was immediately better. 96.6% order completion rate, zero barge-in issues, calls thirty seconds shorter. He listened to the test recordings and was happy.

Then I looked at the bill. $0.82 per call.

At fifty calls a day, that’s $1,230 a month — more expensive than hiring a part-time employee. His target was $0.08 to $0.12 per call.

I called him again. “Quality is great. Cost is seven to ten times your budget. Give me more time — I want to try a third approach.”

This time he said simply: “Okay. Do what you think is right.”

The third version: what actually works

The insight was straightforward: the Realtime API sells convenience — one pipeline, one API key. You’re paying for the integration, not the raw compute. If you unbundle the pipeline and pick the best provider for each component separately, you can get the same quality for significantly less.

I rebuilt the system on LiveKit: Deepgram for speech recognition with custom keyterms boosting the restaurant’s specific menu vocabulary, GPT-5.4-mini for the language model (not a reasoning model — voice AI needs speed, not deliberation), Cartesia for speech synthesis with low latency and emotion control.

The most important architectural decision: take flow control away from the LLM entirely.

Earlier versions let the model manage the conversation. Even with careful prompting, it would occasionally skip confirmation steps, ask for the customer’s name before confirming the order, or once add an item to the cart that didn’t exist on the menu. A prompt is an instruction. It’s not a contract.

The production system uses a deterministic phase machine for flow control — a pure function that takes the current order state and returns the current phase and what transitions are allowed. The model’s job is limited to one thing: understand what the customer said and call the right tool. Everything else is handled by code that’s testable and predictable.

Result: $0.096 per call. 100% order completion on the latest test rounds. Calls averaging 55 seconds instead of over two minutes.

What I took away from this

Sometimes the best way to move a client forward is to let them arrive at the conclusion themselves. Not because you need the satisfaction of being right — but because the experience changes something that an argument can’t.

After that first version, the client became an ally. We stopped spending energy on disagreement and put it into the work instead.

It cost a few weeks and three complete architecture rewrites. But it ended with a system in production, a client who trusts the process, and a working relationship that’s genuinely collaborative.

I’ve documented the full technical architecture — cost breakdown, component selection rationale, the phase machine design — in my guide to building AI systems in production. If you’re working on something similar, it’s here: [Gumroad]

The Evolution of Data Ingestion in AWS Bedrock Knowledge Base

vandriichuk — Tue, 18 Nov 2025 13:22:15 GMT

Let me start with why we decided to use AWS Bedrock Knowledge Base in the first place. Our original plan was to upload documents directly to Bedrock and work with them as-is. But there was a hard limit we couldn’t ignore: Bedrock does not accept files larger than 5 MB.

Our client needed to upload documents up to 50 MB. Splitting or recompressing them would only create more complexity. The cleaner solution was to use the Knowledge Base as a place to store and index large documents, and then let Bedrock work with the resulting chunks.

Once we made that shift, the main challenge became speed and stability of indexing. That led us to rethink the entire ingestion flow.

The old way: StartIngestionJob

Blocking behavior, full-bucket rescans, and a clunky UX

The old flow looked simple but was far from efficient:

Upload a file to S3.
Run StartIngestionJob.
The job rescans the entire bucket.
The KB stays blocked until the job finishes.

The issues were obvious:

The whole KB gets blocked during indexing.
The entire bucket is scanned even if you add just one new file.
Users have to wait 3–5 minutes.
No visibility into the status of individual files.
Documents get reindexed even if nothing changed.

Fine for a prototype, painful for a real product.

The new way: IngestKnowledgeBaseDocuments

Fast, granular, non-blocking indexing

Switching to IngestKnowledgeBaseDocuments changed everything. Now:

Upload the file to S3.
Send it directly for indexing.
The KB stays available.
Only that specific file is indexed.

The benefits are immediate:

Multiple files can be indexed in parallel.
Indexing starts instantly, without waiting for a job.
The user can keep working.
Each file has its own status.
No full-bucket rescans.

The system went from “heavy and slow” to something much closer to real-time.

Custom metadata: the backbone of multi-tenancy

To support multiple users on a single KB index, each document gets a metadata block:

{
  userId: “user-email@example.com”,
  fileName: “report.pdf”,
  fileHash: “sha256-hash”,
  uploadedAt: “2024-01-15T10:30:00Z”,
  fileSize: 5242880,
  contentType: “application/pdf”
}

This solves several problems at once:

Every user sees only their own documents.
Filtering with userId + fileName keeps results precise.
We track who uploaded what and when.
fileHash prevents duplicate indexing.
Search becomes more relevant.

Example filter:

const filter = {
  andAll: [
    { equals: { key: “userId”, value: { stringValue: “user@example.com” } } },
    { equals: { key: “fileName”, value: { stringValue: “contract.pdf” } } }
  ]
};

Tricks that improved speed and UX

A. Batching: up to 10 documents per request

const batches = chunk(files, 10);
for (const batch of batches) {
  await ingestDocumentsBatch(batch);
}

Fewer API calls, faster throughput, AWS limits respected.

B. Caching file status

Avoids reindexing a document if it’s already indexed:

const cache = new Map();

TTL is one hour. Hash-based, so renaming files doesn’t matter.

C. Asynchronous status polling

The user doesn’t wait:

waitForDocumentsIndexed(s3Uris, timeout: 60000)
  .then(results => updateCache(results))
  .catch(err => scheduleRetry());

Polling every 2–3 seconds.

D. Retry with backoff

Smooths out rate limit spikes:

async function retryWithBackoff(fn, maxRetries = 3) {
  for (let i = 0; i < maxRetries; i++) {
    try {
      return await fn();
    } catch (err) {
      if (err.statusCode === 429) {
        const delay = Math.pow(2, i) * 1000 + random(0, 1000);
        await sleep(delay);
      } else throw err;
    }
  }
}

E. Cache cleanup

Keeps the cache lean:

cleanExpiredCache() {
  const now = Date.now();
  for (const [hash, entry] of cache.entries()) {
    if (now - entry.checkedAt > CACHE_TTL) {
      cache.delete(hash);
    }
  }
}

Full workflow: from upload to model response

The user uploads a PDF.
The file goes to S3.
We call IngestKnowledgeBaseDocuments.
KB starts indexing.
The user sees a “file is being processed” message.
Background polling checks status.
Once indexing is done, we update the cache.
For the next query, KB returns only the relevant chunks.
The model generates the final answer.

Fast, predictable, no blocking.

Results after the migration

Before (StartIngestionJob)

3–5 minutes to index a single file.
KB completely blocked.
No user isolation.
Only job-level status.

After (IngestKnowledgeBaseDocuments)

30–60 seconds per file.
KB remains responsive.
Full multi-tenant metadata separation.
File-level status.
Batching up to 10 files.
Cache removes redundant work.

Roughly a 5x speed improvement.

What comes next

Indexing metrics: latency, failures, distribution.
Webhooks for “file indexed” events.
Document versioning via metadata.
More filtering options: by size, date, type.
Predictive indexing for frequently used files.

Key takeaways

The KB solves Bedrock’s file-size limit and handles documents up to 50 MB.
Direct ingestion removes blocking and boosts speed significantly.
Metadata enables true multi-tenant behavior.
Batching, caching, and async polling keep the UX smooth.
Backoff logic keeps the system stable during heavy load.

If you rely on Bedrock KB, switching to direct ingestion will make the whole system feel lighter and more responsive.

Subscribe now

From Naive RAG to Knowledge Graphs: Building Verifiable and Trustworthy AI Assistants

vandriichuk — Wed, 22 Oct 2025 19:48:00 GMT

In the ever-evolving landscape of artificial intelligence, we’re witnessing a paradigm shift from simple prompt engineering to the more sophisticated discipline of context engineering. It’s no longer enough to craft the perfect query; the real challenge lies in furnishing Large Language Models (LLMs) with the right, verifiable information to generate trustworthy answers. This technical deep dive explores the evolution of Retrieval-Augmented Generation (RAG), examining the limitations of traditional approaches and charting a path forward with knowledge graphs, focusing on pioneering frameworks like Microsoft’s GraphRAG and its more agile counterpart, LightRAG.

Classic RAG: A Powerful, Yet Flawed, Solution

Retrieval-Augmented Generation (RAG) has become a standard technique for extending the capabilities of LLMs beyond their static training data, allowing them to access private or up-to-the-minute information. The process is relatively straightforward: when a user asks a question, the RAG system first retrieves relevant snippets of information from a knowledge base and then feeds these snippets, along with the original query, to an LLM to generate an answer.

How Traditional RAG Works:

Indexing: Documents are broken down into smaller pieces (chunks).
Embedding: Each chunk is converted into a numerical vector (embedding) that represents its semantic meaning.
Storing: These vectors are stored in a vector database.
Retrieval: The user’s query is also converted into a vector, and the database is searched for the most similar document chunk vectors.
Augmentation & Generation: The retrieved chunks are added to the context of the user’s prompt, and the LLM generates an answer based on this augmented information.

Despite its effectiveness in simple scenarios, this “naive” approach to RAG has significant drawbacks that become apparent as the complexity and scale of the knowledge base grow.

The Problems with Classic RAG:

Semantic Drift and Name Collisions: Vector search can retrieve semantically similar but contextually incorrect chunks. A query about “Project Alpha” might pull information about a client with the same name, leading to irrelevant results.
Context Leakage and Security Vulnerabilities: A poorly configured RAG can inadvertently include sensitive information, like personal data or financial details, in its response. Furthermore, bad actors can exploit techniques like “prompt injection” by embedding malicious instructions within source documents.
Lack of Relational Understanding: Naive RAG cannot grasp the complex relationships between pieces of information scattered across different documents. It fails to connect the dots between an employee going on vacation, their temporary replacement, and a project they were managing.
Contradictory Information: If the retrieved chunks contain conflicting information, the LLM can become confused, leading to hallucinations or inaccurate answers.

Enter Knowledge Graphs: Bringing Structure to Context

To overcome these limitations, the industry is turning to knowledge graphs. A knowledge graph is essentially a structured representation of information, where entities (people, projects, documents) are nodes, and the relationships between them are edges. Instead of treating information as isolated text snippets, knowledge graphs capture the intricate web of connections, enabling a deeper, more contextual understanding.

Advantages of Knowledge Graphs in RAG:

Improved Accuracy and Relevance: By querying the graph, a system can retrieve not just individual entities but also their related entities and relationships, providing a more comprehensive and relevant context for the LLM.
Verifiability and Explainability: Answers generated using knowledge graphs can be traced back to the source entities and relationships in the graph, allowing users to verify the source of the information and understand how the system arrived at its conclusion.
Reduced Hallucinations: By feeding the LLM structured and validated information, knowledge graphs significantly reduce the likelihood of the model “making up” facts.

Microsoft’s GraphRAG: A Powerful but Costly Pioneer

One of the first major attempts to merge knowledge graphs with RAG is Microsoft’s GraphRAG framework. This comprehensive pipeline uses an LLM to automatically extract entities and relationships from unstructured text to build a knowledge graph.It then uses hierarchical clustering to group related entities into “communities” and generates summaries for each cluster at various levels of abstraction.

How GraphRAG Works:

Indexing: An LLM analyzes source documents to extract entities and relationships, creating a knowledge graph.
Clustering: Community detection algorithms group closely related entities.
Summarization: The LLM generates descriptive summaries for each community.
Retrieval: When a query is made, the system searches for relevant communities via their summaries and then drills down into the graph for detailed context.
Generation: The LLM uses the retrieved information to generate a comprehensive answer.

While incredibly powerful and capable of answering complex, multi-hop questions, GraphRAG has a significant drawback: high cost. The indexing process, particularly the summarization of clusters, requires a massive number of API calls to powerful LLMs, making it computationally expensive and resource-intensive.

LightRAG: The Agile and Cost-Effective Alternative

In response to the complexity and cost of GraphRAG, researchers at the University of Hong Kong developed LightRAG, a lightweight framework that retains the benefits of the graph-based approach while dramatically reducing the computational overhead.

LightRAG simplifies the architecture by jettisoning the multi-level hierarchical clustering and summarization in favor of an elegant, dual-level retrieval system.

The LightRAG Architecture:

Keyword Extraction: When a query is received, an LLM extracts two types of keywords:
- Local Keys: Specific entities mentioned in the query (e.g., “employee Petrov”).
- Global Keys: Broader concepts and themes (e.g., “financial optimization”).
Parallel Search:
- Local keys are used to search for specific nodes in the knowledge graph.
- Global keys are used to search for broader, conceptual relationships.
Context Expansion: Instead of just returning the retrieved entities, LightRAG queries their “neighborhood” within the graph, pulling all their immediate neighbors and connections. This creates a compact yet highly meaning-dense subgraph that serves as the context.
Incremental Updates: Unlike GraphRAG, which requires a full graph rebuild when new data is added, LightRAG supports incremental updates, making it far more efficient for dynamic knowledge bases.

This approach drastically cuts down on the required API calls, making LightRAG hundreds of times cheaper than GraphRAG while, on some metrics, delivering even better answer quality.

Practical Considerations and Future Directions

The move to graph-based RAG systems is not a silver bullet. The success of any of these systems is heavily dependent on the quality of the input data. Organizations must invest in standardizing terminology, defining clear relationships, and establishing naming conventions for entities.

Key Takeaways:

Naive RAG: Fast and cheap, ideal for simple Q&A bots and document search, but falls short on complex queries.
Microsoft GraphRAG: Extremely powerful for deep analysis of complex datasets, but its cost and complexity make it overkill for most applications.
LightRAG: Offers a balanced approach, combining the power of knowledge graphs with efficiency and cost-effectiveness, making it a promising solution for a wide range of tasks.

As we move toward more sophisticated and autonomous AI agents, the need for reliable, verifiable, and context-aware systems will only grow. Graph-based RAG frameworks like LightRAG represent a significant step forward in building AI assistants we can trust with real-world business challenges, paving the way for a future where AI-generated answers are not just eloquent, but provably true.

Navigating the AI Hype: A Realistic Roadmap for Aspiring LLM Specialists

vandriichuk — Fri, 22 Aug 2025 19:38:28 GMT

The AI revolution is in full swing, promising a future transformed by intelligent machines. Every day, headlines tout new breakthroughs, and job boards overflow with “AI Specialist” roles. Yet, amidst this whirlwind of excitement and opportunity, a crucial question arises for aspiring professionals: how do you cut through the deafening hype to build a truly valuable, sustainable career in Large Language Models (LLMs)? If you’re ready to look beyond the buzzwords and commit to the real work, this guide is your starting point.

The world is buzzing with AI. From news headlines to job boards, “Artificial Intelligence” and “Large Language Models” (LLMs) are the phrases on everyone’s lips. The opportunities seem boundless, and the allure of being at the forefront of this technological revolution is undeniable. With this surge in interest, we’ve also seen a rise in discussions around “vibe-driven coding” – the notion that a positive attitude and a surface-level understanding might be enough to break into the field

Let’s be clear: while enthusiasm is fantastic, building a serious career as an LLM specialist requires far more than just good vibes. It demands dedication, a deep understanding of complex concepts, and a structured approach to learning and skill development. It’s not a leisurely stroll in the park; it’s a challenging climb, often feeling like a journey through the “seven circles of development hell.”

But for those who are truly passionate, for those who see beyond the immediate hype and are ready to commit to mastering the intricacies of LLMs, the rewards – both intellectual and professional – are immense. The demand for skilled LLM practitioners, engineers, and researchers who can build, deploy, and innovate with these powerful models is only growing.

Introducing the LLM Specialist Roadmap

Recognizing the need for a clear path through this often-daunting landscape, I’ve put together a comprehensive LLM Specialist Roadmap.

This isn’t a promise of overnight expertise. Instead, it’s a structured, phased guide designed to take you from the foundational basics to advanced, expert-level skills. It outlines:

Key Stages of Development: From Junior Practitioner to LLM Expert/Researcher.
Essential Tools & Technologies: Covering everything from Python and Git to specialized frameworks like PyTorch, TensorFlow, and Hugging Face.
Critical Skills: Detailing the progression of necessary competencies, including prompt engineering, RAG systems, fine-tuning, model optimization, and cutting-edge research areas.
Core Theoretical Knowledge: Emphasizing the importance of understanding linear algebra, neural networks, transformer architecture, and more.
Practical Projects: Suggesting hands-on projects to solidify your learning at each stage.

Think of it as your (brutally honest) guide through the inferno of LLM development. It’s designed to equip you with the knowledge and practical experience needed to not just “vibe” with AI, but to truly build and innovate with it.

Why a Roadmap? Why Now?

In a field evolving as rapidly as AI, having a structured learning path is crucial. It helps cut through the noise, focus on what truly matters, and build a solid foundation upon which to grow. This roadmap is for those who:

Are serious about a career in LLM development.
Understand that real expertise requires effort and persistence.
Are looking for a clear, actionable plan to guide their learning journey.

As I often say (perhaps paraphrasing a very tired developer), “If the path is easy, you’re probably going the wrong way. True peaks require a climb.” This roadmap is designed to help you navigate that climb, one challenging but rewarding step at a time.

Your Journey Starts Here

If you’re ready to move beyond the surface-level buzz and dive deep into the world of Large Language Models, I invite you to explore the LLM Specialist Roadmap.

It won’t be easy. It will demand your full attention and dedication. But if you’re up for the challenge, the path to becoming a sought-after LLM specialist is laid out for you.

(And if, after perusing the roadmap, you decide that perhaps tiger taming or professional napping is a more suitable career path, no judgment here. I warned you! 😉)

Key elements in this draft:

Addresses the Hype: Acknowledges the current AI buzz and the “vibe coding” phenomenon.
Sets Realistic Expectations: Emphasizes the difficulty and dedication required.
Introduces the Solution: Clearly presents your roadmap as the structured path.
Highlights Roadmap Benefits: Explains what the roadmap covers and why it’s valuable.
Motivational (with a touch of humor): Uses the “seven circles of hell” and “tired developer” analogies.
Clear Call to Action: Directs readers to your roadmap.
Maintains Your Voice: Incorporates the ironic and honest tone we’ve been using.
SEO-friendly elements: Uses keywords like “LLM Specialist,” “AI Hype,” “Roadmap,” etc.