How We Made Our AI Chatbot 5x Faster Without Changing the AI Model

When we first launched our AI-powered chatbot on the 247 Restoration Specialists website, the vision was simple: give every visitor instant, intelligent answers about water damage, fire restoration, mold remediation — 24 hours a day, 7 days a week, no waiting on hold.

The reality? Users were staring at a spinning loader for up to 13 seconds before seeing a single word. Some were getting “Sorry, something went wrong” errors and giving up entirely.

We knew we had a problem. So we dug in, found the root causes, and fixed them. Here’s exactly what we did.


The Problem: Every Message Was Treated Like a Research Paper

Our chatbot uses a technique called RAG — Retrieval Augmented Generation. In plain English: before the AI answers your question, it first searches a knowledge base of our articles and content, finds the most relevant information, and uses that to generate a grounded, accurate response.

This is powerful. But our original implementation had a serious flaw — it ran the full research pipeline on every single message, no matter how simple.

Ask “What is 247RS?” and behind the scenes, the system was doing this:

  1. Query Rewrite — an AI model rephrases your question for better search results (~2 seconds)
  2. HyDE — the AI generates a “hypothetical answer” to find better matches (~2 seconds)
  3. Multi-Query Generation — the AI creates 3 alternate versions of your question (~2 seconds)
  4. Vector Search — 3 simultaneous database searches across thousands of embedded documents (~2 seconds)
  5. Reranking — all results are scored and sorted by relevance (~0.5 seconds)
  6. Answer Generation — Gemini finally writes the response (~3–5 seconds)

Total time before the user saw anything: 7 to 13 seconds.

For a simple greeting or a one-line question, that’s unacceptable.


Fix #1 — The Simple Query Fast-Path

The first thing we did was ask: does every question really need all of this?

The answer was no.

Short, conversational messages — “Hi”, “What do you do?”, “Do you work in Chicago?” — don’t need query rewriting, hypothetical answer generation, or multiple search variations. They just need a quick search and a direct answer.

We added a fast-path detector: any message 6 words or fewer skips steps 1–3 entirely and jumps straight to search and generation.

Result:

  • Simple questions went from 7–13 seconds → 2–3 seconds
  • No AI capability was lost — complex, detailed questions still run the full pipeline

Fix #2 — Fixing a Broken Pipeline Step (That Was Silently Failing)

During our investigation, we discovered something alarming: the Multi-Query Generation step was silently broken. It was failing every single time due to a code bug, but instead of throwing an error, it was quietly returning nothing and moving on.

This meant:

  • The system was wasting time attempting a step that always failed
  • We were never getting the benefit of multi-query search
  • The bug was completely invisible in normal logs

We fixed the underlying code error, and multi-query search now works correctly for complex questions.


Fix #3 — Reducing API Quota Pressure

Our original setup was firing up to 5 simultaneous embedding API calls per user message — one for each search query variant. This was hammering Vertex AI’s rate limits.

When we exceeded those limits, the API returned a 429 ResourceExhausted error. The chatbot caught it, had no retry logic, and immediately showed the user: “Sorry, something went wrong.”

We fixed this in two ways:

  • Reduced parallel searches from 3 down to 2 for complex queries (simple queries run just 1)
  • Added automatic retry logic — if Gemini or the embedding API hits a transient error, the system waits 1.5 seconds and tries again before ever showing an error to the user

Result: The mysterious intermittent errors disappeared entirely.


Fix #4 — Parallelizing Independent Steps

Even after the main search was done, two additional lookups were running one after the other:

  • Search for relevant positive feedback examples
  • Search for content from the specific page the user was on

These two searches had no dependency on each other — there was no reason one had to wait for the other. We switched them to run simultaneously using async parallel execution.

Result: Shaved another ~0.5–1 second off every response.


Fix #5 — Real-Time Streaming (The ChatGPT Effect)

This one changed the entire feel of the chatbot.

Previously, even when the pipeline was fast, users still had to wait for the complete answer to be generated before seeing a single word. For longer responses — detailed explanations of the restoration process, step-by-step guides, service area information — this meant 10–20 seconds of staring at a spinner.

We implemented Server-Sent Events (SSE) streaming, the same technology that makes ChatGPT feel alive.

Here’s what changed:

Before After
Spinner → wait → wall of text Words appear one by one, instantly
15 seconds of silence Real-time “typing” effect
Users unsure if it’s working Users engaged from the first word

The moment Gemini starts generating a response, each word is sent to the browser immediately. Users see the answer being written in real time, just like a human typing.

We also added live status messages during the pipeline phase:

  • “Searching knowledge base…”
  • “Preparing answer…”

So even for complex questions where research takes a few seconds, users always know the system is actively working on their request.


Fix #6 — 8x Faster Knowledge Base Building

Behind the scenes, our chatbot learns from the 247RS website by processing every article and blog post into a searchable vector database. This process — called ingestion — needed to run regularly to keep the bot up to date.

The old approach used contextual retrieval: for every chunk of text (a long article split into 5–10 pieces), it made a separate AI call to generate context. One article with 8 chunks = 8 Gemini calls.

We redesigned this to make one AI call per article, generate a summary, and prepend that summary to all chunks from that article.

Result:

  • ~8x fewer API calls during ingestion
  • Ingestion that took hours now completes much faster
  • Knowledge base stays current without burning through API quota

The Results

Metric Before After
Simple question response time 7–13 seconds 2–3 seconds
Intermittent errors Frequent Eliminated
Long answer experience Wait → wall of text Real-time streaming
Ingestion speed Slow (1 AI call/chunk) ~8x faster (1 AI call/article)

Key Takeaway

The AI model itself — Gemini — was never the bottleneck. The bottleneck was how we were using it: too many unnecessary calls, no fast-path for simple queries, a broken pipeline step running silently, and no streaming to show users progress.

Sometimes the biggest performance wins aren’t about upgrading your AI — they’re about being smarter with the AI you already have.


The 247 Restoration Specialists chatbot is available 24/7 at 247restorationspecialists.com — ask us anything about water damage, fire restoration, or mold remediation.



Frequently Asked Questions

What is the Universal Commerce Protocol?

The Universal Commerce Protocol (UCP) is an open standard developed by Google and Shopify that enables AI agents to autonomously conduct commerce transactions across multiple platforms.

How does UCP enable agentic commerce?

UCP provides standardized APIs and protocols allowing AI agents to interact with commerce systems, manage transactions, and understand product catalogs without custom integrations.

Why should I implement UCP?

UCP reduces development time, simplifies AI integration, and unlocks new revenue opportunities through automated commerce capabilities and enhanced customer experiences.





Comments

Leave a Reply

Your email address will not be published. Required fields are marked *