This article details the architecture and development of AskRich, a retrieval-backed chatbot designed to enhance technical screening by providing citation-backed answers from a candidate's portfolio. It explores the system's design, including a Cloudflare Worker at the edge, a LangGraph orchestrator, and a crucial feedback loop for continuous improvement of answer quality and retrieval effectiveness. The discussion also covers the implementation of a resilient rate limiting mechanism.
Read original on Dev.to #architectureThe AskRich chatbot addresses the challenge of insufficient detail in traditional resumes by enabling hiring teams to ask specific technical questions and receive answers grounded in a candidate's actual portfolio and writing, complete with verifiable citations. This core product decision of citation-backed output significantly improves the quality of follow-up questions and overall conversation utility.
The system features a thin web client interacting with a retrieval-backed chat API. A Cloudflare Worker acts as an edge layer, handling requests with rate limiting and cache checks before forwarding to a LangGraph orchestrator. This orchestrator integrates a Content Index for retrieval and an LLM Response Layer for grounded generation, producing answers with citations for the UI.
Browser → POST /api/chat → Cloudflare Worker (rate limit + cache check)
↓ LangGraph Orchestrator
↙ ↘
Content Index LLM Response Layer
(retrieval) (grounded generation)
↘ ↙
Answer + Citations → UI RendererThe Cloudflare Worker supports multiple runtime modes (upstream, local, openai) for flexible testing and routing without client redeployment, showcasing a thoughtful approach to operational agility in AI system development.
A critical component of AskRich is its structured feedback loop. It records events for every question, answer, and user interaction (thumbs-up/down), linked by stable event IDs. This allows for precise triage of low-rated answers, classifying failures into categories like Corpus gap, Retrieval/ranking issue, Prompt/format issue, or Out-of-scope. Changes are tested against baselines to ensure improvements without regressing citation clarity.
Rate limiting is enforced at the Cloudflare Worker using a one-way hash of request context (IP + origin + user-agent) for client identification, avoiding persistent storage of raw IPs. A sequential guard system with an hourly quota and burst interval is implemented. A key design decision is graceful degradation (fail-open) if KV storage for the limiter is unavailable, prioritizing availability over strict rate limiting in adverse conditions.