The RAG Playbook

August 2, 2024 (7mo ago)

Table of Contents

  1. Path to Success
  2. RecSys in an LLM Sandwich
  3. The RAG Flywheel
  4. RAG Playbook
  5. Common Pitfalls
  6. Conclusion

Path to Success

  • Focus on experimentation rather than a vague "make it better"
  • "Look at the data"
    • Why am I looking at my data?
    • What am I looking for?
    • When I see the signals:
      • What can I do with this information to improve my application?
      • Is the juice worth the squeeze?
  • Doing the most obvious thing over and over

RecSys in an LLM Sandwich

RAG: RecSys in an LLM sandwich

Step 1

  • Chat interface or web UI intakes original query
  • Route different queries to search engines using Planning, CoT, Parallel Tool Use

Step 2

  • Retrieval, fitering, scoring and ordering is done

Step 3

  • Produce representations for the chunks
  • Choose a generation prompt to reply to the LLMs

Step 4

  • Build a special UI to render responses beyond text streams
  • E.g. profile cards, pdf viewers, citations, feedback buttons

In Simple Code

def rag_app(question, messages, k):
  searches, qualified_query = query_understanding(question, messages)
  answer_prompt = answer_prompt(qualified_query)
  results = [search.execute() for search in searches]
  results = order_results(results)[:k]
  results = format_chunks(results)
  answer = answer_question(question, results, answer_prompt)
  return answer

The RAG Flywheel

  1. Initial implementation: Start with a basic RAG system setup
  2. Synthetic data generation: Create syntheic questions to test the system's retrieval abilities
  3. Fast evaluation: Conduct quick, unit test-like evaluations to assess basic retrieval capabilities (e.g. precision, recall, mean reciprocal rank), and explain why each matters
  4. Real-world data collection: Gather real user queries and interactions. Ensure feedback is aligned with business outcomes or correlated with important qualities that predict customer satisfaction
  5. Classification and analysis: Categorize and analyze user questions to identify patterns and gaps
  6. System improvements: Based on analysis, make target improvements to the system
  7. Production monitoring: Implement ongoing monitoring to track system performance
  8. User feedback integration: Continuously incorporate user feedback into the system to improve performance
  9. Iterate: Repeat the process, refining the system and refining the process

RAG Playbook

Step 1: Address cold start

  • Generate synthetic Data
  • Establish general baselines for:
    • Recall
    • Precision
    • Lexical search
    • Semantic search
    • Re-rankers
  • Use this to test all otherhypotheses

Step 2: Instrument and observe user data to identify strengths and weaknesses in clusters

  • Topics and capabilities identified with domain experts
  • Understand and underperforming query segments
  • Covert offline analysis to online classifiers and query routers to monitor production data

Step 3: Implement new routers, tools, and capabilities

  • Evaluate experiments against both offline and product metrics
  • Explore and Exploit givendistributions of capabilities
  • Iterate and monitor continuously

Step 4: Rinse and repeat

  • Continuously conduct exploratory data analysis, partner with users and domain experts, and extend capabilities
  • Understand the process of evaluating candidate selection
  • Use topic modeling,

Common Pitfalls

💡

Often people say thay need complex reasoning because of lack of user empathy and not knowing what users really want. Instead, we need to build high-specificty tooling that users care about.

Vicious Cycle

  • Vague metrics: Things "look better"? Things "don't feel right"?
  • No superpowers: Build generic solutions for broad problems (e.g. search all personal data) instead of on specific tasks (e.g. meeting preparation tool). We want outcomes not features.
  • Unactionable feedback: Focused on feedback about LLM generations before having good search evaluations. Looking at this feedback does not result in recommendations we can use to improve the systems over time.

Focus on the Right Metrics

💡

Set concrete goals for leading metrics: Focus on increasing the number of hypotheses tested per week. Through continual experimentation, build a stronger intuition for better retrieval methods or indicies - it's going to be way more boring, like counting tomodoro sessions

  • Leading Metrics: Hard to Measure, Easy to Change

    • Track and predict future performance
    • Provide feedback on when and where to intervene
    • Often inputs to a system
    • Example: number of experiments run
    • Analogy: "count tomodoro sessions, track screen time"
  • Lagging Metrics: Easy to Measure, Hard to Change

    • Measure past outcomes
    • Often unresponsive and hard to change
    • Often outputs of a system
    • Example: application quality, churn, satisfaction
    • Analogy: "procrastinate less, do more work"

Absence Blindness

  • You don't fix what you can't see
  • Everyone sees: generation quality, latency
  • Not everyone sees: poor retrieval quality, bad representations

Conclusion

  • RAG systems require continuous iteration and improvement
  • Focus on experimentation and concrete metrics for success
  • Address cold start with synthetic data and baseline evaluations
  • Instrument and analyze user data to identify areas for improvement
  • Implement targeted tools and capabilities based on insights
  • Avoid common pitfalls like vague metrics and unactionable feedback
  • Prioritize leading metrics that drive actionable improvements

References:

  1. Jason Liu