RAG: Retrieval Evaluations

August 4, 2024 (7mo ago)

Table of Contents

High-level Evaluation Strategy

  • Start with a basic RAG system setup
  • Create synthetic questions to test the system's retrieval abilities
  • Conduct quick, unit test-like evaluations to assess basic retrieval capabilities
  • Gather real user queries and interactions
  • Categorize and analyze user questions to identify patterns and gaps
  • Based on analysis, make targeted improvements to the system
  • Implement ongoing monitoring to track system performance
  • Continuously incorporate user feedback into the system
RAG Flywheel
Read more →

Fundamentals of LLM Evaluations

⚠️

Stop using LGTM@K: "looks good to me" is not a valid evaluation metric.

AspectGeneration evalsRetrieval evals
Testing for...Factuality
Team lead: "Looks better now" (based on made up evaluations that don't capture usage behavior)
Recall, precision@K
Lexical search, semantic search, re-rankers
Speed of testsSlow
~1s to 10s per test
Fast
~10ms to 800ms per test
Cost of tests$100s per run
Team lead: "who spent $1000 on OpenAI credits yesterday?"
Negligible per run
Frequency of testsInfrequentFrequent
Speed of test iterationsSlow iterations (tests that could take minutes take hours)Fast iterations
Ease of test scalabilityDifficultEasy

Generation Evals vs Retrieval Evals

Common Pitfalls

  • When we use generation evals, we assume there are no issues in the retrieval steps
  • Data sets for creating generation evals are expensive to build and verify, so many retrieval issues may go unnoticed
  • Developers don't realize that building out retrieval may be powerful enough for several initial use cases to start gathering feedback on (and do generation afterwards)

Comparison

ExamplesGeneration eval (test for generation and factuality)Retrieval eval (test for recall and precision)
Textbook analogy
  • Question: "What is the powerhouse of the cell?"
  • Potential desired answers:
    • "Mitochondria[6,9,13]"
    • "The powerhouse of the cell is mitochondria from pages 6, 9, 13"
    • Definition of mitochondria + Paragraph
  • Question: "What page(s) can I find the answer to 'what is the powerhouse of the cell?'"
  • Answer: "6, 9, 13"
Contact information
  • Question: "How do I contact Ryan?"
  • Answer:
    • "Ryan's email is [email protected]"
    • "Phone number: (123)-456-7890"
    • "You can contact Ryan on X @ryanxjhan"
  • Question: "What document has the contact information for Ryan?"
  • Answer: "teamcontacts.docx"

Recall and Precision as Key Metrics

RecallPrecision
Definition% of relevant documents that are successfully retrieved

(Relevant retrieved documents)
÷
(Total retrieved documents)
% of retrieved documents relevant to the query

(Relevant retrieved documents)
÷
(Total relevant documents)
Why is this important to measure?
  • High recall means the system finds most of the relevant documents
  • This is especially important in cases where facts are hard to find across many documents
  • High precision means the system retrieves mostly relevant documents
  • Dumber models might get more confused with irrelevant context

Case Study: AI-Generated Reports for Consultants

Situation

  • Consultants do 15-30 research interviews with experts
  • The consultants know which experts had specific perspectives about the products

Complication

  • The consulting team requested an Al-generated report
  • For certain fields, only found 3 out of 6 people quoted
  • The consulting team expected the Al report to retrieve 6 quotes
  • As a result, the team lost confidence in the system

Solution

  • Shifted focus to ensure 100% of recall of all correct citations
  • Manually built question-chunk datasets for relevance

Impact

  • 90% + recall from 50% by doing significant pre-processing work
  • Continues the sales process by building trust and customer specific evaluations

💡 Key takeaways:

  • It's essential that the pre-processing that is done aligns with the anticipated queries to improve recall
  • Leverage customer interactions to motivate your test suites

Case Study: AI for Blueprint

Situation

  • Construction workers want to ask questions regarding blueprints

Complication

  • We used VLLM labeled data we got ~27% recall@75 for image specific search tool
  • We need better image summary prompts to be able to recover descriptions beyond image captioning

Approach

  • Tested VLLM captioning and search over descriptions
  • Used Chain of Thought (CoT) and asked language model to reason about the blue print before describing it
    • "How many rooms..."
    • "Describe the rooms..."
    • "How big was the room..."

Impact

  • 85%+ recall from ~27% through pre-processing of blueprint data sets
  • 4 days only were spent on iterating on summarization prompts
    • 12 prompts and 3 models included
  • Presented to design partners to obtain real user data with this level of recall
  • ~20% of queries were related to counting objects in blueprints
    • Justified an investment to run bounding box models to attempt to count images

💡 Key takeaways:

  • Testing the ability of a single subsystem becomes very fast if we have the right baselines to test against
  • Experimenting with high-specificity prompts for synthetic summary generation goes a long way in improving recall of images, tables, documents and any other artifacts

Synthetic Data Generation

No User Queries

  • Use existing data to generate evals to get a benchmark for precision and recall
    • question
    • answer
    • chunk_id
  • Can be as simple as: assert eval.chunk_id in search(eval.question, limit=k).chunk_ids

Have User Queries

  • Use query data as few shot examples for generation
  • You can even generate chunks from queries and test if they can be retrieved
  • Use LLM as a Ranker to produce Weak Ranking labels
    • Review the weak labels to get correct labels
    • Continue to test Precision / Recall metrics, or even ranking metrics
  • Ask yourself: given what I know about the user data, what kinds of questions could I not answer with this method?
  • Collect everything and track the questions and create evals
    • Every demo, every user interview
    • Collect thumbs up ratings, allow users to delete sources
    • Put it in a google sheet, just collect it, those labels are the gold

Prompts for Question Generation

Try to bake as much domain knowledge into these prompts, change prompts based on document types, be specific

You are tasked with generating relevant questions that a
user of a search product might ask. This task is crucial for
understanding user needs and improving the product's
functionality.

<product_description>
{{PRODUCT_DESCRIPTION}}
</product_description>

Now, examine the example text data:

<document>
{{DOCUMENT}}
</document>

Here are some example of other queries users have asked:

<example_questions>
{{EXAMPLE_QUESTIONS}}
</example_qustions>

To generate relevant questions:
1. Analyze the product description to understand its purpose and target users.
2. Examine the example text data to identify key information, patterns, or insights it might contain.
3. Consider the types of questions users might ask to extract valuable insight from the product.
4. Generate a diverse set of questions that cover different aspects of the data and product
functionality.

When formulating questions, consider the following:
- Information extraction: What specific data points might users want to retrieve?
- Pattern recognition: What trends or relationships in the data might be of interest?
- Comparative analysis: How might users want to compare different elements in the data?
- Contextual understanding: What questions might help users better understand the context of the
data?

Present your generated questions in the following format:
Before listing the questions, provide a brief explanation for each question, highlighting why it would
be relevant and valuable to a user of this product. Present these explanations in the following
format:

Prompts for Ranking Generation

You are tasked with evaluating the relevance of text
chunks to a given question. Your goal is to determine
which chunks contain information that could be useful in
answering the question, while preserving the original chunk
IDs for later processing.
Here is the question you will be using to evaluate the
relevance of the chunks:

<question>
{{QUESTION}}
</question>

You will be provided with a list of text chunks, each with a
unique ID. The chunks will be in the following format:
<chunks>
{{CHUNKS}}
</chunks>

For each chunk, you should:
1. Carefully read and understand its content.
2. Evaluate whether the information in the chunk is relevant to answering the question.
3. Provide a brief explanation of why the chunk is or is not relevant.
4. Assign a relevance score from 0 to 1, where 0 is completely irrelevant and 1 is highly relevant.
Present your evaluation for each chunk in the following format:

<evaluation>
<chunk_id>[Insert the original chunk ID here]</chunk_id>
<reasoning>[Your explanation of relevance or irrelevance]</reasoning>
<relevance_score>[Your score from 0 to 1]</relevance_score>
</evaluation>

Important notes:
- Maintain objectivity in your evaluations.
- Focus on the content's relevance to the question, not its quality or completeness.
- Be consistent in your scoring across all chunks.
- Do not modify or summarize the original chunk text.
- If a chunk contains partial relevance, explain which parts are relevant and which are not.

Conclusion

  • RAG evaluation strategy: Set up basic system, create synthetic questions, conduct quick evaluations, gather real user queries, continuously improve.
  • Retrieval evals: Faster, cheaper, more scalable than generation evals.
  • Key metrics: Recall and precision.
  • Focus on retrieval issues, not just generation.
  • Synthetic data generation: Use existing data or user queries to create evals.

References:

  1. Jason Liu