Table of Contents
- High-level Evaluation Strategy
- Fundamentals of LLM Evaluations
- Generation Evals vs Retrieval Evals
- Recall and Precision as Key Metrics
- Case Study: AI-Generated Reports for Consultants
- Case Study: AI for Blueprint
- Synthetic Data Generation
- Prompts for Question Generation
- Prompts for Ranking Generation
- Conclusion
High-level Evaluation Strategy
- Start with a basic RAG system setup
- Create synthetic questions to test the system's retrieval abilities
- Conduct quick, unit test-like evaluations to assess basic retrieval capabilities
- Gather real user queries and interactions
- Categorize and analyze user questions to identify patterns and gaps
- Based on analysis, make targeted improvements to the system
- Implement ongoing monitoring to track system performance
- Continuously incorporate user feedback into the system
RAG Flywheel
Read more →
Fundamentals of LLM Evaluations
⚠️
Stop using LGTM@K: "looks good to me" is not a valid evaluation metric.
Aspect | Generation evals | Retrieval evals |
---|---|---|
Testing for... | Factuality Team lead: "Looks better now" (based on made up evaluations that don't capture usage behavior) | Recall, precision@K Lexical search, semantic search, re-rankers |
Speed of tests | Slow ~1s to 10s per test | Fast ~10ms to 800ms per test |
Cost of tests | $100s per run Team lead: "who spent $1000 on OpenAI credits yesterday?" | Negligible per run |
Frequency of tests | Infrequent | Frequent |
Speed of test iterations | Slow iterations (tests that could take minutes take hours) | Fast iterations |
Ease of test scalability | Difficult | Easy |
Generation Evals vs Retrieval Evals
Common Pitfalls
- When we use generation evals, we assume there are no issues in the retrieval steps
- Data sets for creating generation evals are expensive to build and verify, so many retrieval issues may go unnoticed
- Developers don't realize that building out retrieval may be powerful enough for several initial use cases to start gathering feedback on (and do generation afterwards)
Comparison
Examples | Generation eval (test for generation and factuality) | Retrieval eval (test for recall and precision) |
---|---|---|
Textbook analogy |
|
|
Contact information |
|
|
Recall and Precision as Key Metrics
Recall | Precision | |
---|---|---|
Definition | % of relevant documents that are successfully retrieved (Relevant retrieved documents) ÷ (Total retrieved documents) | % of retrieved documents relevant to the query (Relevant retrieved documents) ÷ (Total relevant documents) |
Why is this important to measure? |
|
|
Case Study: AI-Generated Reports for Consultants
Situation
- Consultants do 15-30 research interviews with experts
- The consultants know which experts had specific perspectives about the products
Complication
- The consulting team requested an Al-generated report
- For certain fields, only found 3 out of 6 people quoted
- The consulting team expected the Al report to retrieve 6 quotes
- As a result, the team lost confidence in the system
Solution
- Shifted focus to ensure 100% of recall of all correct citations
- Manually built question-chunk datasets for relevance
Impact
- 90% + recall from 50% by doing significant pre-processing work
- Continues the sales process by building trust and customer specific evaluations
💡 Key takeaways:
- It's essential that the pre-processing that is done aligns with the anticipated queries to improve recall
- Leverage customer interactions to motivate your test suites
Case Study: AI for Blueprint
Situation
- Construction workers want to ask questions regarding blueprints
Complication
- We used VLLM labeled data we got ~27% recall@75 for image specific search tool
- We need better image summary prompts to be able to recover descriptions beyond image captioning
Approach
- Tested VLLM captioning and search over descriptions
- Used Chain of Thought (CoT) and asked language model to reason about the blue print before describing it
- "How many rooms..."
- "Describe the rooms..."
- "How big was the room..."
Impact
- 85%+ recall from ~27% through pre-processing of blueprint data sets
- 4 days only were spent on iterating on summarization prompts
- 12 prompts and 3 models included
- Presented to design partners to obtain real user data with this level of recall
- ~20% of queries were related to counting objects in blueprints
- Justified an investment to run bounding box models to attempt to count images
💡 Key takeaways:
- Testing the ability of a single subsystem becomes very fast if we have the right baselines to test against
- Experimenting with high-specificity prompts for synthetic summary generation goes a long way in improving recall of images, tables, documents and any other artifacts
Synthetic Data Generation
No User Queries
- Use existing data to generate evals to get a
benchmark for precision and recall
- question
- answer
- chunk_id
- Can be as simple as:
assert eval.chunk_id in search(eval.question, limit=k).chunk_ids
Have User Queries
- Use query data as few shot examples for generation
- You can even generate chunks from queries and test if they can be retrieved
- Use LLM as a Ranker to produce Weak Ranking labels
- Review the weak labels to get correct labels
- Continue to test Precision / Recall metrics, or even ranking metrics
- Ask yourself: given what I know about the user data, what kinds of questions could I not answer with this method?
- Collect everything and track the questions and create evals
- Every demo, every user interview
- Collect thumbs up ratings, allow users to delete sources
- Put it in a google sheet, just collect it, those labels are the gold
Prompts for Question Generation
Try to bake as much domain knowledge into these prompts, change prompts based on document types, be specific
You are tasked with generating relevant questions that a
user of a search product might ask. This task is crucial for
understanding user needs and improving the product's
functionality.
<product_description>
{{PRODUCT_DESCRIPTION}}
</product_description>
Now, examine the example text data:
<document>
{{DOCUMENT}}
</document>
Here are some example of other queries users have asked:
<example_questions>
{{EXAMPLE_QUESTIONS}}
</example_qustions>
To generate relevant questions:
1. Analyze the product description to understand its purpose and target users.
2. Examine the example text data to identify key information, patterns, or insights it might contain.
3. Consider the types of questions users might ask to extract valuable insight from the product.
4. Generate a diverse set of questions that cover different aspects of the data and product
functionality.
When formulating questions, consider the following:
- Information extraction: What specific data points might users want to retrieve?
- Pattern recognition: What trends or relationships in the data might be of interest?
- Comparative analysis: How might users want to compare different elements in the data?
- Contextual understanding: What questions might help users better understand the context of the
data?
Present your generated questions in the following format:
Before listing the questions, provide a brief explanation for each question, highlighting why it would
be relevant and valuable to a user of this product. Present these explanations in the following
format:
Prompts for Ranking Generation
You are tasked with evaluating the relevance of text
chunks to a given question. Your goal is to determine
which chunks contain information that could be useful in
answering the question, while preserving the original chunk
IDs for later processing.
Here is the question you will be using to evaluate the
relevance of the chunks:
<question>
{{QUESTION}}
</question>
You will be provided with a list of text chunks, each with a
unique ID. The chunks will be in the following format:
<chunks>
{{CHUNKS}}
</chunks>
For each chunk, you should:
1. Carefully read and understand its content.
2. Evaluate whether the information in the chunk is relevant to answering the question.
3. Provide a brief explanation of why the chunk is or is not relevant.
4. Assign a relevance score from 0 to 1, where 0 is completely irrelevant and 1 is highly relevant.
Present your evaluation for each chunk in the following format:
<evaluation>
<chunk_id>[Insert the original chunk ID here]</chunk_id>
<reasoning>[Your explanation of relevance or irrelevance]</reasoning>
<relevance_score>[Your score from 0 to 1]</relevance_score>
</evaluation>
Important notes:
- Maintain objectivity in your evaluations.
- Focus on the content's relevance to the question, not its quality or completeness.
- Be consistent in your scoring across all chunks.
- Do not modify or summarize the original chunk text.
- If a chunk contains partial relevance, explain which parts are relevant and which are not.
Conclusion
- RAG evaluation strategy: Set up basic system, create synthetic questions, conduct quick evaluations, gather real user queries, continuously improve.
- Retrieval evals: Faster, cheaper, more scalable than generation evals.
- Key metrics: Recall and precision.
- Focus on retrieval issues, not just generation.
- Synthetic data generation: Use existing data or user queries to create evals.
References: