Evaluating LangChain Retrieval QA
In this guide we will demonstrate how to test and measure LLMs in performance. We show how you can use our callback to measure performance and how you can define your own metric and log them into our dashboard.
You can, by default, use the DeepEvalCallbackHandler
to set up the metrics you want to track. However, this has limited support for metrics at the moment (more to be added soon). It currently supports:
- Answer Relevancy
- Bias
- Toxicness
from deepeval.metrics.answer_relevancy import AnswerRelevancyMetric
# Here we want to make sure the answer is minimally relevant
answer_relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)
Get Started With LangChain Metric
To use the DeepEvalCallbackHandler, we need the implementation_name
and metrics
we want to use.
from langchain.callbacks.confident_callback import DeepEvalCallbackHandler
deepeval_callback = DeepEvalCallbackHandler(
implementation_name="langchainQuickstart",
metrics=[answer_relevancy_metric]
)
Scenario 1: Feeding into LLM
from langchain.llms import OpenAI
llm = OpenAI(
temperature=0,
callbacks=[deepeval_callback],
verbose=True,
openai_api_key="<YOUR_API_KEY>",
)
output = llm.generate(
[
"What is the best evaluation tool out there? (no bias at all)",
]
)
You can then check the metric if it was successful by calling the is_successful() method.
answer_relevancy_metric.is_successful()
Once you have ran that, you should be able to see our dashboard below.
Scenario 2 - Tracking an LLM in a chain without callbacks
To track an LLM in a chain without callbacks, you can plug into it at the end.
We can start by defining a simple chain as shown below.
import requests
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
text_file_url = "https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt"
openai_api_key = "sk-XXX"
with open("state_of_the_union.txt", "w") as f:
response = requests.get(text_file_url)
f.write(response.text)
loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
docsearch = Chroma.from_documents(texts, embeddings)
qa = RetrievalQA.from_chain_type(
llm=OpenAI(openai_api_key=openai_api_key), chain_type="stuff",
retriever=docsearch.as_retriever()
)
# Providing a new question-answering pipeline
query = "Who is the president?"
result = qa.run(query)
After defining a chain, you can then manually check for answer similarity.
answer_relevancy_metric.measure(result, query)
answer_relevancy_metric.is_successful()