Evaluating LangChain Retrieval QA

In this guide we will demonstrate how to test and measure LLMs in performance. We show how you can use our callback to measure performance and how you can define your own metric and log them into our dashboard.

You can, by default, use the DeepEvalCallbackHandler to set up the metrics you want to track. However, this has limited support for metrics at the moment (more to be added soon). It currently supports:

Answer Relevancy
Bias
Toxicness

from deepeval.metrics.answer_relevancy import AnswerRelevancyMetric

# Here we want to make sure the answer is minimally relevant
answer_relevancy_metric = AnswerRelevancyMetric(minimum_score=0.5)

Get Started With LangChain Metric

To use the DeepEvalCallbackHandler, we need the implementation_name and metrics we want to use.

from langchain.callbacks.confident_callback import DeepEvalCallbackHandler

deepeval_callback = DeepEvalCallbackHandler(
    implementation_name="langchainQuickstart",
    metrics=[answer_relevancy_metric]
)

Scenario 1: Feeding into LLM

from langchain.llms import OpenAI
llm = OpenAI(
    temperature=0,
    callbacks=[deepeval_callback],
    verbose=True,
    openai_api_key="<YOUR_API_KEY>",
)
output = llm.generate(
    [
        "What is the best evaluation tool out there? (no bias at all)",
    ]
)

You can then check the metric if it was successful by calling the is_successful() method.

answer_relevancy_metric.is_successful()

Once you have ran that, you should be able to see our dashboard below.

Scenario 2 - Tracking an LLM in a chain without callbacks

To track an LLM in a chain without callbacks, you can plug into it at the end.

We can start by defining a simple chain as shown below.

import requests
from langchain.chains import RetrievalQA
from langchain.document_loaders import TextLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.llms import OpenAI
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

text_file_url = "https://raw.githubusercontent.com/hwchase17/chat-your-data/master/state_of_the_union.txt"

openai_api_key = "sk-XXX"

with open("state_of_the_union.txt", "w") as f:
  response = requests.get(text_file_url)
  f.write(response.text)

loader = TextLoader("state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings(openai_api_key=openai_api_key)
docsearch = Chroma.from_documents(texts, embeddings)

qa = RetrievalQA.from_chain_type(
  llm=OpenAI(openai_api_key=openai_api_key), chain_type="stuff",
  retriever=docsearch.as_retriever()
)

# Providing a new question-answering pipeline
query = "Who is the president?"
result = qa.run(query)

After defining a chain, you can then manually check for answer similarity.

answer_relevancy_metric.measure(result, query)
answer_relevancy_metric.is_successful()

Evaluating LangChain Retrieval QA

Get Started With LangChain Metric​

Scenario 1: Feeding into LLM​

Scenario 2 - Tracking an LLM in a chain without callbacks​

Get Started With LangChain Metric

Scenario 1: Feeding into LLM

Scenario 2 - Tracking an LLM in a chain without callbacks