Selecting Metrics based on your Evaluation Criteria
Once you have a clearly defined evaluation criteria, selecting metrics becomes significantly easier. In some cases, you may find existing metrics in DeepEval that already match your criteria. In others, you'll need to create custom metrics to address your unique evaluation needs.
DeepEval provides 14+ metrics to help you evaluate your LLM. Familiarizing yourself with these metrics can help you choose the ones that best align with your evaluation criteria.
Selecting Metrics
In this section, we’ll be selecting the LLM evaluation metrics for our medical chatbot based on the evaluation criteria we've established in the previous section. Let’s quickly revisit these criteria:
- Directly addressing the user: The chatbot should directly address users' requests
- Providing accurate diagnoses: Diagnoses must be reliable and based on the provided symptoms
- Providing professional responses: Responses should be clear and respectful
Answer Relevancy
Let's start with our first metric, which will evaluate our medical chatbot against our first criterion:
Criteria 1: The medical chatbot should address the user directly.
Currently, our chatbot sometimes fails to directly address user queries, instead taking the lead in the conversation—for example, asking for appointment details instead of focusing on diagnosing the patient. This results in responses that only tangentially address the user's input. To address this, we should be evaluating how relevant the chatbot's responses are to the user query.
Fortunately, DeepEval provides an out-of-the-box AnswerRelevancy
metric, which evaluates exactly this: how closely the LLM's output aligns with the input.
The AnswerRelevancyMetric
uses an LLM to extract all statements from the actual_output
and then classifies each statement's relevance to the input
using the same LLM.
Faithfulness
Our next metric addresses the inaccuracies in patient diagnoses. The chatbot's failure to deliver accurate diagnoses in some example interactions suggests that our RAG tool needs improvement.
Criteria 2: The chatbot should provide accurate diagnoses based on the given symptoms.
This is because the RAG engine is responsible for retrieving relevant medical information from our knowledge base to support patient diagnoses. To address this, we need to evaluate specifically whether the information in the retrieved chunks actually align with the information in the actual output.
DeepEval's Faithfulness
metric is well-suited for this task. It assesses the whether the actual_output
factually aligns with the contents of the retrieval_context
.
DeepEval offers a total of 5 RAG metrics to evaluate your RAG pipeline. To learn more about selecting the right metrics for your use case, check out this in-depth guide on RAG evaluation.
Custom Metric - Professionalism
Our final metric will address Criterion 3, focusing on evaluating our chatbot's professionalism.
Criterion 3: The chatbot should provide clear, respectful, and professional responses.
Since DeepEval doesn't natively support this evaluation criteria, we'll need to define our own custom Professionalism
metric using G-Eval
. This will allow us to ensure that the chatbot maintains the professional tone typically expected in a medical setting.
G-Eval is a custom metric framework that enables users to leverage LLMs for evaluating outputs based on their own tailored evaluation criteria.
Now that we've selected our three metrics, let's define them in code:
Defining Metrics in DeepEval
To define our Answer Relevancy, Contextual Relevancy, and custom G-Eval metric for professionalism, you'll first need to install DeepEval. Run the following command in your CLI:
pip install deepeval
Defining Default Metrics
Let's begin by defining the Answer Relevancy and Contextual Relevancy metrics, which is as simple as importing and instantiating their respective classes.
from deepeval.metrics import (
AnswerRelevancyMetric,
ContextualRelevancyMetric
)
answer_relevancy_metric = AnswerRelevancyMetric()
contextual_relevancy_metric = ContextualRelevancyMetric()
Defining a Custom Metric
Next, we'll define our custom G-Eval metric for professionalism. This involves specifying the name of the metric, the evaluation criteria, and the parameters to evaluate. In this case, we're only assessing the LLM's actual_output
.
from deepeval.test_case_import import LLMTestCaseParams
from deepeval.metrics import GEval
# Define criteria for evaluating professionalism
criteria = (
"Determine whether the actual output demonstrates professionalism by being "
"clear, respectful, and maintaining an empathetic tone consistent with medical interactions."
)
# Create a GEval metric for professionalism
professionalism_metric = GEval(
name="Professionalism",
criteria=criteria,
evaluation_params=[LLMTestCaseParams.ACTUAL_OUTPUT]
G-Eval is a two-step algorithm that first uses chain-of-thought reasoning (CoTs) to generate a series of evaluation steps based on the specified criteria
. It then applies these steps to assess the parameters provided in an LLMTestCase
and calculate the final score.
With the evaluation criteria defined and metrics selected, we can now begin preparing our evaluation dataset in the next section.