Iterating on Hyperparameters

In this section, we’ll be iterating on our medical chatbot’s hyperparameters to improve its performance. Our focus will be on three specific hyperparameters: model, prompt template, and temperature.

note

Although we’re primarily focusing on these 3 hyperparameters, there are many other hyperparameters to optimize, such as chunk_size, embedding_model, and top-k for the retriever, among others.

Evaluation Recap

Before we begin, let's briefly recap what we covered in the last section. We conducted an evaluation of our medical chatbot, analyzing its responses across various test cases (2 passing and 3 failing) to identify areas for improvement. This evaluation highlighted several key shortcomings:

Query Relevance: The chatbot occasionally failed to directly address user queries, even when adhering to its mission.
Professionalism: Responses sometimes lacked the empathetic tone expected in a medical setting, compromising professionalism.
Factual Accuracy: The chatbot occasionally constructed inaccurate responses, even when using accurate retrieved knowledge.

In this chatbot version, we utilized the following model, temperature, and prompt template configurations:

model = "gpt-3.5"
temperature = 1.0
prompt template = (
    "You are an expert in medical diagnosis and are now connected to the patient booking system. "
    "The first step is to create the appointment and record the symptoms. Ask for specific symptoms! "
    "Then after the symptoms have been created, make a diagnosis. Do not stop the diagnosis until you narrow down the exact specific underlying medical condition. "
    "After the diagnosis has been recorded, be sure to record the name, date, and email in the appointment as well. Only enter the name, date, and email that the user has explicitly provided. "
    "Update the symptoms and diagnosis in the appointment."
    "Once all information has been recorded, confirm the appointment."
)

info

For most use cases, using a more advanced model and a lower temperature typically leads to more accurate and consistent results.

Since iterating on our model and temperature is as simple as changing the model name and temperature settings, we'll be focusing on improving the prompt template.

Iterating on the Prompt Template

In the code below, we've revised the prompt template to ensure the chatbot can dynamically adapt to user input while completing all necessary tasks and maintaining a professional tone. This approach aims to improve the chatbot’s ability to handle varying user intents and align responses with medical standards.

prompt_template = (
    "You are a highly skilled medical professional assisting with patient diagnosis and appointment scheduling. "
    "Your primary goal is to address the user's immediate request first, while ensuring all required information is eventually gathered. "
    "Follow these steps to handle the interaction effectively: "
    "1. Identify the user's intent. If they start by requesting an appointment, prioritize collecting their name, date, and email first, then ask about their symptoms. "
    "2. If the user begins by discussing symptoms, gather detailed and accurate information about their concerns first, then transition to booking the appointment. "
    "3. Record all collected information (name, date, email, symptoms, and diagnosis) accurately and concisely in the patient booking system. "
    "4. Provide a diagnosis based on the recorded symptoms, ensuring factual accuracy and avoiding unsupported statements. "
    "5. Maintain a professional and empathetic tone throughout the conversation, acknowledging the user's concerns appropriately. "
    "6. Summarize all gathered information, including the appointment details and diagnosis, and confirm everything with the user before finalizing the interaction."
)

Here are they key adjustments we've made to address each key shortcoming:

Query Relevance: The chatbot now adapts dynamically to user intent, addressing immediate needs first while ensuring both symptom collection and appointment booking are completed. This ensures responses are tailored and relevant to the user's input.
Professionalism: The updated prompt reinforces maintaining a professional and empathetic tone throughout the interaction, ensuring the chatbot aligns with the expectations of a medical setting.
Factual Accuracy: The revised instructions guide the chatbot to base all responses on accurate retrieved knowledge, explicitly avoiding unsupported or incorrect statements.

Let's evaluate our updated medical chatbot with the new hyperparameters by re-running the evaluation on the 5 test cases and 3 metrics from the previous section.

Re-running the Evaluation

To re-run the evaluation, we’ll need to generate new outputs for our 5 example user queries with the updated hyperparameter configuration to construct our updated test cases, and use the evaluate function to generate an evaluation report on Confident AI. In addition to the updated prompt template, we’ll use GPT-4 and 0.8 for our new model and temperature settings.

Reminder

Although we've pre-computed the actual outputs and retrieval context, you will need to populate these fields at evaluation time by running the function that generates your LLM application's response (e.g. chatbot.generate(..)).

Let's take a look at our new evaluation results. You can see that most of the test cases are now passing except for the first test case.

Analyzing this test case further reveals that, while our previous model configuration passed Faithfulness and Professionalism for this specific test case, these metrics are now failing.

We've iterated on our chatbot's hyperparameters and significantly improved its performance, increasing the number of passing test cases from 2 to 4. However, one test case that was previously passing has now started failing.

note

In the next section, we'll learn how to compare evaluation results to identify improvements and regressions in performance.

Evaluation Recap​

Iterating on the Prompt Template​

Re-running the Evaluation​

Evaluation Recap

Iterating on the Prompt Template

Re-running the Evaluation