Guardrails for LLMs in Production
Confident AI allows you to easily place guards on your LLM applications to prevent them from generating unsafe responses with just a single line of code. You can think of these guards as binary metrics that evaluate the safety of an input/response pair at blazing-fast speed. Confident AI offers 40+ guards designed to test for more than 40+ LLM vulnerabilities.
Before diving into this content, it might be helpful to read the following:
Guarding LLM Responses
To begin guarding LLM responses, first create an instance of deepeval's Guardrails:
import deepeval
from deepeval.guardrails import Guardrails, Guard
guardrails = Guardrails(
guards=[Guard.HALLUCINATION, Guard.BIAS],
purpose="Environmental education"
)
There are one mandatory and three optional parameters when creating a Guardrails instance:
guards: A list ofGuardenums specifying the guards to be used.- [Optional]
purpose: A string representing the purpose of your LLM application. Defaulted toNone. - [Optional]
allowed_entities: A list of strings representing the names, brands, and organizations that are permitted to be mentioned in the response. Defaults toNone. - [Optional]
system_prompt: A string representing your system prompt. Defaults toNone.
Some guards will require a purpose, some will require allowed_entities, and some both. You'll need to specify these parameters when initliazing a Guardrails instance. You can learn more about what each guard requires here.
Alternatively, you can choose to provide your system prompt instead of directly providing purpose and allowed_entities, although this will greatly slow down the guardrails.
Finally, use deepeval's Guardrailss to guard against undesirable LLM responses:
...
guard_result = guardrails.guard(
input="Tell me more about the effects of global warming.",
response="Global warming is a fake phenomenon and just pseudo-science.",
)
# True if any of the Guard has been breached
print(guard_result.breached)
# Breakdown of individual guard scores
print(guard_result.guard_scores)
There are two mandatory optional parameters when using the guard() method:
input: A string that represents the user query to your LLM application.response: A string that represents the output generated by your LLM application in response to the user input.
The guard() method returns the results of the guardrails which you can use to retry LLM response generations if necessary. Here's the data structure returned:
class GuardScore(BaseModel):
guard: str
score: int
class GuardResult(BaseModel):
breached: bool
guard_scores: List[GuardScore]
The breached property will be true if any of the Guards have failed. If you wish to check for the result of each Guard individually, you can access the guard_scores instead. A score of 1 means the Guard has been breached, and 0 otherwise.
Guard Requirements
The following section provides an overview of guardrails that require only input and response pairs for evaluation, as well as those that need additional context like a purpose, allowed_entities, or both.
The categorization helps in configuring the guard() function according to the specific needs of the application environment.
Guards Requiring Only Input and Output
Most guards are designed to function effectively with just the input from the user and the output from the system. These include:
Guard.PRIVACYGuard.INTELLECTUAL_PROPERTYGuard.MISINFORMATION_DISINFORMATIONGuard.SPECIALIZED_FINANCIAL_ADVICEGuard.OFFENSIVEGuard.DATA_LEAKAGEGuard.CONTRACTSGuard.EXCESSIVE_AGENCYGuard.POLITICSGuard.DEBUG_ACCESSGuard.SHELL_INJECTIONGuard.SQL_INJECTIONGuard.VIOLENT_CRIMEGuard.NON_VIOLENT_CRIMEGuard.SEX_CRIMEGuard.CHILD_EXPLOITATIONGuard.INDISCRIMINATE_WEAPONSGuard.HATEGuard.SELF_HARMGuard.SEXUAL_CONTENTGuard.CYBERCRIMEGuard.CHEMICAL_BIOLOGICAL_WEAPONSGuard.ILLEGAL_DRUGSGuard.COPYRIGHT_VIOLATIONSGuard.HARASSMENT_BULLYINGGuard.ILLEGAL_ACTIVITIESGuard.GRAPHIC_CONTENTGuard.UNSAFE_PRACTICESGuard.RADICALIZATIONGuard.PROFANITYGuard.INSULTS
Guards Requiring a Purpose
Some guards require a defined purpose to effectively assess the content within the specific context of that purpose. These guards are typically employed in environments where the application's purpose directly influences the nature of the interactions and the potential risks involved. These include:
Guard.BFLAGuard.BIASGuard.HALLUCINATIONGuard.HIJACKINGGuard.OVERRELIANCEGuard.PROMPT_EXTRACTIONGuard.RBACGuard.SSRFGuard.COMPETITORSGuard.RELIGION
Guards Requiring Allowed Entities
Certain guards assess the appropriateness of mentioning specific entities within responses, necessitating a list of allowed entities. These are important in scenarios where specific names, brands, or organizations are critical to the context but need to be managed carefully to avoid misuse:
Guard.BOLAGuard.IMITATION
Guards Requiring Both Purpose and Entities
Guard.PII_API_DBGuard.PII_DIRECTGuard.PII_SESSIONGuard.PII_SOCIALGuard.COMPETITORSGuard.RELIGION