Controlling the output of a large language AI model requires careful engineering, quite some effort and is only possible to a certain extend.
🛡️AI guardrails are safety mechanisms to ensure that AI applications are aligned with our ethical standards and societal expectations. We have all heard examples of chatbots going “off the rails”. But how do you put boundaries onto an AI model when we don’t actually understand how it works internally (black box). How can we achieve our objective?
1️⃣ We can start by removing problematic content from the dataset before even training the model. This no-brainer is actually harder than expected because just removing paragraphs with offending keywords is not enough. We will have to train a second AI model that is able to detect bias, etc.
2️⃣ After training the model we can use a technique called Reinforcement Learning through Human Feedback. Think of it as repeatedly correcting a child when it misbehaves. “You cannot say this. You should say that”. This takes a lot of manual effort and cost and never quite fully works (as any parent will confirm).
3️⃣Another moment when we can intervene is before the question (prompt) is sent to the AI, again by filtering out inappropriate questions. Note that a one-size-fits-all solution will not be possible because ‘inappropriate’ is also culturally determined and may shift over time.
4️⃣And the last opportunity comes once the output is ready, but just before showing it to the end user. We can again have manual rules & other AI models to check against bias, toxic language etc. Maintaining and updating a large number of these rules can become a big and expensive undertaking.
🔄 Up to now creative people are finding ways to circumvent the guardrails. This demonstrates that controlling AI output is easier said than done. It is crucial to adapt and evolve our strategies.