Member-only story
Feature Steering: The Illusion of Control and the Real Costs of Bias Manipulation
Al governance. Everybody seems to love it. And the new favorite buzzword seems to be “feature steering”. Feature steering promises that, with a couple of minor tweaks, you can tune out the “social biases” hiding in your Large Language Model (LLM). But the idea that we can somehow dial out biases in an Al model feels like a pipe dream. The reality, feature steering will make your model less reliable.
Anthropic appears to be going down that “feel good” road based on their paper, “Evaluating feature steering: A case study in mitigating social biases”. Feature steering, as described in the paper, claims to influence outputs prone to social biases. But it will, paradoxically, render less reliable outputs.
Here’s why unleashing Al’s full potential, instead of putting it on a woke leash, will give outputs that aren’t only more accurate, but also more nuanced and context aware.
Complexity and the Pandora’s Box of Off-Target Effects
The Anthropic paper does admit a undeniable truth: you push one lever; the whole system moves. Jump in there and adjust the “gender bias”, and poof-there goes “age bias” too. You’ve unintentionally whipped up a cocktail you didn’t order. The interconnected features mean everything you tweak has ripple effects — commonly called “off target effects” by the 800-pound brains. Steer one bias and you’ve inevitably tweaked others, with unpredictable results. Models should organically…