Goodfire and Training on Interpretability
Lesswrong
Goodfire is training AI on interpretability, a move critics warn could degrade interpretability techniques through optimization pressure, though the company claims to be managing these risks.
Lesswrong
Goodfire is training AI on interpretability, a move critics warn could degrade interpretability techniques through optimization pressure, though the company claims to be managing these risks.