Research talk: Mechanistic Interpretability for AI Safety: Insights and Contributions

28 Oct 2024 11.00 AM - 12.00 PM Alumni, Current Students, Industry/Academic Partners

On August 21, Dr. Fazl Barez from Oxford University visited DTC and delivered a talk on the emerging field of Mechanistic Interpretability (MI). He explained how MI aims to understand the internal workings of AI systems to reveal how and why specific outputs are produced. Dr. Barez discussed techniques such as activation patching, causal tracing, and sparse autoencoders, which are used to reverse-engineer model behaviours and identify key computational pathways. These insights into model mechanics can enhance AI safety by helping researchers better control and predict AI behaviour.