Research Talk: Mechanistic Interpretability for AI Safety: Insights and Contributions
Title: Mechanistic Interpretability for AI Safety: Insights and Contributions
Speaker: Dr. Fazl Barez
Description: On August 21, Dr. Fazl Barez from Oxford University visited DTC and gave seminars on two topics, one on “Mechanistic Interpretability for AI Safety: Insights and Contributions” and the other one on “Unlearning and Relearning in Large Language Models”. For the first topic, Dr. Barez introduced the emerging field of Mechanistic Interpretability (MI) and explained how it can help to understand how and why an AI system produces specific outputs based on its internal workings. He covered techniques like activation patching, causal tracing, and sparse autoencoders, used to reverse-engineer model behaviors and identify critical computational pathways.
For the second topic, Dr. Barez discussed the importance and challenges of removing undesirable concepts from large language models to prevent their relearning, even after fine-tuning. The presentation concluded with a discussion of the broader implications for AI safety, focusing on the challenges and opportunities in making AI systems more transparent, controllable, and aligned with human values. After the talk, there was a follow-up discussion on how NTU and Oxford University can collaborate on these two areas.