Name: Research Talk: Mechanistic Interpretability for AI Safety: Insights and Contributions
Start: 2024-08-21T15:30:00+08:00
End: 2024-08-21T16:30:00+08:00

Title: Mechanistic Interpretability for AI Safety: Insights and Contributions

Speaker: Dr. Fazl Barez

Description: On August 21, Dr. Fazl Barez from Oxford University visited DTC and gave seminars on two topics, one on “Mechanistic Interpretability for AI Safety: Insights and Contributions” and the other one on “Unlearning and Relearning in Large Language Models”. For the first topic, Dr. Barez introduced the emerging field of Mechanistic Interpretability (MI) and explained how it can help to understand how and why an AI system produces specific outputs based on its internal workings. He covered techniques like activation patching, causal tracing, and sparse autoencoders, used to reverse-engineer model behaviors and identify critical computational pathways.

For the second topic, Dr. Barez discussed the importance and challenges of removing undesirable concepts from large language models to prevent their relearning, even after fine-tuning. The presentation concluded with a discussion of the broader implications for AI safety, focusing on the challenges and opportunities in making AI systems more transparent, controllable, and aligned with human values. After the talk, there was a follow-up discussion on how NTU and Oxford University can collaborate on these two areas.

Digital Trust Centre (DTC)

How can we help you?

Programmes

Financial Matters

Student Exchange

Student Life

NTULearn

Overseas exchanges

Library

Course finder

Alumni events

Alumni stories

Professional development

Alumni discounts

Research Focus

TRACS

GAIN

Research Hub

Academic partners

Research collaborations

Research Talk: Mechanistic Interpretability for AI Safety: Insights and Contributions

Categories