Measuring Mechanistic Interpretability at Scale Without Humans

Klindt, David, Zimmermann, Roland, Brendel, Wieland (March 2024) Measuring Mechanistic Interpretability at Scale Without Humans. In: ICLR 2024 Workshop on Representational Alignment.

Abstract

In today’s era, whatever we can measure at scale, we can optimize. So far, measuring the interpretability of units in deep neural networks (DNNs) for computer vision still requires direct human evaluation and is not scalable. As a result, the inner workings of DNNs remain a mystery despite the remarkable progress we have seen in their applications. In this work, we introduce the first scalable method to measure the per-unit interpretability in vision DNNs. This method does not require any human evaluations, yet its prediction correlates well with existing human interpretability measurements. We validate its predictive power through an interventional human psychophysics study. We demonstrate the usefulness of this measure by performing previously infeasible experiments: (1) A large-scale interpretability analysis across more than 70 million units from 835 computer vision models, and (2) an extensive analysis of how units transform during training. We find an anticorrelation between a model's downstream classification performance and per-unit interpretability, which is also observable during model training. Furthermore, we see that a layer's location and width influence its interpretability.

Item Type:	Conference or Workshop Item (Paper)
CSHL Authors:	Klindt, David
Communities:	CSHL labs > Klindt lab
SWORD Depositor:	CSHL Elements
Depositing User:	CSHL Elements
Date:	2 March 2024
Date Deposited:	11 Apr 2024 16:04
Last Modified:	11 Apr 2024 16:04
Related URLs:	Author
URI:	https://repository.cshl.edu/id/eprint/41508

Actions (login required)

Administrator's edit/view item