Three researchers from the Massachusetts Institute of Technology, MIT have developed a mathematical framework to formally quantify and evaluate the understandability of explanations for machine-learning models.
Yilun Zhou, an electrical engineering and computer science graduate student in the Computer Science and Artificial Intelligence Laboratory (CSAIL) is the lead author of the mathematical framework, while Marco Tulio Ribeiro, a senior researcher at Microsoft Research, and Julie Shah, a professor of aeronautics and astronautics and the director of the Interactive Robotics Group in CSAIL are co-author of the research work.
The research work is expected to be presented at the Conference of the North American Chapter of the Association for Computational Linguistics.
According to the researchers, the mathematical framework for machine-learning explainability can assist researchers to uncover insights about model behavior that would otherwise go unnoticed if they were merely assessing a few individual hypotheses in order to comprehend the complete model.
With this framework, we can have a very clear picture of not only what we know about the model from these local explanations, but more importantly what we don’t know about it.” – the lead author, Yilun Zhou.
ExSum (short for explanation summary) is a framework built by the researchers that formalize such types of claims into rules that can be verified using quantifiable metrics. ExSum examines a rule throughout the entire dataset, not just the single instance for which it was built.
The ExSum allows the user to test the validity of a rule using three different metrics: coverage, validity, and sharpness. The term “coverage” refers to how widely a rule may be applied over the full dataset. The percentage of specific examples that agree with the rule is called validity. Sharpness specifies the rule’s precision; a highly valid rule could be so general that it’s useless for comprehending the model.
ExSum can also disclose surprising insights into a model’s behavior. When testing the movie review classifier, for example, the researchers were astonished to discover that negative terms contribute to the model’s choices in a more pointed and sharper way than positive words. According to Zhou, this could be due to reviewers attempting to be courteous and less direct while criticizing a film.
According Zhou, he plans to expand on this work in the future by applying the concept of understandability to other criteria and explanation forms, such as counterfactual explanations. They concentrated on feature attribution methods for the time being, which describe the individual features that a model used to generate a conclusion.