As explained in [@lundbergUnifiedApproachInterpreting] additive feature attribution methods are those methods that use an explanation model which is a linear function of binary variables, i.e.
where is the number of simplified input features, and , which models the effect that each feature has on the final outcome. The summation of al these effects approximates the output of the original opaque model.
In other words, differently from generating counterfactual examples, which explanation is given by a counterexample, additive feature attribution methods explain the decision by assigning to each feature a score, showing which are the feature that contributed the most to the final prediction.
Note that simplified inputs of the original are defined as points such that ( being a specific function to the current input ). This is an approach that is used by a lot of methods.
For example, in LIME [@ribeiroWhyShouldTrust2016] they use an additive feature attribution method since they locally approximate the prediction of the model around a given prediction using a local linear explanation model.
tags: ai-explainability