A Framework to Predict Human Judgement Multi-Dimensional Quality Scores for Text Summarization

Shapsum is a framework we used to predict human judgements in multi-dimensional quality scores for text summarization during the Natural Language Processing course in the MIDS program. Authors: Carolina Arriaga, Ayman Moawad, Abhi Sharma.


Text summarization is the task of producing a shorter version of a document. Model performance has been compared amongst each other based mainly on their ROUGE score. The metric has been widely criticized because it only assesses content selection and does not account for other quality metrics such as fluency, grammaticality, coherence, consistency and relevance (Ruder). (Lin, 2004) Combined score metrics like BLEND or DPMFcomb incorporate lexical, syntactic and semantic based metrics and achieve high correlation with human judgement (Yu et al., 2015) in the MT and text generation context. However, none of these combined metrics have been tested in summaries, and particularly, have moved away from human scores based on Pyramid and Responsiveness scores. Our findings show that multiple metrics used in the summarization field are predictive of multidimensional quality evaluations from experts. We produced four saturated models using decision trees and the corresponding surrogate Shapley explanation models to measure metric contribution against four dimensions of evaluation (fluency, rele-vance, consistency, coherence). We hope that our work can be used as a standard evaluation framework to compare summary quality between new summarization models.

© 2019. All rights reserved.