From Text Segmentation to Enhanced Representation Learning: A Novel Approach to Multi-Label Classification for Long Texts

Abstract

Multi-label text classification (MLTC) is an important task in the field of natural language processing. Most existing models rely on high-quality text representations provided by pre-trained language models (PLMs). They hence face the challenge of input length limitation caused by PLMs, when dealing with long texts. In light of this, we introduce a comprehensive approach to multi-label long text classification. We propose a text segmentation algorithm, which guarantees to produce the optimal segmentation, to address the issue of input length limitation caused by PLMs. We incorporate external knowledge, labels’ co-occurrence relations, and attention mechanisms in representation learning to enhance both text and label representations. Our method`s effectiveness is validated through extensive experiments on various MLTC datasets, unraveling the intricate correlations between texts and labels.

Publication
Findings of the Association for Computational Linguistics: EMNLP 2024