A significant part of natural language processing (NLP) techniques for sentiment analysis is based on supervised methods, which are affected by the quality of data. Therefore, sentiment analysis needs to be prepared for data quality issues, such as imbalance and lack of labeled data. Data augmentation methods, widely adopted in image classification tasks, include data-space solutions to tackle the problem of limited data and enhance the size and quality of training datasets to provide better models. In this work, we study the advantages and drawbacks of text augmentation methods such as easy data augmentation, back-translation, BART, and pretrained data augmentor) with recent classification algorithms (long short-term memory, convolutional neural network, bidirectional encoder representations of transformers, support vector machine, gated recurrent units, random forests, and enhanced language representation with informative entities, that have attracted sentiment-analysis researchers and industry applications. We explored seven sentiment-analysis datasets to provide scenarios of imbalanced datasets and limited data to discuss the influence of a given classifier in overcoming these problems, and provide insights into promising combinations of transformation, paraphrasing, and generation methods of sentence augmentation. The results revealed improvements from the augmented dataset, mainly for reduced datasets. Furthermore, when balanced by augmenting the minority class, the datasets were found to have improved quality, leading to more robust classifiers. The contributions to this article include the taxonomy of NLP augmentation methods and their efficiency over several classifiers from recent research trends in sentiment analysis and related fields.

Toward Text Data Augmentation for Sentiment Analysis

Barbon S.
2022-01-01

Abstract

A significant part of natural language processing (NLP) techniques for sentiment analysis is based on supervised methods, which are affected by the quality of data. Therefore, sentiment analysis needs to be prepared for data quality issues, such as imbalance and lack of labeled data. Data augmentation methods, widely adopted in image classification tasks, include data-space solutions to tackle the problem of limited data and enhance the size and quality of training datasets to provide better models. In this work, we study the advantages and drawbacks of text augmentation methods such as easy data augmentation, back-translation, BART, and pretrained data augmentor) with recent classification algorithms (long short-term memory, convolutional neural network, bidirectional encoder representations of transformers, support vector machine, gated recurrent units, random forests, and enhanced language representation with informative entities, that have attracted sentiment-analysis researchers and industry applications. We explored seven sentiment-analysis datasets to provide scenarios of imbalanced datasets and limited data to discuss the influence of a given classifier in overcoming these problems, and provide insights into promising combinations of transformation, paraphrasing, and generation methods of sentence augmentation. The results revealed improvements from the augmented dataset, mainly for reduced datasets. Furthermore, when balanced by augmenting the minority class, the datasets were found to have improved quality, leading to more robust classifiers. The contributions to this article include the taxonomy of NLP augmentation methods and their efficiency over several classifiers from recent research trends in sentiment analysis and related fields.
2022
Pubblicato
https://ieeexplore.ieee.org/document/9543519
File in questo prodotto:
File Dimensione Formato  
Toward_Text_Data_Augmentation_for_Sentiment_Analysis.pdf

Accesso chiuso

Tipologia: Documento in Versione Editoriale
Licenza: Copyright Editore
Dimensione 1.23 MB
Formato Adobe PDF
1.23 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Toward_Text_Data_Augmentation_for_Sentiment_Analysis-Post_print.pdf

accesso aperto

Tipologia: Bozza finale post-referaggio (post-print)
Licenza: Digital Rights Management non definito
Dimensione 1.82 MB
Formato Adobe PDF
1.82 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/3055528
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 15
  • ???jsp.display-item.citation.isi??? ND
social impact