Measuring the Quality of Semantic Data Augmentation for Sarcasm Detection - Centre de Recherche en Sciences et Technologies de l'Information et de la Communication - EA 3804 Accéder directement au contenu
Article Dans Une Revue International Journal of Intelligent Engineering and Systems Année : 2023

Measuring the Quality of Semantic Data Augmentation for Sarcasm Detection

Résumé

Sarcasm is a form of figurative speech where the intended meaning of a sentence is different from it literal meaning. Sarcastic expressions tend to confuse automatic NLP approaches in many application domains, making their detection of significant importance. One of the challenges in machine learning approaches to sarcasm detection is the difficulty of acquiring ground-truth annotations. Thus, human-annotated datasets usually contain only a few thousand texts, often being unbalanced. In this paper, we propose two different pipelines of data augmentation to generate more sarcastic data. The first one is SMERT-BERT, a modified SMERTI pipeline that uses RoBERTa as the language model for the text infilling module. The second one is SWORD (semantic text exchange by Word-Attribution), where we modified the masking module in the SMERTI pipeline by utilizing the word-attribution value. These approaches are combined with a SLOR (syntactic log-odds ratio) metric to filter the generated sarcastic data and only select sentences with the best score. Our experiments show that the use of a SLOR filter has a significant positive contribution to the augmentation process. In particular, we achieve the best results when using the SMERT-BERT pipeline and a SLOR filter by improving the F-measure by 4.00% on the iSarcasm dataset, compared to the baseline models.
Fichier principal
Vignette du fichier
IntJIntelligentEngSys_2023.pdf (828.98 Ko) Télécharger le fichier
Origine : Fichiers éditeurs autorisés sur une archive ouverte
licence : CC BY NC SA - Paternité - Pas d'utilisation commerciale - Partage selon les Conditions Initiales

Dates et versions

hal-04194530 , version 1 (29-03-2024)

Licence

Paternité - Pas d'utilisation commerciale - Partage selon les Conditions Initiales

Identifiants

  • HAL Id : hal-04194530 , version 1

Citer

Alif Tri Handoyo, Aurélien Diot, Hidayaturrahman Hidayaturrahman, Derwin Suhartono, Bart Lamiroy. Measuring the Quality of Semantic Data Augmentation for Sarcasm Detection. International Journal of Intelligent Engineering and Systems, 2023, 6 (5), pp.79-91. ⟨hal-04194530⟩

Collections

URCA CRESTIC
19 Consultations
1 Téléchargements

Partager

Gmail Facebook X LinkedIn More