Saved in:
| Main Author: | |
|---|---|
| Format: | Recurso digital |
| Language: | |
| Published: |
Zenodo
2026
|
| Online Access: | https://doi.org/10.5281/zenodo.18366640 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Table of Contents:
- <h1><span>Dataset Description</span></h1> <p><span>Data for "Based on Data Balancing and Model Improvement for Multi-Label Emotion Recognition".</span> <span>This repository contains the comprehensive data and experimental results supporting</span> <span>our study on multi-label emotion recognition using the GoEmotions dataset.</span> <span>The dataset and materials are shared under the CC-BY 4.0 license.</span></p> <h2><span>Core Dataset</span></h2> <ul> <li> <p><span>balanced_emotion_dataset.csv</span></p> <ul> <li> <p><span>Final balanced multi-label sentiment dataset used for training and evaluation.</span></p> </li> <li> <p><span>Renamed from final_balanced_df_output.csv.</span></p> </li> <li> <p><span>Columns: text, sentiment (list of emotion labels).</span></p> </li> </ul> </li> <li> <p><span>goemotions_original.csv</span></p> <ul> <li> <p><span>Original GoEmotions data after removing example_very_unclear.</span></p> </li> </ul> </li> <li> <p><span>sentiment140_auto_labels.csv</span></p> <ul> <li> <p><span>Sentiment140 tweets labeled into the 28 GoEmotions categories.</span></p> </li> <li> <p><span>Columns include text and model_labels.</span></p> </li> </ul> </li> <li> <p><span>gpt4mini_generated_texts.csv</span></p> <ul> <li> <p><span>GPT-4 mini generated texts with target emotion prompts.</span></p> </li> </ul> </li> </ul> <h2><span>Original Submission Data (Version 1)</span></h2> <h3><span>Data for Figures</span></h3> <ul> <li> <p><span>balanced_label_counts.csv</span></p> <ul> <li> <p><span>Renamed from fig2_balanced_label_counts.csv.</span></p> </li> <li> <p><span>Counts of each of the 28 emotion labels in the final balanced dataset.</span></p> </li> <li> <p><span>Columns: Sentiment Labels, Counts.</span></p> </li> </ul> </li> <li> <p><span>training_history.csv</span></p> <ul> <li> <p><span>Training history log for figures (loss and accuracy per epoch).</span></p> </li> <li> <p><span>Columns: epoch, accuracy, loss, val_accuracy, val_loss.</span></p> </li> </ul> </li> </ul> <h3><span>Source Code</span></h3> <ul> <li> <p><span>model_pipeline.ipynb</span></p> <ul> <li> <p><span>Renamed from model (1).ipynb.</span></p> </li> <li> <p><span>Full notebook for data processing, model training, and evaluation.</span></p> </li> </ul> </li> </ul> <h2><span>Updated Experimental Results (Version 2)</span></h2> <p><span>In response to reviewer feedback, we conducted ablation studies and baseline</span> <span>comparisons. The following ablation archives are included:</span></p> <ol> <li> <p><span>ablation_unbalanced_attn.tar.gz</span></p> <ul> <li> <p><span>CNN + BiLSTM + Attention on original unbalanced GoEmotions.</span></p> </li> </ul> </li> <li> <p><span>ablation_unbalanced_noattn.tar.gz</span></p> <ul> <li> <p><span>CNN + BiLSTM (no attention) on original unbalanced GoEmotions.</span></p> </li> </ul> </li> <li> <p><span>ablation_balanced_attn.tar.gz</span></p> <ul> <li> <p><span>CNN + BiLSTM + Attention on oversampled balanced GoEmotions.</span></p> </li> </ul> </li> </ol> <h3><span>Key Updates in Version 2</span></h3> <ul> <li> <p><span>Extended Training: all models trained for 34 epochs (no early stopping).</span></p> </li> <li> <p><span>Validation-Only Threshold Optimization: thresholds tuned on validation only.</span></p> </li> <li> <p><span>Comprehensive Metrics:</span></p> <ul> <li> <p><span>Subset accuracy</span></p> </li> <li> <p><span>Jaccard index</span></p> </li> <li> <p><span>Hamming loss</span></p> </li> <li> <p><span>Micro/Macro Precision, Recall, F1-score</span></p> </li> <li> <p><span>Macro AUC</span></p> </li> <li> <p><span>Per-label metrics for all 28 emotion categories</span></p> </li> </ul> </li> </ul> <h3><span>File Structure (inside each ablation archive)</span></h3> <ul> <li> <p><span>*_loss.png</span></p> </li> <li> <p><span>*_precision.png</span></p> </li> <li> <p><span>*_recall.png</span></p> </li> <li> <p><span>per_label_metrics_thr0.5.csv</span></p> </li> <li> <p><span>per_label_metrics_thr_opt.csv</span></p> </li> <li> <p><span>f1_thr05.png</span></p> </li> <li> <p><span>f1_thr_opt.png</span></p> </li> <li> <p><span>summary.json</span></p> </li> </ul> <h2><span>Quality Control and Audits</span></h2> <ul> <li> <p><span>sentiment140_audit.csv</span></p> <ul> <li> <p><span>Audit samples for Sentiment140 auto-labels.</span></p> </li> </ul> </li> <li> <p><span>gpt4mini_annotations.csv</span></p> <ul> <li> <p><span>Five-annotator labels with majority vote.</span></p> </li> </ul> </li> <li> <p><span>gpt4mini_audit.csv</span></p> <ul> <li> <p><span>Audit samples for GPT-4 mini generated texts.</span></p> </li> </ul> </li> </ul> <h2><span>Transformer Baseline (Version 3)</span></h2> <ul> <li> <p><span>transformer_baseline_train.py</span></p> <ul> <li> <p><span>DistilRoBERTa baseline training script.</span></p> </li> </ul> </li> <li> <p><span>transformer_baseline_requirements.txt</span></p> <ul> <li> <p><span>Python dependencies for the baseline.</span></p> </li> </ul> </li> <li> <p><span>transformer_baseline_summary.json</span></p> <ul> <li> <p><span>Overall metrics at threshold 0.5 and optimized thresholds.</span></p> </li> </ul> </li> <li> <p><span>transformer_baseline_per_label_thr0.5.csv</span></p> <ul> <li> <p><span>Per-label metrics at threshold 0.5.</span></p> </li> </ul> </li> <li> <p><span>transformer_baseline_per_label_thr_opt.csv</span></p> <ul> <li> <p><span>Per-label metrics under validation-tuned thresholds.</span></p> </li> </ul> </li> <li> <p><span>transformer_baseline_thresholds_opt.csv</span></p> <ul> <li> <p><span>Optimized thresholds per label.</span></p> </li> </ul> </li> </ul> <h2><span>Scripts</span></h2> <ul> <li> <p><span>data_balancing_pipeline.py</span></p> <ul> <li> <p><span>Data integration, filtering, and balancing logic.</span></p> </li> </ul> </li> <li> <p><span>cnn_bilstm_training.py</span></p> <ul> <li> <p><span>CNN + BiLSTM + Attention training and evaluation script.</span></p> </li> </ul> </li> <li> <p><span>transformer_baseline_train.py</span></p> <ul> <li> <p><span>Transformer baseline training script.</span></p> </li> </ul> </li> </ul> <h2><span>Notes</span></h2> <ul> <li> <p><span>Split protocol: 80/10/10 with MultilabelStratifiedShuffleSplit, random_state=42.</span></p> </li> <li> <p><span>Threshold optimization: per-label grid search from 0.05 to 0.95 (step 0.05).</span></p> </li> <li> <p><span>File names are normalized and do not include timestamps or parentheses.</span></p> </li> </ul> <h2><span>Citation</span></h2> <p><span>If you use this dataset in your research, please cite the paper associated with</span> <span>this repository.</span></p> <h2><span>Contact</span></h2> <p><span>For questions about the data or experiments, please contact the corresponding</span> <span>author.</span></p>