Dataset Used for Fine-Tuning BERT and Friends

Emotions Dataset for NLP

About Dataset

The Emotions Dataset for NLP is a collection of documents annotated with their corresponding emotions, providing valuable resources for natural language processing (NLP) classification tasks. 📚🔍 It comprises lists of documents paired with emotion labels and is split into train, test, and validation sets to facilitate the development of machine learning models. 🛠️💻

Example

An example entry from the dataset follows the format:

"I feel like I am still looking at a blank canvas blank pieces of paper"; sadness

Sizes of The Sets: 80-10-10 (%)

Training Set : 16,000
Validation Set : 2000
Test Set : 2000

Acknowledgements

This dataset is made available thanks to Elvis and the Hugging Face team. The methodology used to prepare the dataset is detailed in the following publication: CARER: Contextualized Affect Representations for Emotion Recognition.

Inspiration

The Kaggle Emotion Dataset serves as a valuable resource for the community, enabling the development of emotion classification models using NLP-based approaches. 🌟📊 Researchers and practitioners can leverage this dataset to explore a variety of questions related to sentiment analysis and mood identification, such as:

What is the sentiment of a customer's comment?
What is the mood associated with today's special food?

Labels

The dataset includes six emotion labels: Anger, Joy, Love, Fear, Sadness, and Surprise. Each document in the dataset is annotated with one of these emotions.

Limitations

However, sadly, all sentences in the dataset follow a specific format, starting with "I am ..." or "I...". While this format simplifies the annotation process, it may limit the effectiveness of fine-tuned models in extracting sentiment from inputs of different formats. 🤔📝

Therefore, we also plan to explore alternative datasets in the future.

Note: You can find these datasets in /Emotion_Draw/Emotion_Draw/bert_part/data/raw/.

Preprocessing

Note: You can find an associated notebook in /Emotion_Draw/Emotion_Draw/bert_part/notebooks/data_creation.ipynb.

Data Processing for Training Dataset

This script explains the data processing steps for the training dataset.

Reading and Displaying the Dataset

Initially, the raw text file is read, and its contents are split into sentences and labels.

#Read the text file
with open('../data/raw/train.txt', 'r') as file:
    lines = file.readlines()

#Split each line by semicolon
data = [line.strip().split(';') for line in lines]

#Create DataFrame
train_df = pd.DataFrame(data, columns=['Sentence', 'Labels'])

#Display the DataFrame
train_df

The DataFrame is then created and displayed. Following this, exploratory data analysis (EDA) is conducted, including descriptive statistics, checking for missing values, examining data types, and identifying unique labels.

Dataset Exploration

# Descriptive statistics
train_df.describe()

# Check for missing values
train_df.isnull().sum()

# Data types of columns
train_df.dtypes

# Unique labels
train_df['Labels'].unique()

Next, label encoding is performed to convert categorical labels into numerical values.

Label Encoding

# Encode labels
train_df['Labels_Encoded'] = label_encoder.fit_transform(train_df['Labels'])
train_df

Finally, the processed dataset is saved into a CSV file for further use in model training.

Saving Dataset into CSV

# Specify the file path where you want to save the CSV file
file_path = '../data/processed/train_data.csv'

# Save the DataFrame to a CSV file
train_df.to_csv(file_path, index=False)

Similar procedures are applied for validation and test sets. Adjustments to file paths and other configurations may be necessary based on your specific setup.