Skip to content

Latest commit

 

History

History
17 lines (10 loc) · 1.67 KB

README.md

File metadata and controls

17 lines (10 loc) · 1.67 KB

Indonglish Dataset ✨

Dataset for semantic task (sentiment analysis) 😃 😒😐

📋 Paper

[Paper 53] Code-Mixed Sentiment Analysis using Transformer for Twitter Social Media Data

✍️ Citation

Laksmita Widya Astuti, Yunita Sari and Suprapto, “Code-Mixed Sentiment Analysis using Transformer for Twitter Social Media Data” International Journal of Advanced Computer Science and Applications(IJACSA), 14(10), 2023. http://dx.doi.org/10.14569/IJACSA.2023.0141053

❓About

This dataset was constructed based on keywords derived from the sociolinguistic phenomenon observed among teenagers in South Jakarta. The dataset was designed to tackle the semantic task of sentiment analysis, incorporating three distinct label categories: positive, negative, and neutral. The annotation of the dataset was carried out by a panel of five annotators, each possessing expertise language and data science.

📈 Data Generating Process

The available data spans from August 2020 to September 2022. Along with keywords, the endpoint query also includes date-based queries. The dataset is standardized by dividing it into three sections: testing, validation, and training. The evaluation and dataset distribution adhere to the same F1 value calculation as applied to the IndoLEM dataset in a manner similar to the approach outlined in a study conducted by Koto et al. The data distribution in this study employs a ratio of 3638 sentences for training, 399 for validation, and 1011 for testing.