Project Description: In this project, you will perform sentiment analysis on Instagram/Twitter data using PySpark. The goal is to analyze the sentiments expressed in Instagram (or other Social Mediums) posts or comments and gain insights into the overall sentiment towards different topics, brands, or events.
- Obtain Instagram data by leveraging the Instagram API or using publicly available datasets that contain Instagram posts or comments. You can also scrape data from Instagram using libraries like BeautifulSoup or Selenium.
- Clean the text data by removing hashtags, emojis, special characters, and URLs.
- Perform text normalization techniques such as lowercasing, removing stopwords, and lemmatization.
- Build a data pipeline using PySpark to process and analyze the Instagram data efficiently. This includes loading the data, applying transformations, and performing analysis steps in a distributed manner.
- Conduct exploratory data analysis to gain insights into the data. This can involve analyzing word frequencies, post/comment lengths, user engagement metrics, or identifying popular hashtags or topics.
- Train a sentiment analysis model using PySpark's MLlib. You can use pre-trained models like VADER (Valence Aware Dictionary and sEntiment Reasoner) or train your own model using labeled data.
- Prepare a labeled dataset for training the sentiment analysis model. This can be done manually by labeling a subset of the Instagram data or by using pre-labeled datasets available online.
- Extract relevant features from the preprocessed text data that can contribute to sentiment analysis. This may include features like word frequencies, n-grams, or TF-IDF.
- Split the data into training and testing sets.
- Train the sentiment analysis model on the training data and evaluate its performance using appropriate evaluation metrics such as accuracy, precision, recall, or F1-score.
- Use Python libraries like Matplotlib, Seaborn, or Plotly to create visualizations that help in understanding the sentiment distribution, trends, or patterns in the Instagram data.
- Visualize sentiment scores over time, sentiment distribution across different posts or users, or sentiment analysis results for specific hashtags or events.
- Analyze the sentiment analysis results and draw meaningful insights from the Instagram data.
- Prepare a comprehensive report or presentation summarizing the findings, highlighting sentiment patterns, and providing recommendations based on the analysis.
- Instagram API documentation: https://developers.facebook.com/docs/instagram-basic-display-api
- BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Selenium: https://www.selenium.dev/
- PySpark documentation: https://spark.apache.org/docs/latest/api/python/
- Twitter Sentiment Analysis: https://github.com/Wazzabeee/twitter-sentiment-analysis-pyspark
- VADER sentiment analysis: https://github.com/cjhutto/vaderSentiment