The research project was associated with "[INF-DS-RMB] Research Module B: Projekt: Social Media and Business Analytics Project", Summer Semester 2022, for my Masters of Science: Data Science, University of Potsdam, Germany. Associated Research Paper can be found here.
If you want to train/inference/visualize Hierarchical Attention Model (HAN) or LSTM, follow the steps shown in readme here
Install package scrapetube
using pip to scrap the youtube videos id:
pip install scrapetube
pip install youtubesearchpython
Also, create twitter api key and Youtube-api key for scraping data.
Add your twitter credentials in utility/config_KEYS.yml
To scrap the data:
- Change the API key and tokens in
config_KEYS.yml
with your own twitter API key and tokens - To create a list of News channels with their website link and country, Run
python 'Scrap MBFC website.py'
- To scrap each News channel website with their Youtube Channel name and twitter handle, run
python '2. scrap_youtube_twitter.py'
- Review Twitter handles manually.
python '2_1. review_twitter_handle.py'
- To find twitter handle of the remaining news channels, we will do exhaustive search using:
python '3. Get_twitter_handle.py'
- To search the youtube search page with channel title to find their youtube channel
python '4. Scrap_youtube_channel.py'
- To validate the given youtube channel Id. If youtube channel username is available to us, we will get their youtube channel ID.
python '5. scrap_youtube_id.py'
- Another method to get the channel id using their username
python '5.1 get_yt_id.py'
- Manually changing the youtube channel's ids.
python '5.2 get_yt_id_manual.py'
- From the youtube channel's playlist, get all those videos which was published from 2021-01-01 to 2021-08-31.
python '6. get_yt_channel_playlists_videos.py'
- Using the video Id's scraped in step 10, use those video id's to scrap their comments.
python '7. get_yt_comments.py'
--- A utility file to combine right and right center files.
python '7.1 combine_right_and_center_right_data.py'
- Convert the data from json format to csv.
python '8. json_to_csv.py'
- Get the subscription list of all the auhtors who have made comments using Youtube API.
python '9. get_authors_subscription.py'
- First step of Annotation. Annotating users as liberals or conservatives using users subscription data and homogeneity score.
python '10. user_subscription_homogeneity_score.py'
- Create sepearte dataframe to easily annotate hashtags as being used by liberals or conservatives.
python '11. create_df_hashtag_annotations.py'
- Create first layer of annotated data for training (this was done using users subscription data) - Just a sample file to create out models.
python '12. create_data_subscription_training.py'
- Second step of Annotation. Annotating users as liberals or conservatives using hashatags used by the user on their comment and homogeneity score.
python '13. create_data_from_hashtags.py'
- Find conflicted users (user both in left and right channels with different leaning) and remove them. Then combine both the dataet and save. Next, take those samples, where we know the leaning of the user, and generate annotated training data for training.
python '14. generate_training_data revisit.py'
- Preprocess the given trainng dataset (created in step 14 and 17) suitable for training.
python '15. data_for_training_revisit.py'
- Preprocess the un-annotated dataset for inference.
python '16. null_comments.py'
- Create Plots.
python '17. plots.py'
- This can only be executed after yuo have infernece files. On removing conflicts from inferenced result.
python '18. remove_conflicts_inference.py'
utils.py
contains all utility functions and variables.