This project implements a machine learning model to detect fake news articles using natural language processing techniques. It's implemented as a Jupyter Notebook and utilizes various Python libraries for data processing, model training, and visualization.
- Data collection from a GitHub repository
- Text preprocessing and feature extraction
- Machine learning model training (Logistic Regression)
- Model evaluation
- Visualization of word frequencies for fake and true news using word clouds
The project requires the following Python libraries:
- gitpython
- Unidecode
- nltk
- pandas
- numpy
- scikit-learn
- matplotlib
- Pillow
- wordcloud
You can install these dependencies using pip:
!pip install gitpython Unidecode nltk pandas numpy scikit-learn matplotlib Pillow wordcloud
The project uses the Fake.br-Corpus dataset, which is cloned from the following GitHub repository: https://github.com/roneysco/Fake.br-Corpus
- Data Collection: The script clones the Fake.br-Corpus repository and reads the fake and true news articles.
- Data Preprocessing: The text data is cleaned, normalized, and preprocessed.
- Feature Extraction: TF-IDF vectorization is used to convert text data into numerical features.
- Model Training: A Logistic Regression model is trained on the preprocessed data.
- Model Evaluation: The model's accuracy is calculated and printed.
- Visualization: Word clouds are generated to visualize the most frequent words in fake and true news articles.
The model's accuracy is printed at the end of the notebook. Two word cloud visualizations are generated:
- A green "thumbs up" shaped word cloud for true news articles.
- A red "thumbs down" shaped word cloud for fake news articles.
To use this project:
- Ensure all dependencies are installed.
- Run the Jupyter Notebook in an environment with access to the required libraries.
- The notebook will automatically download the necessary NLTK data and clone the Fake.br-Corpus repository.
- Run all cells to see the results and visualizations.
This project is for educational purposes and demonstrates basic techniques in natural language processing and machine learning. The accuracy of fake news detection may vary and should not be considered definitive without further validation.