This project focuses on classifying news articles in English and Hindi using machine learning and natural language processing (NLP). Models are trained on datasets obtained from Kaggle and saved for reuse. The application also includes a Streamlit interface for user interaction.
Language Support: English and Hindi news classification. Machine Learning Models: English: Multinomial Naive Bayes and Tokenization Hindi: Random Forest Classifier and Tokenization Streamlit Application: Interactive interface for classification. Datasets: Preprocessed English and Hindi datasets sourced from Kaggle.
MLNClassifier/
├── EngNewsClassify.ipynb # Jupyter notebook for training and saving English models
├── HinNewsClassify.ipynb # Jupyter notebook for training and saving Hindi models
├── NewsNLP.py # Python script for loading models and running the classifier
├── Multinomial.pkl # English Multinomial Naive Bayes model
├── tokenization.pkl # English Tokenizer
├── hindi_Rf.pkl # Hindi Random Forest model
└── hindiTokenization.pkl # Hindi Tokenizer
├── english_news.csv # English dataset (from Kaggle)
├── hindi_news.csv # Hindi dataset (from Kaggle)
└── README.md # Project documentation
Python 3.7 or higher Libraries: numpy, pandas, sklearn, nltk, streamlit
Clone the repository:
git clone https://github.com/AdityaC784/MLNClassifier.git
cd MLNClassifier
Install dependencies:
pip install -r requirements.txt
Open EngNewsClassify.ipynb and HinNewsClassify.ipynb in Jupyter Notebook.
Run the cells to train the models and save them as .pkl files in the directory.
Run the Streamlit app:
streamlit run NewsNLP.py
Use the Streamlit app to input news texts. The app will classify the news as relevant categories based on the language.
Multinomial.pkl: Trained Multinomial Naive Bayes classifier.
tokenization.pkl: Tokenizer for preprocessing English text.
hindiTokenization.pkl: Tokenizer for preprocessing Hindi text. [Kaggle Link](https://www.kaggle.com/datasets/amananandrai/ag-news-classification-dataset) for english both test.csv and train.csv used. [Kaggle Link](https://www.kaggle.com/datasets/shivamtaneja2304/inshorts-dataset-hindi)
Download and place them in the directory.
Extend support to more languages. Experiment with deep learning models. Add more categories for classification.