Skip to content

Shubh4545/Wikipedia_Article_Classification

Repository files navigation

Wikipedia Article Classification

Problem Statement

Probleme statement given here

Goal

The end goal of this project is to develop a binary classification model that can classify Wikipedia articles as either featured or non-featured.

Installation

We are using Google Colab as IDE beacuse it is very powerfull tool.

we have used libraries like pandas,numpy,seaborn,matplotlib,sklearn,imblearn etc

And For Dataset creation we are using Wikipidia API also.

Usage

First you have to run to create dataset file.

As we are saving data from API it will take time or get timeout due to responce of server so run it continously until all data gets saved

Clean Data

Dataset looks like before

dataset

we performed operations like remove duplicates,removing nan ,checking missing values and changing data type

EDA

This is most important part

We checked Dependent Feature we got to know Data is highly imbalenced ratio is 98:2

dependent

Model

we used Linear regression , Random forst and SVM

Results of three modes are

Result

Post Evluation

Feature imporatnce of feature is :

imp

Conclusion

Based on the results, we can draw the following conclusions:

The Linear Regression model has the highest accuracy on the test set (98.48%), followed closely by the Random Forest model (94.67%). The SVM model has the lowest accuracy on the test set (94%).

The Linear Regression model has a slightly higher accuracy on the training set than the Random Forest model (97.97% vs. 94.52%). The SVM model has the highest accuracy on the training set (97%).

The AUROC score is a metric that indicates the quality of the model's ranking rather than the absolute accuracy. Both the Linear Regression and Random Forest models have high AUROC scores of 0.97, indicating that they can effectively separate the two classes. The SVM model has a slightly lower AUROC score of 0.94, indicating that it may not be as effective in ranking the samples.

Overall, the Linear Regression and Random Forest models seem to perform well on this dataset, while the SVM model has room for improvement. However, it's important to note that the choice of model depends on the specific requirements of the problem at hand, and other factors such as interpretability, computational complexity, and scalability should also be considered.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published