Document Clustering for Topic Modeling

Overview

This project demonstrates document clustering and topic modeling techniques on the Twenty Newsgroups dataset. The objective is to apply clustering algorithms like K-means and Latent Dirichlet Allocation (LDA) to group similar documents and discover underlying topics within a large text corpus.

The analysis includes:

Document Clustering: Using K-means to partition the dataset into clusters.
Topic Modeling: Using LDA to uncover topics from the document corpus.
Visualization: Dimensionality reduction using PCA to visualize the clusters.

Project Components

Data

Dataset: Twenty Newsgroups dataset containing approximately 20,000 newsgroup documents organized into 20 different categories.

Techniques Used

K-means Clustering: A popular unsupervised learning algorithm for partitioning data into clusters based on similarity.
Latent Dirichlet Allocation (LDA): A statistical model for topic modeling that discovers the hidden thematic structure in a collection of documents.

Installation

To get started with this project, follow these steps:

Clone the Repository

git clone https://github.com/rohanag03/Document-Clustering-Topic-Modeling.git
cd Document-Clustering-Topic-Modeling

Install Required Packages
```
pip install -r requirements.txt
```

Key Features

Data Preprocessing: Includes text tokenization, stopword removal, and stemming.
Feature Extraction: Uses TF-IDF vectorization for document representation.
Clustering: Implements K-means to group documents into clusters.
Topic Modeling: Applies LDA to extract topics from the document corpus.
Visualization: Uses PCA for dimensionality reduction to visualize document clusters.

Contributing

Contributions are welcome! If you have suggestions for improvements or would like to add new features, please follow these steps:

Fork the repository.
Create a new branch for your changes.
Commit your changes and push to your forked repository.
Open a pull request with a description of the changes you made.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
20_newsgroup_topic_modellig.ipynb		20_newsgroup_topic_modellig.ipynb
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
twenty+newsgroups.zip		twenty+newsgroups.zip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Clustering for Topic Modeling

Overview

Table of Contents

Project Components

Data

Techniques Used

Installation

Key Features

Contributing

License

About

Releases

Packages

Languages

License

rohanag03/Document-Clustering-Topic-Modeling

Folders and files

Latest commit

History

Repository files navigation

Document Clustering for Topic Modeling

Overview

Table of Contents

Project Components

Data

Techniques Used

Installation

Key Features

Contributing

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages