This project implements the K-means clustering algorithm using the MapReduce framework from scratch in a distributed manner. The implementation is done in Python and uses gRPC for communication between the master, mappers, and reducers.
- Problem Statement
- Implementation Details
- Code Explanation
- How to Run the Program
- Sample Input and Output
- Evaluation
The goal of this project is to implement the MapReduce framework from scratch to perform K-means clustering on a given dataset in a distributed manner. The implementation should be done in Python and use gRPC for communication between the master, mappers, and reducers. The project should also handle fault tolerance and provide a detailed README file explaining the implementation details and how to run the program.
- The entire MapReduce framework is deployed on one machine, but each mapper, reducer, and the master are separate processes with distinct IP addresses (localhost) and port numbers.
- Mappers persist intermediate data, and reducers persist final output data in separate file directories in the local file system.
- The implementation uses gRPC for RPCs, and the code is written in Python.
The K-means algorithm is an iterative algorithm that partitions a dataset into K clusters. The algorithm proceeds as follows:
- Randomly initialize K cluster centroids.
- Assign each data point to the nearest cluster centroid.
- Recompute the cluster centroids based on the mean of the data points assigned to each cluster.
- Repeat steps 2 and 3 for a fixed number of iterations or until convergence (i.e., until the cluster centroids no longer change significantly).
The MapReduce implementation consists of the following components:
- Master: Responsible for running and communicating with the other components in the system.
- Input Split: Divides the input data into smaller chunks that can be processed in parallel by multiple mappers.
- Map: Applies the Map function to each input split to generate intermediate key-value pairs.
- Partition: Partitions the output of the Map function into smaller partitions based on the key.
- Shuffle and Sort: Sorts the intermediate key-value pairs by key and groups the values that belong to the same key.
- Reduce: Applies the Reduce function to each group of values that belong to the same key to generate the final output.
- Centroid Compilation: Parses the output generated by all the reducers to compile the final list of centroids.
The kmeans.proto
file defines the gRPC services and messages used for communication between the master, mappers, and reducers. It includes the following:
Mapper
service: Defines theMap
andGetPartitionData
RPCs for the mapper.Reducer
service: Defines theReduce
RPC for the reducer.- Message definitions for requests and responses used in the RPCs.
The master.py
file contains the implementation of the master process. It includes the following functions:
run_master
: The main function that runs the K-means clustering algorithm using MapReduce.initialize_centroids
: Randomly initializes the centroids from the input data points.split_input_data
: Divides the input data into smaller chunks for parallel processing by mappers.compile_centroids
: Compiles the final list of centroids from the reducer outputs.has_converged
: Checks if the centroids have converged based on a threshold.
The master process coordinates the execution of mappers and reducers, handles fault tolerance, and performs centroid compilation.
The mapper.py
file contains the implementation of the mapper process. It includes the following functions:
Map
: Applies the Map function to each input split to generate intermediate key-value pairs.read_input_split
: Reads the assigned input split from the input file.map_data_points
: Maps each data point to the nearest centroid.partition_data
: Partitions the mapped data based on the centroid ID.save_partitioned_data
: Saves the partitioned data to separate files for each reducer.
The mapper process reads the input split, maps data points to centroids, partitions the data, and saves the partitioned data for reducers.
The reducer.py
file contains the implementation of the reducer process. It includes the following functions:
Reduce
: Applies the Reduce function to each group of values that belong to the same key to generate the final output.shuffle_data
: Shuffles and sorts the intermediate data received from the mappers.reduce_data
: Reduces the shuffled data to compute the new centroids.save_reduced_data
: Saves the reduced data (new centroids) to a file.
The reducer process shuffles and sorts the intermediate data, reduces the data to compute new centroids, and saves the reduced data.
-
Make sure you have Python installed on your machine.
-
Install the required dependencies by running the following command:
pip install grpcio grpcio-tools
-
Generate the gRPC code by running the following command:
python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. kmeans.proto
-
Start the mapper processes by running the following command for each mapper:
python mapper.py
Enter the port number when prompted (e.g., 50051, 50052, etc.).
-
Start the reducer processes by running the following command for each reducer:
python reducer.py
Enter the port number when prompted (e.g., 60051, 60052, etc.).
-
Start the master process by running the following command:
python master.py
Enter the number of mappers, reducers, centroids, and iterations when prompted.
-
The program will execute the K-means clustering algorithm using MapReduce and display the progress and results in the console.
-
After the program finishes, the final centroids will be saved in the
centroids.txt
file.
The input data should be placed in a file named points.txt
in the Input
directory. Each line of the file should contain the coordinates of a data point in the format x,y
.
The program will generate intermediate files in the Mappers
and Reducers
directories during execution. The final centroids will be saved in the centroids.txt
file.