Topic proposal

Group A - Big Data Processing Topic Selection

We have decided to implement topic 5 - Clustering on Large Databases.

We have chosen this topic because all group members are also enrolled on the Machine Learning course and so should have a reasonable understanding of the K-means clustering algorithm.

Our approach would be to begin by applying k means clustering to a small subset of features, such as reputation, user age, user location, creation date etc… This should provide initial clustering of the users. This could then be built upon with more advanced analysis of other data such as user posts, given enough time.

To meet the report content, we would do the following:

To meet the criteria for the literature review, we will research the best methods for applying k means clustering to large amounts of data. Potential sources of information include books such as Pattern classification (Richard O. Duda, Peter E. Hart, David G. Stork) and the Coursera Lectures of machine learning by Andrew Ng.

The clustering program will be implemented using Spark/Scala. If possible we will do testing locally using a small subset of the data. Results will then be processed potentially using Matlab or Python to generate plots and visualise information.

As this is an unsupervised learning program, results will depend on implementation and output of clustering. However we might expect to be able to determine different classes of users such as students, professionals, beginners etc… This will be clearer when initial results have been generated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic proposal

Clone this wiki locally