##MongoDB Data Mining Shell
MongoDB shell implementation of the data mining algorithms.
##Installation
git clone https://github.com/selvinsource/mongodb-datamining-shell.git
cd mongodb-datamining-shell
mongoimport --db mongodbdm --collection weatherData --type csv --headerline --file dataset/weatherData.csv
mongo mongodbdm --eval "var inputCollectionName = \"weatherData\", target = \"play\"" datamining/classification/oner.js
mongoimport --db mongodbdm --collection iris --type csv --headerline --file dataset/iris.csv
mongo mongodbdm --eval "var inputCollectionName = \"iris\", k = 3" datamining/clustering/kmeans.js
Follow this tutorial to compare the results to the Weka Data Mining Software.
##Documentation
Data mining or also called knowledge discovery is a set of activities aiming at analyzing large databases and extracting extra information meaningful for decision making or problem solving.
###Classification
Classification is one of the most common knowledge discovery task that consists in creating a model that predicts a target class based on explanatory variables.
####OneR
OneR is a simple yet accurate classification algorithm that produces a one level decision tree.
For a visual description of the algorithm see OneR pseudocode.
Its oner.js MongoDB implementation takes as input two parameters:
- inputCollectionName - the collection used as training dataset
- target - the target class of the collection
Usage:
mongo yourdatabase --eval "var inputCollectionName = \"yourcollection\", target = \"yourtargetclass\"" datamining/classification/oner.js
Example of a collection and its target class play: weather data.
Limitation:
- the target class must be a categorical variable with values Yes and No
- the explanatory variables must be categorical variables, numerical variables should be discretized in a small number of distinct ranges before running the algorithm
###Clustering
Clustering is the task of identifying and segmenting the instances into a finite number (k) of categories (clusters) which are not predefined (unlike classification).
####K-Means
K-Means is the classic clustering technique that partitions the instances into k clusters whereas k is predefined.
For an high level description of the algorithm see K-Means pseudocode.
Its kmeans.js MongoDB implementation takes as input two parameters:
- inputCollectionName - the collection used as training dataset
- k - the number of predefined clusters
Usage:
mongo mongodbdm --eval "var inputCollectionName = \"yourcollection\", k = numberofclusters" datamining/clustering/kmeans.js
Example of a collection: iris data.
Limitation:
- the variables must be all numerical
Note:
- If a field in the collection is called "class", this is excluded from the computation, instead it will be printed in the result with the assigned cluster
###References
- Hartigan, J. A. (1975) Clustering Algorithms, Probability & Mathematical Statistics, John Wiley & Sons Inc.
- Holte, R. C. (1993) Very simple classification rules perform well on most commonly used datasets, Machine Learning, 11, pp 63-91
- Selvaggio, V. (2011) Customer Churn prediction for an Automotive Dealership using computational Data Mining, MSc dissertation, City University London
- UCI Machine Learning Repository University of California, School of Information and Computer Science