- Modified from https://github.com/minoh0201/DeepMicro
- Added to bioconda
- Added wrapper for Galaxy
DeepMicro is a deep representation learning framework exploiting various autoencoders to learn robust low-dimensional representations from high-dimensional data and training classification models based on the learned representation.
~$ conda install deepmicro
- For GPU usage install tensorflow GPU version
~$ conda install tensorflow-gpu==1.13.1
Step 5: Run DeepMicro, printing out its usage.
~$ python DM.py -h
Make sure you have already gone through the Quick Setup Guide above.
1. Copy your data under the /data
directory. Your data should be a comma separated file without header and index, where each row represents a sample and each column represents a microbe. We are going to assume that your file name is UserDataExample.csv
which is already provided.
2. Check your data can be successfully loaded and verify its shape with the following command.
~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv
The output will show the number of rows and columns right next to X_train.shape
. Our data UserDataExample.csv
contains 80 rows and 200 columns.
Using TensorFlow backend.
Namespace(act='relu', ae=False, ae_lact=False, ae_oact=False, aeloss='mse', cae=False, custom_data='UserDataExample.csv', custom_data_labels=None, data=None, dataType='float64', data_dir='', dims='50', max_epochs=2000, method='all', no_clf=True, numFolds=5, numJobs=-2, patience=20, pca=False, repeat=1, rf_rate=0.1, rp=False, save_rep=False, scoring='roc_auc', seed=0, st_rate=0.25, svm_cache=1000, vae=False, vae_beta=1.0, vae_warmup=False, vae_warmup_rate=0.01)
X_train.shape: (80, 200)
Classification task has been skipped.
3. Suppose that we want to reduce the number of dimensions of our data to 20 from 200 using a shallow autoencoder. Note that --save_rep
argument will save your representation (the complete representation - not just the training) under the /results
folder.
~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv --ae -dm 20 --save_rep
4. Suppose that we want to use deep autoencoder with 2 hidden layers which has 100 units and 40 units, respectively. Let the size of latent layer to be 20. We are going to see the structure of deep autoencoder first.
~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv --ae -dm 100,40,20 --no_trn
It looks fine. Now, run the model and get the learned representation.
~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv --ae -dm 100,40,20 --save_rep
5. We can try variational autoencoder and * convolutional autoencoder* as well. Note that you can see detailed argument description by using -h
argument.
~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv --vae -dm 100,20 --save_rep
~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv --cae -dm 100,50,1 --save_rep
1. Copy your data file and label file under the /data
directory. Your data file should be a comma separated value (CSV) format without header and index, where each row represents a sample and each column represents a microbe. Your label file should contain a binary value (0 or 1) in each line and the number of lines should be equal to that in your data file. We are going to assume that your data file name is UserDataExample.csv
and label file name is UserLabelExample.csv
which are already provided.
2. Check your data can be successfully loaded and verify its shape with the following command.
~$ python DM.py -r 1 --no_clf -cd UserDataExample.csv -cl UserLabelExample.csv
Our data UserDataExample.csv
consists of 80 samples each of which has 200 features. The data will be split into the training set and the test set (in 8:2 ratio). The output will show the number of rows and columns for each data set.
Namespace(act='relu', ae=False, ae_lact=False, ae_oact=False, aeloss='mse', cae=False, custom_data='UserDataExample.csv', custom_data_labels='UserLabelExample.csv', data=None, dataType='float64', data_dir='', dims='50', max_epochs=2000, method='all', no_clf=True, no_trn=False, numFolds=5, numJobs=-2, patience=20, pca=False, repeat=1, rf_rate=0.1, rp=False, save_rep=False, scoring='roc_auc', seed=0, st_rate=0.25, svm_cache=1000, vae=False, vae_beta=1.0, vae_warmup=False, vae_warmup_rate=0.01)
X_train.shape: (64, 200)
y_train.shape: (64,)
X_test.shape: (16, 200)
y_test.shape: (16,)
Classification task has been skipped.
3. Suppose that we want to directly apply SVM algorithm on our data without representation learning. Remove --no_clf
command and specify classification method with -m svm
argument (If you don't specify classification algorithm, all three algorithms will be running).
~$ python DM.py -r 1 -cd UserDataExample.csv -cl UserLabelExample.csv -m svm
The result will be saved under /results
folder as a UserDataExample_result.txt
. The resulting file will be growing as you conduct more experiments.
4. You can learn representation first, and then apply SVM algorithm on the learned representation.
~$ python DM.py -r 1 -cd UserDataExample.csv -cl UserLabelExample.csv --ae -dm 20 -m svm
4.1. You can reload the stored the representation, and then apply SVM algorithm on the learned representation.
~$ python DM.py -r 1 -cd UserDataExample.csv -cl UserLabelExample.csv --load_rep results/PCA_UserDataExample_rep.csv -m svm
5. You can repeat the same experiment by changing seeds for random partitioning of training and test set. Suppose we want to repeat classfication task five times. You can do it by put 5 into -r
argument.
~$ python DM.py -r 5 -cd UserDataExample.csv -cl UserLabelExample.csv --ae -dm 20 -m svm