Personal code for principal component analysis and diffusion map examples. Specifically made to test the idea on some well-known types of data, but it wouldn't take much to modify the source for use with whatever data set or distance metric you desire.
$ make
A library is compiled with the classes needed for the main program and the main program links to that. The main program requires json-fortran. LAPACK is required for the library to calculate the eigenvectors and eigenvalues of various matrices.
Modify dmap.json
. Then do:
$ ./run dmap.json
You can also run principal component analysis using the following file:
$ ./run pca.json
bandwidth.json
is for running the program iteratively over different bandwidth
values. See Figure S1 in this
document for what I was
going for with this. This would more helpful for analyzing simulation data, but
the main program is not set up for that.
The extras
folder contains the source code of two programs to aid in
generating example data sets. No configuration files are provided, so you will
need to edit the source.
A few examples using this program.
Compare the swiss roll and punctured sphere results with those found in this
paper, specifically in
Section 3.1. Note that my value of bandwidth
is the square of what they
call sigma
(I am not squaring the denominator of the Gaussian kernel in my
code).
Colors indicate where points are in relationship to axis with greatest variance.
Colors indicate original cluster.
Colors indicate where points are in relationship to the center of the swiss roll.
Colors indicate where points are in relationship to axis that goes through the holes in the sphere.
The original data is from a Molecular Dynamics simulation I performed of a single octane in water. I used the RMSD between each pair of simulation snapshots of the octane as the distance metric for the diffusion map calculation (1,000 snapshots total). For the principal components analysis I used the dihedral angles as the metric. The colors indicate the radius of gyration of the octane. Compare these results with Figure S2.C from this paper's SI (PDF).
The branch alkane
has the modified code that performs these calculations. The
original simulation trajectory is too large to post here. To reproduce the data,
use this input file with GROMACS and run the
simulation. Then use gmx trjconv
to fit the octane's translational and
rotational motion, saving only the octane's coordinates. Use the output
coordinate file (xtc) as the input for this analysis. By default the simulation
will output 10,000 frames, so you may want to reduce this some for the diffusion
map analysis, since it is very memory intensive.