-
Notifications
You must be signed in to change notification settings - Fork 1
/
README
204 lines (166 loc) · 8.19 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
OPTICS : single-linkage point ordering
--------------------------------------
OPTICS is an algorithm to detect the single-linkage structure in data
collections. The algorithm detects, for each point, the number of
neighbours within a pre-defined radius 'epsilon' and, if the number
is larger than the predefined threshold 'minPoints', proceeds with the
closest neighbour point, otherwise with the next point in the list of
unprocessed points.
OPTICS does not provide a clustering structure per se, but the
point ordering together with the closest-neighbour information can be
used to extract a cluster structure. The algorithm is useful for large
data sets with overlapping areas of varying point density.
The code has been tested to work for >300000 data points.
The algorithm requires only two input para-
meters: a neighbourhood radius (epsilon) and a minimum
number of neighbour points (MinPts). If we set epsilon to
the largest distance in the dataset then each point is reachable
and this parameter has no influence on the results.
Briefly, the algorithm starts at a random data point, cal-
culates the distance to all points within the neighbourhood
radius (epsilon) and, if at least a minimal number of points
(MinPts) is encountered, it records the nearest neighbour
distance (Reachability Distance) and the smallest radius
that encircles MinPts objects (Core Distance). If less than
MinPts points fall within epsilon, the point is considered as
noise. The algorithm repeats the same procedure for the
nearest neighbour point and proceeds iteratively until all
data points have been visited, thereby generating an
ordered list. Setting epsilon to the largest distance implies that none of
the fragments is labelled 'noise' and each is included in
one cluster at least. This choice allows to scan afterwards
for clusters at any density.
The ordered list of Reachability Distances (RDs) can be drawn as
a comprehensive nearest neighbour distance plot (called Reachability Plot).
The output files are as follows:
- file 'output.dat' contains the ordered list of input data;
given are their location 'dataID' in the input data file, core distance 'CD' and
reachability distance 'RD'.
- file 'cluster.dat' contains the cluster structure of the data;
given are the cluster identifier 'id', the id of the 'parent' cluster,
'start' and 'end' in the ordered list and 'size' of the cluster,
its smallest core distance 'minCD' and the identifier 'minCDid' of that point.
- file 'center.dat' contains the cluster centres;
given are the centre's location 'index', identifier in the ordered list 'order',
cluster identifier 'clusterID', core distance 'CD',
reachability distance 'RD' and coordinates 'x' 'y' 'z'.
References
----------
Users publishing results obtained with the program should acknowledge its use by citation:
- Pandini, A, Fornili, A, Kleinjung, J
Structural alphabets derived from attractors in conformational space.
BMC Bioinformatics 11:97 (20 February), 2010.
Other references:
- Ankerst M, Breunig MM, Kriegel HP, Sander J: OPTICS: Ordering Points
To Identify the Clustering Structure.
SIGMOD Proceedings ACM SIGMOD International Conference on Management of Data,
June 1-3, 1999, Philadelphia, Pennsylvania, USA ACM PressDelis A, Faloutsos C,
Ghandeharizadeh S 1999, 49-60.
- Daszykowski M, Walczak B, Massart DL: Looking for natural patterns in analytical data.
2. Tracing local density with OPTICS.
Journal of chemical information and computer sciences 2002, 42(3):500-507.
Install / Uninstall
-------------------
Please read the general 'INSTALL' instructions.
There are several versions of the program that are invoked
upon compilation with 'configure --enable-data-<version>':
- optics_ang
for input data in angle coordinates
compile using the command './configure --enable-data-ang'
- optics_str
for input data in string format
compile using the command './configure --enable-data-str'
- optics_vec
for input data in vector (of arbitrary length) format
compile using the command './configure --enable-data-vec'
- optics_xyz
for input data in xyz (Cartesian) coordinates
compile using the command './configure --enable-data-xyz'
- optics_dist
for input data as distance matrix
compile using the command './configure --enable-data-dist'
The 'optics_ang' program requires the 'pcre' pattern matching library.
See http://www.pcre.org/ for installation options.
The 'optics_vec' program has several distance metrics, one of which is
enabled by a '#define' instruction at the top of the program.
Run 'configure --help' for more information.
For documentation execute 'doxygen doxygen.cfg' in the 'src' directory.
Documentation files are created in 'doc/html' and 'doc/latex'.
The latex documentation is completed by executing 'make pdf' in the
'doc/latex' directory, which creates 'refman.ps' and 'refman.pdf'.
Usage
-----
optics: perform OPTICS single-linkage point ordering
(C) 2008-2015 by Jens Kleinjung and Alessandro Pandini
This program is free software.
optics [--datafile ...] [OPTIONS ...]
--datafile <filename> (mode: mandatory, type: char, default: input.dat)
--outputfile <filename> (mode: optional, type: char, default: output.dat)
--clusterfile <filename> (mode: optional, type: char, default: cluster.dat)
--centerfile <filename> (mode: optional, type: char, default: center.dat)
--uniquefile <filename> (mode: optional, type: char, default: unique.dat)
--eps <epsilon cutoff> (mode: optional, type: float, default: 2.91)
--minpts <min number of neighbours> (mode: optional, type: int, default: 1)
--w <window size for string version> (mode: optional, type: int, default: 1)
--silent (mode: optional, no argument, default: off)
--cite (mode: optional , type: no_arg, default: off)
--version (mode: optional , type: no_arg, default: off)
--help
Use the configure options described in 'README Install/Uninstall'
to compile a specific verion.
Input formats
-------------
- optics_ang
- optics_str
- optics_vec
The input format is array of vectors of arbitrary length composed
of space separated floats.
- optics_xyz
- optics_dist
- optics_dist
The input format is a triangular matrix of distance values
in float format. See for example the 'tests/dist.dat' file.
The matrix contains the distances of the [i,j] point pairs
[0,1] [0,2] [0,3] [0,4] [0,5]
[1,2] [1,3] [1,4] [1,5]
[2,3] [2,4] [2,5]
[3,4] [3,5]
[4,5]
with values
5 3 2 8 1
7 4 7 2
4 2 4
7 3
2
White space and line breaks are irrelevant for the input format,
only the order of values in the above sense must be correct.
Return values
-------------
0 : Clean termination.
1 : Error.
-------------------------------------------------------------------------------
Copyright (C) 2008-2016 Jens Kleinjung
Copyright (C) 2008-2016 Alessandro Pandini
The development was kindly supported by grants from the
European Science Foundation "Frontiers of Functional Genomics" and
"Marie-Curie Intra-European Fellowships for Career development"
(Call identifier: PEOPLE-2007-2-1-IEF).
Further support was given by the MRC National Institute for Medical Research.
Availability
------------
The program is made available under the GNU Public License for academic
scientific purposes, under the condition that proper acknowledgement
is made to the authors of the program in publications resulting from the use
of the program.
License
-------
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.