-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathTODO
216 lines (175 loc) · 20.1 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
- ERROR? ModalDecisionTrees.entropy([1,2],2) == ModalDecisionTrees.entropy([0,2],2)
- Question: without pruning, must purity always increase?
entl=ModalDecisionTrees.entropy([0,40],40)*40
entr=ModalDecisionTrees.entropy([1,3],4)*4
ent=ModalDecisionTrees.entropy([1,43],44)
newent = (entl + entr)/44
purity = -ent
newpurity = -newent
Fixes:
☐ Be careful! parse_tree parses 1.65e-8 as 1.65.
☐ Add tests for AbstractTrees: `using GraphicRecipes; plot(TreePlot(dtree), method = :tree, nodeshape = :ellipse)`
☐ Entropy: regression, fix variance-based loss, and test it; Finally understand how entropy works, and fix it!!! (check test-entropy.txt)
Cool ideas:
☐ Create package from scratch with PkgTemplates!
☐ MLJ and ScikitLearn interfaces; restore optimizations for edge cases (remember optimize_tree_parameters!?)
☐ Smart MLJ interface with hyper parameter search and/or data cache method (for training many symbolic models on the same data)
☐ MLJ: Create non-deterministic DecisionTree model
☐ INTERFACE: build_forest Add (Val{Bool}?) oob_error parameter (note that oob computation is expensive)
☐ MODEL: Bagging and boosting. ADABoost code n_iterations = defaulted to 10; model, coeffs = build_adaboost_stumps(train_labels, train_features, n_iterations); predictions = apply_adaboost_stumps(model, coeffs, test_features [XGBoost](https://towardsdatascience.com/ensemble-learning-bagging-and-boosting-explained-in-3-minutes-2e6d2240ae21)
☐ add conversion rules or promotion rules ModalDataset to MultiDataset: https://docs.julialang.org/en/v1/manual/conversion-and-promotion/#Constructors-that-don't-return-instances-of-their-own-type
☐ minified fwd structures: once computed the fwd, map each unique value to a UInt8/UInt16 value, while mantaining the relative order. Once a tree is learnt, remap the values to the old ones. tables contain a lot of redundance: perhaps use PooledArrays.jl, IndirectArrays.jl, CategoricalArrays.jl for saving space, both for saving to file AND for learing (might be beneficial for caching)?
☐ feature-importance.jl: mean decreased impurity https://medium.com/the-artificial-impostor/feature-importance-measures-for-tree-models-part-i-47f187c1a2c3
☐ DATA: Instances with different channel size. find implementation of multi-dimensional array where one can specify that along one axis subArrays have non-uniform size?! Highlight difference between channelsize and maxchannelsize; define non-uniform DimensionalDataset using ArrayOfVectors https://juliaarrays.github.io/ArraysOfArrays.jl/stable/; then fwd can simply be a gigantic table with Inf and -Inf for non-existing worlds!
☐ add a parameter controlling how many thresholds are (randomly) considered at each node. Check that this optimization doesn't degrade performances and improves training times
☐ test: create a proper test suite, and test reproducibility of the results
☐ min_samples_at_leaf should be treated as any other of the pruning parameters; certify that an algorithmic change won't affect performances, and make the change
☐ Rule-Extraction from DT's and RF's + compute rule's conjuct metrics
☐ posthoc.jl: add Reduced Error Pruning, and/or bottom-up pruning (i.e, from leaves until no change), whereas now there's only top-down pruning (i.e, prune as soon as any pruning condition is met)?
Possible performance improvements:
☐ Check whether Interval2D dimensions are to be swapped
☐ Check whether @views in interpret_world improves the performance
☐ Check that we are mostly using concrete types in structs!
☐ CODE: Check type-stability with code_warntype!
- https://docs.julialang.org/en/v1/devdocs/reflection/#Intermediate-and-compiled-representations
- https://www.johnmyleswhite.com/notebook/2013/12/06/writing-type-stable-code-in-julia/ e.g. type-stable slicing with JuliennedArrays ( from https://discourse.julialang.org/t/way-to-slice-an-array-of-unknown-number-of-dimensions/27246 ), EllipsisNotation https://github.com/ChrisRackauckas/EllipsisNotation.jl. Use \@code_warntype as in https://discourse.julialang.org/t/why-selectdim-is-type-instable/25271/2 to double check each function https://nextjournal.com/jbieler/adding-static-type-checking-to-julia-in-100-lines-of-code/
☐ CODE: try improving performances with @inbounds. @propagate_inbounds (and @boundscheck): "In general, bounds-check removal should be the last optimization you make after ensuring everything else works, is well-tested, and follows the usual performance tips." https://stackoverflow.com/questions/38901275/inbounds-propagation-rules-in-julia?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa In ExplicitModalDataset and when accessing large dicts/arrays
Half-ideas:
☐ pruning: what about instance weights?
☐ Label: a label could also be Nothing?
☐ expand code so that the FWD parameter in the construction of ExplicitModalDataset is used
☐ Perhaps, min_samples_leaf, min_purity_increase & max_purity_at_leaf should be used to select the best_optimal decision that satisfies them, not for rejecting the best_optimal decision. This, however, breaks the pruning-based code optimizations.
☐ enumAccessibles(set) is not needed?!
☐ Perhaps use MLJ.ConfusionMatrix?
Renamings:
☐ pick one between samples & instances
############################################################################################
☐ CODE: adhere to style https://github.com/invenia/BlueStyle
☐ CODE: Add proper comments, docs and authors
☐ CODE: Fix naming style (e.g. variables/features/variables, etc. Order of struct's members)
☐ CODE: Custom pretty printing, instead of display_. Figure out print/show/display See https://docs.julialang.org/en/v1/manual/types/#man-custom-pretty-printing https://schurkus.com/2018/01/05/julia-print-show-display-dump-what-to-override-what-to-use/
☐ CODE: Improve output/logging
☐ CODE: DTree->ModalTree; DForest->ModalForest.
############################################################################################
☐ MODEL: Model may not have an answer? problem with argmax when testing (e.g. RF), especially on unbalanced datasets. better to randomly pick one of the best answers then?
☐ MODEL: attach more info to the models: algorithm used, full namedtuple of training arguments, pre-process pipeline. remove cm and oob_error from random forest? OR also attach a structure that measures training performances. See what other libraries do
☐ MODEL: add info gain of the split to each internal node, purity to each node/leaf
☐ MODEL: [almost done!] Add tree parser and convenience methods for tree editing.
☐ MODEL: introduce post pruning (improve pre-pruning/post-pruning against a dataset and introduce validation/test set distinction)
############################################################################################
☐ MODALLOGIC.JL: OPTIMIZE: enumAccessibles and others for sets. But wait, maybe they are not that useful nor needed?
☐ RCC3 (DC + PP + everything-else)
☐ MODALLOGIC.JL Add TESTs for enumAcc optimizations
############################################################################################
☐ DATA: _split!: enable learning in implicit form (allow training directly on InterpretedModalDataset) and study parallelization.
☐ DATA: Create class for labeled datasets, and remove slice_mf_dataset.
☐ ? DATA: test round_dataset_to_datatype/mapArrayToDataType in multimodal version
☐ DATA: Clean dataset creation pipeline
☐ DATA: add filters (e.g., derivatives, Edge detection, which is spatial https://juliahub.com/docs/ImageEdgeDetection/5h14T/0.1.0/#ImageEdgeDetection.jl-Documentation)!
☐ DATA: give names to features (e.g. create meta-feature that give features a name.)
☐ DATA: feat modal dataset and matricial dataset may have different types. use Base.return_types and promote_type to infer them? Attach to each feature the return type? IntervalFWD could have thresholds with different type from the original dataset
############################################################################################
☐ INTERFACE: Support for learning on dataframes?
☐ INTERFACE: Add a symbol parameter for selecting a trade-off between time and space
############################################################################################
☐ function is_well_formed(tree::DTree), checking the consistency between the supporting instances of different nodes
☐ metrics: G-mean(precision,recall)
☐ create meta-feature that gives features a name.
☐ length(unique(x)) == 1 -> allequal(x) in Julia 1.8
☐ what to do with nans? Perhaps a NaN means that a world must be ignored??? In that case, one could easily design features that ignore small worlds!
☐ Check whether CatWalk.jl can help runtime dispatch look-up time
☐ OPTIMIZE: yield said to be fast with Continuables.jl. https://schlichtanders.github.io/Continuables.jl/dev/
☐ Decision Tree Lookahead (e.g. two levels of decisions in one)
☐ OPTIMIZE: Test whether it's better to use that thresholds[:,i_instance] = map(aggr->aggr(values), aggregators)
☐ OPTIMIZE: check whether threading il actually better ( JULIA_EXCLUSIVE=1 julia -i -t30 scovid.jl -f 2>&1 | tee scan-covid-t30.out; JULIA_EXCLUSIVE=1 julia -i -t20 scovid.jl -f 2>&1 | tee scan-covid-t20.out; JULIA_EXCLUSIVE=1 julia -i -t16 scovid.jl -f 2>&1 | tee scan-covid-t16.out; JULIA_EXCLUSIVE=1 julia -i -t12 scovid.jl -f 2>&1 | tee scan-covid-t12.out; JULIA_EXCLUSIVE=1 julia -i -t8 scovid.jl -f 2>&1 | tee scan-covid-t8.out; JULIA_EXCLUSIVE=1 julia -i -t4 scovid.jl -f 2>&1 | tee scan-covid-t4.out ) &
☐ Profiling: @profile to certify the memory bottleneck
☐ OPTIMIZE: Test hybrid memoization: use memoization only x% of the times
############################################################################################
☐ Generalize World as a tuple of parameters ( https://stackoverflow.com/questions/40160120/generic-constructors-for-subtypes-of-an-abstract-type )
☐ Generalize to beam-search and then beam-search-n?
☐ Use streaming algorithms for the computation of the variance and of the entropy!!! Actually entropy only uses O(n_classes) memory, while variance is more problematic https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
☐ In general, remove all gamma infrastructure. But maybe in the future it will make useful code for an edge case? In that case, make sure it can't be cache-efficiently parallelized. https://github.com/JuliaArrays/TiledIteration.jl
☐ CHECKS: Add checks for enumAcc optimizations: compute one without optimizations and one with optimizations and check that they are equivalent
☐ OPTIMIZE: Try running with: --compile=min -O0, and then with -O3
☐ OPTIMIZE: write specific code for unpruned tree (essentially, skip checks at leaves)
☐ OPTIMIZE: IA3 and IA7 compare with IA
☐ Future: OPTIMIZE: Use Sys.total_memory() and Sys.free_memory() to limit load https://stackoverflow.com/questions/42599273/get-system-memory-information-from-julia#42610074
☐ Estimate best_threshold by interpolation of the loss?
☐ OPTIMIZE: computation of softened operators takes too much time (optimize at least <A> in the temporal case?). Actually, this is currently not a big deal, if we don't train on InterpretedModalDataset.
☐ For temporal series, one may want to have timepoints that are scattered along the time axis. See if https://github.com/JuliaArrays/AxisArrays.jl can be of any help here; but note that one may need to know how to handle missing's
☐ Optimize RCC5
☐ Parametrize world_types forcing things such as MINIMUM SIZE?
☐ Try adding ProgressMeter? https://github.com/timholy/ProgressMeter.jl
☐ Trasducers instead of iterators https://juliafolds.github.io/Transducers.jl/dev/explanation/comparison_to_iterators/#comparison-to-iterators
☐ Try different optimization method: SAM https://arxiv.org/abs/2010.01412
☐ Future: ROC and PR curve in binary classification context 0/1, interpret 1-abs(class_value-score) as the confidence of the prediction (transform binary classification problems in regression problems with labels between 0 and 1?); i) in Random Forest, one could aggregate tree answers in a continuous manner, and output confidence values, in this context, ROC and PR curve can be computed; ii) in Decision Tree there are other approaches. Also compute ROC-AUC, and PR curve ( The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0118432 )
☐ feature-importance.jl: Feature importance with one could first consider the propositional problem, the associated variables (avg V1, max V1, min V1, std V1, avg V2, etc...), and select features using the Mutual Information Score! [https://www.kaggle.com/ryanholbrook/mutual-information]
☐ Expressing the discrete regression case. Currently, String's and Integer's are interpreted as classes, while AbstractFloat's are interpreted as numbers; this is what differentiates classification from regression. Perhaps, the distinction must be made between String's and Number's, that is, theTwo cases Categorical and Numeric, with Numeric's being either Continuous (like Float64) or Discrete (like Integers). But this can be dangerous because if not coded right Categorical cases indexed by Integers end up being considered Numerical cases.
Archive:
✔ Decision->Condition (careful: conflicts with Base.Condition) @done (23-06-28 23:26)
✔ MODALLOGIC.JL Synthesis of SoleLogic module @done (23-06-28 23:26)
✔ MODALLOGIC.JL: OPTIMIZE: add enum_acc_reprAggr for relationGlob, and for all worldtypes and relations @done (23-06-28 23:25)
✔ MODALLOGIC.JL: DIMENSIONAL/MODAL difference, An ontology interpreted over an N-dimensional domain gives rise to a Kripke structure/frame, but one can *generalize for a Kripke structure/frame in graph form*. @done (23-06-28 23:25)
✔ MODALLOGIC.JL computePropositionalThreshold, computeModalThreshold, test_decision seem to do similar things. Maybe uniform their style and naming @done (23-06-28 23:25)
✔ MODALLOGIC.JL Add a generalized solution for unions of relations: struct _UnionOfRelations{T<:NTuple{N,<:Relation} where N} <: Relation end;. _UnionOfRelations{typeof((RelationId,))} "A relation can be defined as a union of other relations. In this case, thresholds can be computed by maximization/minimization of the thresholds referred to the relations involved." (so you can rewrite RCC5 from RCC8 relations and RCC8 from IA, maybe it's cleaner in some cases?) @done (23-06-28 23:25)
✔ ? MODALLOGIC.JL: also remove enum_acc_repr. @done (23-06-28 23:25)
✔ ? MODALLOGIC.JL: Remove n_relations()? @done (23-06-28 23:25)
✔ CODE: Keep this: throw, error(), @error -> error() @done (23-06-21 16:16)
✔ n_instances -> ninstances @done (22-11-28 16:51)
✔ n_frames -> nmodalities @done (22-11-28 16:51)
✔ n_features -> nfeatures @done (22-11-28 16:51)
✔ use ProgressMeter: \@showprogress @done (22-07-14 18:23)
✔ getCanonicalCondition->getFeature @done (22-05-20 22:48)
✔ Functio-symbolic: tree partitions according to logical rules, but occasionally on math relations (e.g., on leaves) @done (22-05-20 22:39)
✔ SCANNER.JL: different modalities with different ontologies: @done (22-02-05 23:59)
✔ SCANNER.JL: Train with less stringent parameters, and post-prune @done (22-02-05 23:59)
############################################################################################
✔ Add the use of an unbalanced test set, and make ConfusionMatrix compute the average recall @done (21-12-03 00:37)
✔ Write optimized ArrayOfOnes struct for W. Someone did this already and it looks ok: https://github.com/emmt/StructuredArrays.jl . Also try: https://github.com/JuliaArrays/FillArrays.jl @done (21-10-08 00:24)
✔ dataset-loading functions should tell about how they balanced, and how they randomized @done (21-10-07 23:35)
✔ Parallelize tree construction? Does this require manual copy of dynamic structures used in _split! ? @done (21-10-07 23:25)
✔ Add IA3 relations @done (21-10-07 23:24)
✔ Refactor and distinguish between testoperators and functions; and opt/aggregator/polarity @done (21-10-07 23:19)
✔ Add IA7 relations: Complete by writing enumAccBare, enumAcc, enum_acc_repr. @done (21-10-07 23:18)
✔ multi-modal extension! @done (21-10-07 23:16)
✔ rename _ninstances -> n_instances @done (21-10-07 23:14)
✔ clean modallogic.jl @done (21-05-06 12:55)
✔ timeit -> symbol @done (21-05-06 10:51)
✔ Clean InitialCondition and worldType code @done (21-05-06 10:36)
✔ check that missing is not in X or Y @done (21-05-06 10:18)
✔ Play with parameters: @done (21-04-03 20:04)
✔ Preliminary scan on logic-agnostic parameters @done (21-03-19 15:05)
✔ Fix purity=-entropy and losses @done (21-03-15 14:55)
✔ Add name of the labels to a dataset and confusion matrix. @done (21-03-15 12:14)
✔ Create RCC5 ontology by joining n/tpp(i) into pp(i) (proper part) and dc/ec into dr (disjointness). @done (21-03-15 11:58)
✔ Add Error catching+repeat with different log level, and keep going @done (21-03-15 11:58)
✔ Extend to >=alpha and <alpha; these operators are ordered, closed/not closed; maybe others are categorical, in general some are. Fix </>= thing. Fix is_closed/polarity @done (21-03-15 10:55)
✔ Fix 3 pre-pruning stop conditions: a) purity>purity_threshold; b) min_sample_{leaf,split}>=threshold{absolute,relative}; c) purity gain < min_purity_gain. What if purity is just card(max_class)/card(all classes)? Why (best_purity / nt)? If ... maybe min_purity_increase needs to become min_info_gain @done (21-03-14 23:03)
✔ Calculate per-class accuracy. @done (21-03-12 17:10)
✔ Create a dataset with many but tiny worlds. Even a completely random one. @done (21-03-12 14:07)
✔ > doubles the time but doesn't seem to improve performances. Is this a bug? Try to come up with a dataset that shows if it works. If it's not a bug, maybe we should consider parametrizing on whether this is to be used or not. @done (21-03-12 14:07)
✔ Rename TestOpLes to canonical_leq @done (21-03-12 14:05)
✔ Fix compute-threshold and enum_acc_repr so that it works with soft test operator as well! @done (21-03-12 14:03)
✔ Figure out why we reach "Uninformative split." when using Les only @done (21-03-10 17:21)
✔ Test 5x5 @done (21-03-08 00:59)
✔ Now move the extremes computation to the outer algorithm scope, so that it happens BEFORE the whole computation. @done (21-03-08 00:59)
✔ Improve the computation of the extremes leveraging the structure of IA Frames. @done (21-03-08 00:59)
✔ Soften > and <= to be >alpha and <=alpha. Make alpha an array to iterate over. Find an efficient way to compute it (note that with a few elements in the world, it can be made efficient with integer operations). @done (21-03-08 00:58)
✔ enumAcc to only use channelsize and not the whole channel @done (21-03-07 16:44)
✔ Optimize/fix topological enumerators. There may be some errors as well with interval relations (think of the TPPi thingy, that maybe happens elsewhere) @done (21-02-10 17:41)
✔ Fix the new optimization thingy. It fails, maybe need to swap min/max. @done (21-02-05 18:06)
✔ Parametrize on the test operators @done (21-02-02 17:52)
✔ Test topological manually @done (21-02-02 14:57)
✔ Fix print_world @done (21-02-02 14:07)
✔ Check the speedup with/without inbounds (3x3 and 5x5 cases) @done (21-02-02 13:56)
✔ verify new code. (test all datasets (avoid dataset[2] flattened)) @done (21-01-26 18:06)
✔ Test with no InitialCondition @done (21-01-26 18:06)
✔ Try different starting condition: start at the central pixel. @done (21-01-25 00:10)
✔ Note that we need to know the speedup for using the extremes array. Hide computation of the extremes. @done (21-01-24 14:32)
✔ Bring back the extremes, noting that this leads to constant propositional check. @done (21-01-24 14:32)
✔ Add > @done (21-01-23 01:19)
✔ Calculate confusion matrix @done (21-01-22 20:09)
✔ Try the two-dimensional case! @done (21-01-21 01:12)
✔ Use view instead of slicing @done (21-01-18 20:47)
✔ perhaps the domain should not be 20x3x3 but 3x3x20, because Julia is column-first @done (21-01-15 14:55)
✔ TODO const MatricialDomain{T,N} = AbstractArray{T,N} end @done (21-01-15 14:55)