-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scoring function of rarity simulation is borked #114
Comments
Thanks for your input! Very unfortunate, that you have to deal with those bugs. You are right, Anyway, now that this is fixed, it's time to talk :) I think it's hard to get good results for analysing the scoring algorithm with your code example. Keep in mind, that we compute the scores based on all memory of the process, the registers and the program counter. So there are a lot of (pseudo) nondeterministic values in there. To get a discussion going solely based on the different variants of combining the counts to a score, I made a little playground to compare both implemented means. (https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fcde915148d6240bb2a0bb51566f7372) I tried to show a case, where it actually makes a difference when we compute harmonic/arithmetic means. The basic idea behind all of that is: For the arithmetic mean it's irrelevant, if 2 states have 5 equal values or 5 states have 2 equal values, right? For both situations we would get 5 as score for every state. But on the other side, because On your idea with That's my reasoning behind all of that. Hope that makes sense. 😅 @ckirsch do you also have some thoughts on that? Also Marcell @Blazefrost: Ideas? Comments? |
We are competing with traditional fuzzers and symbolic execution (SE) engines. Keep that in mind. We are probably able to explore more paths than SE but need guidance which in turn needs to be better than what fuzzers are doing. Considering whole machine states is the promise here. At this stage of the project I suggest we implement what others propose (rarity) as well as other metrics that you think make sense on source code level. Ultimately, we need to demonstrate that we can find more bugs faster than any fuzzers and SE. |
@ChristianMoesl: Thanks for the various fixes! Awesome! Those actually take care of most of the original issues I stumbled over in the scoring function. What remains is just the discussion of the semantic details. For the sake of completeness I have added my proposed scoring functions to your playground. Please see this extended playground for the implementation. I want to emphasize again that I believe Yes, I totally do agree that the non-linear nature of the inverse (1/x) does make a big difference and makes the And yes, I agree that To try to summarize my argument: As long as we agree that |
One side-note I forgot to mentioned are the following two equivalences:
So in essence my proposal includes the two scoring functions already implemented, it does not alter their relative ordering either, it just inverts the interpretation of the ordering. But the proposed |
@mstarzinger I got it! You are right in that, that we can implement that and measure that. But we should probably have a good idea, why that would be interesting. I think what you propose with In general I like the idea of biasing scores based on the amount of counts. But I would intuitivly like to positively bias the states with more counts (more memory touched). Because I think, if more memory is touched, the program is probably way deeper in the execution path, right? The total amount of words touched is monotonically increasing on one execution path. |
# [0.3.0](v0.2.0...v0.3.0) (2021-03-10) ### Bug Fixes * ordering for harmonic mean (less is always better) ([#114](#114)) ([8ae2be8](8ae2be8)) * **rarity:** use right amount of values for mean calculation ([#114](#114)) ([c365767](c365767)) * broken build due to wrong Selfie repo URL ([883821b](883821b)) * enforce unique input variable names for Z3/Boolector ([#111](#111)) ([a0ed9d4](a0ed9d4)) * swapped handling of --iterations and --selection ([#113](#113)) ([c05c743](c05c743)) * **rarity:** Add warning message to ignored partial read ([43e8297](43e8297)) * **rarity:** detect REMU with 0 remainder as bug ([8e989cb](8e989cb)) * **rarity:** Fix filtering before enumerating skewing scoring counter index ([6e99b15](6e99b15)) * **rarity:** Fix rarity scoring ([ba7d0ed](ba7d0ed)) * **rarity:** implemented score algorithm exactly as in the paper ([afe4e9d](afe4e9d)) * **rarity:** push the raw input values only ([d6448bb](d6448bb)) * **rarity:** update dependencies to be compatible with Selfie ([0e86d8d](0e86d8d)) * **rarity:** Use correct format string flag for Format trait ([13cd96c](13cd96c)) * **rarity:** Use f64 for rarity score instead of u64 ([bd363e1](bd363e1)) ### Features * add coin-flip execution path exploration strategy ([081ec28](081ec28)) * create api for rarity simulation ([10aded4](10aded4)) * implement write syscall in engine ([#20](#20)) ([#109](#109)) ([d2fffde](d2fffde)) * **rarity:** adapt engine file for rarity simulation ([aa39116](aa39116)) * **rarity:** add benchmarks as submodule ([d4cb2be](d4cb2be)) * **rarity:** Add bytewise state comparator ([ef72317](ef72317)) * **rarity:** Add CLI parser config for rarity simulation ([7b9e001](7b9e001)) * **rarity:** CLI option to switch between harmonic and arithmetic mean ([90c4174](90c4174)) * **rarity:** detect bugs during concrete execution ([01ace2a](01ace2a)) * **rarity:** Implement copy-create ratio for newly created states ([00e3f10](00e3f10)) * **rarity:** implement search algorithm (breadth-first) ([09e5d64](09e5d64)) * **rarity:** Use harmonic mean as metric ([f30f41e](f30f41e))
The current version of the scoring function
compute_scores
seems to weigh non-rare states higher than rare ones. Instead of selecting states that explored rare value combinations, it prefers the ones where the value/address count is higher. Let us consider the following example input.In this example I expect the simulation to fall into the true-branch in about 10% of all runs. Hence incrementing the memory location of
i
will be rarer than incrementingj
with a high probability. I'd expect rarity simulation to select states that are looping in the true-branch. Note that the looping value of 150 was chosen so that a state needs to survive and be selected for exactly one iteration to reach an exit path (assuming the default step-size of 1000).This however does not happen (when using the default
harmonic
mean), see the following run. I also dump the scores vector for debugging purposes and to prove that at least one state did fall into true-branch (two in this case, the ones with the lower score):When running the same example with the non-default
arithmetic
mean, the rarer state is preferred, becausescore_with_mean
returnsOrdering::Less
in this case. See the following run as an example. Again notice the single state with a lower score (which incidentally does get selected this time):Even though the
arithmetic
mean works out in this example, I still do believe both variants to be somewhat faulty. Here is my understanding of which scoring functions are currently implemented:harmonic
: The harmonic mean over all (non-inverse) counters for a state, but withn
being the number of states instead of the number of counters in the state.arithmetic
: The arithmetic mean over all (non-inverse) counters for a state, but again withn
being the number of states instead of the number of counters in the state. This time usingOrdering::Less
which somewhat hides the issue.Here is what I think would be a better alternative for what we could try:
sum
: The sum of inverse(!) counter values, no mean calculation. I do believe this is the scoring function described in the paper originally, we should at least add it to our portfolio I think, even if we decide not to make if the default.arithmetic*
: The arithmetic mean of the inverse(!) counter values. But importantly withn
being the number of counters in the given state, otherwise we are just dividing by a constant and not normalizing with respect to the number of counters (i.e. number of memory locations with a concrete value) in each individual state.harmonic*
: The harmonic mean of the inverse(!) counter values. Again withn
set like above for the same reason.Note that all three proposed scores use the intuitive
Ordering::Greater
because higher scores represent rarer states in all three cases. The inverse was taken on the individual counter values before mean calculation, not by flipping the interpretation of the mean. I do believe this makes a difference in semantics. And it also makes interpretation of the score uniform, higher is rarer, higher is better.WDYT about this alternative? Or am I completely off base here? All comments welcome!
The text was updated successfully, but these errors were encountered: