Useful malware features #5

So-Cool · 2016-05-20T12:49:41Z

The base of ML features for binaries analysed by Cuckoo is going to be inspired by Reviewer Integration and Performance Measurement for Malware Detection by B Miller et al (available here).
They name all kind of binary features both static and dynamic which seems a good starting point for this project:

static attributes:
- binary metadata,
- digital signing,
- heuristic tools,
- packer detection,
- portable executable format,
- static imports;
dynamic attributes:
- dynamic imports, mutexes, processes,
- filesystem operations,
- network operations,
- registry operations,
- Windows API calls.

Once implemented they should be reviewed and revised with regard to usability for this project.

So-Cool · 2016-07-28T11:42:45Z

The very basic implementation of the above features is complete. load_features method placed in ML class (modules/cuckooml/cuckooml.py file) requires some enhancements though. All of them are explained in the comments and marked with TODO flag.

ghost · 2016-09-28T07:54:06Z

Hi So-Cool,

I read your blog post on this issue:
"The problems I’m aware of are Windows API calls and filesystem operations. I could find overview of API calls in behavior->apistats but the paper mentions that the exact sequence can be extracted form the “raw Cuckoo sandbox output”. Is it located in strings in the JSONs? Any ideas how to extract it?"
-- http://honeynet.github.io/cuckooml/2016/06/19/static-features/

The dataset that you distributed in another blog post does not contain the API call sequences, I'm not sure why that is. But if you run a sample through Cuckoo you will get access to the calls a process makes in the following way:

    processes = self.report.get("behavior", {}).get("processes", {})
    for p in processes:
        apicalls = p.get("calls", {})
        for a in apicalls:
            api = a.get("api", {})

I'm currently attempting to implement the ideas expressed in the paper mentioned above on API call sequences, would be glad to discuss approaches for feature construction (how to represent an API call and a sequence of three) and how to vectorize it. Are you still working on CuckooML?

hgascon · 2016-09-28T11:37:46Z

@dueland you might want to check scikit-learn CountVectorizer

ghost · 2016-09-29T08:10:29Z

@hgascon thanks for the link. I propose the following:

build a string representation of three consecutive API calls
hash it using hashlib md5()

vectorize the hash with n-grams for n=3, with the aproach taken in CuckooML:

    for ngram_api in self.__handle_ssdeep(str(features[i]["api_seq"])):
             my_features[i][":simp:impssdeep:" + ngram_api] = 1

Can you spot any shortcomings with that approach? And is this approach using what is known as the hashing trick?

UPDATE:

My supervisor discouraged using n-grams at all, instead he suggested using the hash as the index.
Instead of setting the hash simply as present (hash = 1), we are considering counting the occurences of the hash (hash += 1).
Instead of iterating through the list of API calls in steps of 3 to retrieve a sequence of API calls, an idea is to iterate one step at a time and then construct a sequence of adjacent elements as before. Difference is that it would not be arbitrary which sequences arise. An example:
[a, b, c, d, e, f]
approach 1: abc, def
approach 2: abc, bcd, cde, def
Use the non-cryptographic 32-bit xxhash instead of md5.

So-Cool · 2016-10-01T12:48:22Z

Sounds good. I also had an idea to build a transition network with weights representing number of transition of given type seen so far, but it's probably a bit more complicated.

So-Cool closed this as completed May 20, 2016

So-Cool reopened this May 20, 2016

So-Cool self-assigned this Jun 26, 2016

So-Cool added the feature label Jun 26, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Useful malware features #5

Useful malware features #5

So-Cool commented May 20, 2016 •

edited

Loading

So-Cool commented Jul 28, 2016

ghost commented Sep 28, 2016 •

edited by ghost

Loading

hgascon commented Sep 28, 2016

ghost commented Sep 29, 2016 •

edited by ghost

Loading

So-Cool commented Oct 1, 2016

Useful malware features #5

Useful malware features #5

Comments

So-Cool commented May 20, 2016 • edited Loading

So-Cool commented Jul 28, 2016

ghost commented Sep 28, 2016 • edited by ghost Loading

hgascon commented Sep 28, 2016

ghost commented Sep 29, 2016 • edited by ghost Loading

So-Cool commented Oct 1, 2016

So-Cool commented May 20, 2016 •

edited

Loading

ghost commented Sep 28, 2016 •

edited by ghost

Loading

ghost commented Sep 29, 2016 •

edited by ghost

Loading