Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Useful malware features #5

Open
So-Cool opened this issue May 20, 2016 · 5 comments
Open

Useful malware features #5

So-Cool opened this issue May 20, 2016 · 5 comments
Assignees
Labels

Comments

@So-Cool
Copy link
Collaborator

So-Cool commented May 20, 2016

The base of ML features for binaries analysed by Cuckoo is going to be inspired by Reviewer Integration and Performance Measurement for Malware Detection by B Miller et al (available here).
They name all kind of binary features both static and dynamic which seems a good starting point for this project:

  • static attributes:
    • binary metadata,
    • digital signing,
    • heuristic tools,
    • packer detection,
    • portable executable format,
    • static imports;
  • dynamic attributes:
    • dynamic imports, mutexes, processes,
    • filesystem operations,
    • network operations,
    • registry operations,
    • Windows API calls.

Once implemented they should be reviewed and revised with regard to usability for this project.

@So-Cool So-Cool closed this as completed May 20, 2016
@So-Cool So-Cool reopened this May 20, 2016
@So-Cool So-Cool self-assigned this Jun 26, 2016
@So-Cool
Copy link
Collaborator Author

So-Cool commented Jul 28, 2016

The very basic implementation of the above features is complete. load_features method placed in ML class (modules/cuckooml/cuckooml.py file) requires some enhancements though. All of them are explained in the comments and marked with TODO flag.

@ghost
Copy link

ghost commented Sep 28, 2016

Hi So-Cool,

I read your blog post on this issue:
"The problems I’m aware of are Windows API calls and filesystem operations. I could find overview of API calls in behavior->apistats but the paper mentions that the exact sequence can be extracted form the “raw Cuckoo sandbox output”. Is it located in strings in the JSONs? Any ideas how to extract it?"
-- http://honeynet.github.io/cuckooml/2016/06/19/static-features/

The dataset that you distributed in another blog post does not contain the API call sequences, I'm not sure why that is. But if you run a sample through Cuckoo you will get access to the calls a process makes in the following way:

    processes = self.report.get("behavior", {}).get("processes", {})
    for p in processes:
        apicalls = p.get("calls", {})
        for a in apicalls:
            api = a.get("api", {})

I'm currently attempting to implement the ideas expressed in the paper mentioned above on API call sequences, would be glad to discuss approaches for feature construction (how to represent an API call and a sequence of three) and how to vectorize it. Are you still working on CuckooML?

@hgascon
Copy link
Member

hgascon commented Sep 28, 2016

@dueland you might want to check scikit-learn CountVectorizer

@ghost
Copy link

ghost commented Sep 29, 2016

@hgascon thanks for the link. I propose the following:

  • build a string representation of three consecutive API calls

  • hash it using hashlib md5()

  • vectorize the hash with n-grams for n=3, with the aproach taken in CuckooML:

        for ngram_api in self.__handle_ssdeep(str(features[i]["api_seq"])):
                 my_features[i][":simp:impssdeep:" + ngram_api] = 1
    

Can you spot any shortcomings with that approach? And is this approach using what is known as the hashing trick?

UPDATE:

  • My supervisor discouraged using n-grams at all, instead he suggested using the hash as the index.
  • Instead of setting the hash simply as present (hash = 1), we are considering counting the occurences of the hash (hash += 1).
  • Instead of iterating through the list of API calls in steps of 3 to retrieve a sequence of API calls, an idea is to iterate one step at a time and then construct a sequence of adjacent elements as before. Difference is that it would not be arbitrary which sequences arise. An example:
    [a, b, c, d, e, f]
    approach 1: abc, def
    approach 2: abc, bcd, cde, def
  • Use the non-cryptographic 32-bit xxhash instead of md5.

@So-Cool
Copy link
Collaborator Author

So-Cool commented Oct 1, 2016

Sounds good. I also had an idea to build a transition network with weights representing number of transition of given type seen so far, but it's probably a bit more complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants