-
Notifications
You must be signed in to change notification settings - Fork 13
/
index.html
329 lines (295 loc) · 17 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
<!DOCTYPE html>
<!-- saved from url=(0036)https://whyisyoung.github.io/BODMAS/ -->
<html lang="en-US">
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-45384067-4"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-45384067-4');
</script>
<!-- Google Tag Manager -->
<script>(function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':
new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],
j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src=
'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);
})(window,document,'script','dataLayer','GTM-P7GQCGG');</script>
<!-- End Google Tag Manager -->
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="stylesheet" href="./css/bodmas.css">
<!-- Begin Jekyll SEO tag v2.7.1 -->
<title>BODMAS Malware Dataset</title>
<meta name="generator" content="Jekyll v3.9.0">
<meta property="og:title" content="BODMAS Malware Dataset">
<meta property="og:locale" content="en_US">
<link rel="canonical" href="https://whyisyoung.github.io/BODMAS/">
<meta property="og:url" content="https://whyisyoung.github.io/BODMAS/">
<meta property="og:site_name" content="BODMAS">
<meta name="twitter:card" content="summary">
<meta property="twitter:title" content="BODMAS Malware Dataset">
<script type="application/ld+json">
{"@type":"WebSite","headline":"BODMAS Malware Dataset","url":"https://whyisyoung.github.io/BODMAS/","name":"BODMAS","@context":"https://schema.org"}</script>
<!-- End Jekyll SEO tag -->
</head>
<body data-new-gr-c-s-check-loaded="14.1000.0" data-gr-ext-installed="">
<!-- Google Tag Manager (noscript) -->
<noscript><iframe src="https://www.googletagmanager.com/ns.html?id=GTM-P7GQCGG"
height="0" width="0" style="display:none;visibility:hidden"></iframe></noscript>
<!-- End Google Tag Manager (noscript) -->
<header>
<!-- <div class="container">
<a id="a-title" href="https://whyisyoung.github.io/BODMAS/"> <h1>BODMAS</h1> </a>
<h2>The BODMAS Malware Dataset is created and maintained by Blue Hexagon and UIUC.</h2>
<section id="downloads">
<a href="https://github.com/whyisyoung/BODMAS" class="btn btn-github"><span class="icon"></span>View on GitHub</a>
</section>
</div> -->
</header>
<div class="container">
<section id="main_content">
<h1 id="bodmas-malware-dataset">BODMAS Malware Dataset</h1>
<section id="downloads">
<a href="https://github.com/whyisyoung/BODMAS" class="btn btn-github"><span class="icon"></span>View on GitHub</a>
</section>
<p id="notice">Update (10/09/2023) - Since Limin is graduadated, please email his labmate Zhi Chen (zhic4@illinois.edu) and CC Dr. Gang Wang (gangw@illinois.edu) for all the future requests.
</p>
<p id="notice">Update (12/15/2021) - Malware category information is available at <a href="https://drive.google.com/drive/folders/1Uf-LebLWyi9eCv97iBal7kL1NgiGEsv_?usp=sharing">Google Drive</a></a>
</p>
<p id="notice">Update (08/29/2021) - Source code is available at: <a href="https://github.com/whyisyoung/BODMAS">GitHub</a>
</p>
<p>BODMAS is short for <strong>B</strong>lue Hexagon <strong>O</strong>pen <strong>D</strong>ataset for <strong>M</strong>alware <strong>A</strong>nalysi<strong>S</strong>. We collaborate with <a href="https://bluehexagon.ai/">Blue Hexagon</a> to release a dataset containing timestamped malware samples and well-curated family information for research purposes. The BODMAS dataset contains 57,293 malware samples and 77,142 benign samples collected from August 2019 to September 2020, with carefully curated family information (581 families).</p>
<p>We extract the feature vectors using the <a href="https://lief.quarkslab.com/">LIEF</a> project (version 0.9.0), the same as the <a href="https://github.com/elastic/ember">Ember</a> dataset (details can be found <a href="https://github.com/elastic/ember/blob/master/ember/features.py">here</a>). Each sample is represented as a 2381 feature vector, along with its label (<code class="language-plaintext highlighter-rouge">benign</code> or <code class="language-plaintext highlighter-rouge">malicious</code>) and malware family if it’s malicious. We also release the original binary for malware samples only.</p>
<p>Further details can be found in our paper “BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware” [<a href="https://liminyang.web.illinois.edu/data/DLS21_BODMAS.pdf", target="_blank">PDF</a>], Deep Learing and Security Workshop 2021 (co-located with IEEE Security and Privacy 2021).</p>
<p>If you end up building on this dataset as part of a project or publication, please include a reference to our paper:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@inproceedings{bodmas,
title = {BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware},
author = {Yang, Limin and Ciptadi, Arridhana and Laziuk, Ihar and Ahmadzadeh, Ali and Wang, Gang},
booktitle = {4th Deep Learning and Security Workshop},
year = {2021}
}
</code></pre></div></div>
<h2 id="download">Download</h2>
<ol>
<li>The feature vectors and metadata are open to everyone. Download the data here: <a href="https://drive.google.com/drive/folders/1Uf-LebLWyi9eCv97iBal7kL1NgiGEsv_?usp=sharing">Google Drive</a>
<ul>
<li>feature vectors (~250 MB): <code class="language-plaintext highlighter-rouge">bodmas.npz</code></li>
<li>metadata (~12 MB): <code class="language-plaintext highlighter-rouge">bodmas_metadata.csv</code></li>
<li>They are sorted by the timestamp in the ascending order (i.e., each feature vector corresponds to one row in the metadata file).</li>
</ul>
</li>
<li>We cannot release the original file for the benign software due to copyright considerations. But we will host the original binaries of malware samples.
<p id="notice"> To avoid misuse, please read and agree to the following conditions before sending us emails. </p>
<ul>
<li>Please email <strike>Limin (liminy2@illinois.edu)</strike> Zhi Chen (zhic4@illinois.edu) and CC Gang (gangw@illinois.edu). Also, please include your Gmail address in the body so that I can add you to the google drive folder where the dataset is stored.
<li>Do not share the data with any others (except your co-authors for the project). We are happy to share with other researchers based upon their requests.</li>
<li>Explain in a few sentences of your plan to do with these binaries. It should not be a precise plan.</li>
<li>If you are in academia, contact us using your institution email and provide us a webpage registered at the university domain that contains your name and affiliation.</li>
<li>If you are in research (industrial) labs, send us an email from your company’s email account and introduce yourself and company. In the email, please attach a justification letter (in PDF format) in official letterhead. The letter needs to state clearly the reasons why this dataset is being requested.</li>
</ul>
<p>Please note that an email not following the conditions might be ignored. And we will keep the public list of organizations accessing these samples at the bottom.</p>
</li>
</ol>
<h2 id="get-started">Get Started</h2>
<ul>
<li>
<p>To load the feature vectors, you need to load <code class="language-plaintext highlighter-rouge">bodmas.npz</code> (a numpy compressed format) with the following code. Note that the feature values are unnormalized, which is okay for classifiers like gradient-boosted decision tree, but you may need to normalize them first when applying an MLP classifier.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="n">filename</span> <span class="o">=</span> <span class="s">'bodmas.npz'</span>
<span class="n">data</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
<span class="n">X</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s">'X'</span><span class="p">]</span> <span class="c1"># all the feature vectors
</span><span class="n">y</span> <span class="o">=</span> <span class="n">data</span><span class="p">[</span><span class="s">'y'</span><span class="p">]</span> <span class="c1"># labels, 0 as benign, 1 as malicious
</span>
<span class="k">print</span><span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">shape</span><span class="p">,</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="c1"># >>> (134435, 2381), (134435,)
</span></code></pre></div> </div>
</li>
<li>
<p>For <code class="language-plaintext highlighter-rouge">bodmas_metadata.csv</code>, it has three columns, indicating SHA-256, when the sample first appeared, and malware family. <code class="language-plaintext highlighter-rouge">If the malware family is empty, then it’s a benign sample.</code></p>
</li>
<li>
<p>Top malware families and their number of samples (>= 1,000) are as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>1. sfone: 4729
2. wacatac: 4694
3. upatre: 3901
4. wabot: 3673
5. small: 3339
6. ganelp: 2232
7. dinwod: 2057
8. mira: 1960
9. berbew: 1749
10. sillyp2p: 1616
11. ceeinject: 1169
12. gepys: 1124
13. benjamin: 1071
14. musecador: 1054
</code></pre></div>
</div>
</li>
</ul>
<h2 id="organizations">Organizations Reguested Our Dataset</h2>
<ol>
<li>Simon Fraser University, Canada</li>
<li>Oracle Labs</li>
<li>Columbia University</li>
<li>Telkom University, Indonesia</li>
<li>University of Alberta, Canada</li>
<li>Orange Inc., France</li>
<li>Beijing Institute of Technology</li>
<li>College Of Engineering Pune, India</li>
<li>University of Salerno, Italy</li>
<li>Shanghai Jiao Tong University</li>
<li>Southeast University</li>
<li>Beijing University of Posts and Telecommunications</li>
<li>Guizhou Normal University</li>
<li>Korea University</li>
<li>GuiLin University of Electronic and Technology</li>
<li>New York University</li>
<li>University of Chinese Academy of Sciences</li>
<li>University of the West of England (UWE) Bristol</li>
<li>University College Dublin, Ireland</li>
<li>Women Engineering College, Ajmer, India</li>
<li>Beijing University of Technology</li>
<li>Air University Islamabad, Pakistan</li>
<li>Eastern Connecticut State University</li>
<li>Yonsei University, South Korea</li>
<li>Arizona State University</li>
<li>Bandung Institute of Technology, Indonesia</li>
<li>University of Southampton, United Kingdom</li>
<li>Xidian University</li>
<li>University of Balamand, Lebanon</li>
<li>The University of Chicago</li>
<li>Xinjiang University</li>
<li>University of Turin, Italy</li>
<li>Punjab University College of Information Technology, Pakistan</li>
<li>Guangzhou University</li>
<li>Middle East Technical University, Turkey</li>
<li>Microsoft</li>
<li>Sana'a University, Yemen</li>
<li>HarfangLab, France</li>
<li>Purdue University Northwest</li>
<li>PSG College of Technology, India</li>
<li>University of Windsor, Canada</li>
<li>Georgia Tech</li>
<li>De Montfort University, United Kingdom</li>
<li>Ghent University, Belgium</li>
<li>Iowa State University</li>
<li>Macquarie University, Australia</li>
<li>Hongik University, South Korea</li>
<li>UiTM Shah Alam, Malaysia</li>
<li>Hanoi University of Science and Technology, Vietnam</li>
<li>Ain Shams university, Egypt</li>
<li>Open University of Catalonia, Spain</li>
<li>Amrita Vishwa Vidyapeetham, India</li>
<li>National University of Science and Technology, Zimbabwe</li>
<li>Nagoya University, Japan</li>
<li>Institute of Information Security, Japan</li>
<li>Heriot-Watt University, United Kingdom</li>
<li>Edinburgh Napier University, United Kingdom</li>
<li>Istanbul University-Cerrahpaşa, Turkey</li>
<li>Zhejiang University</li>
<li>Hanyang University, South Korea</li>
<li>Army Engineering University of PLA</li>
<li>Purdue University</li>
<li>University of Molise, Italy</li>
<li>SharpAI LLC</li>
<li>Silesian University of Technology, Poland</li>
<li>Florida State University</li>
<li>University Of Bath, United Kingdom</li>
<li>National University of Computer and Emerging Sciences, Pakistan</li>
<li>Chungnam National University, South Korea</li>
<li>PeeploTech</li>
<li>Damietta University, Egypt</li>
<li>Queen's University Belfast, United Kingdom</li>
<li>Vilnius Tech, Italy</li>
<li>Indian Institute of Technology Roorkee, India</li>
<li>Beijing University of Civil Engineering and Architecture</li>
<li>University of Quebec in Outaouais, Canada</li>
<li>National Institute of Technology Raipur, India</li>
<li>University of Colorado Colorado Springs</li>
<li>University of Technology and Applied Sciences, Oman</li>
<li>University of Portsmouth, United Kingdom</li>
<li>Brno University of Technology, Czechia</li>
<li>Royal Holloway, University of London, United Kingdom</li>
<li>The University of Alabama in Huntsville</li>
<li>University of Portsmouth, United Kingdom</li>
<li>Wuhan University</li>
<li>Guizhou University</li>
<li>Amrita Vishwa Vidyapeetham, India</li>
<li>Birkbeck, University of London, United Kingdom</li>
<li>GoldenEye Inc</li>
<li>Huazhong University of Science and Technology</li>
<li>Sam Houston State University</li>
<li>Hoseo University, South Korea</li>
<li>East China University of Science and Technology</li>
<li>Xiamen University Malaysia</li>
<li>Pamantasan ng Lungsod ng Maynila, Pilipinas</li>
<li>Sichuan University</li>
<li>Nanjing University of Information Science and Technology</li>
<li>University of Information Technology, Ho Chi Minh City, Vietnam</li>
<li>Seoul National University of Science and Technology, South Korea</li>
<li>University of Science and Technology of China</li>
<li>Tsukuba University, Japan</li>
<li>University of Toronto, Canada</li>
<li>Charles Darwin University, Australia</li>
<li>Zoho Corporation, India</li>
<li>University of Cape Town, South Africa</li>
<li>Sivas University of Science and Technology, Turkey</li>
<li>University of Bari Aldo Moro, Italy</li>
<li>UET Lahore University of Engineering and Technology</li>
<li>Bandung Institute of Technology, Indonesia</li>
<li>Sungshin Women's University,South Korea</li>
<li>Budapest University of Technology and Economics, Hungary</li>
<li>University of Bari (islab-uniba), Italy</li>
<li>Dongguk University, South Korea</li>
<li>People's Public Security University, China</li>
<li>Fujian Normal University, China</li>
<li>Qassim University, Saudi Arabia</li>
<li>Sichuan University, China</li>
<li>Zhejiang Normal University, China</li>
<li>University of Minnesota</li>
<li>Amrita Vishwa Vidyapeetham, India</li>
<li>Indian Institute of Technology Jammu, India</li>
<li>Babes-Bolyai University of Cluj-Napoca, Romania</li>
<li>Texas A&M University</li>
<li>Ho Chi Minh City University of Technology, Vietnam</li>
<li>AnxinSec, China</li>
<li>Czech Technical University in Prague, Czechia</li>
<li>Koç University, Turkey </li>
<li>Telkom University, Indonesia</li>
<li>ShanghaiTech University, China</li>
<li>University of Electronic Science and Technology of China, China</li>
<li>VNU-HCM University of Information Technology, Vietnam</li>
<li>Johns Hopkins University</li>
<li>Umm Al-Qura University, Kingdom of Saudia Arabia</li>
<li>Federal University of Parana, Brazil</li>
<li>University of Sannio in Benevento, Italy</li>
<li>German University in Cairo, Egypt</li>
<li>BRAC University, Bangladesh</li>
<li>University of Piraeus, Greece</li>
<li>ECIT-Queens University Belfast, Northern Ireland</li>
<li>Nanjing University of Posts and Telecommunications, China</li>
<li>National University of Defense Technology, China</li>
<li>Numidia Institute of Technology, Algeria</li>
<li>George Washington University</li>
</ol>
<h2 id="contact">Contributors</h2>
<p><a href="https://liminyang.web.illinois.edu">Limin Yang</a>, Ph.D. from UIUC.</p>
<p>Arridhana Ciptadi, Blue Hexagon Inc.</p>
<p>Ihar Laziuk, Blue Hexagon Inc.</p>
<p>Ali Ahmadzadeh, Blue Hexagon Inc.</p>
<p><a href="https://gangw.cs.illinois.edu">Gang Wang</a>, Associate Professor at UIUC</p>
</section>
<br>
<p class="footer">Last updated:
<script>
t = new Date(document.lastModified).toLocaleDateString()
document.write(t);
</script>
</p>
</div>
<br> <br> <br>
</body></html>