GitHub - aleskandro/tcpdump-analyze-hadoop

Code challenge: analyzing big tcpdump logs

The code runs as a standalone python script, or with hadoop.

To run the code without hadoop:

$ cd src/
$ ./start.py </PATH/TO/TCPDUMP.LOG> [-flagsToSelectStatistics]

It will take a lot of time for large files

Hadoop

To get a faster way to quickly analyze the log it could be used hadoop as a distributed framework to process large data sets. The use of libraries as pandas could not permit, alone, to get Gigabytes (or hundred of GB) processed in a distributed way. So, considering the manipulation to be done simple (so no needs for pandas) but huge in terms of a huge dataset to analyze, this example was implemented with hadoop and python, using hadoop streaming APIs. Alternatives could be to use pandas, a db as prestoDb and the needed connectors in python to make the import of the data set and the manipulation: this was not considered as fast as hadoop and MapReduce way, in designing and development as first thing.

The code was tested in a Docker container derived from 'sequenceiq/hadoop-docker' hub.docker.com, using a single node hadoop server. To build and run the container (take care of permissions and SELinux):

    $ docker build -t myhadoop .
    $ docker run -v /ABSPATH/TO/THIS/REPO/src/:/code -it myhadoop /etc/bootstrap.sh -bash

In the container, to run the analyzer:

    # cd /code/
    # ./start.py -H /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.7.0.jar </PATH/TO/TCPDUMP.LOG> [-flagsToSelectStatistics]

Dependencies

python-argparse
start python packages (threading, sys, os...)
linux-coreutils: (cat, less, sort... )
[Optional] Hadoop

Usage

usage: start.py [-h] [-s] [-c] [-C] [-p] [-t] [-H HADOOP] logFile

Process tcpdump log files

positional arguments:
  logFile

optional arguments:

  -h, --help            show this help message and exit

  -s, --serversTopTen   Get a top ten of servers ordered by count of received
                        packets

  -c, --clientsTopTen   Get a top ten of clients ordered by sent bytes

  -C, --clientsTopTenByServers
                        Take a top ten of clients sending packets to servers,
                        for each server

  -p, --generalStatistics
                        Counts total packets and large or small packets as
                        percentage (large > 500b)

  -t, --timestamps      Enable stats on number of packets logged in a period
                        of 5 minutes
  -H HADOOP, --hadoop HADOOP
                        Enable hadoop mapreduce (take the hadoop streaming
                        .jar path as argument)

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
docs		docs
samples		samples
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Code challenge: analyzing big tcpdump logs

Hadoop

Dependencies

Usage

About

Releases

Packages

Languages

aleskandro/tcpdump-analyze-hadoop

Folders and files

Latest commit

History

Repository files navigation

Code challenge: analyzing big tcpdump logs

Hadoop

Dependencies

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages