xRay enables simple extraction of your S3 metadata into two data formats
- Parquet files that you can import into a spark cluster for analytics
- An elastic search file to import into elastic search
- Generate a binary for your system
$ git clone https://github.com/vardhanv/xray.git
$ cd xray
$ sbt universal:packageBin
$ cd target/universal
$ unzip xray-<version>.zip
$ cd xray-<version>/bin
$ ./xray --help
-
If you generate an elastic search output file (assume xray.out)
- Create an elastic search cluster on AWS
- Upload the data into elastic search
$ curl --tr-encoding -XPOST 'http://<your_elastic_search_url>/_bulk' --data-binary @xray.out
- Now you can analyze it in the AWS Elastic Search / Kibana service
-
If you generate a parquet file you can analyze it in a spark cluster
- Go to http://www.databricks.com
- Click on "Manage Account"
- Select community edition
- Create a cluster - wait for the cluster to come online
- Create a table using the parquet file - (assume "giab")
- Create a notebook - Workspare/users/.../Create/Notebook/Language Scala
> import sqlContext.implicits._
> import org.apache.spark.sql.functions._
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> val df = sqlContext.table("giab")
> df.count
> df.show()
> // Deduplicated Storage Used in TB
> val TB : Long = 1000000000000L
> val totalTB = df.select(col("Content_Length")).rdd.map(_(0).asInstanceOf[Long].toDouble/TB).reduce(_+_)
> val totalTB_unique = df.dropDuplicates("ETag").select(col("Content_Length")).rdd.map(_(0).asInstanceOf[Long].toDouble/TB).reduce(_+_)
> val totalSavings = totalTB - totalTB_unique
$ ./xray --help
xRay 1.0
Usage: xRay [options]
-b, --bucket <value> target s3 bucket
-p, --parquet generate parquet file output
-l, --elastic generate elastic search output
-x, --number-obj:maxObj=objAtATime
optional, <x=y>, index "x" objects "y" at a time. defaults: x [1 billion], y:[1000]
-e, --ep-url <value> optional, endpoint. default = https://s3.amazonaws.com
-f, --profile <value> optional, aws profile. default = default. create using "aws --configure"
-r, --region <value> optional, s3 region.
-o, --output <value> optional, output file. default = xray.out
--help prints help text