Video Scene Search

The contents of this readme can be seen as slides.

Video of demo at: https://youtu.be/bSqjk9qmxrU

Home page: 0:05 - https://youtu.be/bSqjk9qmxrU?t=5s
About page: 0:15 - https://youtu.be/bSqjk9qmxrU?t=15s
Finding credit scene example: 0:26 - https://youtu.be/bSqjk9qmxrU?t=26s
Histogram: 0:59 - https://youtu.be/bSqjk9qmxrU?t=59s
Finding four similar scenes: 1:18 - https://youtu.be/bSqjk9qmxrU?t=1m18s
Finding red color example: 1:54 - https://youtu.be/bSqjk9qmxrU?t=1m52s

Introduction

[Video Scene Search] lets you perform an image search for video scenes that are similar. The user uploads an image and VSS will return with video recommendations and time in the video where a similar scene was found.

Data Pipeline

Overview of Pipeline

Takes videos stored in HDFS and runs a Spark Batch job on it
Spark batch job extracts frames from each video, then for each frame, calculates a hash and a histogram, and stores both in Cassandra
Users submit images they want to search for
The Flask front end also produces a hash and a histogram for the image, and then attempts to find the most similar image(s) in the existing Cassandra database
Similarity is determined by hamming distance and cosine similarity
Information about the most similar frames, the video the belong time, and the time of their occurence in the video is returned to the user
For the most similar frames, distance similarity against all the frames of the entire video that the closest frame belongs to is also sent back to the user

The image below depicts the underlying data pipeline and cluster size.

Data source

Data source consists of ~8gb of youtube videos (mostly trailers, ~12.5hrs of video time), downloaded using the youtube-dl tool.

Performance

Batch Processing: Approx 38minutes to hash 270,000 frames. (Every 5th frame was hashed)
Stream processing: For new image queries, approximately 50 seconds to return with recommendations
Accuracy: Not very good except for black-screen scenes or credit scenes. (Due to perceptual hashing algorithm not being very good for describing/fingerprinting image)
Took about 70 seconds to find an image when using all-pairs cosine similarity search of histograms (40 secons if just comparing hamming distance)

Challenges and Future Improvements

What doesn't work very well:

Need to reduce search space. Doing the "all pairs" distance calculation through each row in the existing frames database to find the most similar frame is not scalable. Need to find a way search only a partition and not the whole database
Find some way to cluster and partition the database
Need to speed up the similarity searching
Currently, the project does a join of a DStream RDD against the large static RDD that has all the existing frames. This join is extremeley slow (Takes about 45 seconds when Dstream RDD has 1 item, and the static RDD has ~400k rows). dstreamRDD.join(staticRDD) and staticRDD.join(dstreamRDD) are both the same slowness. Maybe a bottle neck in the join (or something) is also causing things to run super slow (and maybe on one node?)
Broadcasting the large static RDD (to avoid the join) was not done because this is also not as scalable, in case more videos are added to the database, each video results in thousands of frames, so thousands of additional rows to the database. There is a limit to what can be broadcasted. Most forums are saying only 2^31 rows can be broadcasted (2 billion rows? or 2 GB?), due to some java int size thing within Spark. At 30 frames per second, this is about 20,000hrs of videos, or approx 10,000 2hr long videos. See here: http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-common-in-spark-to-broadcast-a-10-gb-variable-td2604.html Note that the number of movies made in history is >100k
Perceptual hash is not good. It's only good for finding duplicates, not really good for finding similaries. Maybe alternatives to perceptual hashing can be tried:
Color histogram. Because currently, perceptual hash discards all color information as it only quantifies "image frequency" using discrete cosine transform
Try Tensor Flow's trained Inception V3
Run Inception v3 on each image to get a vector of detected/recognized objects, then do cosine similarity to compare images. See here: http://stackoverflow.com/questions/34809795/tensorflow-return-similar-images. I briefly attempted this, but it was taking about 5 seconds per image to classify (return the pool_3:0 tensor). Other users online have optimized it to about 1 second per image, but since i have ~300k images, this would still take a hundred hours (about a week), and this project is only 4 weeks long
Other thoughts
Perhaps use Elastic Search instead of using Spark to do the comparison in attempt to search for the most similar image?
For the batch phase, be able to distributedly process a single video

Back to Table of contents

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
content_for_readme		content_for_readme
notes_draft_playground		notes_draft_playground
pegasus_setup_notes		pegasus_setup_notes
pipeline		pipeline
webserver		webserver
README.md		README.md
cluster_descriptions.txt		cluster_descriptions.txt
scp_upload_commands.sh		scp_upload_commands.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Video Scene Search

Table of contents

Introduction

Data Pipeline

Data source

Performance

Challenges and Future Improvements

About

Releases

Packages

Languages

gylu/videoSceneSearch

Folders and files

Latest commit

History

Repository files navigation

Video Scene Search

Table of contents

Introduction

Data Pipeline

Data source

Performance

Challenges and Future Improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages