This project is aimed at enabling automatic content summarization of videos by creating a representative mosaic of the video's object of focus. We intend to do this in a three-step process involving real time object detection, salient object identification and image mosaicing.
With the advent of smart phones with high-quality cameras, video and image content is being generated at an unprecedented rate, rendering the task of manual content identification and classification virtually impossible. Content identification and classification tasks impact human lives in many ways. Content categorization enables delivery of personalized content based on user preferences. At the same time, it is instrumental in maximizing revenue for several industries. For instance, the advertising and marketing industry is a particularly visual world, with millions of images and videos displayed everyday within websites, television programs and movies, in order to expose consumers to the latest trends and products.
The sheer volume of media existing today motivates the need for automated software systems that leverage state-of-the-art artificial intelligence (AI) and computer vision (CV) techniques for accomplishing the content cataloging task efficiently. AI and CV based software allow for visual product discovery and categorization, thereby reducing the reliance on manually entered, subjective, noisy product meta-data. Their ability to group similar products based on their visual affinity makes the process of categorization objective, noise-free and exponentially faster as compared to methods that require human intervention. However, like many other computer vision problems, a single approach that is universally considered the obvious or “best” method to address the problem of content categorization efficiently and effectively is lacking.
Another problem closely associated with video and image categorization is that of content summarization. There is a growing need for automatically generating aesthetically pleasing visualizations that are representative of categorized content and provide the viewer with a preview or overview of the actual content in the media. This project is aimed at efficient and effective summarization of video content by means of real time object detection and image mosaicing.
Our proposed content summarization pipeline comprises of three modules namely Object detection, Salient Object Identification and Image Mosaicing.
Object detection is the problem of finding and classifying a variable number of objects in an image. Object detection has proven to be a hard problem as compared to classification, since its output is variable in dimensions due to the inherent differences in object size and number across images. Object detection in videos is an even more challenging task than object detection in images primarily due to the motion and blur effects that result in detection failures on certain frames.
A traditional method for object detection is using Histogram of Oriented Gradients(HOG) features and Support Vector Machine (SVM) for classification. It requires a multi-scale sliding window making it much slower. In recent years deep learning has been a real game changer in computer vision. Deep learning models have virtually replaced classical techniques for the tasks of image classification and object detection and are currently an active area of research in computer vision. Many deep learning models are already in place that are state of the art in object detection. These models are fast and provide high accuracy and detection efficiency. In this phase, we considered state-of-the-art frameworks like R-CNN. R-FCN, YOLO and SSD for application-specific object detection.
After obtaining the output of the object detection module, we extracted information about the content and salient objects of the video based on a mixture of heuristics and learned decisions. In this phase, we determined the object of focus in the video by analyzing identified object categories, their frequency of occurrence and the mutual relationship between objects detected in each frame. The output of this module was a list of tags or categories and a salient object along with the best representative frame of its occurrence in the video.
Mosaicing is an old art technique where pictures or designs are formed by inlaying small bits of colored stone, glass, or tile. These small bits are visible close up, but the boundaries of the bits will blend and a single recognizable image will show at a distance. In the modern digital world, this ancient art form has been transformed and combined with new technologies. Instead of using pure-colored blocks, entire images can be used as tiles to make an overall picture.
After obtaining the output of the Object Detection and Salient Object Identification modules, we chose the representative image as our target image. Using other occurrences of the same object (as well as other significant objects within those frames), we reconstructed the target image in the form of a mosaic. This aesthetic visualization would be our final output that succinctly provides a visual preview of the video content.
YOLOv2 model was selected for object detection since it provides a good trade-off between accuracy and processing speed. The model was trained on COCO dataset for 80 classes. It consists of 23 conv layers, 2 route layers and 1 detection layer with input image size of 608x608. In particular, we chose the darkflow python implementation using the Tensorflow framework for object detection in this project. For each of the commercials under consideration, we down-sampled the number of frames per second (fps) by a factor of 3, reducing 24 fps to 8 fps. While the network itself has the potential of supporting processing upto 40 fps, in the absence of access to machines with GPUs, we ran the model on a 4-core CPU machine with a processing speed of about 1 fps. The key steps of the module are outlined below:
The annotated images presented below illustrate the performance of the object detection module on 4 commercials advertising a phone (Windows), a car (Mercedes Benz), a tennis racket (Wilson), and a laptop (Microsoft Surface) respectively. It is worth noting that several of these objects are correctly detected even in unfavorable visual conditions such as poor lighting, motion-blur, and significant obstruction by other objects. While the accuracy is remarkable, the speed of classification leaves much to be desired due to the limited computational resources available to us.
For identifying the object of focus in commercial advertisements, we started by filtering objects identified with a confidence of less than 50%. Additionally, we filtered out object classes based on their frequency of occurrence in the video. Objects detected in only a handful of frames were successfully filtered out due to our thresholding strategy. The key steps of the module are outlined below:
The graphs presented below illustrate the binned detected-object-distribution by time (frame number) in 4 commercials advertising a phone (Windows), a car (Mercedes Benz), a tennis racket (Wilson), and a laptop (Microsoft Surface) respectively.The car advertisement was a basic case where the most frequently detected object was a car itself, and hence, did not require any further estimation for determining the salient object. However, in each of the other commercials the most frequent object category was person. A naive most-frequent estimation would erroneously label these advertisements as being people-centric. However, using our heuristics, each of these commercials was correctly labeled as being about the actual object class to which it belonged.
We used the classical approach of spatial domain mosaicing to obtain an aesthetic mosaic representation of the selected target image. The images that were identified and stored by Module 1 as images containing salient object, were used as tiles by this module to reconstruct the target image as mosaic. The target image was divided into blocks and for each block a best matching tile is selected best of color distribution and correlation. The crude mosaic that was formed was further enhanced by different techniques. The key steps of the module are outlined below:
The images presented below illustrate the performance of the Mosaicing module on 4 commercials advertising a phone (Windows), a car (Mercedes Benz), a tennis racket (Wilson), and a laptop (Microsoft Surface) respectively. These set of images shows the entire process of mosaicing module. The input target image that was taken was converted into a crude mosaic which was then processed using anti-aliasing filter, followed by local reconciliation and then finally resolution based reconciliation to get the final image. It is worth noting that our module was able to generate a relatively good mosaic even for a target image that has significant motion or where the collected tile images did not have much variation. The aesthetic appeal of these images is relatively good and are also computationally fast to generate.
Despite having remarkable accuracy, the object detection module was by no means perfect in its classification of objects. A major cause of these mis-classifications was found to be motion-blur and scene-transition blurring in videos that possessed slow or smooth fade-in/fade-out transitions between different scenes. This transition-blur caused the image to possess artifacts that were a combination of artifacts from separate scenes. Figures below illustrate misclassification due to motion-blur and scene-transition blur.
Another classification error was found to be occurring in peculiar circumstances wherein a flexible object was found to metamorphose into a shape that represents a different entity. For instance, Figures below illustrates misclassification of a human arm (shaped like elephant's ears and trunk) as an elephant, the Mercedes logo mistaken for a clock, and a bag (shaped like a horse's snout) labeled as a horse.
In order to handle each of the aforementioned imperfections in classification, we suppressed objects detected during the salient object identification phase using a composite of frequency and confidence thresholding. Detected objects falling short of these thresholds were no longer considered candidates for being the salient object of focus.
Lack of computational power due to absence of access to GPU-equipped machines was identified to be a common factor that limited our ability to improve certain undesirable outcomes. The classification time for videos currently takes longer than the length of the video, which is undesirable for scalability reasons. This could be avoided if machines with GPUs were accessible to us since the object identification task naturally lends itself to parallelization. Lack of computing power is also the primary reason for avoiding repeated iterative training or tweaking of network architecture due to the huge training and testing time overheads.
Since most advertisements feature humans in some capacity or the other, people were often the most frequently detected object in commercials. Heuristically, we decided to disqualify people from being candidates for the object of focus. More generally, this heuristic can be extended to most living creatures who are seldom advertised in commercials. It is worth noting that this strategy is applicable only in the presence of an object that has a significant frequency even if that frequency is lower than that of person class. This would ensure that mobile phone commercials, despite featuring more humans than phones, would still be classified correctly, while political campaign advertisements would be accurately identified as being people-centric.
Poor frame selection was another challenge which caused by using constant threshold for brightness and sharpness for different videos. Each video has its own unique color scheme, lighting and background settings and hence coming up with a single threshold value for a wide range of such advertisements was impossible. We deployed statistical approach to leverage each video’s unique brightness and sharpness distributions. We computed thresholds for different parameters depending upon there importance in the target image selection. We did Gaussian fitting on the data accumulated by each frame to determine model parameters which were then used to compute thresholds.
Imperfect Tile matching was a major challenge in this module which was primarily due to image noise and resizing of tiles. We Use Anti-aliasing filter on the image and tiles in order to smoothen them, before calculating average color correlation between the tile image and the block of target image. This smoothening helped in reducing the errors in matching process that were present due to certain high frequency components and noise in the images.
Another problem was that dull mosaics were created due to limited color diversity of tile images. We tried to find the best matching tile for each block of target image but due to limited color variations in the tile images the selected tile was not up to the mark. We therefore tried to enhance the tile color by using color information around the local vicinity of the tile, as well as taking into account the color loss that occurred due to resolution change and image resizing. This enhancing helped us in generating mosaics that are more vibrant and appealing. An example of the improvement that we achieved can be seen in the image below:
Please feel free to reach out to us in case of any questions or comments.