Presentation abstracts

Introduction to Point Clouds and Recent Advances in Point Cloud Construction

Point Clouds are sets of data points representing a 3D object. Point Clouds are related to Computer Vision by the problem posed by the need for a machine to understand a 3D object observed first as a 2D image or a set of multiple 2D images. Humans naturally perceive 3D objects with the help of binocular vision and cues recognized by the visual cortex, but machines must be programmed to generate a 3D model of the objects they can distinguish. After the problem of object detection comes the problem of 3D recognition.This presentation will discuss the problem posed by generating accurate point clouds, and unique solutions generated by research in neural networks. The presentation will begin with a brief background on Point Clouds and some discussion of generating Point Clouds from images. Ideally, the presentation will include a demonstration of point cloud generation from a simple image using existing software or open-source code. Following the demonstration will be some necessary background on neural networks. Finally, there will be a discussion of the methods described in the article Dense 3D Point cloud Reconstruction Using a Deep Pyramid Network, which proposes a deep pyramidal network to predict point clouds. The resulting architecture predicts successively higher-resolution point clouds from an initial point cloud, creating a more accurate reconstruction of the object and demonstrating an understanding of the object’s form. While not comprehensive, this presentation should achieve the goal of giving the audience a brief introduction to 3D object recognition, and tying it to the applications of neural networks, a pervasive topic in computer vision research.


Mandikal, Priyanka and R. Venkatesh Babu. “Dense 3D Point Cloud Reconstruction Using a Deep Pyramid Network.” CoRR abs/1901.08906.

Audio Signals Encoding For Cough, And Environmental sound Classification Using Convolutional Neural Networks

Cough detection has considerable clinical value, which can provide an objective basis for assessment and diagnosis of respiratory diseases. Motivated by the great achievements of CNNs in recent years, in the first experiment they adopted 5 different ways to encode audio signals as images and treated them as the input of CNNs, so that image processing technology could be applied to applied to analyze audio signals. In order to explore the optimal audio signals encoding method, the comparative experiments were performed on medical dataset containing 70000 audio segments from 26 patients. Experimental results show that RASTA-PLP spectrum is the best method to encode audio signals as images with respect to cough classification task, which gives an average accuracy of 0.9965 in 200 iterations on test batches and a F1-score of 0.9768 on samples re-sampled from the test set. For the other experiment, a deep model consisting of 2 convolutional layers with max-pooling and 2 fully connected layers is trained on a low level representation of audio data (segmented spectrograms) with deltas. The accuracy of the network is evaluated on 3 public datasets of environmental and urban recordings. The model outperforms baseline implementations relying on mel-frequency cepstral coefficients and achieves results comparable to other state-of-the-art approaches. Therefore, the image processing-based method is shown to be a promising choice for the process of audio signals.

DeepTest: Automated Testing Autonomous Cars

Advances in Deep Neural Networks have caused a rise in development in Deep Neural Network driven autonomous cars. By utilizing sensors like cameras, LIDAR, a car can drive independently without any human intervention. Manufacturers that design such cares strive to improve and learn more of the different types of autonomous vehicles. Laws regulate autonomous vehicles during their fast track process of testing, but, despite their advances, Deep Neural Networks often have bugs which can lead to a potentially fatal collision. Most autonomous vehicles are heavily dependent on manual collection of test data under a variety of test conditions.To overcome this, many steps must be taken. Designing test to automatically detect erroneous behavior of autonomous vehicles that can lead to a potentially fatal crash is crucial to finding a solution to this problem. Real world changes in driving conditions, such as fog, rain, blurring, etc. must be considered.

The Optimization of Objectness Estimation

Over the last few years, objectness detection and estimation has been a major factor and topic for discussion in computer vision.Essentially, objectness detection can be described as the process by which objects are searched for within a given image. A “sliding window” method is often used as a search strategy, which is effective, but can be quite time consuming. “Objectness” is usually a scaler value, or number, that represents how likely an image window will cover the object within the image.In general, this process can be optimized by employing several methods to create an environment in which these objects can be more easily detected. Of course, the effectiveness of these methods will depend upon the image itself. However, processing time and memory usage can often be improved by employing these adjustments, even if they are small. In my presentation, I will discuss a couple of approaches and their attempts to improve and optimize objectness detection for different scenarios. One of these approaches includes the use of “Superpixel Straddling” to measure objectness within an image. The results of this strategy will be discussed as well as a couple of improvements that I believe could be made for further improvement.


In addition to verbal communication, non-verbal communication modes such ashand gestures are one of the key aspects of human to human interactions. Addingnatural form of interactions for usingcomputers or any other devices would make themmore convenientfor different groups of people irrespective of their knowledge on modern day computing.For a long time, human to computer interaction has been limited to the use of input devices such as mouse and keyboards. Natural language processing, in recent years, have been gradually expandedshowing the extent of natural mode of Human Computer Interaction. Till date, there have been severalworks regarding hand gesture recognition. Modern handheld devices like smartphones, tablets, laptops and even some other electronic devices have been equipped with technologies that canrecognize a few hand gestures when performed touching the device. Use of computer vision can improve the current Human Computer Interaction by performing various computer vision algorithms on captured images or video to understand what the user is signaling using hand gestures.

Generative Adversarial Networks: An Overview

Generative Adversarial Networks (GANs) usually consists of a pair of deep neural networks - the Generator network and the Discriminator network. The Generator learns to simulate samples from the complex high dimensional distribution of the real-world data. Simultaneously, the Discriminator learns to distinguish the real samples from the simulated samples generated by the Generator. Thus, training GANs is a two-player game in which the Generator’s goal is to fool the Discriminator, by minimizing the difference between data distribution of the real-world data and the simulated data. During training, the Generator does not have any direct access to the real data, but the Discriminator has access to both real and simulated data. The training error from the Discriminator is passed to the Generator, which helps it to learn to generate fake or simulated data of better quality. The Generator wins when the Discriminator performs no better than random guess in distinguishing real from fake data. Error back propagation through these competing networks allows GANs to learn deep representations from limited amount of labeled training data. GANs are becoming an emerging technique for both unsupervised and semi-supervised learning tasks with many applications in computer vision, such as image synthesis, semantic image editing, style transfer, image super resolution, image classification, object detection and object recognition. In this presentation, I will provide a brief overview of different types (architectures) of GANs and their common applications in computer vision. These include fully connected GANs, Convolutional GANs, Conditional GANs, Bidirectional GANs, Adversarial Autoencoders, Global and Local Perception GANs, etc.


Antonia Creswell, Tom White, Vincent Dumoulin, Kai Arulkumaran, Biswa Sengupta, and Anil A. Bharath. Generative adversarial networks: An overview. In the Proceedings of IEEE Signal Processing Magazine Special Issue on Deep Learning for Visual Understanding, accepted paper, 2017.

Goodfellow, Ian J., Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron C., and Bengio, Yoshua. Generative adversarial nets. NIPS, 2014.

DLib: Overview and Comparison to OpenCV

This presentation aims to provide information on what the DLib library provides and how DLib compares and is used with OpenCV. Firstly, I will discuss what DLib is and why someone would want to use it. DLib is an open source project, and thus one slide will explain the license DLib uses and one slide will explain the concepts and benefits of open source software. This slide leads us into a brief spiel on why Linux is important in the Computer Vision world, and more broadly, important in computing in general. I will recommend to students to try to use a virtual machine to get their hands dirty as well as recommend trying DLib and OpenCV with Python. I will demonstrate a quick example of DLib with a facial landmark program. I use a pre-trained facial landmark detector from the DLib library to estimate the location of 68 (x, y) coordinates that map to facial structures on the face. This is example is very brief. The code works well on front facing faces that are not obfuscated by glasses, noise, or compression artifacts or have acceptable resolution. Next, I will move on to the main experiment of DLib vs OpenCV. I will be using two different programs to apply HoG face detection in DLib and OpenCV. In this section I will analyze the computing performance and the accuracy of each approach. For accuracy, we will analyze how the different variables that obscure a face in images effect each approach. If there is time remaining, I will wait for questions about anything covered in the slides.

Super Resolution for Synthetic Aperture Radar

Synthetic Aperture Radar (SAR) is used to make 2 dimensional images from 3D reconstructions used radar image processing. Resolution on radar image processing is prone to high noise on capturing for RCS imaging and air to ground image processing. Super resolution can capture hidden details to remove noise. Useful for using many low resolution images to generate high resolution images. I will also tie in my experience in the work filed with this topic.

Human Motion Recognition for Computer Vision Enhanced RGB-D Sensors

High-resolution depth and visual (RGB-D) sensing has become available due to wide spread commercial technology. Niche camera systems like the Asus Xtion sensor combined with Computer Vision is the foundation for this platform. Assuming a stable scene filmed by the sensor, we can extrapolate the baseline data sets needed to track accurate human action/activity recognition. Classic object identification and tracking algorithms based on RGB-D images is not always consistent. This mainly occurs when the environment is disarranged, or the lighting conditions fluctuate. These conditions both of which occur frequently in a real-world situation. The scientific issue is the correlation between per-pixel depth and RGB information when one of them is missing or corrupted. This influences the decision making based on the fused and incomplete information. This can be mitigated through techniques that spatially calibrate and correlate the depth image with the RGB images.