Thesis and dissertations

* Click on the title to read (or hide) the abstract.


  Thesis   MENDONÇA, MARCELO. Introducing a self-supervised, superfeature-based network for video object segmentation. 129 p.

Abstract: Video object segmentation (VOS) is a complex computer vision task that involves identifying and separating the pixels in a video sequence based on regions, which can be either the background or foreground of the scene, or even specific objects within it. The task must be accomplished consistently throughout the sequence, ensuring that the same object or region receives the same label in all frames. Recent advances in deep learning techniques and high-definition datasets have led to significant progress in the VOS area. Modern methods can handle complex video scenarios, including multiple objects moving over dynamic backgrounds. However, these methods rely heavily on manually annotated datasets, which can be expensive and time-consuming to create. Alternatively, self-supervised methods have been proposed to eliminate the need for manual annotations during training. These methods utilize intrinsic properties of videos, such as the temporal coherence between frames, to generate a supervisory signal for training without human intervention. The downside is that self-supervised methods often demand extensive training data to effectively learn the VOS task without supervision. In this work, we propose Superfeatures in a Highly Compressed Latent Space (SHLS), a novel self-supervised VOS method that dispenses manual annotations while reducing substantially the demand for training data. Using a metric learning approach, SHLS combines superpixels and deep learning features, enabling us to learn the VOS task from a small dataset of unlabeled still images. Our solution is built upon Iterative over-Segmentation via Edge Clustering (ISEC), our efficient superpixel method that provides the same level of segmentation accuracy as top-performing superpixel algorithms while generating significantly fewer superpixels. This is especially useful for processing videos, where the number of pixels increases over time. Our proposed SHLS embeds convolutional features from the frame pixels into the corresponding superpixel areas, resulting in ultra-compact image representations called superfeatures. The superfeatures comprise a latent space where object information is efficiently stored, retrieved, and classified throughout the frame sequence. We conduct a series of experiments on the most popular VOS datasets and observe interesting results. Compared to state-of-the-art self-supervised methods, SHLS achieves the best performance on the single-object segmentation test of the DAVIS-2016 dataset and ranks in the top five on the DAVIS-2017 multi-object test. Remarkably, our method was trained with only 10,000 still images, outstanding from the other self-supervised methods, which require much larger video-based datasets. Overall, our proposed method represents a significant advancement in self-supervised VOS, offering an efficient and effective alternative to manual annotations and significantly reducing the demand for training data.


  Thesis   FONTINELE, JEFFERSON. Paying attention to the boundaries in semantic image segmentation. 74 p.

Abstract: Image segmentation consists of assigning a label to each pixel in the image in such a way that pixels belonging to the same objects in the image must have the same labels. The segmented area of an object must span all pixels up to the limits (boundaries) with the other objects. The boundary region can provide helpful information for the segmentation process, as it marks a discontinuity that can define a segment limit. However, segmentation methods commonly suffer to explore boundary information and consequently to segment this region. This is so mainly due to the proximity between image regions containing different labels. In view of that, we propose to investigate how to take into account the boundary information when semantically segmenting an image object. Our first contribution is a graph-based image segmenter, called interactive dynamic programming (IDP)-expansion. This is a weakly-supervised method that requires a seed into each object targeted to be segmented in the image, subsequently minimizing an energy function to obtain the image labels. IDP-expansion explores dynamic programming to initialize an alpha expansion algorithm over superpixels to improve boundary information in a segmentation process. Over the Berkeley segmentation data set, our experiments showed that IDP-expansion is 51.2% faster than a traditional alpha-expansion based segmentation. Although IDP-expansion has shown to be faster, it suffers from two matters: A mandatory seed initialization and the lack of semantic information. This further led us to develop a supervised convolutional neural network architecture to semantically explore boundary information. Our novel method, called DS-FNet, uses two streams integrated in an end-to-end convolutional network to combine segmentation and boundary information based on an attention-aware mechanism. To evaluate DS-FNet, we initially conducted experiments on general-purpose (Pascal Context) and traffic (Cityscapes, CamVid, and Mapillary Vistas) image data sets, having the mean intersection over union (mIoU) as the reference metric. DS-FNet outperformed ten segmentation networks in the Pascal Context, Cityscapes, and CamVid data sets. In the Mapillary Vistas data set, DSFNet achieved second place when compared to five other methods. A second round of experiments was performed to evaluate the generalization of our proposed method on challenging medical imaging data sets, containing several kidney biopsy whole slide images (WSIs). The data sets used to evaluate the second version of our network were HubMAP, WSI Fiocruz, and a subset of Neptune data set, all considering glomerulus segmentation. After training DS-FNet only over HubMAP data set, containing periodic acid-Schiff (PAS)-stained WSIs with only non-injured glomeruli, we found that our network was capable to segment glomeruli on WSIs stained by other methods (hematoxylineosin (HE), periodic acid-methenamine silver (PAMS), trichrome (TRI), and silver (SIL)). The results of these latter experiments show that our model is more robust than other models based on U-Net architecture. All the experiments and analyses presented in this work demonstrated that the explicit and adequate consideration of boundary information improves the results over non-boundary segmentation methods


  Thesis   ABDALLA, KALYF. From modeling perceptions to evaluating video summarizers. 74 p.

Abstract: Hours of video are uploaded to streaming platforms every minute, with recommender systems suggesting popular and relevant videos that can help users save time in the searching process. Video summarizers have been developed to detect the video’s most relevant parts, automatically condensing them into a shorter video. Currently, evaluating this type of method is challenging since the metrics do not assess user annotations’ subjective criteria, such as conciseness. To address the conciseness criterion, we propose a novel metric to evaluate video summarizers at multiple compression rates. Our metric, called Compression Level of USer Annotation (CLUSA), assesses the video summarizers’ performance by matching the predicted relevance scores directly. To do so, CLUSA generates video summaries by gradually discarding video segments from the relevance scores annotated by users. After grouping the generated video summaries by the compression rates, CLUSA matches them to the predicted relevance scores. To preserve relevant information in concise video summaries, CLUSA weighs the video summarizers’ performance in each compression range to compute an overall performance score. As CLUSA weighs all compression ranges even that user annotations do not span some compression rates, the baseline changes with each video summarization data set. Hence, the interpretation of the video summarizers’ performance score is not as straightforward as other metrics. In our experiments, we compared CLUSA with other evaluation metrics for video summarization. Our findings suggest that all analyzed metrics evaluate video summarizers appropriately using binary annotations. For multi-valued ones, CLUSA proved to be more suitable, preserving the most relevant video information in the evaluation process.

  Dissertation   VIEIRA, GABRIEL. Gaze Estimation via Attention-Augmented convolutional Networks 105 p.

Abstract: Gaze estimation is highly relevant to applications in multiple fields, including but not limited to interactive systems, specialized human-computer interfaces, and behavioral research. Like many other computer vision tasks, gaze estimation greatly benefited from the advancement of deep learning in the past decade. A number of large scale data sets for appearance-based gaze estimation have been made public, and neural networks have been established as the core for the state-of-the-art approaches to this task. Currently, there is still room for improvement with regards to the network architectures used to perform appearance-based gaze estimation. One promising avenue to improve gaze estimation accuracy is to take into account the head pose information contained in facial images. A few published works do this by using the entire face image as the input to the network. One drawback to this strategy is that traditional convolutional neural networks are not able to form long-range spatial relationships within images. This is a significant factor in head pose estimation, given that the pose is determined by a combination of different features from eyes, nose, mouth, etc. To this effect, here we propose a novel approach that uses self-attention augmented convolution layers to improve the quality of the learned features. This is done by giving the CNN the ability to form long-range complex spatial relationships. We propose the use of a shallower residual network with attention-augmented convolutions, which we dubbed ARes-14. We show that by using Ares-14 as a backbone, it is possible to outperform deeper architectures by learning dependencies between distant regions in full-face images, creating better and more spatially-aware feature representations derived from the face and eye images before gaze regression. An interesting side-effect of our approach is also that it can create more visually-interpretable intermediary representations derived from the attention weights used by the self-attention layers, enabling interesting discussions about the learning process of the network. We dubbed our gaze estimation framework as ARes-gaze, which explores our Attention-augmented ResNet (ARes-14) as twin convolutional backbones. In our experiments, results showed a decrease of the average angular error by 2.38% when compared to state of-the-art methods on the MPIIFaceGaze data set, and achieved second-place on the EyeDiap data set. It is worth noting that our proposed framework was the only one to reach high accuracy simultaneously on both data sets among the evaluated methods.

  Dissertation   ESTRELA, LEANDRO. DMT: Um dispositivo mecatrônico termográfico de baixo custo. 91q p.

Abstract: The discovery of infrared radiation and the development of technologies capable of detecting it made possible the emergence of thermography as a science. Thermography is a technique for graphically recording body temperatures, with the intention of distinguishing areas with different temperatures. Bodies with temperatures above -273◦C are capable of emitting infrared radiation.This characteristic allows to study the behavior of the temperature in different objects, structures and surfaces over time. Applications involving thermography cover the areas of security and military applications, being used in border surveillance, search and rescue, sea patrols and coastal qsurveillance, wildlife; through studies related to understanding the thermal physiology of animals; health care and veterinary medicine; involving the diagnosis of various diseases, work-related injuries, studies of behaviors and diagnoses in animals; and the engineering, being applied in inspections of electrical and mechanical equipment, building inspections, conformities in air conditioning systems, management and maintenance of installations; this is not an exhaustive list.The biggest limitation for the development of studies and applications involving thermography is related to the high cost of its equipment. Pocket thermal cameras like Fluke PTi120 and FLIR C2 cost an average of R$ 6,000.00 and R$ 3,000.00 respectively. Thermal cameras with additional technologies and features like the FLIR T1020 HD and Fluke TiX 580 cost an average of US$ 41,500.00 and US$ 14,000.00 respectively. The price variation is related to the application that the equipment will have, its precision, camera resolution, sensor resolution, temperature measurement limits and embedded technologies produced by each manufacturer. In this sense, the purpose of this project is to present the development of a low-cost thermographic mechatronic device (DMT) responsible for the production of thermal images of objects, covering the stages of construction of the physical structure, data acquisition, movement system, system electronics and control, electrical system and graphic interface for device control and image formation. The DMT has an accuracy of ± 1◦C, at ambient temperature between 0◦C and 50◦C, having a useful reading area of 20 × 22 cm, producing images with 320 × 360 pixels, and capable of reading objects with temperatures between 0◦C and 300◦C, The construction cost of this device was less than R$ 800.00, and it can be used for studies of thermography in small bodies such as the human hand, small objects with heat variation, electronic circuits and components and portables, such as smartphones, lithium-ion batteries or tablets

  Thesis   ARAÚJO, POMPÍLIO. Intelligent drones to investigate criminal scenes. 99 p.

Abstract: A location associated with a committed crime must be preserved, even before criminal experts start collecting and analyzing evidences. Indeed, crime scenes should be recorded with minimal human interference. In order to help specialists to accomplish this task, we propose an intelligent system for investigation of a crime scene using a drone. Our system recognizes objects considered important evidence at the crime scene, defining the trajectories by which a drone performs detailed search. Existing methods are not dedicated to seeking evidence at a crime scene or are not specific to unmanned aircrafts. Our system is structured into three subsystems: (i) Aircraft auto-location, so-called Air-SSLAM, that estimates drone pose as well as provides coordinates to proportional-integral-derivative controllers for aircraft stabilization, (ii) controllers act in pairs in each direction to keep the aircraft on a calculated trajectory using the initial detected evidences, (iii) a new multi-perspective based detector analyzes multiple images of the same object in order to improve the reliability of object recognition. The goal is to make the drone fly through the paths defined by the objects recognized in the scene. Each subsystem was separately evaluated, as well as the complete system. At the end, Air-SSLAM presented a translational average error (TAE) of 0.10, 0.20, and 0.20 meters in the X, Y, and Z directions, respectively, and the average accommodation time of the controllers was between 10 and 20 secs. Our multi-perspective detection method increased 18.2% the detection rate of the baseline detector. In our experiments, we showed that the more perspectives, the higher the accuracy for localizing the evidences in the scene. The evaluation of the complete system was performed in a simulator, as well as in a real-world environment. The entire system, called Air-CSI, correctly identified all the objects in a controlled tested scenario, taking an average of 13.6 perspectives to identify an object. After surveying the crime scene, Air-CSI produces a report containing a list of evidences, sketches, images and videos, all collected during the investigation.


  Thesis   CERQUEIRA, R. A Multi-device sonar simulator for real-time underwater applications. 100 p.

Abstract: Simulation of underwater sonars allows the development and evaluation of acoustic-based algorithms without the real data beforehand, which reduces the costs and risks of in- field experiments. However, such applications require modeling acoustic physics while rendering data time-efficiently. Towards a high-fi delity virtualization with real-time constraints, this work presents a simulator able to reproduce the operation of two main types of imaging sonars: mechanical scanning imaging sonar and forward-looking sonar. The virtual underwater scenario is based on three components: (i) Gazebo handles the physical forces, (ii) OpenSceneGraph renders the ocean visual effects, and (iii) ROCK framework provides the communication layer between simulated components. Using this base, an underwater simulated scene can be acquired, then it is processed by a hybrid graphics pipeline to obtain the simulated sonar image. On GPU, shaders compute primary and secondary reflections by using selective rasterization and ray-tracing approach, where the computational resources are allocated to reflective surfaces only. Resulting reflections are characterized as two sonar rendering parameters: pulse distance and echo intensity, being all calculated over insoni ed objects in the 3D scene. Those sonar rendering parameters are then processed into simulated sonar data on CPU, in which the acoustic representation of the observable scene is composed and displayed. Sound-intrinsic features, such as noise, sound attenuation, reverberation, and material properties are also considered as part of the fi nal acoustic image. Our evaluations demonstrated the effectiveness of our method to produce images visually close to those generated by real sonar devices. In terms of computation time, the achieved results enable the proposed simulator to feed underwater applications where online processing of acoustic data is a requirement.

  Dissertation   NEVES, G. Rotated multi-object detection from forward-looking sonar images. 133 p.

Abstract: The underwater world is a hazardous place to people, being even unreachable in some places. It is common to employ manned or unmanned underwater vehicles, when human activities have to be performed underwater. Particularly, oil and gas companies have used remotely operated vehicles ROV to inspect and maintain submerged structures of subsea facilities. Because of the complexity and the high cost of ROV operations, some researches have addressed autonomous underwater vehicles AUV to perform inspection tasks under water. AUV are typically equipped with perception sensors, such as optical cameras and sonars that ultimately provide visual and acoustic information of underwater scenarios. With the goal of comprehending the surrounding environment, object detection over perception sensor data is a crucial task. Indeed, detected objects can be used for many applications of the AUV system, such as to locate obstacles, to provide landmarks for the navigation, to move the vehicle with respect to a detected target object and to plan AUV trajectories. Although optical cameras are still important sensors for AUV, their sensing capability is limited underwater, being only able to work at very short ranges, and in low-turbidity water conditions. In contrast, sonars can cover larger operative ranges through the water and work in turbid water conditions. However, sonars provide noisy data with lower resolution and more difficult interpretation, thus making object recognition in sonar images an arduous task. Having all this in mind, our work proposes a novel multi-object detection framework that outputs object position and rotation from sonar images. Two convolutional neural network-based architectures are proposed to detect and estimate rotated bounding boxes: An end-to-end system, called RBoxNet, and a pipeline comprised of two networks, called YOLOv2+RBoxDNet. Both proposed approaches are structured from one of three representations of rotated bounding boxes regressed deep inside. To the best of our knowledge, there is no other work in the literature that estimates rotated bounding boxes in sonar images. Experimental analyses were performed by comparing several configurations of our proposed methods (by varying backbone, regression representation, and architecture) with other state-of-the-art methods over real sonar images. Results showed that RBoxNet presented the best tradeoff between accuracy and speed, reaching an averaged mAP@[.5,.95] of 90.3% at 8.58 frames per seconds (FPS), while YOLOv2+RBoxDNet was the fastest solution running at 16.19 FPS, but with a lower averaged mAP@[.5,.95] of 77.5%. Both proposed methods were robust to variations of additive Gaussian noise, detecting objects even when the noise level is up to 0.10.


  Dissertation   CHAPARRO, LAURA. Configurando múltiplas câmeras RGB-D para captura de movimentos de pacientes hemiparéticos. 96 p.

Abstract: The analysis of the movement of the human body studies the variation of the position within the space in which it unfolds in a certain period of time. In the present, it is engaged in various areas: game development, dance analysis, animation creation in films, etc. As the of passing time, the analysis of the movement is to be applied in the evaluation of the evolution of physiotherapy treatments. One medical situation that require immediate initiation of physiotherapy treatment is known as hemiparesis: a neurological condition of various causes that difficult the moves of one half part of the body. It refers to a diminution of motor or partial force that affects one arm and one leg on the same side of the body. This is the sequence of Encephalic Vascular Accident (EVA). To reduce the effects of the consequences of hemiparesis, patients are subjected to physiotherapy treatments. To assess the evolution of the treatments, physiotherapists use clinical scales based on subjective judgments. In general, to analyze movement, it is necessary to use devices composed by multiple cameras, a single camera cannot capture details of the movement being carried out, mainly when the user is not frontally turned for the camera. To use a system of multiple cameras, it is necessary to apply a calibration process, in order to obtain the parameters used to unify the different coordinate systems of each one of the cameras. Currently, there are devices made up of multiple cameras, such as OptiTrack, for example, they are made up of a set of multiple cameras arranged in a space, where the patient needs a special clothing, which contains markers, which are used to track movements. However, a large majority of the devices for the clinical area are intrusive, requiring a large physical space, high price regarding its acquisition and maintenance, therefore, it is not easy to get. Following this ideas, the present work proposes the use of multiple RGB-D cameras, of the Kinect type, performing a stereo calibration, by pairs of cameras, in order to determine the correct configuration that contains minimal occlusion and auto-occlusion problems inherent to the use of a single camera. Furthermore, the process of defining the location and position of the camera aims to minimize interference between the devices used. To rate the performance of the system, a composite skeleton from the fusion of two skeletons supplied by each Kinect was used, using the articulation positions of two upper members and the head. The results show the less angular capture distances that correspond to a lesser occurrence of the problems mentioned above, being the skeleton resulting from the fusion more robust than when was using a single skeleton to evaluate the move

  Dissertation   RUIZ, M. A tool for building multi-purpose and multi-pose, synthetic data set. 103 p.

Abstract: This work proposes a new tool to generate synthetic data, multi-purpose, from multiple discretized camera viewpoints and predefined environmental conditions. To generate any data set, three steps are required: locate 3D models in the scene, set the discretization parameters, and finally run the generator. The set of rendered images provide data that can be used for geometric pattern recognition problems, such as depth estimation, camera pose estimation, 3D box estimation, 3D reconstruction, camera calibration, and also a pixel-perfect ground-truth for scene understanding problems (e.g., semantic segmentation, instance segmentation, object detection, just to cite a few). In this paper, we also survey the most well-known synthetic data sets used in the area, pointing out the importance to use rendered images to train convolutional neural networks. When compared to similar tools, our generator contains a wide set of features easy to extend, besides of allowing the building of sets of images annotated and organized in the well-known MSCOCO format, so ready for deep learning works. To the best of our knowledge, the proposed tool is the first to automatically generate synthetic data sets with such characteristics, in large scale, and allowing training and evaluation of supervised methods for all of the covered features. Code and data sets has been made available at


  Dissertation   BARROS, J. P. S. Reachable workspace a partir de múltiplos kinects para análise de movimento de pacientes hemiparéticos. 93 p.

Abstract: Stroke is a health problem for the world’s population,the second leading cause of death and one of the leading causes of physical disability in the world. One of the consequences of stroke is hemiparesis which is characterized by weakness on one side of the body and reduction of the individual’s muscular capacity, which necessitates immediate onset of physiotherapeutic treatment. About 45-75% of adults who have had a stroke have diffculty using the upper limb in activities of daily living. Health professionals generally use clinical scales to evaluate the evolution of physiotherapeutic treatment. These scales are based on subjective assessments by the physiotherapist. Other tools, such as accelerometers and force platforms, are used. But, they need more physical space. In addition, they are more sophisticated and intrusive in the preparation of patients. The present work proposes to use multiple low-cost and non-intrusive RGB-D (Kinect) cameras in order to objectively calculate the reachable workspace of the movements performed by the upper limbs of an individual to obtain a quantitative index in order to evaluate the physiotherapeutic treatment of hemiparetic patients. Although the index calculation is not the main goal of this research, getting reachable workspace is an important step for the index. A skeleton fusion strategy is applied so as to fuse the multiple skeletons obtained into a single skeletal representation. This is done so that the information lost by one camera was obtained in another. This minimizes losses that occur due to problems of occlusion, self-occlusion, or tracking limitations while obtaining movement positions of the joints. An extrinsic calibration was performed between the cameras in order to unify the coordinate systems of each camera in a single coordinate system to obtain the fusion of the skeletons. The reachable workspace is computed from the composite skeleton. For this, alpha-shapes is used in order to create a convex polygon of the obtained joint positions. The tracking rate of the composite skeletal joints and the tracking rate of the joints of each skeleton are calculated in order to determine the efficiency of the composite skeleton. The results show that skeleton fusion proves to be more robust for minimizing tracking failures than the use of individual cameras.

  Dissertation   SOUZA, L. C. COHAWES: Um novo descritor aplicado em reconhecimento facial. 96 p.

Abstract: Automatic face recognition in images looking-for identification of some previously registered images in a database. The recognition strategy should be robust enough to overcome the problems inherent in this task, such as low image quality, changes in lighting or even change in facial expression, for example.To confront such problems, methods for robust representation of objects in image are of primary importance. In order to contribute with a new image representation to assist in the process of face identification, this work introduces a new image descriptor, called Correlation of Haar Wavelet Sums (COHAWES). This new descriptor calculates the correlation in regions of the image based on the sums of vertical and horizontal gradients of haar wavelets filters. This descriptor is applied in face recognition and evaluated with the help of the nearest neighbor (NN), locally-constrained linear coding (LLC), k-nearest neighbor with sparse representation classifier (KNN-SRC), linear regression classification (LRC), orthonormal minimization ℓ2 (MO-L2), supportvector machines (SVM), linearly approximated sparse representation-based classification (LASRC), sparse representation classifier with gradient projection for sparse reconstruction (SRC-GPSR), and sparse representation classifier with homotopy (SRC-Homotopy). The performance in face recognition using COHAWES was compared to others, such as: dual-cross patterns (DCP), histogram of oriented gradients (HOG), Gabor wavelets, local binary patterns (LBP), a vector combining HOG, Gabor wavelets and LBP (HOG + GABOR + LBP) and COHAWES-COV, which is a modification of COHAWES using covariance instead of correlation, using the previously listed classifiers. These comparisons were performed using the ORL, Yale, YaleB, FERET, PubFig83 + LFW data sets, and COHAWES presented superior results on the mean of the overall accuracy and in most cases in the majority of cases. Evaluations in other image domains still need to be performed in order to analyze the application of COHAWES in other representation tasks.


  Thesis   FRANCO, A. C. S. On deeply learning features for automatic person image re-identification. 103 p.

Abstract: The automatic person re-identification (re-id) problem resides in matching an unknown person image to a database of previously labeled images of people. Among several issues to cope with this research field, person re-id has to deal with person appearance and environment variations. As such, discriminative features to represent a person identity must be robust regardless those variations. Comparison among two image features is commonly accomplished by distance metrics. Although features and distance metrics can be handcrafted or trainable, the latter type has demonstrated more potential to breakthroughs in achieving state-of-the-art performance over public data sets. A recent paradigm that allows to work with trainable features is deep learning, which aims at learning features directly from raw image data. Although deep learning has recently achieved significant improvements in person re-identification, found on some few recent works, there is still room for learning strategies, which can be exploited to increase the current state-of-the-art performance. In this work a novel deep learning strategy is proposed, called here as coarse-to-fine learning (CFL), as well as a novel type of feature, called convolutional covariance features (CCF), for person re-identification. CFL is based on the human learning process. The core of CFL is a framework conceived to perform a cascade network training, learning person image features from generic-to-specific concepts about a person. Each network is comprised of a convolutional neural network (CNN) and a deep belief network denoising autoenconder (DBN-DAE). The CNN is responsible to learn local features, while the DBN-DAE learns global features, robust to illumination changing, certain image deformations, horizontal mirroring and image blurring. After extracting the convolutional features via CFL, those ones are then wrapped in covariance matrices, composing the CCF. CCF and flat features were combined to improve the performance of person re-identification in comparison with component features. The performance of the proposed framework was assessed comparatively against 18 state-of-the-art methods by using public data sets (VIPeR, i-LIDS, CUHK01 and CUHK03), cumulative matching characteristic curves and top ranking references. After a thorough analysis, our proposed framework demonstrated a superior performance.

  Dissertation   SOUZA JUNIOR, L. O. O. Um estudo sistemático sobre detecção de impostor facial. 97 p.

Abstract: Facial expressions are the result of changes in facial muscles in response to the internal emotional state of a person. Based on Darwin studies, Paul Ekman developed a study suggesting the existence of seven basic universal facial expressions: joy, sadness, fear, disgust, contempt, surprise and anger, plus the neutral expression. Also, Paul Ekman developed a tool, the Facial Action Coding System or simply FACS, to taxomize all facial muscle movements and their intensity and study more deeply the facial behavior. From this tool, hence more and more applications, which automatically detect facial expressions, start to be pervasive in various fields, such as education, entertainment, psychology, human-computer interaction, behavior monitoring, just to cite a few. Unlike the available methods in the literature, this work suggests a new approach to the development of a system for recognizing the seven basic facial expressions proposed by Ekman: joy, sadness, fear, disgust, contempt, surprise and anger. The proposed method builds a conspicuous map of the main region faces, training it via a convolutional neural network. For that was developed two neural network architectures that have the legitimacy of the training done through cross-validation and the test results were evaluated by ROC curve analysis and matrix of confusion. The evaluation of the proposed approach indicated a promising result. Experimental results achieved an average accuracy of 90% over the extended Cohn-Kanade data set for the seven basic expressions, demonstrating the best performance against all state-of-the-art methods and, in absolute terms, the proposed approach had a best result in three of seven facial expressions which shows a promising result.

  Dissertation   CANARIO, J. P. P. S. Um abordagem deep learning para reconhecimento de expressões faciais. 82 p.

Abstract: Facial expressions are the result of changes in facial muscles in response to the internal emotional state of a person. Based on Darwin studies, Paul Ekman developed a study suggesting the existence of seven basic universal facial expressions: joy, sadness, fear, disgust, contempt, surprise and anger, plus the neutral expression. Also, Paul Ekman developed a tool, the Facial Action Coding System or simply FACS, to taxomize all facial muscle movements and their intensity and study more deeply the facial behavior. From this tool, hence more and more applications, which automatically detect facial expressions, start to be pervasive in various fields, such as education, entertainment, psychology, human-computer interaction, behavior monitoring, just to cite a few. Unlike the available methods in the literature, this work suggests a new approach to the development of a system for recognizing the seven basic facial expressions proposed by Ekman: joy, sadness, fear, disgust, contempt, surprise and anger. The proposed method builds a conspicuous map of the main region faces, training it via a convolutional neural network. For that was developed two neural network architectures that have the legitimacy of the training done through cross-validation and the test results were evaluated by ROC curve analysis and matrix of confusion. The evaluation of the proposed approach indicated a promising result. Experimental results achieved an average accuracy of 90% over the extended Cohn-Kanade data set for the seven basic expressions, demonstrating the best performance against all state-of-the-art methods and, in absolute terms, the proposed approach had a best result in three of seven facial expressions which shows a promising result.


  Dissertation   SANTOS, T. N. Detecção e rastreamento da mão utilizando dados de profundidade. 81 p.

Resumo: As interfaces naturais têm demonstrado uma grande importância na interação entre o homem e a máquina, viabilizando desde jogos eletrônicos até a reabilitação de pacientes submetidos a fisioterapia. O rastreamento da mão por câmeras permite implementar tais interfaces, explorando os gestos humanos para controlar algum sistema computadorizado sem a necessidade de contato físico. O método proposto neste trabalho visa detectar e rastrear as mãos utilizando dados de profundidade. Uma vez que tais dados não produzem quantidade suficiente de pontos de interesse (pontos chaves) para a detecção da mão, foi proposto um algoritmo denominado Volume da Normal para exceder a descrição das características presentes nestas imagens, sendo baseado no cálculo do volume do vetor normal de cada pixel atribuindo valores arbitrários para o tamanho deste vetor. O rastreamento da mão é baseado na análise de descritores locais da imagem de profundidade (processada pela Transformada da Distancia Euclidiana) e de um conjunto de imagens da mão após aplicação do Volume da Normal, utilizando para isto o algoritmo Oriented FAST and Rotated BRIEF. Um procedimento para a criação de um modelo cinemático da mão foi proposto como estágio inicial para um possível rastreamento contínuo dos dedos numa pesquisa posterior. Ao final, a detecção da mão foi executada a uma velocidade de 7,9 quadros por segundo, alcançando uma taxa de detecção média para detecção de poses do conjunto de treinamento igual a 36,4% e 38,15% para poses variadas. Para detecção de gestos realizados a partir do conjunto de treinamento foi alcançada uma taxa médiade 21,94%. Para cenários onde há presença de objetos semelhantes à mão, o detector apresentou uma taxa de precisão igual a 14,72% com um desvio padrão de 3,82%.

  Dissertation   SOUZA, T. T. L. Auto-calibração de camêras de vídeo-vigilância por meio de informações da cena. 104 p.

Abstract: Surveillance cameras are commonly used in public and private security systems. This kind of equipment allows automation of surveillance tasks, when integrated with intelligent pattern recognition systems. Camera calibration allows intelligent systems to use the 3D geometry of a scene as a tool to determine the position and size of a target object. Typical systems may contain a large number of cameras, which are installed in different locations, and they are composed of static and heterogeneous cameras. Manual camera calibration requires intense human effort in order to calibrate all camera in a network. In this work is proposed a framework for auto-calibration of surveillance cameras, without any human intervention in the calibration process. Our framework uses scene clues and prior knowledge of the human height distribution to estimate needed parameters for camera calibration, which includes the camera position, orientation and internal properties. Evaluation of the framework indicates promising results. Based on our analysis, the proposed framework reaches an absolute error less than 5 cm in human height estimation, and an average error less than 30 cm in length determination above the scene ground plane. Compared with other similar methods, our framework demonstrates better efficiency by using 80% less samples in the parameter convergence process, and it reaches 40% more precision in the camera parameter estimation.


  Dissertation   SILVA, G. J. O. Avaliação de técnicas de segmentação aplicadas a imagens de raio-X panorâmico dos dentes. 127 p.

Abstract: Nowadays, medical images are fundamental sources of data in modern medicine. The field of Computer Vision has helped a lot in research that involves extracting information from these images to make diagnoses on patients. Through images taken with radiological examination (X-ray), the dentist can examine the entire tooth structure and build (if necessary) the treatment plan of the patient. Among the radiological examinations, there is the panoramic radiography (panoramic X-ray), shows that dental irregularities such as: the impacted teeth, bone abnormalities, cysts, tumors, infections and fractures. However, the analysis depends on careful work of professional and this is not done automatically, mainly due to the diffculty there are computational tools capable of extracting the features of these images as desired. The objective of this study is to conduct a review of image segmentation techniques, demonstrating the application of each of images in X-ray (panoramic) dataset constructed for this work, seeking to isolate the teeth to facilitate analysis. After, a thorough evaluation of the performance of segmentation algorithms is carried out to prove the effciency of the results and then determine which segmentation technique is most suitable for adaptation to the problem domain studied. To achieve this goal, the methodology commonly used in research in the area of Computer Vision is adopted, performing qualitative analysis During the early stages that are the collection and annotation of datasets (cataloging the images to be analyzed), preprocessing (using techniques prepare the images for subsequent steps). Finally, conducting quantitative and experimental analysis during stages of segmentation of images, which are used algorithms responsible for separating the sets of extracted features, among those most likely to belong to the objects of interest and those least likely. Ending with the performance analysis of segmentation algorithms through the use of existing Computer Vision (precision, recall, F-score, accuracy and especificity) and proving the effciency of the results by analyzing the curve metric precision / recall and F-score. The aim of this work is to assist professionals in Dental Health to differentiate objects of interest, teeth no teeth, contributing to the future construction of a model for systematic analysis of possible dental anomalies and producing information as a basis for a second opinion to the dental medical diagnosis.


  Dissertation   SANTOS, M. Road Detection in Traffic Analysis: A Context-aware Approach. 110 p.

Abstract: Correctly identifying the area regarding the image road is a crucial task for many traffic analyses based on computer vision. Despite that, most of the systems do not provide this functionality in an automatic fashion; instead, the road area needs to be annotated by tedious and inefficient manual processes. This situation results in further inconveniences when one deals with a lot of cameras, demanding considerable effort to setup the system. Besides, since traffic analysis is an outdoor activity, cameras are exposed to disturbances due to natural events (e.g., winds, rains and bird strikes), which may require recurrent system reconfiguration. Although there are some solutions intended to provide automatic road detection, they are not capable of dealing with common situations in urban context, such as poorly-structured roads or occlusions due to moving objects stopped in the scene. Moreover in many cases they are restricted to straight-shaped roads (commonly freeways or highways), so that automatic road detection cannot be provided in most of the traffic scenarios. In order to cope with this problem, we propose a new approach for road detection. Our method is based on a set of innovative solutions, each of them intended to address specific problems related to the detection task. In this sense, a context-aware background modeling method has been developed, which extracts contextual information from the scene in order to produce background models more robust to occlusions. From this point, segmentation is performed to extract the shape of each object in the image; this is accomplished by means of a superpixel method specially designed for road segmentation, which allows for detection of roads with any shape. For each extracted segment we then compute a set of features, the goal of which is supporting a decision tree-based classifier in the task of assigning the objects as being road or non-road. The formulation of our method — a road detection carried out by a combination of multiple features — makes it able to deal with situations where the road is not easily distinguishable from other objects in the image, as when the road is poorly-structured. A thorough evaluation has indicated promising results in favour of this method. Quantitatively, the results point to 75% of accuracy, 90% of precision and 82% of recall over challenging traffic videos caught in non-controlled conditions. Qualitatively, resulting images demonstrate the potential of the method to perform road detection in different situations, in many cases obtaining quasi-perfect results.


  Dissertation   SILVA FILHO, J. G. Multiscale Spectral Residue for Faster Image Object Detection. 115 p.

Abstract: Accuracy in image object detection has been usually achieved at the expense of much computational load. Therefore a trade-off between detection performance and fast execution commonly represents the ultimate goal of an object detector in real life applications. Most images are composed of non-trivial amounts of background information, such as sky, ground and water. In this sense, using an object detector against a recurring background pattern can require a significant amount of the total processing time. To alleviate this problem, search space reduction methods can help focusing the detection procedure on more distinctive image regions. Among the several approaches for search space reduction, we explored saliency information to organize regions based on their probability of containing objects. Saliency detectors are capable of pinpointing regions which generate stronger visual stimuli based solely on information extracted from the image. The fact that saliency methods do not require prior training is an important benefit, which allows application of these techniques in a broad range of machine vision domains. We propose a novel method toward the goal of faster object detectors. The proposed method was grounded on a multi-scale spectral residue (MSR) analysis using saliency detection. For better search space reduction, our method enables fine control of search scale, more robustness to variations on saliency intensity along an object length and also a direct way to control the balance between search space reduction and false negatives caused by region selection. Compared to a regular sliding window search over the images, in our experiments, MSR was able to reduce by 75% (in average) the number of windows to be evaluated by an object detector while improving or at least maintaining detector ROC performance. The proposed method was thoroughly evaluated over a subset of LabelMe dataset (person images), improving detection performance in most cases. This evaluation was done comparing object detection performance against different object detectors, with and without MSR. Additionally, we also provide evaluation of how different object classes interact with MSR, which was done using Pascal VOC 2007 dataset. Finally, tests made showed that window selection performance of MSR has a good scalability with regard to image size. From the obtained data, our conclusion is that MSR can provide substantial benefits to existing sliding window detectors.

  Dissertation   SILVA, C. A. P. Reconhecimento de Expressões Faciais Utilizando Redes Neurais. 101 p.

Abstract: The automatic analysis of facial expressions has drawn attention from researchers in different fields of study such as psychology, computer science, linguistics, neuroscience and related fields. In the last decades several researchers have released many studies and introduced a large amount of approaches and methods for human-face detection, recognition and analysis. The advances on image processing, computer vision and computing power also contributed to this success. This work proposes a system for automatic human-face expression recognition. The proposed system classifies seven different facial expressions: happiness, anger, sadness, surprise, disgust, fear and neutral. The proposed system was evaluated with two facial expression databases: MUG Facial Expression e Face and Gesture Recognition Research Network (FG-NET). These databases contain images with uniform and non-uniform background. The databases also contain images of people who have individual differences such as: beard, mustache and glasses. The experimental results demonstrate that the proposed system shows 97.62% of accuracy for the seven defined facial expressions using artificial neural networks.

  Dissertation   SOBRAL, A. C. Classificação Automática do Estado do Trânsito Baseada em Contexto Global. 95 p.

Abstract: Intelligent vision systems for urban traffic surveillance have been frequently adopted. The traditional approaches are based on detection and counting of individual vehicles to perform traffic analysis. However, traditional approaches commonly fail, especially, on crowded situations (e.g. high traffic congestion) due to the large occlusion of moving objects, causing error on the vehicle counting and traffic analysis. Global approaches evaluate the crowd as an individual entity. Some properties can be extracted from crowds behavior analysis like crowd flows, density, speed, localization and direction. This work proposes a method for highway traffic video classification based on global approach. The method uses two crowd properties and classify the traffic congestion into three classes: light, medium and heavy. These properties are based on average crowd density and average crowd speed. In this work, we chose to combine these two properties in a feature vector that is used to compose the training set. Experimental results show 94.50% of accuracy on 254 highway traffic videos using artificial neural networks.