View synthesis – Knowledge and References

Explore chapters and articles related to this topic

Deep Semantic Segmentation in Autonomous Driving

Published in Mahmoud Hassaballah, Ali Ismail Awad, Deep Learning in Computer Vision, 2020

Hazem Rashed, Senthil Yogamani, Ahmad El-Sallab, Mohamed Elhelw, Mahmoud Hassaballah

Depth estimation using deep learning has been tackled through several approaches [66,67], which can be classified as supervised, semi-supervised, and unsupervised methods. Each approach has its pros and cons. For example, supervised approaches generally provide better performance than unsupervised ones. Unsupervised approaches have the flexibility to be trained without explicit depth annotations; however, they usually provide less accuracy than supervised ones. Generally, supervised approaches require accurate depth annotation, which requires usage of synthetic datasets. In [68], ego-motion information is used to exploit temporal information. The depth network takes only the target view as input and a pixel-wise depth map is generated. The pose network takes the target view and the source views as input and generates relative camera poses. The outputs are used to inverse-warp the source views to reconstruct the target view in an unsupervised manner. View synthesis is used for that task, where a target image can be synthesized given a pixel-wise depth map of that image, in addition to pose and visibility in a nearby view. Hence, unsupervised training is performed. Godard et al. [69] formulated depth estimation as an unsupervised learning problem where epipolar geometry constraints are exploited to generate disparity images by training the network for image reconstruction loss. This approach has the advantage of using only stereo information for training in the unsupervised manner, while in cases of inference, only monocular images are needed.

Learning global spatial information for multi-view object-centric models

View Article

Journal Information

Published in Advanced Robotics, 2023

Yuya Kobayashi, Masahiro Suzuki, Yutaka Matsuo

To represent multi-object scenes properly, the properties of individual objects and the spatial arrangement of objects should be specified. However, existing multi-view object-centric methods only explicitly model representations of individual objects, and each of them includes its own spatial information separately. Namely, the relationship between objects is not represented in this case. This modeling can degrade novel view synthesis and segmentation because the model needs to solve occlusions and spatial ambiguity virtually without prior knowledge about objects' spatial relationships. This can be serious especially when the number of available observation views is limited. In addition, this modeling can also lead to the inability to generate physically plausible novel scenes, because they need to place objects independently, which often brings about collisions and misplacement of objects.