Explore chapters and articles related to this topic
Object Classification
Published in Maheshkumar H. Kolekar, Intelligent Video Surveillance Systems, 2018
Recent developments in artificial intelligence and deep learning have paved the way to accomplishing tasks involving multimodal learning. Visual questioning-answering is one such challenge that requires high-level scene interpretation from images combined with language modeling of the relevant question and answer. Image-caption generation is the problem of generating a descriptive sentence for an image. A quick glance at an image is sufficient for a human to point out and describe an immense amount of details about the visual scene. The fact that humans can do this with remarkable ease makes this a very interesting and challenging problem for AI, combining aspects of computer vision, in particular, scene understanding. However, this remarkable ability has proven to be an elusive task for our visual recognition models.
A multimodal hybrid parallel network intrusion detection model
Published in Connection Science, 2023
Shuxin Shi, Dezhi Han, Mingming Cui
The existing IDS based on DL lack flexibility and adaptability. Only a single mode is used in the end-to-end DL IDS, resulting in significant deviations in the detection results. This is a major issue in intrusion detection, as small deviations in detection can lead to more attacks and potential threats (Z. Wang et al., 2022). Network security researchers often make a comprehensive judgment through various information such as abnormal traffic information, traffic load content, and communication interaction process when auditing and analysing malicious traffic. Multimodal learning is a method that seeks to process and understand different modal information through ML and DL. It is more closely aligned with the general laws of human understanding of the world than traditional ML methods that rely on a single mode.
A Multimodal Teaching Model of College English Based on Human–Computer Interaction
Published in International Journal of Human–Computer Interaction, 2023
Fei Qin, Qian Sun, Yongyan Ye, Le Wang
Multimodal machine learning is the use of machine learning technology to realize the processing and understanding of multi-source modal information. Currently, multimodal machine learning can be divided into five fields: multimodal learning, modal transformation, modal alignment, multimodal information fusion, and multimodal collaborative learning. It aims to explore a multimodal teaching model based on HCI (Riva & Riva, 2020). Unimodal representation learning is to represent a numeric vector as a numeric vector that can be processed by a computer, or abstract it into a higher-level feature vector. Multimodal representation learning is to eliminate redundancy between modes through the complementarity of multiple modes, so as to obtain better feature expression (Krasnokutska & Kovalchuk, 2017). Two important areas of research are joint expression and collaborative expression. Joint representation is to combine multiple pattern information into a multi-pattern vector space; while cooperative representation is to map each pattern under multiple patterns to the corresponding representation space, and the relationship between them also conforms to certain constraints. Features of multimodal expressions are often used for information retrieval and classification.
A survey of multimodal deep generative models
Published in Advanced Robotics, 2022
Masahiro Suzuki, Yutaka Matsuo
We perceive various kinds of sensory information from our external world, such as vision, sounds, and smells. These different types of information are called different modalities, and it is known that we can develop a more reliable understanding of the world through multiple modalities, or multimodal information [1]. In recent years, multimodal learning [2] has been studied in the field of artificial intelligence and machine learning, which aims to build models that make predictions based on such multimodal information. Multimodal learning is especially important for robots that need to operate properly in the real world, because they need to make sense of the world based on the various types of information they receive through their onboard sensors [3].