Dimensionality reduction – Knowledge and References

Explore chapters and articles related to this topic

Machine Learning

Published in Seyedeh Leili Mirtaheri, Reza Shahbazian, Machine Learning Theory to Applications, 2022

Seyedeh Leili Mirtaheri, Reza Shahbazian

Principal Component Analysis (PCA), is a dimensionality reduction method that is often used to reduce the dimensionality of large datasets by transforming a large set of variables into a smaller one that still contains most of the information from the large set. Reducing the number of variables of a data set naturally comes at the expense of accuracy, but the trick in dimensionality reduction is to trade a little accuracy for simplicity. Because smaller datasets are easier to explore and visualize they make analyzing data much easier and faster for machine learning algorithms without the extraneous variables to process. In nutshell, the idea of the principal component analysis method is to reduce the number of variables of a dataset, while preserving as much information as possible.

Hybridization Preprocessing and Resampling Technique-Based Neural Network Approach for Credit Card Fraud Detection

View Chapter

Purchase Book

Published in Nedunchezhian Raju, M. Rajalakshmi, Dinesh Goyal, S. Balamurugan, Ahmed A. Elngar, Bright Keswani, Empowering Artificial Intelligence Through Machine Learning, 2022

Bright Keswani, Poonam Keswani, Prity Vijay, Ambarish G. Mohapatra

PCA is a dimensionality reduction method which extracts the set of features from a very large or big dataset and converts it into low-dimension dataset. In other words, it extracts all the information from a high-dimensional dataset with an intention to capture almost all the information from a big and high dimensional dataset. Increased size of data is not only hazards for storing but processing it also becomes problematic for many traditional ML approach. Nowadays, sizes of data anticipating from different sources have been enlarged containing in numerous features in it.4 Out of these numerous features, many are redundant and convey the same piece of information. Therefore, this type of redundant features should be removed so that dataset contains less but very important and meaningful information. PCA checks the variance of each property into a dataset and collects all the high variance data to form a new set of examples based on the original one while absorbing all the information in it. PCA linearly transformed high dimensional original data using the algebraic calculation of principal component.

Machine Learning in IoT-Based Ambient Backscatter Communication System

View Chapter

Purchase Book

Published in Bhawana Rudra, Anshul Verma, Shekhar Verma, Bhanu Shrestha, Futuristic Research Trends and Applications of Internet of Things, 2022

Shivani Chouksey, Tushar S. Muratkar, Ankit Bhurane, Prabhat Sharma, Ashwin Kothari

Dealing with strong direct-link interference from the RF source is one of the most difficult aspects of ambient backscatter. Treating the interference as noise and demodulating the backscattered information using energy detection was a simple approach. Using unsupervised learning approaches, the energy set elements are divided into two clusters that correspond to two different types of transmission bits either 0 or 1 from the tag. For these elements of the energy set in the cluster, a point is calculated, which a machine learns without obtaining supervised target outputs or rewards from its environment; it is said to be unsupervised learning. Given that the machine receives no feedback from its environment, it may be impossible to deduce what it could be capable of learning. However, a framework for unsupervised learning is based on the assumption that the purpose of the machine function is to generate characteristics of the input that can help in decision making, prediction of the future inputs, transferring the inputs to another machine efficiently, and so on. Unsupervised learning is be defined as the recognition of patterns in the data that will be above the level of pure unstructured noise. Clustering and dimensionality reduction are two simple instances of unsupervised learning. Almost every unsupervised learning research may be considered as a model that gives probability of the data being learned. Even if the machine is not supervised and compensated, developing a model that shows the probability distribution for a new input based on previous inputs is a beneficial approach.

Abnormal network packets identification using header information collected from Honeywall architecture

View Article

Journal Information

Published in Journal of Information and Telecommunication, 2023

Kha Van Nguyen, Hai Thanh Nguyen, Thang Quyet Le, Quang Nhat Minh Truong

LDA is also a dimensionality reduction technique. As the name implies, dimensionality reduction techniques reduce the number of dimensions (i.e. variables or dimensions or features) in the data set while retaining as much information as possible. With the advantages of the LDA algorithm, we use the LDA algorithm to represent the collected data on the 2D graph. After transforming the data with the LDA algorithm, we have shown the distribution of records on the histogram as shown in Figure 7. The green points (medium risk) account for the highest proportion distributed on the right side of the graph. The second-highest proportion is the unwarranted network flows occupying the upper left part. From the graph, it can be concluded that the 29 proposed features effectively classify the network attack data stream.

Automated categorization of student's query

View Article

Journal Information

Published in International Journal of Computers and Applications, 2022

Naveen Kumar, Hare Krishna, Shashi Shubham, Prabhu Padarbind Rout

The proposed platform's core has a model that categorizes students' queries. The text classification is not a new area; authors of [6–17] have already explored it. This article uses text classification for the classification of students' queries. The query categorization can be divided into a series of tasks like data collection, text preprocessing (data filtering, tokenization, stemming, stop word removal, and vectorization), feature reduction (dimensionality reduction or feature selection), classification, performance evaluation, etc. There are many dimensionality reduction approaches, i.e. Principal Component Analysis (PCA), Singular Value Decomposition (SVD), Linear Discriminant Analysis, Isomap Embedding, Locally Linear Embedding, etc. SVD is better suited for sparse data (data with many zeros) [18]. Therefore, the proposed platform uses Singular Value Decomposition (SVD) for dimensionality reduction. Five different machine learning-based approaches, i.e. Naïve Bayes (NB) [19] classifier, Multi-Layer Perception with Back Propagation (MLP with BP) [20], K-Nearest Neighbours (KNN) [21], Support Vector Machine (SVM) [22], and Random Forest (RF) [23] classifier are used for categorizing the query. These classifiers are used as they have different natures; this will help us determine which class of classifiers are suitable for query categorization. Ten-fold cross-validation is used to evaluate the performance of the classifiers on four different metrics, i.e. Accuracy, Precision, Recall, and F1-Measure. The result for various dimensions and folds are shown using box plots.

Deep machine learning for structural health monitoring on ship hulls using acoustic emission method

View Article

Journal Information

Published in Ships and Offshore Structures, 2021

Petros Karvelis, George Georgoulas, Vassilios Kappatos, Chrysostomos Stylios

An intuitive explanation of its effectiveness as well as the need for their complementary properties can be provided by projecting the multidimensional data (200 features) in a three-dimensional space. The ‘projection’ was created using a variation of the Stochastic Neighbour Embedding (SNE) method Hinton and Roweis (2002) called t-Distributed Stochastic Neighbour Embedding, t-SNE Maaten and Hinton (2008). Actually, t-SNE is a non linear dimensionality reduction technique which acts in a two-stage process: First, it starts by converting the high-dimensional Euclidean distances between data points into conditional probabilities that represent similarities in the high space. At the second stage, a probability distribution over the points in the low-dimensional map is defined and the method tries to minimise distance between the two distributions.