High Dimensional Data

High-Dimensional Data Analysis

Atanu Bhattacharjee in Bayesian Approaches in Oncology Using R and OpenBUGS, 2020

The base sequence procedure is now well developed. A large amount of genetic data is available publically. This large amount of data challenged us for the development of analytical tools for analyzing such accumulated data. It is essential to analyze such extensive genetic data by advanced computational methodology coupled with statistical techniques for processing genetic data. Similarly, microarray also provides gene expression information. Commonly, ten of thousands of variables obtained by a single experiment. Dataset with this large number of variables are known as high-dimensional data. Earlier, this used to measure gene expression in serum or tissue. Currently, it used for DNA methylation expression. Tremendous progress in a microarray experiment observed. Similar, growth in the statistical analysis method followed. Primarily, the gene effect classification is the main challenge in high-dimensional data analysis. Filter out a few variables from ten of thousands of variables is the task of the gene classification. The conventional approach for statistical methodology is known as an unsupervised approach. But currently, the direction shifted from unsupervised to supervised approach. The supervised approach help to define the characteristics (Y) to gene expression data (X).

View Chapter

Purchase Book

Exploratory Data Analysis with Unsupervised Machine Learning

Altuna Akalin in Computational Genomics with R, 2020

Principal component analysis (PCA) is maybe the most popular technique to examine high-dimensional data. There are multiple interpretations of how PCA reduces dimensionality. We will first focus on geometrical interpretation, where this operation can be interpreted as rotating the original dimensions of the data. For this, we go back to our example gene expression data set. In this example, we will represent our patients with expression profiles of just two genes, CD33 (ENSG00000105383) and PYGL (ENSG00000100504). This way we can visualize them in a scatter plot (see Figure 4.9).

View Chapter

Purchase Book

Deep Learning to Diagnose Diseases and Security in 5G Healthcare Informatics

K. Gayathri Devi, Kishore Balasubramanian, Le Anh Ngoc in Machine Learning and Deep Learning Techniques for Medical Science, 2022

In addition, there are many data sources that can be used to enrich health data, including but not limited to genomics, health data, social media data, and environmental data. The following are examples of the main types of ML/DL that can be used in sanitary applications:Unsupervised LearningSupervised LearningSemi-supervised LearningReinforcement LearningUnsupervised Learning: Unsupervised learning techniques employ unlabelled data and are ML techniques. Unsupervised learning techniques include clustering data points based on a single similarity measure and dimensionality reduction to translate high-dimensional data to a lower-dimensional feature space (occasionally also referred to as feature selection).

View Chapter

Purchase Book

Variable selection for mode regression

Published in Journal of Applied Statistics, 2018

Yingzhen Chen, Xuejun Ma, Jingke Zhou

Nowadays, as the advance of the technology of collecting and storing data, high-dimensional data are very prevailing. Variable selection is one of the strategies to handle high-dimensional regression analysis. Penalized techniques have been proposed to conduct variable selection for conventional regression model by shrinking inactive coefficients to 0. Such as, Tibshirani [8], Fan and Li [1], and Zhang [11] proposed LASSO, SCAD and MCP for mean regression, Li and Zhu [7], Wu and Liu [9] studied quantile regression via LASSO, SCAD and so on. However, variable selection for mode regression has been rarely studied. In this paper, we will concentrate on the variable selection problem for mode regression by combining the nonparametric kernel estimation with sparsity penalty method. Theoretically, we explore and prove the asymptotical properties of the resultant estimator. Numerical simulations and real data analysis are also conduct to illustrate the finite sample performance of the proposed procedure.

View Article

Journal Information

Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference scale-invariant test

Published in Journal of Applied Statistics, 2023

Liang Zhang, Tianming Zhu, Jin-Ting Zhang

The problem of testing the equality of mean vectors for high-dimensional data is frequently encountered in many contemporary statistical studies. One prominent aspect of high-dimensional data is that there are many measurements taken on only a few subjects, that is, the number of variables is much larger than the number of observations. For example, in DNA microarray data, thousands of gene expression levels are often measured on a relatively few subjects. Our motivating example is the colon data set, which is well-known and publicly available at http://microarray.princeton.edu/oncology/affydata/index.html. It contains 22 normal colon tissues and 40 tumor colon tissues, each having 2000 gene expression levels. It is of interest to check whether the normal colon tissues and the tumor colon tissues have the same mean expression levels. In this two-sample problem, the data dimension p = 2000 is much larger than the total sample size n = 62, and the covariance matrices of the two samples are probably not the same. Therefore, the classical Hotelling p of

View Article

Journal Information

A generalized l _2,p-norm regression based feature selection algorithm

Published in Journal of Applied Statistics, 2023

X. Zhi, J. Liu, S. Wu, C. Niu

In many applications such as genetic data analysis, image processing and data mining, people often encounters very high-dimensional data. Some features of the high-dimensional data are related to the target task, while many features are redundant [23]. Therefore, dimension reduction has become an important stage of data preprocessing in such applications [12,13]. Feature selection and feature extraction are two main dimension reduction methods [2,22]. Feature extraction transforms the original data into a new low-dimensional subspace. Feature selection algorithm selects low-dimensional features from the original high-dimensional data according to certain processing rules. The latter can retain the original representation of the data without changing the original features and is interpretable, while the former cannot do this [23]. Over the years, the research on feature selection has received more and more attention, and has made considerable progress.

View Article

Journal Information

Related Knowledge Centers

Cluster Analysis
DNA Microarray
Heaps' Law
Newborn Screening
Biclustering
Bioinformatics
Association Rule Learning
Correlation Clustering

I want to publish

High-Dimensional Data Analysis

Exploratory Data Analysis with Unsupervised Machine Learning

Deep Learning to Diagnose Diseases and Security in 5G Healthcare Informatics

Variable selection for mode regression

Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference scale-invariant test

A generalized l 2,p-norm regression based feature selection algorithm

Related Knowledge Centers

Current Research

A generalized l _2,p-norm regression based feature selection algorithm