High-Dimensional Data Analysis
Atanu Bhattacharjee in Bayesian Approaches in Oncology Using R and OpenBUGS, 2020
The base sequence procedure is now well developed. A large amount of genetic data is available publically. This large amount of data challenged us for the development of analytical tools for analyzing such accumulated data. It is essential to analyze such extensive genetic data by advanced computational methodology coupled with statistical techniques for processing genetic data. Similarly, microarray also provides gene expression information. Commonly, ten of thousands of variables obtained by a single experiment. Dataset with this large number of variables are known as high-dimensional data. Earlier, this used to measure gene expression in serum or tissue. Currently, it used for DNA methylation expression. Tremendous progress in a microarray experiment observed. Similar, growth in the statistical analysis method followed. Primarily, the gene effect classification is the main challenge in high-dimensional data analysis. Filter out a few variables from ten of thousands of variables is the task of the gene classification. The conventional approach for statistical methodology is known as an unsupervised approach. But currently, the direction shifted from unsupervised to supervised approach. The supervised approach help to define the characteristics (Y) to gene expression data (X).
Exploratory Data Analysis with Unsupervised Machine Learning
Altuna Akalin in Computational Genomics with R, 2020
Principal component analysis (PCA) is maybe the most popular technique to examine high-dimensional data. There are multiple interpretations of how PCA reduces dimensionality. We will first focus on geometrical interpretation, where this operation can be interpreted as rotating the original dimensions of the data. For this, we go back to our example gene expression data set. In this example, we will represent our patients with expression profiles of just two genes, CD33 (ENSG00000105383) and PYGL (ENSG00000100504). This way we can visualize them in a scatter plot (see Figure 4.9).
Deep Learning to Diagnose Diseases and Security in 5G Healthcare Informatics
K. Gayathri Devi, Kishore Balasubramanian, Le Anh Ngoc in Machine Learning and Deep Learning Techniques for Medical Science, 2022
In addition, there are many data sources that can be used to enrich health data, including but not limited to genomics, health data, social media data, and environmental data. The following are examples of the main types of ML/DL that can be used in sanitary applications:Unsupervised LearningSupervised LearningSemi-supervised LearningReinforcement LearningUnsupervised Learning: Unsupervised learning techniques employ unlabelled data and are ML techniques. Unsupervised learning techniques include clustering data points based on a single similarity measure and dimensionality reduction to translate high-dimensional data to a lower-dimensional feature space (occasionally also referred to as feature selection).
Variable selection for mode regression
Published in Journal of Applied Statistics, 2018
Yingzhen Chen, Xuejun Ma, Jingke Zhou
Nowadays, as the advance of the technology of collecting and storing data, high-dimensional data are very prevailing. Variable selection is one of the strategies to handle high-dimensional regression analysis. Penalized techniques have been proposed to conduct variable selection for conventional regression model by shrinking inactive coefficients to 0. Such as, Tibshirani [8], Fan and Li [1], and Zhang [11] proposed LASSO, SCAD and MCP for mean regression, Li and Zhu [7], Wu and Liu [9] studied quantile regression via LASSO, SCAD and so on. However, variable selection for mode regression has been rarely studied. In this paper, we will concentrate on the variable selection problem for mode regression by combining the nonparametric kernel estimation with sparsity penalty method. Theoretically, we explore and prove the asymptotical properties of the resultant estimator. Numerical simulations and real data analysis are also conduct to illustrate the finite sample performance of the proposed procedure.
Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference scale-invariant test
Published in Journal of Applied Statistics, 2023
Liang Zhang, Tianming Zhu, Jin-Ting Zhang
The problem of testing the equality of mean vectors for high-dimensional data is frequently encountered in many contemporary statistical studies. One prominent aspect of high-dimensional data is that there are many measurements taken on only a few subjects, that is, the number of variables is much larger than the number of observations. For example, in DNA microarray data, thousands of gene expression levels are often measured on a relatively few subjects. Our motivating example is the colon data set, which is well-known and publicly available at http://microarray.princeton.edu/oncology/affydata/index.html. It contains 22 normal colon tissues and 40 tumor colon tissues, each having 2000 gene expression levels. It is of interest to check whether the normal colon tissues and the tumor colon tissues have the same mean expression levels. In this two-sample problem, the data dimension p = 2000 is much larger than the total sample size n = 62, and the covariance matrices of the two samples are probably not the same. Therefore, the classical Hotelling p of
A generalized l 2,p-norm regression based feature selection algorithm
Published in Journal of Applied Statistics, 2023
In many applications such as genetic data analysis, image processing and data mining, people often encounters very high-dimensional data. Some features of the high-dimensional data are related to the target task, while many features are redundant [23]. Therefore, dimension reduction has become an important stage of data preprocessing in such applications [12,13]. Feature selection and feature extraction are two main dimension reduction methods [2,22]. Feature extraction transforms the original data into a new low-dimensional subspace. Feature selection algorithm selects low-dimensional features from the original high-dimensional data according to certain processing rules. The latter can retain the original representation of the data without changing the original features and is interpretable, while the former cannot do this [23]. Over the years, the research on feature selection has received more and more attention, and has made considerable progress.
Related Knowledge Centers
- Cluster Analysis
- DNA Microarray
- Heaps' Law
- Newborn Screening
- Biclustering
- Bioinformatics
- Association Rule Learning
- Correlation Clustering