High dimensional data – Knowledge and References

Explore chapters and articles related to this topic

Reliable Biomedical Applications Using AI Models

Published in Punit Gupta, Dinesh Kumar Saini, Rohit Verma, Healthcare Solutions Using Machine Learning and Informatics, 2023

Shambhavi Mishra, Tanveer Ahmed, Vipul Mishra

Studies based on genomics sequencing and gene expression directed towards protein structure prediction fall under the biomedical sector. Several studies show omics work on genomics, but other applications such as biomedicine and bioinformatics can also be found. Omics covers genetic data such as protein, metabol, gen, transcript, and epigen. It also concerns protein–protein interactions (PPIs). The authors of [39] studied different statistical learning framework methods that are integrated with different multidisciplinary areas including biology, machine learning, and AI. In the literature, PCA, clustering methods, regularization-based methods, regression methods, and knowledge enhancement learning have all been investigated and analyzed. The limitations and strengths of multiple standard ML methods are also discussed. According to [40]’s research, image data alone is insufficient for analyzing complicated disorders and obtaining an appropriate diagnosis. In parallel with large, high-quality data sets, domain knowledge and the requirement for multiple networks is also important. While high-dimensional data will always yield better results, all three components are crucial for providing robust ML model training and validation. The authors of one of the studies[41] looked at various AI-based approaches to analyzing different types of cancer.

Deep Learning to Diagnose Diseases and Security in 5G Healthcare Informatics

View Chapter

Purchase Book

Published in K. Gayathri Devi, Kishore Balasubramanian, Le Anh Ngoc, Machine Learning and Deep Learning Techniques for Medical Science, 2022

Partha Ghosh

In addition, there are many data sources that can be used to enrich health data, including but not limited to genomics, health data, social media data, and environmental data. The following are examples of the main types of ML/DL that can be used in sanitary applications:Unsupervised LearningSupervised LearningSemi-supervised LearningReinforcement LearningUnsupervised Learning: Unsupervised learning techniques employ unlabelled data and are ML techniques. Unsupervised learning techniques include clustering data points based on a single similarity measure and dimensionality reduction to translate high-dimensional data to a lower-dimensional feature space (occasionally also referred to as feature selection).

High-Dimensional Data Analysis

View Chapter

Purchase Book

Published in Atanu Bhattacharjee, Bayesian Approaches in Oncology Using R and OpenBUGS, 2020

Atanu Bhattacharjee

The base sequence procedure is now well developed. A large amount of genetic data is available publically. This large amount of data challenged us for the development of analytical tools for analyzing such accumulated data. It is essential to analyze such extensive genetic data by advanced computational methodology coupled with statistical techniques for processing genetic data. Similarly, microarray also provides gene expression information. Commonly, ten of thousands of variables obtained by a single experiment. Dataset with this large number of variables are known as high-dimensional data. Earlier, this used to measure gene expression in serum or tissue. Currently, it used for DNA methylation expression. Tremendous progress in a microarray experiment observed. Similar, growth in the statistical analysis method followed. Primarily, the gene effect classification is the main challenge in high-dimensional data analysis. Filter out a few variables from ten of thousands of variables is the task of the gene classification. The conventional approach for statistical methodology is known as an unsupervised approach. But currently, the direction shifted from unsupervised to supervised approach. The supervised approach help to define the characteristics (Y) to gene expression data (X).

Machine Learning-Based prediction of Post-Treatment ambulatory blood pressure in patients with hypertension

View Article

Journal Information

Published in Blood Pressure, 2023

Hyeonyong Hae, Soo-Jin Kang, Tae Oh Kim, Pil Hyung Lee, Seung-Whan Lee, Young-Hak Kim, Cheol Whan Lee, Seong-Wook Park

In our current study, the ML models for predicting post-treatment ambulatory BP level utilised features including clinical characteristics, initial 24-hour ABPM data, and initial and adjusted anti-hypertension medication. Both untreated patients and those already receiving BP-lowering therapy were enrolled in the analysis. Moreover, ABPM-derived mean 24-hour and daytime BPs were used as the endpoints to precisely assess the effectiveness of medical treatment. By applying kernels, Support vector machines are automatically regularised to avoid overfitting with high-dimensional data. K-nearest neighbours assessed the predictive value of similarity measure was assessed by using. Among decision tree models with similar performance, CatBoost for gradient boosting on decision trees was used. Among the developed models, CatBoost best predicted the post-treatment ambulatory BP levels. The percentage differences between Catboost-predicted vs. ABPM-measured mean 24-hour SBP and DBP at follow-up were 6.6% ± 5.7% and 6.8% ± 5.5%, respectively. The model also predicted the mean daytime BP with a difference of <7%. Even among high-risk patients with chronic renal diseases and diabetes mellitus, a consistent correlation between the CatBoost-predicted vs. ABPM-measured mean 24-hour BP at follow-up was observed. For patients with a high BP variability in whom achieving the target BP is challenging, the models accurately predicted post-treatment BP changes.

Hot spot identification method based on Andrews curves: an application on the COVID-19 crisis effects on caregiver distress in neurocognitive disorder

View Article

Journal Information

Published in Journal of Applied Statistics, 2023

E. Skamnia, P. Economou, S. Bersimis, M. Frouda, A. Politis, P. Alexopoulos

For instance, psychometric data of a person might involve many different types of measurements, such as indices connected to personality, intelligence, anxiety, depression, neuro-psychology, etc., resulting in high-dimensional data. Identifying a number of people that share, more or less, the same characteristics can help experts to divide them into groups according to their response in psychometric tests and recognize good or bad practices. This may help to develop useful guidelines and treatment for future references. In Section 5, an application of the proposed method is presented. More specifically, the proposed method is applied to data related to dyads ( caregiver–patient) collected during the first confinement in Greece due to the SARS-CoV-2 pandemic. By applying the proposed method, the general aim is to detect group/groups of caregivers with common characteristics. Such hot spots are identified for caregivers of people with neurocognitive disorder, who managed to cope well with the mental effects of the COVID19 pandemic and the burden related to symptoms of neurocognitive disorder. More details about this specific data set are included in Section 5.

Two-sample Behrens–Fisher problems for high-dimensional data: a normal reference scale-invariant test

View Article

Journal Information

Published in Journal of Applied Statistics, 2023

Liang Zhang, Tianming Zhu, Jin-Ting Zhang

The problem of testing the equality of mean vectors for high-dimensional data is frequently encountered in many contemporary statistical studies. One prominent aspect of high-dimensional data is that there are many measurements taken on only a few subjects, that is, the number of variables is much larger than the number of observations. For example, in DNA microarray data, thousands of gene expression levels are often measured on a relatively few subjects. Our motivating example is the colon data set, which is well-known and publicly available at http://microarray.princeton.edu/oncology/affydata/index.html. It contains 22 normal colon tissues and 40 tumor colon tissues, each having 2000 gene expression levels. It is of interest to check whether the normal colon tissues and the tumor colon tissues have the same mean expression levels. In this two-sample problem, the data dimension p = 2000 is much larger than the total sample size n = 62, and the covariance matrices of the two samples are probably not the same. Therefore, the classical Hotelling p of