Explore chapters and articles related to this topic
Exploratory Data Analysis with Unsupervised Machine Learning
Published in Altuna Akalin, Computational Genomics with R, 2020
Another related and maybe more robust algorithm is called “k-medoids” clustering (Reynolds et al., 2006). The procedure is almost identical to k-means clustering with a couple of differences. In this case, centroids chosen are real data points in our case patients, and the metric we are trying to optimize in each iteration is based on the Manhattan distance to the centroid. In k-means this was based on the sum of squared distances, so Euclidean distance. Below we show how to use the k-medoids clustering function pam() from the cluster package.
Cluster Analysis
Published in Nusrat Rabbee, Biomarker Analysis in Clinical Trials with R, 2020
PAM clustering is similar to k-means algorithm except it works on medoids and hence more robust to outliers. The PAM algorithm searches for k representative objects in a data set formed by k medoids and then assigns each object to the closest medoid in order to create clusters. Its aim is to minimize the sum of dissimilarities between the objects in a cluster and the center of the same cluster (medoid). It is known to be a robust version of k-means as it is considered to be less sensitive to outliers.
Analysis of DNA Microarrays
Published in John Crowley, Antje Hoering, Handbook of Statisticsin Clinical Oncology, 2012
Shigeyuki Matsui, Hisashi Noma
Partitioning clustering algorithms produce a single collection of non-nested disjoint clusters for a prespecified number of clusters and initial partitioning. The k-means clustering (MacQueen 1967), k-medoids clustering (Kaufman and Rousseeuw 1990), and self-organizing maps (SOM) (Tamayo et al. 1999) are such algorithms that have been applied to microarray data. For a given number of clusters k and initial cluster centers, k-means clustering partitions the objects so that the sum of squared distances of each object to its closest cluster center is minimized. The k-medoids clustering uses medoids instead of centroids for the centers of clusters, which is more robust to outliers than k-means. SOM is a neural network procedure that can be viewed as a constrained version of k-means clustering that forces the cluster centers to lie in a discrete two-dimensional space to aid interpretation. An advantage of partitioning clustering is that, through utilizing the prior information on the number of clusters, they reduce the risk of clustering on noise, a weakness of hierarchical clustering, although one does not typically know the number of clusters and the prior information can be incorrect. Partitioning clustering is less computationally demanding than hierarchical clustering, which is particularly advantageous for clustering thousands of genes. An important practical issue is how to choose the initial partitioning, which can largely impact the final result. It is generally recommended that a partitioning procedure is repeatedly run for different sets of initial cluster centers and the partition that minimizes the within-cluster sum of square is chosen for a given number of clusters.
The interplay of gut microbiota between donors and recipients determines the efficacy of fecal microbiota transplantation
Published in Gut Microbes, 2022
Ruiqiao He, Pan Li, Jinfeng Wang, Bota Cui, Faming Zhang, Fangqing Zhao
We first clustered the gut microbiota of the patients by following a previously published tutorial.68 To decrease noise, a genus was discarded if its average abundance across all samples was below 1%. Samples were clustered by partitioning around medoids (PAM), and the optimal number of clusters was estimated using the Calinski-Harabasz index. Samples were projected into two dimensions and visualized through principal coordinates analysis (PCoA) by the “dudi.pco” function in the ade4 package in R. Cluster dissimilarity was measured by the “adonis” function in the vegan package in R. The dominant taxon in each enterotype was identified based on the significance level, fold change and relative abundance between enterotypes. Enterobacteriaceae was identified as the dominant taxon in the RCPT/E, because it was abundant and significantly enriched in this cluster (Wilcoxon test, q < 0.001). Four of the top five differential genera in the RCPT/E (Wilcoxon test, q < 0.001) were from Enterobacteriaceae. To make the dominant taxon comparable between RCPT/E and RCPT/B, the most abundant and significantly differential genus in Enterobacteriaceae (Wilcoxon test, q < 0.001) was used to represent the dominant genus in the subsequent analyses.
Dysbiosis in a canine model of human fistulizing Crohn’s disease
Published in Gut Microbes, 2020
Ana Maldonado-Contreras, Lluís Ferrer, Caitlin Cawley, Sarah Crain, Shakti Bhattarai, Juan Toscano, Doyle V. Ward, Andrew Hoffman
For analysis, we considered only taxa that were detected in at least 10% of samples and had a relative abundance of at 0.2% or greater in at least one sample. Community patterns were analyzed using partitioning around medoids with the estimation of the number of clusters (PAMK) to find the optimal number of clusters as performed previously44 and visualized after multidimensional scaling (MDS). Clustering analyses and visualizations were performed in Phyloseq 1.26.1 and the R package cluster v1.4–1 to estimate microbiome patterns using PAMK with optimum average silhouette width.82,83 Briefly, the PAM algorithm is based on the search for “k” representative objects or k-medoids among the observations of the data set. In k-medoids clustering, each cluster is represented by one of the data point in the cluster. These points are named cluster medoids. After finding a set of k-medoids, clusters are constructed by assigning each observation to the nearest medoid. To estimate the optimal number of clusters, we estimated the average silhouette method. Here, we compute the PAM algorithm using different values of clusters k. Next, the average cluster silhouette is calculated according to the number of clusters. A high average silhouette width indicates a good clustering. The optimal number of clusters k is the one that maximizes the average silhouette over a range of possible values for k.
Unsupervised classification of eclipsing binary light curves through k-medoids clustering
Published in Journal of Applied Statistics, 2020
Soumita Modak, Tanuka Chattopadhyay, Asis Kumar Chattopadhyay
k-medoids is a nonparametric partitioning based clustering method which can be applied to univariate time series like LCs. We use a fast and efficient algorithm ‘PAM’ [19] which is executed using the inbuilt function ‘pam’ in software ‘R’. It is based on the search for k medoids in the data set, where a medoid is the representative object of the cluster it belongs to. These k medoids represent various structural aspects of the data set of size N being partitioned into k mutually exclusive and exhaustive clusters around k medoids, where a medoid is that object of the cluster for which the sum of distances to all other objects of the cluster is minimal. Because of the medoids, this method is robust against noise, outliers or sparsely distributed data [43]. This clustering method allows any distance measure depending upon the nature of the given data.