Explore chapters and articles related to this topic
Environmental Monitoring and Assessment – Normal Response Models
Published in Song S. Qian, Mark R. DuFour, Ibrahim Alameddine, Bayesian Applications in Environmental and Ecological Studies with R and Stan, 2023
Song S. Qian, Mark R. DuFour, Ibrahim Alameddine
The question answered by a discordance test is whether a data point is from the contaminated area of the superfund site. A discordancy test is only effective when the number of “outliers” in the sample is small relative to the total sample size. But the exact number of contaminated observations is rarely known. When the number of contaminated samples is large, the problem of outlier detection becomes one of discrimination between two classes of data. From a practical viewpoint, evaluating the background concentration of a contaminant does not require us to identify whether or not each data point was from a contaminated location. As such, using a direct estimation approach should be far more effective than applying the hypothesis testing-based approach on individual observations.
Model Assessment
Published in Gary L. Rosner, Purushottam W. Laud, Wesley O. Johnson, Bayesian Thinking in Biostatistics, 2021
Gary L. Rosner, Purushottam W. Laud, Wesley O. Johnson
Data are not always well behaved. What does this mean? Often, one can model a set of data reasonably well with a parametric model like the normal or log-normal distribution. Sometimes, however, one or a few of the observations are erroneous. Perhaps someone entered the number 10.67 into the data file as 1.067 or 106.7. Sometimes the stipulated data collection protocol is violated. For example, an experiment on cloud seeding for producing rainfall (Cook and Weisberg [86]) specified randomized seeding, or not, on “suitable” days. Unfortunately, cloud seeding was done on one day that was not suitable, and an extraordinary amount of rainfall was recorded. “Mistakes” like these often result in what are called “outlier” observations. In some instances, outliers can have a major effect on the resulting data analysis, though this will not always be the case. It is important in any data analysis to look carefully to see if there are any gross instances of outliers. Outliers will not always be visible to the eye, however, especially if there is a multivariable aspect to the data. Here we present some methods for outlier detection, along with accommodation where appropriate.
High-Dimensional Data Analysis
Published in Atanu Bhattacharjee, Bayesian Approaches in Oncology Using R and OpenBUGS, 2020
The HighDimOut package is useful for Outlier Detection Algorithms in High-Dimensional Data. It is helpful for outlier detection by unification scheme. One example of running this package is detailed.
Bootstrap Outlier Identification in Clinical Datasets for Lens Power Formula Constant Optimization
Published in Current Eye Research, 2023
Achim Langenbucher, Nóra Szentmáry, Alan Cayless, Jascha Wendelstein, Peter Hoffmann
To address these issues, we applied a method of bootstrap outlier identification which is well known in statistics to our task of automated outlier detection, in order to clean up clinical datasets transferred to us for formula constant optimization.13 The waviness or multimodality in the Bootlier plot in this context indicates that there is at least one outlier in the dataset that requires cleaning.13 A stepwise algorithm was implemented which eliminates extreme values from the tails of the distributions of the formula prediction error, and for each cycle the Bootlier plot is derived to qualify the new dataset to be used for formula constant optimization. We used the Bootlier Index as a termination criterion for the stepwise procedure, as established by Shingh and Xie in 2003.13 This index is a measure for the multimodality in the Bootlier plot: the higher the Bootlier Index the wavier the Bootlier graph (as shown in Figure 1). In our study we used a cut-off of 0.001 for the Bootlier Index. However, this cut-off has to be adapted to the windowing if the kernel PDF is derived from the distribution of the mean minus the trimmed mean PE.
Multicriteria decision frontiers for prescription anomaly detection over time
Published in Journal of Applied Statistics, 2022
Babak Zafari, Tahir Ekin, Fabrizio Ruggeri
This work presents an integrated statistical learning and decision framework based on multicriteria concentration curves for drug prescription anomaly decision making over time.The model first utilizes a natural language processing algorithm (i.e. structural topic model) to detect prescription patterns. It then introduces different outlier detection approaches to detect anomalies for overall prescription patterns or certain drug groups such as opioids. The final results of this unsupervised model will be embedded into a visual tool along with decision frontiers determined based on different risk thresholds.The visual tool enables health care practitioners or auditors to assess the trade-offs among different criteria and identify audit leads to detect aberrant prescription behaviors.The proposed framework is modular and can be integrated in other decision support tools to help detect anomalies and generate investigation leads. It is also general enough to study prescription data over different time frames, such as monthly billings.
Evaluation of robust outlier detection methods for zero-inflated complex data
Published in Journal of Applied Statistics, 2020
M. Templ, J. Gussenbauer, P. Filzmoser
Statistical outlier detection methods are usually built around some sort of robust statistical estimate. Such estimators are characterized by not being strongly influenced by outliers which enables them to produce reliable estimates although extreme values are present in the data. The robustness of an estimator T is typically characterized by either the 'influence function' (IF) or the 'breakdown point' (BP). The IF describes the sensitivity of a single outlier (or very small amount of contamination) on an estimator T, and for estimation methods, including various outlier detection methods, one prefers estimators T with bounded IF. Contrary to the IF, which describes the influence on the estimator T by small amounts of contamination, the breakdown point specifies the minimal amount of contamination for which the estimator is no longer able to produce a useful estimation value. The maximal achievable breakdown point is 50