Skip to the content.

Last Updated: 21 Feb 2025

Table of Contents

Introduction

Raw data is almost always messy. It often contains errors ranging from incorrect entries to missing values. Failing to handle these errors can disrupt machine learning algorithms and statistical inference, leading to inflated variance and biased estimates. However, erroneous data often follows certain patterns. In Part 1, we discussed how missing data can be Missing At Random, Missing Completely At Random, or Missing Not At Random — all of which relate to the correlation structure of the data.

Outliers, on the other hand, fall into two main categories: casewise and cellwise outliers.

Casewise Outliers

Casewise outliers occur when an entire row in a dataframe is anomalous. This assumes that most cases come from a certain model distribution, but some deviate from it. These outliers are relatively easy to detect using methods like Cook’s or Euclidean Distance. A large body of research already exists on detecting these types of outliers.

Cellwise Outliers

Cellwise outliers occur when a single cell within a row is anomalous. These are harder to detect and more common in research data. They often result from data entry errors or recording issues. Research into detecting cellwise outliers is relatively recent, with significant advancements over the last decade by researchers like Rousseeuw and Raymakers. In Part 3, we will explore these methods and introduce a new approach.

outliers

Types of Cellwise Outliers

Cellwise outliers are tricky to detect because they only stand out in multi-dimensional settings or when reduced to two dimensions. Without actively looking for them, they can go unnoticed and compromise model performance and statistical inference.

We’ll explore different types of cellwise outliers using a three-dimensional dataset generated from a multivariate normal distribution. The plot below illustrates the dataset’s strong positive correlation between $X_1$, $X_2$, and $X_3$, which makes detecting cellwise outliers easier despite potential issues with linear models.

Type A: Z-Score

A cell value is much larger or smaller than expected. We define $z$ as the number of standard deviations ($\text{SD}(X_i)$) that $X_{ij}$ deviates from the true value.

Type B: Mahalanobis Quantile

A cell value is exchanged with a value within the range of $\sim U[\min(X_i), \max(X_i)]$, causing it to exceed the 99th percentile of the Mahalanobis distance of the uncontaminated data.

Type C: Cellwise Replacement

A cell value is replaced by a specific value. This is based on the function generateData in the cellWise package (Raymaekers and Rousseeuw, 2022).

Conclusion

We explored the difference between cellwise and casewise outliers, along with the different types of cellwise outliers and how they appear in real-world data. In Part 3, we will examine how our new algorithm compares to leading methods for detecting cellwise outliers.

Data Science R Outlier Detection Simulations