Outlier Detection and Removal using Z-score and IQR(Inter Quartile Range)

Introduction

Outliers are data points that deviate significantly from the rest of the data in a set. They can be caused by a variety of factors, such as data entry errors, measurement errors, or anomalies in the underlying process. Outliers can distort the results of data analysis and make it difficult to identify trends and patterns. Two  common methods for outlier detection are: z-score and IQR. 

Z-score method to identify outliers

The z-score is a measure of how far a data point is from the mean of the data set. A z-score of 3 or more is generally considered to be an outlier. To calculate the z-score for a data point, you can use the following formula:

z = (x - mean) / standard_deviation

where:

x is the data point.
mean is the mean of the data set.
standard_deviation is the standard deviation of the data set.

The below code explains how to detect and remove outlier.


As you can see, the outlier 1000 has been removed from the data set.

IQR method to identify outliers



The interquartile range (IQR) method is another simple and effective way to identify outliers in a data set. It is based on the idea that the majority of the data points in a set will fall within a certain range, and the data  points that fall outside of this range are likely to be outliers.

To identify outliers using the IQR method, you first need to calculate the IQR. The IQR is the difference between the 75th percentile and the 25th percentile of the data set.

Once you have calculated the IQR, you can define outliers as data points that fall outside of the following ranges:

Lower outlier range: 25th percentile - 1.5 * IQR
Upper outlier range: 75th percentile + 1.5 * IQR
Any data points that fall outside of these ranges can be considered to be outliers.

Conclusion

  • The z-score can be used to detect and remove outliers from data sets. This can improve the quality and reliability of data analysis by removing data points that deviate significantly from the rest of the data set.
  • The IQR method is a simple and effective way to identify outliers in a data set. It is relatively robust to outliers, meaning that it will not remove too many data points from the data set. It is also relatively insensitive to the distribution of the data set.
I hope you found this blog post helpful. Thank you for reading!











Comments

Post a Comment