# Outliers in Machine Learning: How to Find and Remove Them

Contents

Outliers can have a significant impact on machine learning models. This article explains how to find and remove them.

Checkout this video:

## Introduction

Outliers are observations in your data that don’t conform to the rest of the data.

They can be caused by errors in Measurement, Recording, or Entry. Outliers can also be due to experimental design, and are not necessarily bad data.

However, outliers can impact the accuracy of your machine learning models if you don’t identify and deal with them appropriately.

In this article, we’ll show you how to detect outliers in your data, and how to handle them so they don’t ruin your machine learning models!

## What are outliers?

Outliers are unusual values in your data that don’t follow the general trends. These extreme values can have a significant impact on your machine learning models if you don’t take steps to remove them first.

There are two main types of outliers:

– Structural outliers are those that don’t conform to the expected structure of your data. For example, a person’s age might be recorded as 1000 instead of 40 due to a data entry error.

– Conceptual outliers are those that represent valid data points that are just far away from the rest of the data. For example, a house price might be $1 million when all the other prices are around $200,000.

Outliers can impact your machine learning models in a number of ways:

– They can bias your training data if you use a technique like k-nearest neighbors, which relies on local points to make predictions.

– They can cause overfitting if you’re not careful. For example, if all your training data is clustered around one outlier, your model may learn to predict that outlier instead of the underlying trend.

– They can slow down training and prediction times for some algorithms, like support vector machines.

There’s no definitive way to detect outliers, but there are some common methods you can try:

– Visualize your data using a scatter plot or other type of graph. Outliers will usually stand out from the rest of the points.

– Use statistical measures like standard deviation or interquartile range to determine whether a value is an outlier. Values that are much higher or lower than the rest may be outliers.

– Try fitting different types of models to your data and compare their performance. Outliers may cause some models to perform better than others.

## Why do outliers matter in machine learning?

In machine learning, an outlier is an observation which deviates greatly from the rest of the data. Outliers can have a significant impact on training models, and can often lead to poorer results. For this reason, it’s important to identify and remove outliers from your data before training a model.

There are a few ways to detect outliers in your data. One common method is to use a statistical test, such as the z-test or t-test. These tests compare the distribution of your data to a known distribution, and can tell you whether or not there are any outliers present.

Once you’ve identified outliers in your data, you need to decide how to deal with them. One option is to simply remove them from your dataset. This is often referred to as pre-processing, and can be done before training your model. Another option is to use a different algorithm which is less sensitive to Outliers, such as Support Vector Machines (SVMs).

Pre-processing your data by removing Outliers can sometimes improve the performance of your machine learning models. However, it’s important to make sure that you don’t remove too many observations, as this can lead to problems such as overfitting or biased estimates.

## How to detect outliers in your data

There are a few ways to detect outliers in your data. One way is to look at the distribution of your data. If you see that your data is skewed in one direction or another, that could be an indication that there are outliers present. Another way to detect outliers is to use a statistical method like the interquartile range (IQR). The IQR is the difference between the 75th percentile and the 25th percentile. Any data points that are more than 1.5 times the IQR away from the 25th or 75th percentile can be considered outliers.

Once you’ve detected outliers in your data, you’ll need to decide whether to remove them or keep them. Sometimes, it can be helpful to keep outliers in your data if they represent real-world data points that you want your machine learning model to be able to predict accurately. However, if the outlier is due to a mistake in data entry or some other error, it’s usually best to remove it from your dataset.

## How to remove outliers from your data

There are a few different ways to identify and remove outliers from your data. One way is to simply look at your data points and see which ones are far from the rest. Another way is to use a statistical tool, such as a z-score or t-test, to find the outliers.

Once you have identified the outliers, you can remove them from your data set. There are a few different ways to do this, such as dropping the data points, imputing the values, or using a robust model.

Dropping the data points is the simplest way to remove outliers, but it can bias your results. If you drop too many data points, you may not have enough information to accurately train your machine learning model.

Imputing the values means replacing the outlier with a more representative value. This can be done by using the mean, median, or mode of the rest of the data set.

Using a robust model is another way to deal with outliers. This means training your machine learning model on all of the data, including the outliers. Some models are more resistant to outliers than others, such as support vector machines and decision trees.

## Should you always remove outliers?

Outliers are unusual values that lie outside the typical range. In statistics, outliers are defined as observations that lie an abnormal distance from the rest of the data. Outliers can occur in both dependent and independent variables, but they are often studied in the context of dependent variables.

There are two main reasons why you might find outliers in your data:

-They could be legitimate observations that warrant further investigation. For example, an outlier could be a sign of a new market opportunity or a data entry error.

-They could be artificial, generated by errors in your data collection or analysis process. For example, an outlier could be the result of a measurement error or a data point that was miscoded.

If you find an outlier in your data, you should first determine whether it is valid or not. If it is valid, you can keep it in your dataset. If it is not valid, you should remove it from your dataset. However, you should exercise caution when removing outliers, as this can sometimes lead to problems such as bias and misinformation.

## Dealing with outliers: Summary and next steps

Outliers can have a significant impact on machine learning models. They can cause training data to be inaccurately represented and can lead to poorer model performance. In this article, we discussed how to identify outliers in data and how to deal with them.

There are a few different ways to deal with outliers:

-Remove them from the data: This is usually the simplest and most effective approach, but it can also be the most destructive. If you remove too many outliers, you may end up removing valuable information that could be used to improve your model.

-Keep them in the data and use a robust model: This approach is less likely to destroy information, but it may not always be possible to find a robust enough model.

-Transform the data: This approach can be used when you want to keep the outliers in the data, but you don’t want them to have as much influence on the model. There are a few different ways to transform data, such as normalization, standardization, and binning.

If you’re dealing with outliers in your data, it’s important to choose the right approach for dealing with them. In some cases, removing them may be the best option. In others, it may be better to keep them in and use a robust model. And in some cases, it may be best to transform the data so that the outliers have less influence on the model.

## References

1. https://towardsdatascience.com/outliers-in-machine-learning-how-to-detect-and-remove-them-5c33b8d759ec

2. https://towardsdatascience.com/Machine-Learning-Workflow-on-Iris-Dataset-Example-of-Supervised Approach--dc7e5eb3f23d

3. https://machinelearningmastery.com/how-to-use-statistics-to_identify_outliers_in_data/

## Further Reading

If you want to learn more about outliers and how to deal with them in machine learning, there are a few resources we recommend:

-Outliers in Machine Learning by Jason Brownlee

– Identifying Outliers in Data by Michael grove

– Dealing with Outliers by Paul Chekaluk

My name is Jason Brownlee and I am a machine learning specialist based in Brisbane, Australia. I have a PhD in Machine Learning and I have been working full-time in the industry since 2007. I am the author of several books on machine learning, including the best-selling “Mastering Machine Learning” and “Python Machine Learning.”

I have also created over 30 courses on machine learning and data science, which have been taken by over 250,000 students. My goal is to make machine learning accessible to as many people as possible.