Seven types of data bias in machine learning
In a perfect world, artificial intelligence (AI) model predictions represent all target users equally; however, machine bias researchers will tell you that’s not necessarily the case. At least not yet. Read on to learn why it’s important to understand what data bias is, the different ways it can seep into your training data and how to avoid it in machine learning projects.
What is data bias?
Data bias in machine learning is a type of error in which certain elements of a dataset are more heavily weighted and/or represented than others. A biased dataset does not accurately represent a model’s use case, resulting in skewed outcomes, low accuracy levels and analytical errors.
Data bias versus data variance
While data bias is a representation or weighting error, data variance is the amount the intended function of a machine learning model changes while it’s being trained. Despite being flexible, models with high variance are prone to overfitting and low predictive accuracy because they are reliant on their training data.
The goal during AI model development is to reduce both data bias and data variance as much as possible in order to get more accurate outputs.
Why is understanding data bias important?
Biased algorithms affect not just the accuracy of your model, but can also stretch to issues of ethics, fairness and inclusion. It’s important that training data for machine learning projects reflects the real world since the machine learns to do its job based on this data. Data bias can occur in a range of areas, from human reporting and selection bias to algorithmic and interpretation bias.
Resolving data bias in artificial intelligence technology means first determining where it is. It’s only after you know where a bias exists that you can take the necessary steps to remedy it, whether it be addressing missing data or improving your annotation processes. With this in mind, it’s extremely important to be vigilant about the scope, quality and handling of your data to avoid bias where possible.
Below, we’ve listed seven of the most common types of data bias in machine learning to help you analyze and understand where it happens, and what you can do about it.
Types of data bias
Though not exhaustive, this list contains common examples of data bias in the field, along with examples of where it occurs.
Sample bias: Sample bias occurs when a dataset does not reflect the realities of the environment in which a model will run. An example is certain facial recognition systems trained primarily on images of white men. These models have considerably lower levels of accuracy with women and people of different ethnicities. Another name for this bias is selection bias.
Exclusion bias: Exclusion bias is most common at the data preprocessing stage. Most often it’s a case of deleting valuable data thought to be unimportant. However, it can also occur due to the systematic exclusion of certain information. For example, imagine you have a dataset of customer sales in America and Canada. Since 98% of the customers are from America, you choose to delete the location data thinking it is irrelevant. However, this means your model will not pick up on the fact that your Canadian customers spend two times more.
Measurement bias: This type of bias occurs when the data collected for training differs from that collected in the real world, or when faulty measurements result in data distortion. A good example of this bias occurs in image recognition datasets, where the training data is collected with one type of camera, but the production data is collected with a different camera. Measurement bias can also occur due to inconsistent annotation during the data labeling stage of a project.
Recall bias: This is a kind of measurement bias, and is common at the data labeling stage of a project. Recall bias arises when you label similar types of data inconsistently. This results in lower accuracy. For example, let’s say you have a team labeling images of phones as damaged, partially-damaged, or undamaged. If someone labels one image as damaged, but a similar image as partially damaged, your data will be inconsistent.
Observer bias: Also known as confirmation bias, observer bias is the effect of seeing what you expect to see or want to see in data. This can happen when researchers go into a project with subjective thoughts about their study, either conscious or unconscious. We can also see this when labelers let their subjective thoughts control their labeling habits, resulting in inaccurate data.
Racial bias: Though not data bias in the traditional sense, this still warrants mentioning due to its prevalence in AI technology of late. Racial bias occurs when data skews in favor of particular demographics. This can be seen in facial recognition and automatic speech recognition technology which fails to recognize people of color as accurately as it does caucasians.
Association bias: This bias occurs when the data for a machine learning model reinforces and/or multiplies a cultural bias. Your dataset may have a collection of jobs in which all men are doctors and all women are nurses. This does not mean that women cannot be doctors, and men cannot be nurses. However, as far as your machine learning model is concerned, female doctors and male nurses do not exist. Association bias is best known for creating gender bias.
The essential guide to AI training data
Discover best practices for the sourcing, labeling and analyzing of training data from TELUS Digital (formerly TELUS International), a leading provider of AI data solutions.
How do I avoid data bias in machine learning projects?
The prevention of data bias in machine learning projects is an ongoing process. Though it is sometimes difficult to know when your machine learning algorithm, data or model is biased, there are a number of steps you can take to help prevent bias or catch it early. Though far from a comprehensive list, the bullet points below provide an entry-level guide for thinking about data bias for machine learning projects.
- To the best of your ability, research your users in advance. Be aware of your general use-cases and potential outliers.
- Ensure your team of data scientists and data labelers is diverse.
- Where possible, combine inputs from multiple sources to ensure data diversity.
- Create a gold standard for your data labeling. A gold standard is a set of data that reflects the ideal labeled data for your task. It enables you to measure your team’s annotations for accuracy.
- Make clear guidelines for data labeling expectations so data labelers are consistent.
- Use multi-pass annotation for any project where data accuracy may be prone to bias. Examples of this include sentiment analysis, content moderation, and intent recognition.
- Enlist the help of someone with domain expertise to review your collected and/or annotated data. Someone from outside of your team may see biases that your team has overlooked.
- Analyze your data regularly. Keep track of errors and problem areas so you can respond to and resolve them quickly. Carefully analyze data points before making the decision to delete or keep them.
- Make bias testing a part of your development cycle.
In closing, it’s important to be aware of the potential biases in machine learning for any data project. By putting the right systems in place early and keeping on top of data collection, labeling and implementation, you can notice it before it becomes a problem, or respond to it when it pops up.