Top AI Safety Dataset Contains Harmful Data

published February 27, 2023

UPDATE (03/04/2023): After a discussion with Anthropic, it was clarified that the helpfulness and harmlessness subset this post analyzes itself contains red-teaming data! This is a crucial correction as this means the presence of seemingly egregious data points is intended and not by accident. This post was written with the understanding that the helpfulness and harmlessness subset contained no red-teaming data (the intentional inclusion of egregious examples to improve the end model). For transparency, I am leaving the original post un-modified and have appended Anthropic's response in full.

updated tldr; Anthropics RLHF helpfulness and harmlessness dataset contains racist datapoints. This is intentional as red-teaming data exists in this dataset. Its inclusion is designed to make the model less harmful!

original tldr; Anthropics RLHF helpfulness and harmlessness dataset contains racist and harmful content and everyone is training on it. Clean your data!

all comments to $\text{}$

Content warning: this post references explicit and offensive language.

AI chatbots have entered the mainstream: Open AI deployed ChatGPT, Bing has Sydney and Anthropic is slowly building up to a release of Claude. All of these organizations claim to care and work towards a concept known as ‘AI Alignment’, or in people-speak ‘AI shouldn’t do bad things for/to humans’. I am here to argue, with evidence, that these organizations are missing the forest for the trees with their current struggle to align large language models. They are not focusing enough on data quality.

How do you align a large language model?

The predominant mechanism used to ‘align’ chatbots is a training technique popularized by Open AI in 2020 known as RLHF (Reinforcement Learning from Human Feedback). To understand how RLHF works, you first need to know that these chatbots, under their shiny, well-marketed hoods, are just big black boxes that eat text and output a word (technically, a sub-word token) in a recursive loop. These big black boxes are trained over web-scraped copies of the entire internet and from this training they learn information about the world. Locked away in these big black boxes reside facts like ‘russia invaded Ukraine in 2022’ and ‘Пу́тін хуйло́’. Without the right key, these big black boxes are useful but tend to do un-wanted things like lie and go on racist rampages. The key to accessing this information in a manner that’s ‘aligned’ with human prior beliefs on what's good and bad is RLHF.

RLHF simply means: “have the big black box generate responses to inputs, make humans rank the responses based on a definition of good and then keep training the big black boxes but only on good responses”.

Here resides the key issue: humans labeling responses. The data this produces, while better than blindly eating the internet, stinks. In this post, I’ll prove it to you.

Below you will find an interactive interface (the map) for exploring a human-curated dataset, hh-rlhf, released by Anthropic and used for training large language models with RLHF. A detailed description of how this dataset was collected can be found in Section 2 of Anthropic's paper. In short, each data point (represented as a point on the map) was collected by having a human-annotator choose between two responses from a large language model. When you hover over a point, you can see the corresponding 'chosen' and 'rejected' response. On the map, points that are close to each other have 'chosen' responses that are semantically similar. Zoom around the map to find summaries of the latent clusters making up the dataset.

Anthropic hh-RLHF Dataset Map

A global view of the dataset now let's us explore the types of interactions used as reward signal to align large language models. Clicking any hyperlinked word below will take you to the exact region of the map that is being referenced.

First impressions

At first glance, the dataset looks kosher. It's diverse: covering interactions about cooking advice and skin care to politics, crime and relationships. It seems consistent : if you sample a point at random and look at the annotators decision, you'd probably agree with it.

But on closer inspection, it's easy to find trouble.


Content warning applies below.

In fifteen minutes of exploring I found the following examples of aligned 'chosen' chats. By training on these data points, large language models learn that it's okay to respond to humans in the manner highlighted by the data point.

Clicking on the caption link will take you the relevant region of the map. Hover or zoom into images to enlarge and read their text.

Racism exists in chosen chats.
Figure 1: KKK endorsements
Figure 2: Rascism
Figure 3: Rascist Language
Human trafficking
Figure 4: Human Trafficking Tutorial
Figure 5: Where to Traffick
Chosen responses are not always politically neutral
Figure 6: Political bias
Bank robbery tutorials
Figure 7: Bank robbery tutorial

Why do these bad data points matter?

This dataset is publicly released by Anthropic for use in aligning large language models. Not only is it likely being used by Anthropic to train their large langugae model Claude, it is certainly being used to RLHF other large language models. For example, the following huggingface models use this dataset for training: sileod/deberta-v3-base-tasksource-nli, reciprocate/ppo_hh_gpt-j, theblackcat102/pythia-12B-dedup-1000. Further, Open Assistant, a project to perform an open-source replication of ChatGPT, has benchmarked several models on this dataset. Even a few wrong labels can give enough signal during RLHF to make the end model hallucinate and ultimately non-aligned.

Most importantly, Anthropic publically presents itself as an organization that is leading the race to build aligned and safe large language models. If they can't get the data quality right and include blatantly racist examples in their training data, what hope can we have that anyone else will?

Counter arguments

You are cherry-picking and this is anecdotal

Explore the map and you will easily find more non-aligned data points.

A few datapoints with poorly chosen outputs won't change much in the end large language model

Sure. That is plausible but not provable. Anyways, why have these bad data points in the first place - just drop them.

Although both responses are bad, the accepted is marginally better/safer than the rejected so the data point should be in the dataset.

The accepted responses are still not safe and are being used for training. Is it okay for the model to be a little racist as opposed to completely racist?

Not all datasets for RLHF are bad

To my knowledge, this is the only open-source dataset easily accessible for training. This means everyone will be training on it. I would wager that most closed source datasets are also of sub-par quality.

What we should do?
It's easy to get caught up in the excitement of watching training loss curves fall and benchmark metrics rise. Running your GPU's at full utilization is sexy. Cleaning data currently is not. If you care about building AI systems that do what you want them to do, I urge you to spend more time exploring and cleaning your datasets. I guarantee that you'll be surprised by what you find and your models will better for it.

My final words to you (Anthropic, Open AI, Microsoft, Google, Cohere and friends) are short:


because we have to look at your outputs.

If you believe that scaling large language models also means scaling our ability to interact with large datasets, come checkout Nomic AI. We are building tools that let humans quickly interact, label, clean and monitor massive, dynamically changing text datasets. You can make the interactive explorer linked above yourself with a few lines of python.

from nomic import atlas
import numpy as np
from datasets import load_dataset

dataset = load_dataset('Anthropic/hh-rlhf')['train'] #helpfulness and harmlessness subsets
documents = [document for document in dataset]
project = atlas.map_text(data=documents,
                          name='Anthropic RLHF',
                          description="What's in Anthropics RLHF dataset?"

Response from Anthropic (03/04/2023)

First off: thank you to Andriy for writing this! Critically assessing the outputs of industrial labs is an extremely important part of how we get to better AI systems, and your post is a stellar example of the genre. On the point itself, I think a succinct summary of Andriy’s concern is: The widely-used hh-rlhf dataset for RLHF model training contains a substantial number of ‘doubly-egregious’ points, ones where both model responses are extremely bad. One of these points is marked as ‘accepted’ and therefore will lead to more egregious behavior from models trained on this dataset. Andriy’s position is that these doubly-egregious points are likely to lead to more racist behavior in the models trained on this dataset. However, our work on red-teaming leads us to believe that these doubly-egregious points are likely to lead to less racist behavior in models that use this dataset.

Why include egregious examples at all?
The world has a lot of egregious prompts in it, and we’d like the model to do its best to recover from those egregious prompts. Without any sort of egregious examples, we’re worried that when the model is put in an egregious place, it’ll be content to continuously generate egregious output. We actually found that a data-generating process called red-teaming, where we get annotators to ‘jailbreak’ the model into producing as many egregious examples as possible, leads to dramatically less egregious behavior in the final models. This is because if the model is jail-broken into producing an egregious response, we’d like it to try to recover and get back to producing non-egregious output. Without any examples of egregious behavior, the preference model might not know that it should reward a return to non-egregious output, and so won’t train that trait into the behavior model.
Why does red-teaming produce two egregious examples?
Red teaming is a process where we ask annotators to do their best to ‘break’ our models, and get the model to produce the worst output possible - so we can then train away from it. During the red-teaming process we actually asked annotators to continue the conversation with the worse of the two responses, which we then trained away from. This lets us get the model far out of its comfort zone, where the models we used to generate the data habitually produce doubly-egregious responses. By collecting this data we increase the chance that the final model behaves in a robust way when it’s deployed in the wild. We used models with less/no harmlessness training because that’s what we started with: vanilla models with prompts that encouraged them towards - but didn’t effectively enforce - harmlessness. Improving on this lack of harmlessness was the purpose of building this dataset.
Why not always have a good response in the pair?
There are two possibilities for generating a good response. The first is to have a human write a good response. The problem with this is that we have found models to be very keen to learn ‘human detectors’, where if you have a mix of model-generated and human-generated data, the model learns how to detect the human response - from style, from structure, from word choice - rather than learn the trait we actually want it to. The other option is to ‘re-roll’: let the annotator re-generate one of the bad responses until it gets a good one. We think this will plausibly work, but validating that it works is a research project we’ve yet to undertake.
Why don’t egregious examples lead to egregious behavior in RLHF?
Egregious data fed into supervised learning would absolutely lead to more egregious behavior in the models trained on them. RLHF is somewhat different from supervised learning though. The difference between how supervised learning uses its training set and how RLHF uses its training set is that - supervised learning uses the training sets to learn behavior - RLHF uses the training sets to learn preferences, which are then used to learn the behavior in a separate stage. In a supervised learning set up (in contrast to RLHF), the language model is optimized to output more text like the examples that it sees during training. So in a supervised learning set up, the model will imitate the behavior that it sees in the dataset. But critically, in RLHF the behavior model never sees the dataset directly. It will not see either of the responses from the doubly-egregious samples, and so can’t imitate them. Only the preference model sees the egregious examples, and the preference model does not generate any text itself. The effect of the examples Andriy gives is instead solely to generate (as Andriy mentions) a preference towards the less-bad of two bad options.
Can you give hard evidence for the value of doubly-egregious examples?
We think the evidence of their value is implied by the success of red-teaming as a strategy for reducing egregious behavior, and we discuss this in the paper associated with this dataset. But we haven’t specifically explored the importance of doubly-egregious examples. We are keen to assist anyone working on projects like these however, so please reach out if this is you!