updated tldr; Anthropics RLHF helpfulness and harmlessness dataset contains racist datapoints. This is intentional as red-teaming data exists in this dataset. Its inclusion is designed to make the model less harmful!
original tldr; Anthropics RLHF helpfulness and harmlessness dataset contains racist and harmful content and everyone is training on it. Clean your data!
Content warning: this post references explicit and offensive language.
AI chatbots have entered the mainstream: Open AI deployed ChatGPT, Bing has Sydney and Anthropic is slowly building up to a release of Claude. All of these organizations claim to care and work towards a concept known as ‘AI Alignment’, or in people-speak ‘AI shouldn’t do bad things for/to humans’. I am here to argue, with evidence, that these organizations are missing the forest for the trees with their current struggle to align large language models. They are not focusing enough on data quality.
The predominant mechanism used to ‘align’ chatbots is a training technique popularized by Open AI in 2020 known as RLHF (Reinforcement Learning from Human Feedback). To understand how RLHF works, you first need to know that these chatbots, under their shiny, well-marketed hoods, are just big black boxes that eat text and output a word (technically, a sub-word token) in a recursive loop. These big black boxes are trained over web-scraped copies of the entire internet and from this training they learn information about the world. Locked away in these big black boxes reside facts like ‘russia invaded Ukraine in 2022’ and ‘Пу́тін хуйло́’. Without the right key, these big black boxes are useful but tend to do un-wanted things like lie and go on racist rampages. The key to accessing this information in a manner that’s ‘aligned’ with human prior beliefs on what's good and bad is RLHF.
RLHF simply means: “have the big black box generate responses to inputs, make humans rank the responses based on a definition of good and then keep training the big black boxes but only on good responses”.
Here resides the key issue: humans labeling responses. The data this produces, while better than blindly eating the internet, stinks. In this post, I’ll prove it to you.
Below you will find an interactive interface (the map) for exploring a human-curated dataset, hh-rlhf, released by Anthropic and used for training large language models with RLHF. A detailed description of how this dataset was collected can be found in Section 2 of Anthropic's paper. In short, each data point (represented as a point on the map) was collected by having a human-annotator choose between two responses from a large language model. When you hover over a point, you can see the corresponding 'chosen' and 'rejected' response. On the map, points that are close to each other have 'chosen' responses that are semantically similar. Zoom around the map to find summaries of the latent clusters making up the dataset.
A global view of the dataset now let's us explore the types of interactions used as reward signal to align large language models. Clicking any hyperlinked word below will take you to the exact region of the map that is being referenced.
At first glance, the dataset looks kosher. It's diverse: covering interactions about cooking advice and skin care to politics, crime and relationships. It seems consistent : if you sample a point at random and look at the annotators decision, you'd probably agree with it.
But on closer inspection, it's easy to find trouble.
Content warning applies below.
In fifteen minutes of exploring I found the following examples of aligned 'chosen' chats. By training on these data points, large language models learn that it's okay to respond to humans in the manner highlighted by the data point.
Clicking on the caption link will take you the relevant region of the map. Hover or zoom into images to enlarge and read their text.
This dataset is publicly released by Anthropic for use in aligning large language models. Not only is it likely being used by Anthropic to train their large langugae model Claude, it is certainly being used to RLHF other large language models. For example, the following huggingface models use this dataset for training: sileod/deberta-v3-base-tasksource-nli, reciprocate/ppo_hh_gpt-j, theblackcat102/pythia-12B-dedup-1000. Further, Open Assistant, a project to perform an open-source replication of ChatGPT, has benchmarked several models on this dataset. Even a few wrong labels can give enough signal during RLHF to make the end model hallucinate and ultimately non-aligned.
Most importantly, Anthropic publically presents itself as an organization that is leading the race to build aligned and safe large language models. If they can't get the data quality right and include blatantly racist examples in their training data, what hope can we have that anyone else will?
Explore the map and you will easily find more non-aligned data points.
Sure. That is plausible but not provable. Anyways, why have these bad data points in the first place - just drop them.
The accepted responses are still not safe and are being used for training. Is it okay for the model to be a little racist as opposed to completely racist?
To my knowledge, this is the only open-source dataset easily accessible for training. This means everyone will be training on it. I would wager that most closed source datasets are also of sub-par quality.
My final words to you (Anthropic, Open AI, Microsoft, Google, Cohere and friends) are short:
LOOK AT YOUR DATA
because we have to look at your outputs.
If you believe that scaling large language models also means scaling our ability to interact with large datasets, come checkout Nomic AI. We are building tools that let humans quickly interact, label, clean and monitor massive, dynamically changing text datasets. You can make the interactive explorer linked above yourself with a few lines of python.
from nomic import atlas
import numpy as np
from datasets import load_dataset
dataset = load_dataset('Anthropic/hh-rlhf')['train'] #helpfulness and harmlessness subsets
documents = [document for document in dataset]
project = atlas.map_text(data=documents,
indexed_field='chosen',
name='Anthropic RLHF',
description="What's in Anthropics RLHF dataset?"
)
print(project.maps)
First off: thank you to Andriy for writing this! Critically assessing the outputs of industrial labs is an extremely important part of how we get to better AI systems, and your post is a stellar example of the genre. On the point itself, I think a succinct summary of Andriy’s concern is: The widely-used hh-rlhf dataset for RLHF model training contains a substantial number of ‘doubly-egregious’ points, ones where both model responses are extremely bad. One of these points is marked as ‘accepted’ and therefore will lead to more egregious behavior from models trained on this dataset. Andriy’s position is that these doubly-egregious points are likely to lead to more racist behavior in the models trained on this dataset. However, our work on red-teaming leads us to believe that these doubly-egregious points are likely to lead to less racist behavior in models that use this dataset.