For all of the recent advances in language AI technology, it still struggles with one of the most basic applications. In a new study, scientists tested four of the best AI systems for detecting hate speech and found that all of them struggled in different ways to distinguish toxic and innocuous sentences.
The results are not surprising—creating AI that understands the nuances of natural language is hard. But the way the researchers diagnosed the problem is important. They developed 29 different tests targeting different aspects of hate speech to more precisely pinpoint exactly where each system fails. This makes it easier to understand how to overcome a system’s weaknesses and is already helping one commercial service improve its AI.
The study authors, led by scientists from the University of Oxford and the Alan Turing Institute, interviewed employees across 16 nonprofits who work on online hate. The team used these interviews to create a taxonomy of 18 different types of hate speech, focusing on English and text-based hate speech only, including derogatory speech, slurs, and threatening language. They also identified 11 non-hateful scenarios that commonly trip up AI moderators, including the use of profanity in innocuous statements, slurs that have been reclaimed by the targeted community, and denouncements of hate that quote or reference the original hate speech (known as counter speech).
For each of the 29 different categories, they hand-crafted dozens of examples and used “template” sentences like “I hate [IDENTITY]” or “You are just a [SLUR] to me” to generate the same sets of examples for seven protected groups—identities that are legally protected from discrimination under US law. They open-sourced the final data set called HateCheck, which contains nearly 4,000 total examples.