An AI trawled 3.5M books and found fundamental differences in the written language we use to describe men and women.
An unsupervised machine learning study presented at the 2019 meeting of Association for Computational Linguistics—which examined 3.5M books published between 1900 and 2008—indicates that men are described based on their behavior, where women are described based on appearance.
In specific, words like “beautiful” and “sexy” are two of the adjectives most frequently used to describe women, while common descriptors for men were “brave,” “rational,” and “righteous.” The books, which amounted to approximately 11B words in sum, included a mix of fiction and non-fiction.
“We are clearly able to see that the words used for women refer much more to their appearances than the words used to describe men,” said University of Copenhagen computer scientist and assistant professor Isabelle Augenstein in a statement. “Thus, we have been able to confirm a widespread perception, only now at a statistical level.”
The study was conducted by an international team that includes computer scientists from: Google Research (Lawrence Wolf-Sonkin), Johns Hopkins University/University of Cambridge (Ryan Cotterell), University of Copenhagen (Isabelle Augenstein), University of Maryland (Alexander Hoyle), and University of Massachusetts-Amherst/Microsoft Research (Hanna Wallach).
In order to determine with a higher degree of statistical certainty this correlation, the team extracted verbs and adjectives directly associated with gender-specific nouns (i.e. “son”), including immediate combinations (i.e. “sexy stewardess”). Then, they applied semantic analysis to determine if these words were positive, neutral, or negative.
“What really makes this novel is we’re able to incorporate sentiment,” Hoyle said in a statement. “Words like ‘pregnant’ or ‘bearded’ might be neutral, but others like ‘hysterical,’ ‘shrewish’ or ‘chaste’ for women, are not. Scoring them gives us the ability to make quantitative comparisons in the paper.”
The study demonstrated that negative verbs associated with appearance are used 500 percent more frequently for female subjects than among males. Meanwhile, body-related adjectives of a positive and/or neutral nature occurred approximately 200 percent more often among women. Positive male descriptors were most frequently related to their personalities and behaviors.
The study did not take into account certain factors which the researchers acknowledge could impact the findings, such as genre, the authors of the individual passages, and the differences between books published in different periods within 1900-2008. While the books that comprise the dataset were published during a period that saw massive shifts in the relative parity between men and women, they can still function within datasets that are used by algorithms that power common services such as the voice and text technology used by Google, Facebook, Apple, and more.
“The algorithms work to identify patterns, and whenever one is observed, it is perceived that something is ‘true,’ Augenstein said. “If any of these patterns refer to biased language, the result will also be biased. The systems adopt, so to speak, the language that we people use, and thus, our gender stereotypes and prejudices.”
Where this becomes especially complicated is in the ways this data can become the basis of decision-making algorithms.
“If the language we use to describe men and women differs, in employee recommendations for example, it will influence who is offered a job when companies use IT systems to sort through job applications,” Augenstein said. “We can try to take this into account when developing machine-learning models by either using less biased text or by forcing models to ignore or counteract bias. All three things are possible.”
Credit: Google News