Hacker Factor
Home Blog Services Software Presentations Publications About Contact
Gender Guesser
Hacker Factor As demonstrated at:
Black Hat Briefings, USA 2006

Gender Guesser

The words you use can disclose identifying features. This tool attempts to determine an author's gender based on the words used.

Submitted text is evaluated based on two types of writing: formal and informal. Formal writing includes fiction and non-fiction stories, articles, and news reports. Informal writing includes blog and chat-room text. (Email can be formal, informal, or some combination.) You should view the results based on the appropriate type of writing.

Analyze

Type or paste a writing sample for gender analysis. Then click on "Analyze" to see the results. For best performance, use at least 300 words -- more words is generally more accurate.

     

Results


About Gender Guesser

In 2003, a team of researchers from the Illinois Institute of Technology and Bar-Ilan University in Israel (Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni) developed a method to estimate gender from word usage. Their paper described a Bayesian network where weighted word frequencies and parts of speech could be used to estimate the gender of an author. Their approach made a distinction between fiction and non-fiction writing styles.

A simplified version of this work was implemented as the Gender Genie (no longer available). They showed that fewer words were needed and that writing styles varied based on the forum. For example, fiction and non-fiction differs from blogs (informal writing). Even though the genres differ, there are still gender-specific word frequencies.

This Gender Guesser system is heavily based on the Gender Genie. In particular, the word lists and weights are reproduced from the Gender Genie. The Gender Guesser extends the interpretation of informal writing to work on blogs and chat-room messages, and combines formal writing styles (fiction, non-fiction, essays, news reports, etc.). It also looks for weak emphasis -- used to distinguish European English from American English. In general, if the difference between male and female weight values is not significant (a "weak" score), then the author could be European. This is because the weight matrix is biased for distinguishing genders in American English.

(Oh yeah, and Gender Guesser is completely implemented in JavaScript. View the source to this page to see all of the code.)

A few quick notes:


Future

Hacker Factor is currently developing a variation of this system that can estimate an authors age, type of English used (for narrowing down nationality), and whether the speaker uses English as a second language. For example, there are distinctions between American English, British English, Commonwealth, Canadian, Australian, and other dialects. Even within a major dialect such as American English, there are regionally specific subdialects. These variations are not limited to spoken words. Many of these dialect variations show up in the words we use.

This software is provided as open source, but it is not "free software". This software may not be used for commercial purposes, may not be redistributed on another web site, and may not be reproduced in a different medium in whole or in part without the explicit written permission of the author, Neal Krawetz. It is provided with no warranty expressed or implied, may not be accurate, and may not be suitable for any particular task or need. Use at your own risk. In jurisdictions where a warranty must be provided, this software cannot be used. (In non-legalese: You can look at how it works, but ask before you take.)

Copyright © 2006 Neal Krawetz, Hacker Factor Solutions. All rights reserved.