The words you use can disclose identifying features.
This tool attempts to determine an author's gender based on the words used.
Submitted text is evaluated based on two types of writing: formal and informal.
Formal writing includes fiction and non-fiction stories, articles, and
news reports.
Informal writing includes blog and chat-room text.
(Email can be formal, informal, or some combination.)
You should view the results based on the appropriate type of writing.
About Gender Guesser
In 2003, a team of researchers from
the Illinois Institute of Technology and Bar-Ilan University in Israel
(Shlomo Argamon, Moshe Koppel, Jonathan Fine, and Anat Rachel Shimoni)
developed a method to estimate gender from word usage.
Their paper
described a Bayesian network where weighted word frequencies and parts of
speech could be used to estimate the gender of an author.
Their approach made a distinction between fiction and non-fiction writing
styles.
A simplified version of this work was implemented as the
Gender Genie (no longer available).
They showed that fewer words were needed and that writing styles varied
based on the forum. For example, fiction and non-fiction differs from blogs
(informal writing). Even though the genres differ, there are still
gender-specific word frequencies.
This Gender Guesser system is heavily based on the Gender Genie.
In particular, the word lists and weights are reproduced from the Gender
Genie. The Gender Guesser extends the interpretation of informal writing
to work on blogs and chat-room messages, and combines formal writing styles
(fiction, non-fiction, essays, news reports, etc.). It also looks for weak
emphasis -- used to distinguish European English from American English.
In general, if the difference between male and female weight values is not
significant (a "weak" score), then the author could be European. This is
because the weight matrix is biased for distinguishing genders in
American English.
(Oh yeah, and Gender Guesser is completely implemented in JavaScript.
View the source to this page to see all of the code.)
A few quick notes:
- The system generates a simple estimate (profiling).
While Gender Guesser may be 60% - 70% accurate, it is not 100% accurate.
This is better than random guessing (50%), but should not be interpreted as
"fact".
In particular, men should not be offended if it says you write like a girl, and women should not be offended if it says you write like a boy.
- People write differently in different forums.
For example, a single writing sample may appear MALE for informal writing
but test as FEMALE for formal writing.
Be sure to interpret the results based on the appropriate writing style.
(These notes, for example, are more informal/blog than formal/non-fiction.)
- Many factors can impact the interpretation from any single person's
writing.
The content, knowledge of the material, age of the author, nationality,
experience, occupation, and education level can all impact writing styles.
For example, a woman who has spent 20 years working in a male-dominated
field may write like her co-workers. Similarly, professional female
writers (and experienced hobbyists) frequently use male writing styles.
Gender Guesser does not take any of these factors into account.
- Email can blur the lines between formal and informal writing styles.
An informal email from a manager may have traces of formality, and a
formal email from a 12-year-old is likely to be informal compared to a
letter from a 40-year-old. Do not be surprised if email messages sent to
public forums test incorrectly -- when writing for an audience, people
commonly use informal words, phrases, and slang within a formal writing style.
- Quotations, block quotes, and included text usually carries the gender
from the initial author. Be sure to remove quoted text from any pasted
content. Also, significant changes from a copy-editor can result in a
different gender analysis. (A male editor may make a female author's news
article appear MALE or as a Weak MALE.)
- Lyrics, lists, poems, and prose are special writing styles.
This tool is unlikely to classify these texts correctly.
- The system needs a paragraph or two of text in order to observe word
repetition. A good sample should have 300 words or more. Fewer words can
lead to more variation in accuracy, and a single sentence is unlikely
to generate an accurate result.
Pasting the same text multiple times will not change the results!
- People tend to write with consistent styles. If the system
misclassifies a particular author, then other writings by the same author
will likely be misclassify the same way.
- And most importantly: This is an ESTIMATE.
Please do not email me about instances where it made the wrong determination.
(I've seen it generate incorrect results lots of times already.)
Future
Hacker Factor is currently developing a variation of this system that can
estimate an authors age, type of English used (for narrowing down
nationality), and whether the speaker uses English as a second language.
For example, there are distinctions between American English, British English,
Commonwealth, Canadian, Australian, and other dialects. Even within
a major dialect such as American English, there are regionally specific
subdialects. These variations are not limited to spoken words. Many of
these dialect variations show up in the words we use.
This software is provided as open source, but it is not "free software".
This software may not be used for commercial purposes,
may not be redistributed on another web site, and
may not be reproduced in a different medium in whole or in part
without the explicit written permission of the author, Neal Krawetz.
It is provided with no warranty expressed or implied, may not be accurate,
and may not be suitable for any particular task or need.
Use at your own risk.
In jurisdictions where a warranty must be provided, this software cannot
be used.
(In non-legalese: You can look at how it works, but ask before you take.)
Copyright © 2006 Neal Krawetz,
Hacker Factor Solutions.
All rights reserved.