Word salad

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Word salad[1] is a mixture of seemingly meaningful words that together signify nothing.[2]

Contents

[edit] Produced by humans

The term originated as the common name for schizophasia, a symptom of various mental illnesses. Schizophasia produces language that is not meaningful and might or might not be grammatical. "Salad" indicates that the words are tossed together randomly.

Word salad may also be a term of scorn.

  • In everyday usage, the term "word salad" may be used to denote derisive feelings toward a person or organization's speech or writings.
  • When applied to a physical theory, "word salad" is a derogatory description that labels the theory as senseless or utterly devoid of meaning.[citation needed]

[edit] Produced by technology

In the context of computer science and linguistics, explicitly constructed word salad is a tool for demonstrating the difference between random utterance and coherent expression of thought. Software such as the Dissociated press within emacs demonstrates the construction of interesting-but-meaningless word salad from large samples of coherent language, by constructing new, random documents that share some of the same word or letter clustering properties as the language sample. These word salads appear as natural language to the inattentive eye or ear, but are clearly meaningless when read or listened to with full attention. In the 21st century, spammers have begun using word salad construction as a way to elude e-mail filtering and attract web page indexing to spam;[3][4] this technique is referred to as Bayesian poisoning.

[edit] Word salad with spam e-mail

In response to the growing problem of spam e-mail, filtering tools became available starting around 2002 which implemented a widely employed method known as the naive Bayes classifier; these techniques are called Bayesian spam filtering. This method uses the probability of various words appearing in spam emails to automatically classify them as spam.

In response, spammers developed word salad to fool programs employing this method of classification.[5] By adding large amounts of random text somewhere in their message, spammers hope to confuse Bayesian classifiers into classifying the message as "ham e-mail" (non-spam e-mail). This technique is known as Bayesian poisoning, and may consist of random words from a dictionary, random sentences or paragraphs from various text corpora, or words targeted at a specific user.

Using actual text from some large corpus of legitimate English (the plays of Shakespeare, other etexts distributed by Project Gutenberg, random world wide web pages, Wikipedia, or the like) attempts to get around algorithms that might detect the more primitive form of word salad.[6]

On its own, Bayesian poisoning by adding random words or paragraphs has been generally found to be ineffective, and indeed may improve spam filter accuracy, as discussed at Bayesian poisoning. However, in combination with web bugs, it can be highly effective in determining which words can help evade a particular user's spam filter.

[edit] Word salad for web page spam

Gyöngyi and Garcia-Molina state this problem clearly:

"As more and more people rely on the wealth of information available online, increased exposure on the World Wide Web may yield significant financial gains for individuals or organizations. Most frequently, search engines are the entryways to the Web; that is why some people try to mislead search engines, so that their pages would rank high in search results, and thus, capture user attention."[3]

[edit] Letter salad

On an even smaller scale than word salad, some spammers use misspellings of words to try to thwart Bayesian filters. Misspelling Viagra as Via6ra, \/|/\Gr/\, or any one of a number of other ways (see Leet), or even using characters from international character sets. In the absence of spell-checking, such words are simply flagged as nonsense rather than red-flagged.

[edit] Word salad filtering

Naive Bayes classifiers do not distinguish between word salad and actual text, because they consider words in isolation – this is what makes them naive. Algorithms for detecting word salad are clearly possible and not particularly difficult to implement.[citation needed] They would be, for the most part, more computationally intensive than most rules used by spam filters today (2006). A statistical approach based on Zipf's law of word frequency has potential in detecting simple word salad, as do grammar checking and the use of natural language processing.[7] Statistical Markovian analysis, where short phrases are used to determine if they are likely to occur in normal English sentences, is another statistical approach that would be effective against completely random phrasing[7] but might be fooled by Dissociated press techniques.[citation needed]

[edit] Notes

  1. ^ Encyclopedia Britannica: word salad
  2. ^ Lavergne 2006:384
  3. ^ a b Gyöngyi 2005
  4. ^ Lavergne 2006
  5. ^ For examples see (Lavergne 2006:285) Figure 1.
  6. ^ A New Breed of Spam
  7. ^ a b Lavergne 2006:386

[edit] References

Gyöngyi, Zoltán; Garcia-Molina, Hector (2005), "Web spam taxonomy", Proceedings of the First International Workshop on Adversarial Information Retrieval on the Web (AIRWeb), 2005 in The 14th International World Wide Web Conference (WWW 2005) May 10, (Tue)-14 (Sat), 2005, Nippon Convention Center (Makuhari Messe), Chiba, Japan., New York, N.Y.: ACM Press, ISBN 1-59593-046-9 

Lavergne, Thomas (2006). "Unnatural language detection" (PDF). RJCRI'O6: Young Scientist' conference on Information Retrieval: 383-388, (French?). Retrieved on 2009-03-01. 

Personal tools
Languages