Talk:Mojibake/Archive 1

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 1

French, and even English

I've seen this happen a lot with French, it gets turned into something like, for example, mÄ?'lange, for mélange.

And yes, even in English; sometimes the apostrophe goes ape. since I can't back it up, though (might have been an individual computer problem unrelated to encoding) I figured I'd mention it here first. —Preceding unsigned comment added by 217.235.74.134 (talk) 16:33, 2 December 2007 (UTC)

I agree with the English comment. The following characters have a very high level of mojibake in English:

… ’ ” “

Other punctuation and special characters used in English that are not included in the original ASCII set have the same problem as well.Peaceoutside (talk) 22:11, 10 June 2008 (UTC)

Indeed, the article mentions "æ" as a problem for a nordic language but neglects to mention it and "œ" are also going to be a problem for English. TristanDC (talk) 16:32, 29 June 2008 (UTC)

Meaning of "Mojibake"

in this context "bake" means "changed", not "ghost". and mojibake means "character change", so it happens not only international context but also in different character-sets in same language. for instance, between euc-jp and shift_jis. --a japanese

Thanks for pointing that out. Feel free to be bold and correct! --Menchi 18:18, 3 Sep 2003 (UTC)

Sorry for dredging up, but it seems to be less accurate to say "bake" means simply "change". It roughly corresponds to "mutate", "deform", "disguise" or something like that. That's why the same word is able to have meanings such as "ghost" or "monster" etc. 125.197.200.77 (talk) 12:02, 3 September 2011 (UTC)

The meta of this page contains the tag: <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>. Should this be changed to utf-8? would that help with the display of characters? Onco p53 11:36, 7 Nov 2004 (UTC)

Changing that wouldn't help with the display of characters because there aren't any character display problems (or at least mojibake problems) at the English Wikipedia site. On the contrary, because English Wikipedia articles are encoded in ISO-8859-1, not UTF-8, switching the meta like that would CAUSE mojibake problems.--69.214.232.105 06:27, 21 Apr 2005 (UTC)

It's worth noting that since this discussion the English Wikipedia has converted (wisely, IMHO) to UTF-8; the meta tag now reads: <meta http-equiv]="Content-Type" content="text/html; charset=utf-8" /> --babbage 19:28, 28 September 2006 (UTC)

Speaking of which, one nitpick: "An improperly configured or poorly written web browser may not distinguish a page coded in EUC-JP and another in Shift-JIS if the coding scheme is not assigned explicitly" has a POV that web browsers should be expected to guess at the character set of a page that does not have the "proper" language encoding tags (either header or meta tags), I think this is incorrect. A better perspective is that the pages are lacking the proper encoding hint, the browser isn't doing anything wrong by defaulting to the language of the user. - dlm 17.184.102.113 16:51, 8 June 2007 (UTC)

I agree. How should we rephrase? — The Storm Surfer 18:00, 17 June 2007 (UTC)

I tried. --Alvestrand 19:03, 17 June 2007 (UTC)

You're edit looks good to me! =) dlm 17.184.102.113 15:57, 19 June 2007 (UTC)

"Problems in other languages" section

I think there's a misunderstanding here. The word mojibake is a part of the English language, that happens to be a loanword from the Japanese language, and the term doesn't specify the language which is suffering from encoding issues. So there are no "other languages" because "the language" is not specified for mojibake in the first place. Otherwise this page would be a simple redirect, since an article on its own would be without merit. --69.214.232.105 06:27, 21 Apr 2005 (UTC)

But the Japanese mojibake was coined in response to the problems of the Japanese (and all Asians, by extension). This is why "problems in other languages" begins "this problem is not unique to the Asian users..." after having just described the problem for Japanese and (off-handedly) Chinese. It has nothing to do with English. To eliminate this section, you need to rewrite the intro to be less specifically targeted to Japanese/Asian languages. In short, this article does specify "the language" in the first place, and it happens to be Japanese. JRM 22:33, 2005 Apr 24 (UTC)

Actually, to be perfectly fair to the other anon above, it doesn't "specify". If it "specified" the language of "Japanese" as what this term is used to describe display errors in, that would mean it would completely rule out the use for any language but Japanese - which is not even implied in the article. Technically, there is not even anything in the article that suggests that it can't be used - technically - to also describe problems with, say, Cyrillic, or with special variation characters like those found in French, Spanish, German or Romanian. The origins and original usage of this term (East Asian), are obvious and noted, partly through the discussion of the problem's history and partly through the simple act of labeling it a "loan word from Japanese" -- but there is actually no clarification whatsoever for the casual reader as to whether or not this applies ONLY to East Asian character display errors, or whether it, in English, applies to any character display error of this nature. This is actually kind of a problem for people like me, who have never heard this term before. It should be clarified considerably - preferably with references, if possible, to make sure it's not a small mistake on the part of one editor or another. We've already got two different users who have completely different interpretations of what the word describes, after all... and for all we know, either one could be right!

Additionally, there's also the simple fact that this section does not, in fact, read like it is describing "problems in other languages" like you imply it to - it sounds like it is merely listing other languages' names for the same basic concept... and thus sounds like it should be titled "In other languages", rather than "problems in...".4.238.12.41 03:00, 3 November 2007 (UTC)

May we have a literal translation of krokozyabry please? Salleman 13:06, 1 Jun 2005 (UTC)

Mojibake vs. inappropriate choice of encoding

I removed the German example (characters stripped of diacritics), because I think it's not an instance of mojibake. In this case the text renders perfectly well according to its encoding—it's just that the encoding used is wrong, because it cannot contain all characters you want to use. What you typically get in such cases is not mojibake but a converted text consisting completely of "?????", boxes, or similar placeholders for unrepresentable characters—German is "lucky" that most characters can be converted with relatively little loss in the text overall.

What makes this different from mojibake is that a text in mojibake could be displayed correctly, if only you used the correct encoding and had the correct fonts. There are many problems that can arise with text handling on computers, but I don't think everything that can go wrong is mojibake.

The "problems in other languages" section does talk about mail servers stripping high bits and Outlook apparently ignoring encodings in some cases—these involuntary "conversions" could be considered mojibake, but you'd need to make clear the problems occur because programs are actively messing up the existing text, not just applying a wrong encoding when displaying. I don't think saving a German text in ASCII falls in the same category. JRM · Talk 13:23, 14 August 2005 (UTC)

Pronounciation

Is there evidence that it's actually pronounced as the Japanese word? How is pronounced in the English speaking world by English speakers? I'm sure someone's gone to a conference and used the word, and I'd put a decent bet that English speakers pronounce the last half like bake. --Prosfilaes 02:10, 4 November 2005 (UTC)

Possible, but not likelier than the possibility that they pronounce it like the original. The computer and language-savvy who likely borrowed this term first also know enough to pronounce it properly. If you read about this term, you are likely to catch either its pronunciation or the fact that it's Japanese. If you hear about it, I'd put a decent bet that you hear it from someone who pronounces it mo-jee-ba-kay, which, while still not Japanese, is not as perverted as "modgy-bake".

Of course, that is, if it's pronounced at all to any significant degree to settle on a pronunciation. It'll be some time before any dictionary lists it. JRM · Talk 18:27, 3 January 2006 (UTC)

It's not "properly". Nativized English words "should" be pronounced in English ways. I've read this term for years, but never seen a pronounciation. Maybe the otaku pronounce it as per the Japanese, but I would be surprised if most readers used anything but an English reading. Likewise "anime", not "ah-nime", is the pronounciation that has taken hold, even for something that is Japanese specific.--Prosfilaes 18:48, 3 January 2006 (UTC)

'Nativized English words "should" be pronounced in English ways.' The thing here is that "English ways" are rather fickle when it comes to pronunciation. (And spelling: why "pronunciation"? It makes no sense.) There's also the thing about us just not knowing. You're free to go around and spread the proper English way of pronouncing it, of course, and I'd be the last to insist on a pronunciation guide. JRM · Talk 23:10, 3 January 2006 (UTC)

In the case of anime, that loanword came from English, so I think the common pronunciation is appropriate. (Some people claim that the loanword came from French, but there is no such evidence, and anyway "animate" is Latin/Middle English in origin.) Of course loanwords are pronounced in the borrowing language as adapted to the sound system. The Japanese language butchers more than a few English words. Likewise, most Americans simply can't pronounce "karaoke" (which is mutilated to carry-okie), "futon" (becoming foo-tahn), or "tsunami" (becoming soo-nah-me). In fact, to correctly pronounce those words when talking to an English-speaker guarantees non-comprehension. Similarly, Hepburn romanization doesn't mimick the Japanese writing system, but is geared toward English-speakers, and that becomes the standard. Neither outcome (being too native or too non-native) is particularly desirable, but in the interests of propagating the term, the "bastardized" form is probably most appropriate. GMW 16:34, 8 February 2006 (UTC)

I'm American, and without looking it up, my pronunciation would be mo-jee-BAAK-ee. --Wulf 05:24, 25 March 2006 (UTC)

I've worked on international software quite a bit, and by far the most common pronunciation is MOE-jee-BAH-kee, with a bit more stress on the BAH than the MOE, or minor variations (BAHK-ee, BAA-gee). Even Japanese developers, designers, translators, managers, etc., when speaking English, usually pronounce it that way (if they're fluent in English and experienced in i18n, at least). --Falcotron 20:25, 24 September 2006 (UTC)

I for one, knowing mojibake has Japanese origin, would pronounce it MOE-jee-BAH-kee, but would expect it to morph into a verb, e.g., "My browser has a habit of mojibaking the Xinhua homepage." -- Jevanyn _talk 18:47, 11 November 2009 (UTC)

"'Nativized English words "should" be pronounced in English ways.' The thing here is that "English ways" are rather fickle when it comes to pronunciation. (And spelling: why "pronunciation"? It makes no sense." - Um, yes it does. I know a lot of things in English spelling seem stupid or odd thanks to years of borrowing romanizations and words from other sources (Souix Indians, anyone? A silent 'x' means that HAD to have been borrowed from French romanizations of the natives' term) and almost as many years of not particularly regulating itself (try reading Chaucer, ouch!)... but take it from somebody who had this RELENTLESSLY (albeit thankfully metaphorically!) beaten into her head as a child: that conjugation of the verb "to pronounce" is NOT supposed to be pronounced "pro-noun-cee-ay-shun". It is - as the British know quite well, I'm sure - supposed to be pronounced "pro-nun-cee-ay-shun". I know it doesn't seem intuitive from many an American dialectical accent standpoint, and no, ironically not every American even pronounces it correctly all of the time (understandably), but that's the official pronunciation, and it's not odd or unintuitive at all once you realize that that is exactly how the English (who cobbled together the language originally, let's not forget) STILL pronounce the word... well, if they're educated in the standard (Queen's English) they do, at any rate. Their particular accent makes it actually feel quite a bit more natural than it does in American accents, but there you go. Nobody ever said everything had to change between the dialects, even when the change would make sense. :P 4.238.12.41 03:54, 3 November 2007 (UTC)

That is utterly bizarre. There is no authority on how English is pronounced. No dialect is more or less correct than any other. Some dialects are used by rich people, true, but that just means rich people use them, not that the poor are wrong for speaking differently. And, finally, the British have absolutely no say in how anyone else pronounces anything. It is everyone's language. —Preceding unsigned comment added by 216.220.11.84 (talk) 17:11, 3 June 2008 (UTC)

Added IPA! Can I ask the users above to stay on topic please... this discussion is long to read as it is. Now, I've added IPA as /ˌmɔdʒi'bake/, which I think is a fair representation of the "common, Anglicised" pronunciation of the word. My reason for adding this is that I think many people will be tempted to pronounce it as "MODGY-bake" if they haven't come across the word before. So the IPA I suggest here is NOT meant to be the "correct, Japanese" way of pronouncing it, but rather a guideline as to how to pronounce the word in English. I understand there may be variants ([ˌmɔdʒi'bake], [ˌmɔdʒi'bakeɪ], [ˌmoʊdʒi'bake], [ˌmɔdʒi'bakə]...), also depending on how loyal one wants to be to the original Japanese, but I think mine is a fair "middle-of-the-road" version that is well pronounceable to non-Japanese speakers, while not moving too far away from the Japanese original - feedback welcome. By the way, if there are any Japanese-speakers out there, you are very welcome to add IPA for a "proper Japanese" pronunciation separately, e.g. "Mojibake, pronounced as /.../ in English, /.../ in Japanese". //Edit: just saw it's already there further down the article, sorry! --HAdG (talk) 19:37, 13 December 2008 (UTC)

Krzaki

I think there may have been an editing problem. There's nothing explaining why users tend to refer to Polish diacritical characters as krzaki (bushes). I sense there's some joke here, but I haven't enough information to understand what it is. I imagine the explanation was here at one time, but perhaps was inadvertently deleted. Could someone with a knowledge of current Polish language/popular culture drop an explanation in here? —CKA3KA (Skazka) 20:14, 11 February 2006 (UTC)

How do I "kill" mojibake?

I'm using Firefox 1.5 on Windows 2000 in the US, and anything in Japanese is displayed as a bunch of question marks... (e.g. PlayStation 3 looks like ?????????3) Is there a font or something I can download to stop this?

Please drop a note on my talk page if you reply, as I might forget to check back here.

Thanks,
Wulf 05:32, 25 March 2006 (UTC)

I would try going to that mojibake page using Internet Explorer. I believe that Internet Explorer will inform you of the problem and ask if your would like to download the proper language set for it. I believe that once it's installed it should work in Firefox as well as the rest of your applications. I'm not too sure though as I don't run Windows very often. Good luck. Jecowa 17:42, 29 August 2006 (UTC)

I know I'm late but: First update your Firefox. Second, force the encoding to UTF-8 using View→Character Encoding→UTF-8. Third, make sure you have got a font with Japanese characters (Arial Unicode MS for example, AFAIK). --Ysangkok 12:58, 10 July 2007 (UTC)

Merge from Garbage characters

Support merge, same thing, only needs one article. However, since there are so many words for it in different langauges (mojibake, krokozyabry, krzaki...) and no reason to prefer one over the others, the permanent location should be an English phrase like Garbage characters. —Keenan Pepper 14:45, 31 August 2006 (UTC)
Support I support a merger. Wikipedia:Naming conventions (common names) says to use the most common form of the name and suggests a google search to assist in determining which is more common. I searched Google and came up with 120,000 hits for "mojibake" and 117,000 hits for "garbage characters". It seems to me that the article title should be "mojibake" with "garbage characters" mentioned along with it in bold in the first sentence as it was slightly more common in a Google search, and since mojibake is already the more extensive article of the two. Jecowa 15:17, 31 August 2006 (UTC)
OTOH, restricting Google to pages in English gives 110,000 for "garbage characters" and 53,500 for "mojibake". -- JHunterJ 16:03, 31 August 2006 (UTC)
Support merge to Garbage characters name, per Keenan Pepper -- JHunterJ 16:03, 31 August 2006 (UTC)
Comment: Since Mojibake has a much longer history than Garbage characters, I'll merge Garbage characters into Mojibake, delete Garbage characters, and then move Mojibake to Garbage characters (or whatever name we decide) to preserve its history. Any objections? —Keenan Pepper 16:57, 31 August 2006 (UTC)
Support Maybe merge them into Mojibake and do a re-direct thingy-ma-bob from Garbage characters? Mindofzoo999 01:27, 12 September 2006 (UTC)
Comment. If the pages are merged, I think mojibake should be the name (with garbage characters as a redirect), because it's the closest thing to a standard industry term that there is, and because it's more specific. "Garbage characters" includes other things besides mojibake, such as line noise (as mentioned in the article), and maybe even cat-on-the-keyboard text. --Falcotron 20:38, 24 September 2006 (UTC)
Oppose Mojibake is a specific form of garbage characters. Mojibake is used in English, unlike krokozyabry and krzaki. Garbage characters also includes line noise and I'm sure people can come up with other examples. And frankly, there's nothing mergable from garbage characters.--Prosfilaes 22:09, 24 September 2006 (UTC)

Bulgarian

The sentence on Bulgarian doesn't seem to make sense, and uses the non-word "phonounciation." I was going to fix it, but then I realized I didn't know what it was trying to say:

In Bulgarian, mojibake is often called maymunitsa (маймуница), meaning monkey's alphabet, named similarly to the phonounciation of kirilitsa (кирилица) or Cyrillic.

Does this mean maymunitsa was named in the same way as kirilitsa? Or that it's named because it's pronounced similar to kirilitsa (which seems unlikely)? Or that it's named as an attempt to read Cyrillic spelling as Latin? Or... I can't figure it out. Any ideas? --Falcotron 20:32, 24 September 2006 (UTC)

Simple it is not! The most wide spread term amongst Bulgarian netizens is Shliokavitza. It's play on the way Cyrillic is spelled in Bulgarian and the early years when people substituted numbers for some unique Bulgarian letters. Namely 6 for Sh and 4 for Ch.Check this article Romanization_of_Bulgarian and also Translit

Please remove "maymunitsa" whoever told you that was either trolling or did not know any better. --Irongrip (talk) 20:30, 4 June 2008 (UTC)

Cyrillic KOI8-R

Unlike the other encodings, KOI8-R was not "rendered unreadable" by the 8th-byte stripping (being specifically designed for that purpose with letters in non-alphabetic order), such stripping turned the message into transliterated Russian instead.

Roughly speaking. If you were to transliterate Russian, you would not come out with KOI8-R stripped; you would either use diacritics or multiple letters. There is no good single letter diacritic-less transliteration in Latin for ц (ts), or ш (sh), or ф (ph), and not enough useful letters in the Latin alphabet. So я becomes q and ш becomes {.--Prosfilaes 12:15, 27 October 2006 (UTC)

Further elaboration

In the case of Cyrillic (Russian, Ukrainian, Serbian, Bulgarian etc.) it would be really helpful to have a matrix of visual examples of how things look when you have different combinations of original encoding vs. viewing. For example how does original CP-866 look when you try to read it under CP-1251, etc. etc. Obviously this matrix would need to be presented in some font-independent graphics bitmap.

More generally, how do you efficiently go from seeing various types of "mojibake" to a correct representation of what someone wrote? —The preceding unsigned comment was added by 76.168.216.138 (talk) 16:15, 27 January 2007 (UTC).

Merge/links/key words

This article needs to be connected to "nonsense" "gobbyldygook" (?) and other similar articles. It also needs to be searchable by "text" "nonsense" "typing" "encoding" etc. Can someone do the keywords? I dunno how. Also I will link it in other articles. Dudeman1st 08:34, 26 March 2007 (UTC)

Hebrew

Since no one else seems to have noticed, a while back somebody changed the part about what mojibake is usually called in Hebrew from sinit to jibrish. I have no idea which (if either) of these is correct, as they are both equally unsupported. The Storm Surfer 21:49, 11 April 2007 (UTC)

Is Mojibake really a loanword?

Internet Japan-o-philes are notorious for appropriating Japanese into their own context-specific lexicon. I don't really think that the word has made any inroads outside of that community. I can't imagine I could use the word without explaining myself, even among tech-savvy friends who understand the problem of mis-matched character sets. The word choice I'd lean to is garbage characters/garbage text or gibberish (note the Hebrew usage).

Calling it mojibake also makes for a confusing article, since the generic topic of garbage text has to be covered. Defining the problem in specific relation to a given language necessitates the indiscriminate list of "problems in other languages". Describing it in generic terms would obviate the need to explain that the problem exists in Russian, Chinese, Hebrew, etc.

Any opinions on a possible re-focusing of the article and a move to a more generic title? – Þ 03:38, 7 May 2007 (UTC)

We've just had this argument; is there any way you could read the discussion at the bottom of the page?--Prosfilaes 13:12, 18 June 2007 (UTC)

Note: The above comment was made before the discussion below and may well have precipitated it. -- Rick Block (talk) 14:07, 18 June 2007 (UTC)

Yeah, I don't know why I didn't notice that. Sorry.--Prosfilaes 16:10, 18 June 2007 (UTC)

Requested move

The following discussion is an archived discussion of the proposal. Please do not modify it. Subsequent comments should be made in a new section on the talk page. No further edits should be made to this section.

The result of the proposal was no consensus to move the page to Garbage characters, per the discussion below. It is not clear that the current title enjoys widespread support, but it is clear that a less-specific title for this specific phenomenon is unlikely to be accepted. I would suggest adding more prominent mentions of other foreign-language names to the lede, or making it more clear that this page is meant to be limited to Japanese-related cases. Dekimasu よ! 03:29, 3 June 2007 (UTC)

Mojibake → Garbage characters — Mojibake is currently not commonly used in English. Mojibake is not a Japanese exclusive concept. The phenomenon can happen to all non-Latin users. "Garbage characters" is easily understood. —Voidvector 01:56, 29 May 2007 (UTC)

Survey

Feel free to state your position on the renaming proposal by beginning a new line in this section with *'''Support''' or *'''Oppose''', then sign your comment with ~~~~. Since polling is not a substitute for discussion, please explain your reasons, taking into account Wikipedia's naming conventions.

Comment. It appears to be used at least sometimes in English, based on the references given. I'm not sure that the meaning of "garbage characters" is immediately clear or that it would be a more common search term. Dekimasu よ! 06:05, 29 May 2007 (UTC)

Oppose. I have always said either "mojibake" or "gibberish" as suggested by Þ above. I'm not wedded to calling it "mojibake," but I don't think that "garbage characters" is the right choice either. Amake 08:29, 29 May 2007 (UTC)

Oppose per Amake. Mojibake is too specific to the Japanese case/language, but "gibberish" is too general, applying to wide varieties of forms of nonsense, and not just this particular situation. I'm not sure if there's a single term in English that can be said to be the definitively "most correct" or "most common" term for this... I would not recognize the meaning of "garbage characters" upon first glance. LordAmeth 09:33, 29 May 2007 (UTC)

Oppose "Garbage characters" is a description (and a poor one at that). The term "mojibake" is indeed commonly used as discussed below. Bendono 13:01, 29 May 2007 (UTC)

Oppose. I suspect there is no succinct English word or phrase for this since it is almost by definition not a problem with English. If there is a term or phrase other than mojibake widely used in the industry I'd be OK with renaming, but garbage characters is certainly not it. The general concept is character rendering problems in electronic mail, which is kind of a mouthful but I'd expect would be an immediately recognizable topic to at least anyone in the industry. -- Rick Block (talk) 18:12, 29 May 2007 (UTC)

Oppose. While "garbage charcters" is just an English phrase that can be understood from context, "mojibake", as currently used in many environments, is a specific term for a specific phenomenon - exactly what one would want to look up in Wikipedia. (that said, I wish someone could write about when the word first escaped from Japan...) --Alvestrand 21:19, 29 May 2007 (UTC)

Oppose as this article is specifically about the Japanese instance. I also generally agree with the other oppose comment above. ···日本穣^{? · Talk to Nihonjoe} 22:34, 29 May 2007 (UTC)

Comment Mojibake is not a good name, but neither is "Garbage characters". See below. --129.78.64.102 05:00, 31 May 2007 (UTC)

Oppose Mojibake is a specific name for the phenomenom, and used in an English dictionary--the Jargon file.--Prosfilaes 14:33, 31 May 2007 (UTC)

Oppose Mojibake is a good specific term. "Garbage characters" is not. --Serge 02:57, 1 June 2007 (UTC)

Discussion

Regarding comments above about citations, I have looked at the external links in the article. All of which referred to Japanese context. I would think under those conditions, the user would have at least some knowledge of Japanese.

I did a Google search of "Mojibake -japanese -文字化け -shift-jis -sjis -site:jp" on English only pages, in attempt to find uses of Mojibake outside of Japanese context, it returned only 9,000 pages.

On a side note, I just did a search of Microsoft Knowledge Base (the largest multi-language IT help database i can think of), it uses "Garbled" or "Garbled characters" as English translation. (Example pages: [1] [2] [3] [4], you can select the document language on the side) --Voidvector 09:53, 29 May 2007 (UTC)

The phrase "is garbled" is acceptable as a description. However, terming the phenomenon as "garbled characters" is quite awkward. For people who work in internationalization (and occasionally localization), the English term used is in fact "mojibake". It may etymologically derive from the Japanese language, but it is now used to refer to the phenomenon produced in other languages as well. For those who the word is not part of their active vocabulary, all that can be done is to describe it with phrases such as "garbled text" etc. Note that these are descriptions and not an actually accepted term. Bendono 12:58, 29 May 2007 (UTC)

Re Google searches: The most common context I've heard "mojibake" in is "mojibake spam". Search for those 2 words on Google returns an estimated 91.000 entries; I guess most of these articles are not in Japanese. --Alvestrand 06:05, 30 May 2007 (UTC)

I wasn't able to find any useful result in searching for "mojibake spam", but based on the description, i believe it is called Bayesian poisoning in English.

Based on the responses, I see that the overwhelming watchers of this page support the current title. Nevertheless, I am not convinced that the word is accepted in English. None of the respondents supplied any article/documentation showing the word's usage outside of Japanese context. Based on my impression, if a Russian, Polish, or Chinese person had started this article, it would have been named differently. --Voidvector 10:27, 30 May 2007 (UTC)

From the Google search I mentioned above: [5], [6], [7], [8]. I don't see much Japanese context.... --Alvestrand 13:59, 30 May 2007 (UTC)

I suspect several of the commenters don't watch this page but came here (like I did) because the move request was announced at Wikipedia:WikiProject Japan. This probably biases the comments, although I personally don't speak or read Japanese but have heard of this term because I work in an email related field. I asked Alvestrand to comment, since I know this topic is within his area of professional expertise. The Cold Fusion Developer's journal referenced in the article is an example of using this term for the general problem rather than specifically in the Japanese context. The term is also defined at [9]. -- Rick Block (talk) 14:28, 30 May 2007 (UTC)

This article is not specifically about "the Japanese instance" - and if it does, then it should be improved to reflect a more world-wide view. This is a common phenomenon and happens to many different character sets. "Garbage characters" is not a good name - no evidence of common usage.

Neither is "mojibake" -- ordinary readers have no idea what it means. Checking the sources, two of the external links are Japanese, so really there is only one internet article that provides any evidence of the usage of this term as an English word for the phenomenon generally, rather than the transliteration of a Japanese word used in Japan to refer to the phenomenon. As to google searches: it does show some usage of "mojibake" in a wider context, but so does a google search for, say, "luanma", the Chinese term, and I'm sure for other languages too.

I think something like "scrambled character display" or something more descriptive would be best. This is not only a technical issue - it is experienced by many people. Surely there are comonly used English terms for it? --PalaceGuard008 05:02, 31 May 2007 (UTC)

If you can provide a name for it, I might go along with a change, but "scrambled character display" is not a name--it's a description, and not even a great one. It could refer to a "character display", like a terminal, getting scrambled, and I wouldn't automatically understand it to be mojibake. Mojibake is clear and unambiguous. Not only that, it's used in English dictionaries, like the Jargon file entry, and most of the google searches for mojibake show it being used in English with this meaning.

Perhaps luanma is used in English, but the first page of google searches shows names and stuff I can't decipher. The second page shows one use with this meaning, in a text that clearly not produced by a fluent speaker. This clearly contrasts with the use of mojibake by fluent English speakers for other fluent English speakers.--Prosfilaes 14:33, 31 May 2007 (UTC)

The difference bewteen luanma and mojibake is one of degree: both are foreign terms, used by some english speakers (of various degrees of adroitness) to describe the same phenomenon. The difference is that mojibake is more often in certain technical contexts, while luanma is less so used. That mojibake appears in Jargon file only means that it is used in English - it doesn't mean it is a common name - and it isn't.

Here's a Google (english only) search of "luanma": http://www.google.com/search?as_q=luanma&hl=en&num=10&btnG=Google+Search&as_epq=&as_oq=&as_eq=&lr=lang_en&as_ft=i&as_filetype=&as_qdr=all&as_nlo=&as_nhi=&as_occt=any&as_dt=i&as_sitesearch=&as_rights=&safe=off

In particular, I think this page provides evidence of a fluent English speaker writing wih considerable technical dexterity on the subject using the term "luanma".

The thing is, I can see that mojibake is used as a technical term in English in some contexts. However, that doesn't make it a common name among ordinary readers. So is luanma, though less common, but that, too, is not a common name among ordinary readers. I'm not advocating a move to "luanma". But given the variety of expressions used, it is perhaps wise to adopt a descriptive name, simply so that ordinary leaders will have some idea of what they are reading about.

The present state is neither convenient nor useful. I typed in a few names that people use for this phenomenon (based on anecdotal evidence) in Wikipedia (e.g. "jumbled characters", "encoding error", "random characters", "strange encoding", , "decoding error", "character display error"), and none of them led to anything relevant at all. An ordinary reader trying to read up on the phenomenon is likely to get nowhere. Even if there is a redirect, they would encounter a completely alien name which makes no sense to them and they are likely to have never heard of. --PalaceGuard008 02:26, 2 June 2007 (UTC)

That mojibake is in the the Jargon file means that it's accepted as an English term by an English dictionary. Mojibake is an English word. Luanma is more arguable, but that's really moot. You typed in a few random descriptions of the issue and found nothing relevant. That's unfortunate, but not uncommon. Ordinary readers who working with the concept of generalized integers may type that into the search box, but they don't get group theory. If there were a name for this that they are likely to have heard of, there wouldn't be such disagreement, but there isn't. If a user used any of the words you typed in that you called names for this phenomenon and asked for my help, my response would be "what exactly are you seeing?". "Jumbled characters" strikes me as a type of puzzle, and "encoding error" is far more broad than Mojibake, and "character display error" far more broad than that.

Mojibake is a name for it. That's what we need for this article, a name, not a description, and especially not a description that is vague.--Prosfilaes 15:25, 2 June 2007 (UTC)

With respect, Jargon file is not an "English dictionary". It's a "dictionary", or, better stated, a glossary, of technical terms. That does not reflect common usage at all. An "English dictionary" is something like the OED, which records the English language, not nerdy jargons. It's one online resource, and it reflects some usage for this term. Likewise, if you looked at the links supplied, so is Luanma - it is also used by some as a technical term.

That Diceros bicornis appears in a biology textbook does not mean it is a common name for what we might otherwise call a "rhino".

If we cannot agree on a common English name for the phenomenon, then at the very least mojibake should not be presented as a common English name, because it isn't. --PalaceGuard008 01:01, 3 June 2007 (UTC)

The above discussion is preserved as an archive of the proposal. Please do not modify it. Subsequent comments should be made in a new section on this talk page. No further edits should be made to this section.

"competing encodings"?

What is meant by "competing encodings"? Shinobu 13:54, 16 September 2007 (UTC)

In the section on Russian? I'd assume this means anything other than Windows CP-1251 (Microsoft's code page for Russian). It would be good to clarify this if anyone actually knows (something like "when replying to or forwarding messages created in any encoding other than Windows CP-1251"). -- Rick Block (talk) 16:34, 30 September 2007 (UTC)

Pronunciation

Could a user knowledgeable on the matter please add a pronunciation insert, since this is a transliterated word? I assume is is mo-jee-bah-kay and not mo-jee-bayk, but I really have no idea! Thanks. Tolstoy143 - "Quos vult perdere dementat" 05:31, 30 September 2007 (UTC)

I'm not sure either, but I assume it's mo-jee-bah-kee. -- Rick Block (talk) 16:23, 30 September 2007 (UTC)

As someone currently taking Japanese I (who therefore has knowledge of romanization systems used for Japanese), I can confirm it would be more like mo-ji-bah-keh... if you were to pronounce it like the original Japanese, according to romanization. Many English speakers in the computer fields seem to use the term though as well, as an English word, and I have no idea how they pronounce it. ;) —Preceding unsigned comment added by 4.238.12.41 (talk) 04:12, 3 November 2007 (UTC)

Non-IPA pronunciation information

As far as I can see, there's no pronunciation information in the article, just romanisations. The idea of romanisation is to spell words using other writing systems with Roman characters, not to provide any help in pronunciation (i.e. Kunreisiki). Perhaps pronunciation information would be useful in the article, but there isn't any at the moment so is the cleanup-IPA really necessary? There isn't a pronunciation needed template (that I can find) for page headers, so I'll just remove it.--holizz 10:24, 18 October 2007 (UTC)

Globalize?

I don't see any particular need to expand this article to cover other cultures, or that the current article has too much emphasis on Japan. It could be rearranged to make the general discussion more generic, but I'm afraid it make it harder to understand.--Prosfilaes 18:19, 18 October 2007 (UTC)

Mojibake/garbage characters/something else?

Does the article really need to be named Mojibake or by some English translation of it?

Because I mean, be realistic here: Mojibake isn't a loan word. It's Japanese. Some webpages and a jargon dictionary happened to use it because it sounds cool. A jargon dictionary isn't exactly qualified to declare new words, nor are a few webpages, nor is Wikipedia.

Excluding the etymology section, the article describes a phenomenon, not a word. So there's no need to even try to translate it to English and call it something really clunky like Garbage Characters.

How about something like Text Garbling or Garbled Text or Encoding Mismatch?

Or just call it "Gibberish (computing)".

rei 70.68.197.145 06:51, 19 October 2007 (UTC)

The least you could do is read the talk page. -Amake 07:09, 19 October 2007 (UTC)

Please see the discussion above about a proposed move (with the green background). -- Rick Block (talk) 13:52, 19 October 2007 (UTC)

Here's a thought (after reading the whole page, yes) - could we perhaps place the description of the phenomenon in the same article as whatever covers the more general topic of displaying foreign characters on computers using encoding whatchamacallits, and simply transwiki the foreign or foreign-derived words for it (including mojibake) to Wiktionary? It's pretty clear that such terms as mojibake and that one Chinese one are sometimes used in English, but there's not really very good evidence to support mojibake being the dominant or only official name for this problem in English... or for there really BEING an obviously dominant and official name for this in English. But, this wouldn't be a problem for Wiktionary as I recall - all you'd need to show is that it is sometimes used by English-speakers in some circles to refer to this kind of problem, and you can even list it in the English dictionary segment of the site! Unless I'm mistaken, you all are making this a lot harder than it needs to be, really. Take away the etymology section and the section on "in other languages", and you have a very, very tiny "article" - which could easily be trimmed a bit and merged into an article on the more general subject this applies directly to? I mean, it's not like this is so thorough it HAS to be its own article, really. 4.238.12.41 04:29, 3 November 2007 (UTC)

Encoding whatchamacallits? I don't think we can meaningfully evaluate the idea without the name of the article to merge it to. Furthermore, there's six paragraphs to the subject without any mention of names--including of course "in other languages" which is a relevant section--which is a decent-size if short article.

BTW, there's no such thing as an official word in English. We are not French; we speak a wild language, unfettered by an Academy or formal regulation. There is good evidence that it is the dominant term--it's in an English-language dictionary, and by fluent English speakers in published books, unlike the other terms.--Prosfilaes 13:17, 3 November 2007 (UTC)

I've come across the expression "letter snot" when discussing this, from mis-typing the phrase "letters not coming out right". -- Jevanyn _talk 18:54, 11 November 2009 (UTC)

Example

A good example, if people want one, is to open notepad, (i assume this would work on a simple text editor for a mac or other unix-based OS,) and type "this program can break" without quotes. Then, save it and open it again. If it works it will display either boxes or some kind of asian lettering, (this is what happends for me, but i dont know which language it is.) It will most likleydisplay boxes. 67.201.144.50 (talk) 06:31, 25 May 2008 (UTC)

The Asian lettering is Chinese, and it's completely meaningless. Professor M. Fiendish, Esq. 00:30, 29 August 2009 (UTC)

Illustration

The illustration really ought to be recursive :)

Wikipedia logo

Which puzzle piece in the W logo is the faulty Devanagari one? Mcswell (talk) 15:38, 30 December 2009 (UTC)

Mojibake image

This edit adds an image purported to represent mojibake. However, there is no mojibake there. The problem is that the selected font does not have all of the needed glyphs. In addition, the boxes are a feature of Firefox help with this issue. Inside they show the Unicode code point of the non-displayable character. Manually looking them will illustrate that the text is completely there without any encoding issues. On a typical Windows system font linking would compensate for this, but there is limited support on Linux. As this is a font issue and does not illustrate mojibake I will remove it. Bendono (talk) 02:19, 7 April 2009 (UTC)

That's fine with me; I guess I misunderstood what mojibake refers to. rʨanaɢ ^talk/_contribs 02:47, 7 April 2009 (UTC)

Then what are they supposed to be called?

If it isn't mojibake, what should I call when a character isn't present in the font used to display it, and one of those square, or boxes, or dots or interrogation marks etc are displayed in place? --TiagoTiago (talk) 07:00, 23 July 2009 (UTC)

Technically, it's mojibake. Mojibake is a misread of letters in a foreign language that does not really use basic latin letters and instead uses other letters (such as squares, weird letters, etc). —Preceding unsigned comment added by Kanzler31 23:38, 26 May 2010‎ (UTC)

No, it isn't. You might have concluded that after reading the old definition, vague as it was. In essence, we're talking about mispresentation, not "misreading". I've made a new lead section with a "narrower" definition that rules this out, and even specifically warns not to confuse font/rendering problems with mojibake. When the symbol is missing from the font used to display it, what I've seen is boxes with the codepoint in hex, but the replacement character � is also said to be a valid substitute in those situations, according to its own article. These replacements are valid, in contrast to the pseudorandom replacements of totally unrelated symbols from a different language that mojibake can give you. But crucially it's a different failure, technically. Btw, don't forget to sign your answers, before I do it. 84.209.119.158 (talk) 09:17, 17 August 2014 (UTC)

Writing system vs. language

Languages and writing systems are conflated in the article as it stands (21 June 2010). I expanded the Russian section and mentioned Bulgarian and other languages using Cyrillic, since they (mostly) share encoding schemes and have virtually identical mojibake issues. Then I noticed parallel sections for Bulgarian and so forth.

Would it be better to restructure the article around writing systems? Anecdotally discussing languages individually misses the fact that there are major mojibake issues between writing systems (such as Latin vs. Cyrillic vs. Hangul vs. Chinese ....), then minor issues between their variants (Russian, Serbian and Ukrainian Cyrillic as well as extensions used in Mari, Tajik, Uzbek and so forth). LADave (talk) 21:37, 21 June 2010 (UTC)

German heading, but no German example?

The paragraph heading "Nordic languages and German" is misleading - all nordic languages are mentioned there, but not German... 84.63.240.214 (talk) 00:17, 14 December 2010 (UTC)

Contains Japanese text

I don't know if I want to put the {{Contains Japanese text}} template on this page, because it also has a link to the Mojibake page. I just want to get "community approval". Thanks. –Mnid ^{(Let's talk about it!)} 17:17, 15 August 2011 (UTC)

Arabic example.png

file:Arabic example.png has been nominated for deletion -- 65.92.180.137 (talk) 05:34, 4 March 2013 (UTC)

[10] Is it just my computer, or the Arabic displayed correctly? Bennylin (talk) 09:06, 21 January 2014 (UTC)

"Kod Obmena Informatsiey" in plural?

My Russian is rusty but I think that, like in English, "information" has no plural in Russian. It should rather be "Kod Obmena Informatsii" in singular. Joe Forster/STA (talk) 12:09, 24 December 2015 (UTC)

Just fixed a whopper

In § Yugoslav languages, the first sentence is

Slovenian, Croatian, Bosnian, Serbian, the variants of the Yugoslav Serbo-Croatian language, add to the basic Latin alphabet the letters š, đ, č, ć, ž, and their capital counterparts Š, Đ, Č, Ć, Ž (only č/Č, š/Š and ž/Ž in Slovenian ...

and the last sentence was

For example, Windows 98/Me can be set to most non-right-to-left SBCS codepages including 1250, but only at install time.

Maybe fortunately, I'm not really a coder, so I didn't know what "SBCS" was supposed to stand for. But I am a linguist, and a fast reader, and it sure looked to me like a slightly reordered abbreviation for "Slovenian, Croatian, Bosnian, Serbian". When I clicked on it, though, I saw

SBCS, or Single Byte Character Set, is used to refer to character encodings that use exactly one byte for each graphic character.
[SBCS]

So I piped it to [[SBCS|single-byte]] instead. --Thnidu (talk) 04:12, 18 February 2017 (UTC)

Windows and Unicode

From the article:

Whereas Linux distributions mostly switched to UTF-8 (around 2004) for all uses of text, Microsoft Windows still uses codepages for text files that differ between languages.

(Template and links omitted.) I believe this is incorrect, as Windows NT uses Unicode internally, specifically UTF-16. Codepages are still supported, but they're distinctly legacy. I'm not sure about 95/98/Me but IIRC, the Registry is stored as flat UTF-16 text, which suggests they do too. Hairy Dude (talk) 13:49, 29 January 2018 (UTC)

Just because it uses UTF-16 internally doesn't mean that when loading text files it interprets them as UTF-8 or UTF-16, or if it still assumes a code page for files without a BOM. Being a Linux user I can't test what it might actually do, but a quick search seems to indicate it's a mix of modern apps an APIs that assume "Unicode" and legacy apps and APIs that still assume a code page. Anomie ⚔ 14:17, 29 January 2018 (UTC)

Modern Windows versions do not use UTF-8 in many places, one of them "Zipped folder" feature. It's easy to verify: create a file with a CJK filename (e.g. 维基百科.txt which means Wikipedia.txt), right-click it, Send to - Compressed (zipped) folder. A new zip file named 维基百科.zip will be created. This zip file works well in Windows, but does not work in programs who uses UTF-8 as the default encoding (fun fact: zip file format does not have a field to declare its encoding). As far as I have tested, if you use unzip(1), p7zip(1), or even Python zipfile module to open this zip file in GNU/Linux, all you see is Î¬»ù°Ù¿Æ.txt, which is GB encoding decoded as Latin-1. ~wzyboy (talk) 03:19, 30 November 2018 (UTC)

What was this

I once read an article on how English letters can be replaced with letters from other languages. Would be happy to see what that was ^いくらBraden1127 _イクラꅇ 22:16, 30 July 2018 (UTC)

NVM it's IDN homograph attack ^いくらBraden1127 _イクラꅇ 22:18, 30 July 2018 (UTC)

Mojibake doesn't make sense

The screenshot of Japanese Wikipedia with poor Windows encoding looks like a Uncyclopedia nonsense page. (see AAAAAAA! on Uncyclopedia, not mojibake) --109.201.34.216 (talk) 07:00, 21 September 2019 (UTC)