Jump to content

User:Curpsbot-unicodify

From Wikipedia, the free encyclopedia

This is a bot being run by User:Curps, written using the pywikipedia framework. It is currently being run in a controlled way: it is manually launched to recurse through a particular category and its subcategories. For now it is concentrating on eastern European topics.


Yes this is OK and will not cause problems: see Village pump (technical) discussion on this

The Mediawiki software now has a list of old browsers that cannot handle Unicode correctly, and presents these browsers with a "safe" version of the page to edit.

Note that ever since Mediawiki 1.5, literal Unicode characters are being added in hundreds of edits by ordinary users every day, every time they type any non-ASCII character (unless they go to the trouble of memorizing and typing in the &# code for each such character, which is extremely unlikely).

To summarize, the bot leaves a page's rendered display (what readers see) entirely unchanged, and leaves its wiki markup (what editors see) looking exactly as it would if the page's entire edit history had been edited under Mediawiki 1.5.

What it does

[edit]

The bot almost never makes any changes that are visible to page readers; nearly always, it only makes changes that are visible to editors. The only mild exceptions involve manual intervention (approval) by the bot operator, and these cases are noted in the edit summary.

  • when manual intervention is done by the bot operator for bogus non-printable &#<num>; for num=128–159 (0x80-0x9f)
  • when manual intervention is done by the bot operator to add missing semicolon to &<name> or &#<num>
  • when manual intervention is done by the bot operator for leftover Latin-1/2/? %NN (0xa0-0xff)
  • when links are changed to avoid visible underscores: Albert_EinsteinAlbert Einstein

Conversion to literal Unicode

[edit]

Character entity references

[edit]

Some (not all) character entity references (&<name>;) are converted to literal Unicode characters; others are left unchanged. See #Entities set.

For instance:

&ecirc; → ê

Sometimes the semicolon is erroneously omitted. The bot attempts to detect this and suggests a repair, subject to manual approval by the bot operator.

However, some entities are not checked for missing semicolons because they would cause too many false positives due to URLs of the form http://xxxxx.yyy?aaaa=....&bbbb=.... For example &sect is not checked for because it is rare and many such URLs may include variables such as "&section=".

Numeric character references

[edit]

Some (not all) numeric character references (&#<num>; or &#x<hex>;) are converted to the corresponding literal Unicode characters; others are left unchanged, or they are converted to character entity references if the latter exist. See #Unicode ranges.

Note that sometimes numeric character references in the range 128–159 (0x80–0x9f) are found. These are undefined in Unicode (or rather are defined as non-printing control characters) and represent Windows-125x code points. The bot suggests various possible characters based on mappings of Windows-125x to Unicode, one of which can be manually selected by the bot operator.

For instance:

&#32993; → 胡
&#8211; → &ndash;
&#263; → ć
&#150; → [manual approval] → &ndash;
&#x259; → ə

Sometimes the semicolon is erroneously omitted. The bot attempts to detect this and suggests a repair, subject to manual approval by the bot operator.

"Percent" escape sequences

[edit]

Some (not all) "percent" escape sequences are converted to literal Unicode characters, but only when these occur within [[ ]] or in the left part of [[ | ]]. See #Unicode ranges. The ones that are not converted to literal Unicode characters are converted to numeric character references (decimal) instead.

Nearly all of these "percent" escape sequences represent UTF-8. In some rare cases, some Latin-1 or Latin-2 "percent" escape sequences are left over from pre–MediaWiki 1.5. [1]. The bot suggests various possible characters based on mappings of Latin-x to Unicode, one of which can be manually selected by the bot operator.

Note: if the link begins with "http://", then some silly person put an external link inside two brackets instead of just one, and as of September 21 2005, the bot will avoid modifying such links.

%LX when within [[ ]] or [[ | ]]  (UTF-8 escape, 0x0000-0x007F)

           where "L" is in (0, 1, 2, 3, 4, 5, 6, 7)
           where "X" is any hexadecimal digit

%MX%NX when within [[ ]] or [[ | ]]  (UTF-8 escape, 0x0080-0x07FF)

           where "M" is in (c, C, d, D)
                 "N" is in (8, 9, a, A, b, B)
                 "X" is any hexadecimal digit

%EX%NX%NX when within [[ ]] or [[ | ]]  (UTF-8 escape, 0x0800-0xFFFF)

           where "E" is in (e, E)
                 "N" is in (8, 9, a, A, b, B)
                 "X" is any hexadecimal digit
%KX when within [[ ]] or [[ | ]] and all of the above substitutions already done  (ISO Latin-1/2/? escape, 0x0080 - x00FF)

           where "K" is in (8, 9, a, A, b, B, c, C, d, D, e, E, f, F)
                 "X" is any hexadecimal digit

For instance:

[[aaa%C5%82aaa]] → [[aaałaaa]]

Underscores

[edit]

The bot will sometimes:

change underscores to spaces in the "A" part of [[A|B]] or [[A]]

more generally, convert 2 or more consecutive spaces/underscores to a single space, as above.

remove leading and trailing spaces/underscores in the "A" part of [[A|B]] but not in [[A]]

However it tries to avoid cluttering the edit history with trivial changes. Therefore it will only do the above if one of the following is true:

  • Other changes have already been made on the page (an edit will definitely be done)
  • Underscores visibly appear within the link, which works but looks a bit ugly (for instance Albert_Einstein vs. Albert Einstein)

Nevertheless, there is a flag that can be set to force this underscore and space processing even if the above conditions aren't fulfilled. On September 18 2005, a run was requested to be done with this flag set.[2] [3]


It does this for Image: links as well. Wikipedia seems to handle these exactly the same way as other links: underscores appear in the URL but spaces appear in the headline in the displayed page.

Changing underscores to spaces can cause a problem in Template: files, where it is possible to have template parameters (with underscores in their names) within a link, ie: [[ {{{foo_bar}}} ]]. For the time being the bot will avoid editing templates by default.

[edit]

If after all the above processing, a link of the form [[A|B]] ends up as [[B|B]] (that is, with A and B being completely identical character for character), then this is trivially simplified to [[B]].

More generally, [[A|B]] is simplified to [[B]] if A and B differ only trivially (first letter case-insensitive and disregarding leading and trailing blanks).

If A and B cannot be simplified, any leading and trailing blanks in the "A" part of [[A|B]] are removed; however, they are not removed in the "B" part of [[A|B]] or [[B]] (because we could have, for instance, text text[[ link]] text).

A flag can be set to do further link simplification when certain conditions are fulfilled (such as the page at "A" being a redirect to B). This functionality is described at the /redirects sub-page. This is still in development and is currently turned off.

Entities set

[edit]

The complete set of character entity references defined in HTML 4.01 is given in http://www.w3.org/TR/html401/sgml/entities.html

The bot only converts a subset of these to literal Unicode and leaves the rest unchanged. Currently, the ones that are converted include:

  • None of ASCII (&quot; &amp; &lt; &gt;)
  • All of Latin-1 except &nbsp; and &shy;
  • All of the small handful of Extended Latin-A entities: &OElig; &oelig; &Scaron; &scaron; &Yuml;
  • Latin Extended-B &fnof; (U+0192)
  • Both of the Spacing Modifier Letters entities: &circ; (U+02C6, not the same as ASCII ^) and &tilde; (U+02DC, not the same as ASCII ~).
  • All of Greek except &thetasym; (U+03D1), &upsih; (U+03D2), &piv; (U+03D6)
  • Some of General Punctuation: &lsquo; (U+2018), &rsquo; (U+2019), &sbquo; (U+201A), &ldquo; (U+201C), &rdquo; (U+201D), &bdquo; (U+201E), &dagger; (U+2020), &Dagger; (U+2021), &bull; (U+2022), &hellip; (U+2026), &prime; (U+2032), &Prime; (U+2033), &permil; (U+2030), &oline; (U+203E), but not &ensp; (U+2002), &emsp; (U+2003), &thinsp; (U+2009), &zwnj; (U+200C), &zwj; (U+200D), &lrm; (U+200E), &rlm; (U+200F), &ndash; (U+2013), &mdash; (U+2014), &frasl; (U+2044).
  • From Letterlike Symbols: &trade; (U+2122) but not &weierp; (U+2118), &image; (U+2111), &real; (U+211C), &alefsym; (U+2135)
  • From Arrows: &larr; (U+2190), &uarr; (U+2191), &rarr; (U+2192), &darr; (U+2193), &harr; (U+2194), &rArr; (U+21D2) but not &crarr; (U+21B5), &lArr; (U+21D0), &uArr; (U+21D1), &dArr; (U+21D3), &hArr; (U+21D4)
  • From Currency Symbols: &euro; (U+20AC)
  • From Mathematical Operators: &pat; (U+2202), &radic; (U+221a), &infin; (U+221E), &cap; (U+2229), &asymp; (U+2248), &ne; (U+2260), &equiv; (U+2261), &le; (U+2264), &ge; (U+2265), but not others, including &minus; (U+2212)
  • From Miscellaneous Technical: None
  • From Geometric Shapes: &loz; (U+25CA)
  • From Miscellaneous Symbols: &spades; (U+2660), &clubs; (U+2663), &hearts; (U+2665), &diams; (U+2666)


Some of the above may be subject to change.

A flag can be set to cause &mdash; and/or &ndash; to be converted.

Unicode ranges

[edit]

The bot only works on specific Unicode character ranges which are already in widespread use in the English Wikipedia (particularly as interwiki links but also elsewhere in articles), for which printable characters are commonly available on most operating systems. It will not convert characters outside of those ranges (for an example, see this bot edit [4]; the characters that aren't changed are precisely the ones that are not displayable, at least on my system).

Currently these ranges (see http://www.unicode.org/charts ) are:

  • ASCII printable (0x20–0x7E) only as %NN, not as &#<num>; or &<name>; (see below)
  • Latin-1 printable above 0xA0 (except &nbsp; and &shy;)
  • Latin-1 non-printable (0x80–0x9F) handled manually, see below
  • Latin extended-A
  • Greek, Cyrillic, Hebrew, Arabic, Chinese, Japanese, Korean
  • Chinese Pinyin part of Latin extended-B (third-tone)
  • Vietnamese part of Latin extended-B (o-horn, u-horn)
  • Armenian, Georgian
  • Azerbaijani uppercase schwa (from Latin extended-B) and lowercase schwa (from IPA)
  • Devanagari, Bengali, Gujarati, Tamil, Telugu, Kannada, Malayalam
  • Thai
  • Vietnamese part of Latin Extended Additional
  • Hebrew part of Alphabetical Presentation Forms (used for Yiddish)
  • Arabic Presentation Forms A and B
  • Halfwidth and Fullwidth Forms (double-width ASCII chars, often used with CJK)
  • Some punctuation and symbols beyond the Latin-1 set: euro, trade, numero sign, ndash, mdash, minus sign, left/right single/double quotes, daggers, bullet, prime and double prime, arrows, horizontal ellipsis
  • One combining diacritic: combining acute accent, also used as "stress mark"
  • Some but not all of the miscellaneous characters in WGL4; some are not converted if there is potential for confusion with another character (such as mathematical operator U+2206 INCREMENT which could be confused with uppercase Greek delta)

"Mainstream" characters only

[edit]

Note, not all of the characters in the above ranges are converted. For instance, the Unicode Cyrillic range (U+0400 to U+04FF) includes many obsolete Cyrillic characters (such as yat) and Cyrillic characters for non-Slavic languages (such as Ossetian Cyrillic æ) which are not available in default fonts. In general, characters are only converted if glyphs are available for them in the default font, since characters that appear as "�" in the browser editor are difficult to work with.

mdash, ndash, minus sign

[edit]

By default, the bot will not convert &mdash; &ndash; or &minus; (however it will convert the numerical forms to the alphabetical forms, for instance: &#8211; → &mdash;) Similarly, it will convert &#160; → &nbsp; and likewise for &shy;

However, a flag can be set to turn on conversion of mdash and ndash. This is useful, for instance, in working on Wikipedia articles that cover Canadian election ridings, which incorporate dashes in their names.

Note the Unicode minus sign is not the same as the ASCII hyphen. Its glyph is the same width as the "+" sign (compare -40 and −40), and it will not line-break (so if you want −40 you never get the "−" at the end of one line and the "40" at the start of the next). The Unicode minus sign is available in the range of characters below any editing window: it's next to the ± sign, and not to be confused with the en-dash – which is next to the em-dash —.

ASCII printable

[edit]

Numeric character reference (&#<num>;) or character entity references (&<name>;) are not converted when they represent ASCII characters (eg, &#39; &amp; &gt; &lt; &quot;). This is because such usage may be intended to avoid being considered wiki markup: for instance [5]:

''Warspite''&#39;s 381 mm rounds

where &#39; is used instead of <nowiki>'</nowiki>, to display:

Warspite's 381 mm rounds

However, there is a special exception for &#32; (&#x20;) = SPACE, because this doesn't interfere with HTML markup and because for some reason it seems to occur somewhat often within Cyrillic (see for instance: [6]). So we do convert this one into a literal space.

However, printable ASCII (not control characters or DEL) is almost always converted when it occurs in the form of %NN in link page names: for instance:

[[New_York%2C_New_York_%28song%29]][[New York, New York (song)]]

The exceptions are for %5B ( [ ), %5D ( ] ) and %7C ( | ), which are converted to numeric character references instead because otherwise they would interfere with the [[ | ]] syntax. This is mostly hypothetical, since it's unlikely that these will ever occur in article titles.

Latin-1 non-printable

[edit]

The bot also detects numeric character references in the range 0x80 to 0x9F (decimal 128 to 159). These are non-printing characters in Unicode, but are used as printing characters in older character sets such as Windows-1250, Windows-1252, etc.

As part of the conversion to Mediawiki 1.5, bytes occurring in pre-1.5 content in the range 0x80 to 0x9F were assumed to be from Windows-1252 and converted accordingly, and should not be seen in wiki text. However, numeric character references in this range were not converted in any way. When the bot finds these, it pauses and prompts the operator with possible characters based on the mappings from the various Windows-125x to Unicode. [7] [8] [9]

Some examples: this old revision of the "Acute accent" article, with a &#158; in it (should be z-with-caron); this old revision of Nicolaus Copernicus with four &#151; in it (should be mdash). Another example of this was this old revision:[10], where there is a &#159; (U+009F). The text in question appears as "Walenty RoŸdzieński" on Windows because Y-umlaut is code 0x9F in Windows-1252, but it should be "Walenty Roździeński" since z-acute is code 0x9F in Windows-1250. The bot suggests substituting Y-umlaut (U+0178) or z-acute (U+017A) in place of (U+009F), and the operator selects the suitable choice after examining the article and the context. Another alternative would be Cyrillic-small-letter-dzhe (U+045F) from Windows-1251.

ISO Latin escape sequences

[edit]

Nearly all of the %NN escape sequences represent UTF-8 escape sequences. However, some represent "leftover" ISO-Latin escape sequences, left over from the time when the English and some Scandinavian wikipedias used Latin-1, the Polish wikipedia used Latin-2, etc. As described above, the ISO-Latin escape sequences can be unambiguously distinguished from the UTF-8 escape sequences. However, manual intervention is needed in order to confirm whether any particular ISO-Latin escape sequence represents Latin-1, Latin-2, etc. Most are Latin-1.

Private Use Area

[edit]

The Unicode range U+E000 to U+F8FF is designated as a "private use area". No characters in this range are defined; rather, it's reserved for individual users and organizations for use with their own custom character sets and fonts, outside the scope of Unicode. Such characters should never be in Wikipedia, however there are some rare examples as the former version of Cædmon which used &#61549; (U+F06D) [11] (actually, this is the only example I've found so far; see [12] for how the mystery was solved).

The bot can't attempt any kind of fix here. It flags the occurrence in its log files. The only thing to do is to try a purely manual fix (Google search, contacting the original author, consulting groff symbol font mapping, etc).

Missing semicolons

[edit]

The bot will try to detect missing final semicolons in character entity references (such as "D&eacutej&aacute vu" instead of "D&eacute;j&aacute; vu") and prompt the user on whether to repair this. This requires manual intervention because of the possibility of false positives. In particular, & occurs often in URLs, in the form:

http://xxxxx.yyy?aaa=...&bbb=...&ccc=....
  • Some entities are not checked because they are substrings of another entity (for instance, "&sigma" is not checked because there would be a false positive with every occurrence of "&sigmaf")
  • Some entities are not checked because they are too short and likely to produce false positives (for instance, "&pi")


The bot will also try to detect missing final semicolons in numeric character references (such as "&#263" instead of "&#263;", or the equivalent hexadecimal with "&#x") and prompt the user on whether to repair this. In this case, the possibility of false positives is greatly reduced compared to the previous case, however we still require manual intervention to approve the change.

Right-to-left and bidirectional text

[edit]

If the bot creates any Unicode characters that are RTL (right-to-left, such as Arabic or Hebrew), it will report this in the edit summary.

This section needs to be updated and modified See my comments at the Village Pump under the heading, "a proposed solution to Unicode bidirectional algorithm woes in the text editor"


If the RTL text has no LTR (left-to-right) characters embedded within it other than space and quote ('), it will go ahead and perform the conversion to Unicode. However, if any other LTR characters are embedded within two segments of RTL text, the bot will pause and ask the operator to confirm the change. If the change is confirmed, an extra report is made in the edit summary.

This is necessary because display issues can arise in the display of bidirectional text. Even though all the underlying Unicode characters are in the proper sequence as before, they may be displayed in an out-of-order sequence in the browser's editor, or sometimes even in the page as viewed by the reader. For discussion of these issues, see: http://www.w3.org/TR/html401/struct/dirlang.html

The most common case of this is an Arabic or Hebrew interwiki link that contains a disambiguation with parentheses. For instance, in the Ani Maamin article, the interwiki link was:

[[he:&#1488;&#1504;&#1497; &#1502;&#1488;&#1502;&#1497;&#1503; (&#1508;&#1497;&#1493;&#1496;)]]

When these numeric character references are converted to Unicode, the appearance (in the browser's editor or in the diffs [13]) displays as:

[[he:אני מאמין (פיוט)]]

when really it should display as:

[[he:אני מאמין (פיוט)]]

Note that this is only a display issue: the actual underlying Unicode characters are all in proper sequence and the Hebrew interwiki link itself works fine and takes you to the correct page. The issue is that the browser display can't decide whehter the final closing parenthesis should attach to the preceding Hebrew letter ("ט") and display as "(" as a right-to-left closing parenthesis, or whether it should attach to the following ASCII character "]" and display as ")" as a left-to-right closing parenthesis.

When embedded within article text — like this: אני מאמין (פיוט) — there may also be display issues, but in this case it is sufficient to enclose the text within <span dir="rtl"> … </span> to make it display properly: אני מאמין (פיוט).

In the case of Arabic or Hebrew interwiki links, I'll usually go ahead and manually approve the change: the convenience to Arabic- or Hebrew-speaking editors to be able to actually read the interwiki link (instead of dealing with &# soup) outweighs the single misplaced parenthesis. Other cases are handled on a case-by-case basis. In some especially complicated cases of embedding (for example Template:User ar-1) it will be preferable to leave the numeric character references rather than convert to Unicode.

Edit summaries

[edit]
Edit summaries will look like this (in the most elaborate case):
(20 &#<num>; → Unicode (12 RTL chars created) • 4 &<name>; → Unicode • 6 &#<num>; → &<name>; • 2 UNDEF (&# 128–159) found: 2 fixed semi-manually, 0 not • 4 link(s): %NN changed • 2 link(s): _ → space • RTL-LTR-RTL found)
These need to be a bit cryptic because of a 237-character limit.
If simplification of redirects was done, this will also be added to the edit summary.

Issues

[edit]

I am currently manually checking the results of each bot edit.

As of September 3 2005, the bot has made a couple of thousand edits. I will no longer check routine bot edits, but will still check any edits that involve RTL characters (Hebrew, Arabic) or %NN characters or any edits that are out of the ordinary (such as [14]). These can be identified by the description in the edit summary.
As of September 14 2005, I am no longer checking the %NN edits.
With regards to the redirect-simplifying functionality, I will check each such edit. This is still in development and is turned off by default.

See also

[edit]
  • Wikipedia:Naming conventions (Unicode) (draft) (this shows Unicode charts, and serves as a guideline for what characters are commonly printable in default fonts)
  • User:Func/wpfunc/addipaextensions.js, which can be added to your [[User:YOU/monobook.js]] file (create if necessary) to allow direct insertion of special IPA characters, in much the same way that special Latin-alphabet characters can already be inserted, using a list at the bottom of the edit window. This idea can be adapted to Cyrillic, Greek, and so forth.