Jump to content

User:Yurik/CaseCheckerBot

From Wikipedia, the free encyclopedia


CaseChecker bot was created to fix any broken links and incorrectly named pages that were accidentally written in a mix of Latin and Cyrillic letters.

Algorithm

[edit]

The bot finds looks at all the links from all pages, including red links, that have a mixed script word -- a word with symbols from both CYR and LAT lists. Once found, the bot will attempt to figure out a "proper" way to write that word. For example, :ru:Cлово with Latin C will be replaced with all-Cyrillic ru:Слово automatically because the letter л exists only in CYR list. Notice that all other letters can be written in both CYR and LAT scripts. In case an entire word could be written in both letters, the bot will check if either of the articles exist, and if they are redirects. In case the bot is unable to choose by itself, it will ask an operator.

Additionally, the bot will rename all incorrect articles and redirects based on the same rules.

The bot has a white list (could also be maintained per wiki) to prevent it from breaking good pages or giving repeated warnings.

Alphabets

[edit]

This list is adapted from wiktionary:Appendix:Cyrillic script

ab: АаБбВвГгГьгьГәгәӶӷӶьӷьӶәӷәДдДәдәЕеЖжЖьжьЖәжәЗзӠӡӠәӡәИиКкКькьКәкәҚқҚьқьҚәқәҞҟҞьҟьҞәҟәЛлМмНнОоПпԤԥРрСсТтТәтәҬҭҬәҭәУуФфХхХьхьХәхәҲҳҲәҳәЦцЦәцәҴҵҴәҵәЧчҶҷҼҽҾҿШшШьшьШәшәЫыҨҩЏџЏьџьЬьӘ
be: АаБбВвГгДдЕеЁёЖжЗзІіЙйКкЛлМмНнОоПпРрСсТтУуЎўФфХхЦцЧчШшЫыЬьЭэЮюЯя
bg: АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЬЮЯабвгдежзийклмнопрстуфхцчшщъьюя
cv: АӐБВГДЕЁӖЖЗИЙКЛМНОПРСҪТУӲФХЦЧШЩЪЫЬЭЮЯаӑбвгдеёӗжзийклмнопрсҫтуӳфхцчшщъыьэюя
kk: АӘБВГҒДЕЁЖЗИЙКҚЛМНҢОӨПРСТУҰҮФХҺЦЧШЩЪЫІЬЭЮЯаәбвгғдеёжзийкқлмнңоөпрстуұүфхһцчшщъыіьэюя
mn: АБВГДЕЁЖЗИЙКЛМНОӨПРСТУҮФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмноөпрстуүфхцчшщъыьэюя
os: АБВГГъДДжДзЕЗИЙККъЛМНОППъРСТТъУФХХъЦЦъЧЧъЫабвггъддждзезийккълмноппърсттъуфххъццъччъы # Ӕӕ
ru: АБВГДЕЁЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдеёжзийклмнопрстуфхцчшщъыьэюя
sh: АаБбВвГгДдЂђЕеЖжЗзИиЈјКкЛлЉљМмНнЊњОоПпРрСсТтЋћУуФфХхЦцЧчЏџШш
sr: АаБбВвГгДдЂђЕеЖжЗзИиЈјКкЛлЉљМмНнЊњОоПпРрСсТтЋћУуФфХхЦцЧчЏџШш
uk: АаБбВвГ㥴ДдЕеЄєЖжЗзИиІіЇїЙйКкЛлМмНнОоПпРрСсТтУуФфХхЦцЧчШшЩщЬьЮюЯя

Mapping

[edit]

Here is the mapping of all CYR and LAT characters that the bot uses. If your language has some additional characters that are not listed, or there is a missing mapping with the Latin character, please edit this page to add it. See Cyrillic alphabets for more info.

First two groups - upper and lower case mapped or fairly common Cyrillic letters. First row - all Cyrillic character, second row - the same character as first row, but only if it is being mapped, third row - if mapped, corresponding Latin character, and the fourth row - all Latin characters. If the two middle rows are empty, the character is not being mapped.

For ALL wikies

[edit]
CYR: І АБВГДЕЖЗИЙКЛМНОПРСТУҮФХЦЧШЩЪЫЬЭЮЯ
     І А В  Е    К МНО РСТ Ү Х          
     I A B  E    K MHO PCT Y X          
LAT: I A B  E    K MHO PCT Y X          
cyr: і абвгдежзийклмнопрстуүфхцчшщъыьэюя
     і а    е        о рс у  х          
     i a    e        o pc y  x          
lat: i a b  e    k m o pcty  x          
CYR:     Ӓ ҪЀ  Ё   Ї      Ӧ        Ӑ        Ӗ       
         Ӓ ҪЀ  Ё   Ї      Ӧ        Ӑ        Ӗ       
         Ä ÇÈ  Ë   Ï      Ö        Ă        Ĕ       
LAT: ÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞĀĂĄĆĈĊČĎĐĒĔĖĘĚĜĞĠĢ
cyr:     ӓ ҫѐ  ё   ї      ӧ        ӑ        ӗ       
         ӓ ҫѐ  ё   ї      ӧ        ӑ        ӗ       
         ä çè  ë   ï      ö        ă        ĕ       
lat: àáâãäåçèéêëìíîïðñòóôõöøùúûüýþāăąćĉċčďđēĕėęěĝğġģ
CYR:                              Ҭ       Џ  Ӱ   
                                                 
                                                 
LAT: ĤĦĨĪĬĮijĴĶĹĻĽĿŁŃŅŇŊŌŎŐŒŔŖŘŚŜŞŠŢŤŦŨŪŬŮŰŲŴŶŸŹŻŽ
cyr:                              ҭ       џ  ӱ   
                                             ӱ   
                                             ÿ   
lat: ĥħĩīĭįIJĵķĺļľŀłńņňŋōŏőœŕŗřśŝşšţťŧũūŭůűųŵŷÿźżž

All other non-mapped symbols

LAT: DFGLNRUZ
lat: dfglnruz
lat: ßĸİıſʼn
CYR: ЂЃЄЉЊЋЌЍЎѶѸѼѾҊҌҎҐҒҔҖҘҚҜҞҠҢҤҦҨҰҲҴҶҸҼҾӁӃӅӇӉӋӍӘӚӜӞӠӢӤӨӪӬӮӲӴӶӸӺӼӾԀԂԄԆԈԊԌԎԐԒԔԖԘԞԠԢԤԦ
cyr: ђѓєљњћќѝўѷѹѽѿҋҍҏґғҕҗҙқҝҟҡңҥҧҩұҳҵҷҹҽҿӂӄӆӈӊӌӎәӛӝӟӡӣӥөӫӭӯӳӵӷӹӻӽӿԁԃԅԇԉԋԍԏԑԓԕԗԙԟԡԣԥԧ

Wiki-specific mapping

[edit]

For all other wikies the letters will not be auto-substituted

     1 2 3 4 4 5 6
CYR: Ѕ Ј Ѵ Ԛ Ԝ Һ Ӕ
     Ѕ Ј   Ԛ Ԝ
     S J   Q W
LAT: S J V Q W   Æ
		    	  
cyr: ѕ ј ѵ ԛ ԝ һ ӕ
     ѕ ј   ԛ ԝ һ
     s j   q w h
lat: s j v q w h æ
  • 1: Macedonian (Slavic group)
  • 2: Macedonian, Montenegrin, Serbian (Slavic); Kildin Sami (Uralic)
  • 3: Ancient letter, no mapping
  • 4: Kurdish (Iranian group)
  • 5: Kurdish, Mongolian, Buryat, Kalmyk, Bashkir, Kazakh
  • 6: Ossetic letter that was not used because most fonts didn't support it until a few years ago. Will force to always use LAT for the next few years.

(Need to assign wiki code to each)

Special cases

[edit]

A very special "Palochka" Cyrillic character that some languages use. It looks identical to the Latin I and Cyrillic І. I will need very clear rules (which wiki, under what circumstances, etc) before I start detecting this as a Cyrillic symbol.

CYR: Ӏӏ