Wikipedia:WikiProject Red Link Recovery/Unlikely links

From Wikipedia, the free encyclopedia

This page is for the discussion of the Unlikely Links tool, hosted on Toolforge at http://tools.wmflabs.org/tb-dev/unlikely/.


Ideas for future unlikeliness checks[edit]

  • Characters in the UTF-16 range may indicate corruption or untranslated foreign-language links
  • Anything that would trigger a rule from MediaWiki:Titleblacklist - if a page cannot be created for the target of a link, that link is suspect.
  • Mixes of language-specific characters - for example Icelandic and Romanian specific characters in the same red link
  • Badly formed template links

- TB (talk) 22:31, 15 December 2010 (UTC)[reply]

Common Double Letters/Triple Letters[edit]

ii is a reasonably common double letter - skiing, Hawaii, various star names - perhaps it should be excluded from uncommon double letters. welsh (talk) 04:54, 24 December 2010 (UTC)[reply]

I've removed double i's for now. The letters in use were chosen by counting all instances of double lettes in article titles and selecting the least common 5. 'i' was indeed the most commonly present of the five selected. - TB (talk) 19:40, 24 December 2010 (UTC)[reply]

III is fairly common as well because it is 3 in Roman Numerals. III is used in the naming of films, kings, queens etc. John Cross (talk) 08:19, 9 September 2012 (UTC)[reply]

...and divisions in sports leagues ... John Cross (talk) 08:24, 9 September 2012 (UTC)[reply]

For sure. I've removed triple I's from the rule for now, but on the whole feel that this pattern has never been especially useful. - TB (talk) 15:43, 9 September 2012 (UTC)[reply]

Namespaces[edit]

The tool does not display whether a page is in the Portal: or Template: space, but rather leaves it unmarked, which then defaults to Main:. It's easy to see what's going on by doing a what links here? on the redlink, but flagging the namespace would be better. welsh (talk) 14:11, 24 December 2010 (UTC)[reply]

Fixed. - TB (talk) 19:33, 24 December 2010 (UTC)[reply]
Thanks welsh (talk) 12:40, 8 January 2011 (UTC)[reply]

Slow[edit]

The suggestions from the tool are taking a long time to display - several minutes in some cases. Is there anything like a tweak to indexes that could fix this? For example, triple letters towards the end of the alphabet. welsh (talk) 12:40, 8 January 2011 (UTC)[reply]

Alas, a simple index won't do the job in this case. Currently, a list of all red links in the English-language wikipedia is maintained and searched on demand for any matching a particular patten (the patterns can be seen here). The list is too long to brute-force search quickly, and the patterns too varied to index effectively. The real solution is I suppose to store pre-calculated lists, as the RLRL tool does - however, in the longer-term, I'm hoping to transform the tool into a more generalised 'red-link explorer', hence it's simplistic design for now. I'll ponde the matter more - inspiration might strike yet ;) - TB (talk) 10:16, 13 January 2011 (UTC)[reply]
I've adjusted a few things to hopefully improve performance a bit. More to come. - TB (talk) 21:48, 27 May 2011 (UTC)[reply]
I noticed the refresh was faster even without knowing anything had been changed! Well done welsh (talk) 23:32, 27 May 2011 (UTC)[reply]

List rebuilt[edit]

Redlink list rebuilt, and a few tweaks made to the tool to make it deal more sensibly with large numbers of whitelisted entries. - TB (talk) 07:54, 26 April 2011 (UTC)[reply]

New pattern added - 'All uppercase'[edit]

New pattern added - 'All uppercase'. This shows red links that are ALL IN UPPER CASE, of course ;) - TB (talk) 17:36, 27 April 2011 (UTC)[reply]

Cool new set! Lots of whitelist candidates (ships, satellites, asteroids, international standards, radio stations...) but many positives too. welsh (talk) 06:57, 28 April 2011 (UTC)[reply]

New pattern added - 'Offensive words'[edit]

New pattern added - 'Offensive words'. This shows red links matching a small selection of offensive English-language words. - TB (talk) 21:21, 21 May 2011 (UTC)[reply]

Sorting lists[edit]

Sometimes, maybe just for variety or efficiency of editing, it would be good to see the lists sorted by Containing Article rather than bad link name. This would be particularly useful in the very long ALL UPPERCASE class. welsh (talk) 09:18, 22 May 2011 (UTC)[reply]

I quite agree - in general the facilities for navigating lists of unlikely links are pretty crude. I'll see if I can't graft on a more flexible set of tools, hopefully including the ability to sort and further filter lists. - TB (talk) 11:22, 22 May 2011 (UTC)[reply]

Which way forwards?[edit]

Okay, I've tried quite a few approaches to improving this tool can find nothing that satisfies me, so I'm soliciting input on what folks want. My original intention was that it develop into a 'red link explorer' tool, allowing users to flexibly generate lists of red links of interest, hopefully for the purpose of fixing them. It turns out that there are a couple of showstoppers making this infeasible:

  1. The way the MediaWiki database is structured makes it time consuming to generate a list of all red links (around 4 hours currently)
  2. Likewise, the database structure makes it very hard to maintain such a list - normally one could run through the hundreds of edits made each minute and add/remove red links to keep the list of all red links up to date. Not possible :(
  3. The list of red links is large enough that waving it past even a simple regular expression takes double-digits seconds. Running arbitrary user-generated queries is likely to be problem-prone.

So, a new vision is needed. Anyone ? - TB (talk) 20:33, 6 July 2011 (UTC)[reply]

New pattern added - 'Double disambiguation'[edit]

New pattern added - 'Double disambiguation'. This shows red links ending in two bracketed terms - for example 1906_Australasian_Championships_(tennis)_(tennis) - TB (talk) 14:46, 25 August 2011 (UTC)[reply]

Actually it would be nice if this ignored (tennis) (tennis) for the moment because it is being tested by a template that gets it right in the end. The template value can be modified to avoid it, but it's a waste of time compared to any other examples this finds. Mark Hurd (talk) 12:09, 26 December 2014 (UTC)[reply]

Target page[edit]

How about looking for links to "Target page name"? You get those when you click on the "redirect" icon in the edit box and don't change the text. I've fixed a few of those a few times.ospalh (talk) 19:04, 20 September 2011 (UTC)[reply]

Hi Ospalh. Nice idea - that's a new one by me, I tend to not use the javascripty goodies. The list you're after can be found using the normal "What Links Here" tool. Thinking this over, there are a few other similar "error indicator links" we should probably be checking periodically also:
Can you think of any more ? - TB (talk) 19:42, 20 September 2011 (UTC)[reply]

New set: Sabha constituencies[edit]

There are around 550 Lok Sabha constituencies, all of which AFAIK have pages. Spelling variations seem rife; I believe that most of the redlinks in this set should be fixable. - TB (talk) 12:04, 1 February 2012 (UTC)[reply]

Unfortunately there are a lot of Vidhan Sabha constituencies. We also have some obsolete Lok Sabha constituencies that we have not yet dealt with, I believe. All the best: Rich Farmbrough23:22, 22 December 2014 (UTC).

New set: Co-ordinates[edit]

We have over 2800 red links containing geographic coordinates. The majority of these look to be poorly filtered automatically generated content. - TB (talk) 14:56, 23 April 2012 (UTC)[reply]

They are disambiguation for Burmese townships. All the best: Rich Farmbrough03:35, 14 November 2014 (UTC).

New set: Mosty non-English characters[edit]

These are links consisting mostly of multibyte unicode characters - Cyrillic, Greek, Armenian, Hebrew, Arabic and Syriac lettering, and Korean, Chinese, and Japanese ideographs mostly. Nearly 4500 red links match this at the time of writing; it looks like a mix of transwikied stuff and untranslated or only partly translated source. - TB (talk) 15:03, 23 April 2012 (UTC)[reply]

Updated[edit]

I've adjusted the way this tool works behind the scenes; it should now identify larger number of red links in each category. As always, please shout if problems. - TB (talk) 10:28, 13 June 2012 (UTC)[reply]

New feature: Check redirect and article titles[edit]

It is now possible to use this tool to check for articles and redirects matching the various patterns. So for example, once can search for articles containing mismatched brackets in their titles, or redirects containing HTML entities. N.B.; not all the patterns are particularly relevant to article and redirect titles - as with red links, matching a given pattern does not necessarily make a page or redirect title incorrect. - TB (talk) 20:52, 17 October 2012 (UTC)[reply]

New feature: Caching[edit]

The performance of this tool has never been great, and it has seemed particularly poor of late. To help mitigate this, I've added a layer of caching. You may find it slow (perhaps around 2 minutes) to bring up the first set of results for any given query, but should be pretty quick on the same set for an hour or two after this. As is always the case with caching, oddities may occur - I'll be tidying things up over the next week or two. Cheers. - TB (talk) 18:27, 21 October 2012 (UTC)[reply]

It is now possible to manually clear cached data for a given check, forcing the tool to re-evaluate things from scratch. This of course takes a minute or two to do, but will ensure that edits made to the live Wikipedia are reflected in the results it shows, replag notwithstanding. - TB (talk) 18:20, 28 October 2012 (UTC)[reply]
Additional to this; I've increased the duration after which caches are automatically re-evaluated from 2 hours to 24. The poor toolserver's struggling enough as it is, and you can always use this new feature to force the cache to re-evaluated sooner if you prefer - TB (talk) 18:23, 28 October 2012 (UTC)[reply]

New feature: Sorting[edit]

It is now possible to sort the results of the various unlikely checks alphabetically or by title length. I've also tidied up the tools for scrolling through the lists a bit. - TB (talk) 20:38, 23 October 2012 (UTC)[reply]

Would it be possible to sort by "On page", which would highlight clusters of fixes? welsh (talk) 17:25, 27 October 2012 (UTC)[reply]
Tricky. A red link may exist on several pages; to accurately sort by page would mean listing the same red link several times - once against each page it appears on. It might be easier to provide a separate listing of pages containing two or more examples of a given type of unlikely title - let me see what can be done. - TB (talk) 17:37, 27 October 2012 (UTC)[reply]

New features[edit]

Some cool new features! Can I suggest that you add the Replag onto the top of the screen, as that helps understand what's going on? Missing spaces near brackets is a mine of broken links - there were a lot of chemicals and some placenames as false positives, but they have been whitelisted and there's c 8000 links worth looking at. Just a small bug - when you've used a sort key and mark something as whitelisted, the display reverts to the dont care sort rather than the one previously selected. Keep up the good work! welsh (talk) 09:21, 24 October 2012 (UTC)[reply]

Ta. Replag's a bit of a fiddly concept now that there's a caching layer involved. I've added the meter for now, but note that (for now at least) titles may not always be added/removed from a given unlikely set until its cache 'expires'. I'm hoping to have the tool do a little more work behind the scenes to maintain the caches in the new few days, at which point it'll behave more sensibly. I've been working through Missing spaces near brackets applied to redirects (rather than redlinks) and have come across at least three classes of problematic redirects that can be corrected. An excellent unlikely check all round. Using the whitelist/dewhitelist option no longer causes the tool to forget about sorting. - TB (talk) 13:15, 24 October 2012 (UTC)[reply]

There seems to be a set of fixed articles that get stuck and remain in the to-be-done category. See, for example, article 135th_Georgia_General_Assembly linking to Dick_Lane_(Georgia_politician. I have rebuilt the cache, waited, re-editted the article without the problem going away. Not a major issue, just annoying! welsh (talk) 07:46, 11 November 2012 (UTC)[reply]

It looks like the toolserver's copy of the database has become slightly corrupted. There are plans afoot to rebuild it entirely in the next few days - this should solve the problem. Relevant info in JIRA and on toolserver-l - TB (talk) 09:01, 11 November 2012 (UTC)[reply]

Base dataset rebuilt[edit]

I've adjusted the way the main redlinks list is generated behind the scenes. This tool will show a few extra redlinks that it didn't know about before. - TB (talk) 22:07, 25 October 2012 (UTC)[reply]

New set: Extra spaces near quotes[edit]

A variation on the recently added sets to do with extra (or missing) brackets, only this time looking for extra spaces around quotes, for example the red link AD_"_Ilinden" on Delčevo. - TB (talk) 19:57, 29 October 2012 (UTC)[reply]

Tool Labs migration[edit]

I have migrated this tool from the Toolserver to the new Tool Labs setup. All being well it should be faster and operate more reliably at its new home - http://tools.wmflabs.org/tb-dev/unlikely/. - TB (talk) 21:57, 24 May 2013 (UTC)[reply]

I've had a chance to fine-tune this tool to better suit the new setup on Tool Labs. It should now be faster than ever. As always, kindly let me know if anything is broken. - TB (talk) 19:23, 8 June 2013 (UTC)[reply]


New sets: File:, Template: and Category: in other languages[edit]

Three new sets. When content is imported/translated from another language, namespace tags are often left untranslated. These sets attempt to pick out such tags. - TB (talk) 22:23, 8 June 2013 (UTC)[reply]

New pattern: Upper case disambig terms[edit]

I've added a new pattern to this tool to detect disambiguation terms that have been capitalised where normally they would not be. For now it detects the limited set of terms listed below - an analysis of the current crop of red links shows these to be the most commonly mis-capitalisated. Happy to amend the list of any of these be correct as they are, or if you know of any more problematic terms to add. - TB (talk) 11:24, 29 August 2014 (UTC)[reply]

  • Actor - Album - Artist - Band - Bible - Bishop - Book - City - Company - Director - Documentary - Film - Game - General - Governor - Magazine - Mayor - Mormon - Movie - Novel - Politician - Producer - Radio - Rapper - Rugby - Singer - Single - Song - Soundtrack

New pattern: Non-Olympic years[edit]

Red links to do with the Olympics, but that include a year in which the event was not held. All a bit speculative, but interesting nonetheless. - TB (talk) 21:39, 9 September 2014 (UTC)[reply]

Very useful, I have reduced this from 327 to 99 by white listing valid links (mostly relating to the Youth Olympic Festival and Olympic qualifying events. I have also fixed quite a bunch of the errors, unfortunately replag is running around 60,000 due to the labs outage I guess, so it's hard to see the progress. All the best: Rich Farmbrough16:37, 12 November 2014 (UTC).
Good job; yes this pattern was always going to generate a fair number of false positives. I'll see if I can't amend it to filter out the Youth Olympic Festival and Olympic qualifying events automatically. Cheers. - TB (talk) 17:11, 12 November 2014 (UTC)[reply]


Trailing dots[edit]

Would be good to exclude the following:

  • \bPLC.
  • \bLtd.
  • \bX.Y. (for all capitals X and Y)
  • \bX. Y. (for all capitals X and Y)
  • \bInc.
  • \bJr.
  • \bSr.
  • \bCo.
  • \be.V.
  • \be. V.

All the best: Rich Farmbrough19:58, 19 December 2014 (UTC).

Good idea; done (mostly) and apologies for the delay in doing so. - TB (talk) 18:06, 28 July 2015 (UTC)[reply]

Database maintainence[edit]

This tool can now be applied to Wikipediae in languages other than English. As part of this process;the English-language database has been moved to a new location - this should not have caused any problems, but please let me know if anything is broken. I'll happily add any languages to the tool that may be useful, but will have to depend on requestors to sort out which rules might be applicable and to suggest new rules specific to differing languages. - TB (talk) 14:20, 3 October 2015 (UTC)[reply]

Malfunction report[edit]

Dear @Topbanana: Querying ptwiki now fails with

Table 'p50380g50491__unlikely_ptwiki_p.unlikely_patterns' doesn't exist

Querying enwiki, and lvwiki, works just fine. Perhaps it is related to the latest change... What do you think? --Usien6 msghis 18:22, 25 July 2016 (UTC)[reply]

All fixed - no idea what went wrong, the entire database seems to have vanished from the WP:tool labs server - either that or I forgot to set it up in the first place :) - TB (talk) 20:33, 1 September 2016 (UTC)[reply]

New pattern: Missing pipes[edit]

New pattern available that tries to catch the case when an unsucessful attempt has been made to use a piped link to hide a disambiguation term in parentheses. Honestly, it was harder to parse that into English than it was to express as a regular expression. Just look at the results of the pattern - it'll make sense I'm sure... - TB (talk) 20:25, 16 February 2017 (UTC)[reply]