User:Blevintron/Bot

From Wikipedia, the free encyclopedia

This page is about my bot to combat link rot.

Purpose[edit]

To improve Wikipedia by identifying and marking broken links. To combat link rot by alerting the right user that a link has gone bad.

Haven't we solved this problem already?[edit]

No, there are tons of broken links on Wikipedia. Even if Wikipedia has other means of identifying an repairing broken links, they are not effective enough. Preliminary trials of this bot have discovered 10,789 broken links across 38,431 articles. That's about 1 broken link per 4 articles. Those numbers exclude links which are already marked as broken.

Why is this different[edit]

The bot tries to alert the right user when a link goes bad. The right user is the user who originally introduced that broken link to that article. This will be more effective than other approaches because the right user has shown interest in the article (by contributing in the past) and has shown some expertise (by adding a link).

The case studies section shows some examples of those messages.

Will that work? Won't that just be annoying?[edit]

There are two concerns relating to this approach:

  1. Will the actions of this bot be effective? Specifically, will the bot's actions encourage human editors to repair broken links?
  2. Are the solicitations messages necessary, or will they only annoy the human editors?

First, I should mention that the bot supports an opt-out feature (via {{bots |deny=BrokenLinkBot}} on articles or user pages), and that it has a strict limit on the number of messages it sends to one user in one day. The opt-out feature is advertised at the end of every solicitation message.

I hypothesize that the bot will be effective and will not be annoying. To answer the question concretely, we need an experiment. During normal operation, the bot collects information about the articles it has modified. Those results are tabulated and discussed at the experiment page. After a sufficient amount of experimental data has been collected, we can evaluate the effectiveness/annoyance of the bot and—if necessary—pursue a different approach.

Why not post broken links to a link rot group instead?[edit]

The problem with a volunteer link rot group is that it has (relatively) few volunteers to handle a huge number of broken links. It's a bottleneck—for a link rot group to work, every volunteer must tackle a large number of links. Further, these volunteers don't necessarily know much about the articles they try to fix. It can be very difficult for a volunteer to replace a link if that volunteer never knew what that link pointed to.

My plan spreads this work over a very large number of contributors, so each contributor has less work to do. Since these contributors added the link, we can assume that (1) they know about the subject matter, (2) the know what the link used to point to, and (3) they care enough about the article to contribute.

Why not contact the most recent, most active contributors to that article instead?[edit]

Not every article has recent or active contributors. Articles that have recent or active contributors probably have fewer broken links. The power of this bot is that it can work to fix broken links in the long tail of Wikipedia articles.

Community Discussions[edit]

There is a discussion of this bot on the village pump idea lab.

Technical Details[edit]

  • The bot is written in Ruby, using the standard libraries. Wikipedia interface is hand-rolled. It runs on linux; it could probably be ported to Mac or Win32 with a little effort.
  • The source code is released as open source.
  • I will host this bot myself.

Tasks[edit]

See also the sections about throttling.

TASK 0: Checking Links[edit]

This is a read-only task.

The bot maintains a pool of questionable links---those which are not clearly working or broken. Links within that pool are checked at regular intervals until we are confident that the link is truly broken.

  • The bot carefully avoids links which have already been marked as broken, e.g. <ref>[http://ignored/] Some reference {{broken link}}</ref>
  • The bot carefully avoids links which already have an archive url specified, e.g. {{Cite web |url=http://ignored/ |archiveurl=http://something/}}

I have a working prototype of this task.

TASK 1: Marking Links[edit]

The bot edits a wikipedia article, adding the {{broken link}} template to broken links, citations, etc.

  • The bot carefully avoids links which have already been marked as broken, e.g. <ref>[http://ignored/] Some reference {{broken link}}</ref>
  • The bot carefully avoids links which already have an archive url specified, e.g. {{Cite web |url=http://ignored/ |archiveurl=http://something/}}

I have a working prototype of this task.

TASK 2: Soliciting Help for Link Repair from Human Contributors[edit]

The bot scans the revision history of the article to find which user first added that link to the article. It sends a polite message to that user's User_talk: page, alerting them that the link has gone bad, suggesting possible archive matches, and encouraging them to fix the broken link. Examples of these messages can be found in the case studies section.

I have a working prototype of this task.

TASK 3: Collect Statistics to Evaluate Effectiveness, Participation and Annoyance[edit]

This task will only modify the bot's user space and the operator's user space.

The bot will scan its own edits to determine if any of them were reverted. If so, it notifies the bot operator. The bot operator can use this to improve the bot. Similarly, it will analyze revision history of the article and of the users it has contacted. It will use this information to measure how effective it is at eliminating broken links; see the effectiveness experiment.

I have a working prototype of this task.

TASK 4: Uploading its source code, status, etc to Wikipedia[edit]

The bot will upload its source code to its user page.

The bot will upload its status (running, software version, throttling parameters, etc) to a user page.

I have a working prototype of this task

Good Communication[edit]

Here are some examples of its communications in response to a few articles with broken links. I don't mean to pick on these articles or these authors; these were randomly selected by the bot.

The important aspects of these communications are:

  • They are polite;
  • The don't accuse people of posting broken links, but instead emphasize that links go bad over time;
  • They clearly advertise the opt-out feature;
  • When available, they suggest archived copies; and
  • The link to the relevant Wikipedia policies on link rot and bots.

Case Study: Johnny Unitas Stadium[edit]

Here's an example of what it would write in response to broken links in the page Johnny Unitas Stadium

Edit Summary[edit]

Marked broken link http://www.towsontigers.com/johnnyu/index.asp; Marked broken link http://www.towsontigers.com/facilities/footballhouse.asp. Report problems to User_talk:Blevintron.

Message to User_talk:Thx2005, new section: Broken links in article 'Johnny Unitas Stadium'[edit]

Thank you for your contributions to the article 'Johnny Unitas Stadium'. Sadly, some of the links that you added have died. The article needs your help to repair link rot.

This link has died after you added it in February 2007:

This link has died after you added it in July 2007:

I'm just a bot, so I don't really know how to fix the problem. Could you please take a look? Thanks!

PS- if you don't want BrokenLinkBot to contact you, simply add {{bots |deny=BrokenLinkBot }} to your user page or your user talk page.

~~~~

Diffs[edit]

These are the changes to the article. Line wraps are introduced to make the diff readable, but are not inserted into the article.

 
 in 2002.  In fact, Unitas threw his last public pass at the re-opening of the
 facility (as Towson Stadium) just a few days before his 
-death<ref>[http://www.towsontigers.com/johnnyu/index.asp]</ref>.  His widow,
-Sandy, felt it appropriate to honor him by having the stadium named for him 
-instead, with fund-raising in his name taking the place of the money that a
-corporate naming would have supplied.
+death<ref>[http://www.towsontigers.com/johnnyu/index.asp] {{Broken link |
+date=March 2012 | bot=BrokenLinkBot/20120322.174726 }}</ref>.  His widow, Sandy,
+felt it appropriate to honor him by having the stadium named for him instead,
+with fund-raising in his name taking the place of the money that a corporate
+naming would have supplied.
 
 ==External links==
 *[https://admin.xosn.com/ViewArticle.dbml?DB_OEM_ID=21300&ATCLID=1511688 Towson
 Athletics - Johnny Unitas Stadium]
 *[http://www.towsontigers.com/facilities/footballhouse.asp Towson Athletics -
-Field House]
+Field House] {{Broken link | date=March 2012 | bot=BrokenLinkBot/20120322.174726
+}}
 

Case Study: Mohammed Ali Hammadi[edit]

Here's an example of what it would write in response to broken links in the page Mohammed Ali Hammadi

Edit Summary[edit]

Marked broken link http://service.spiegel.de/cache/international/0,1518,391177,00.html; Marked broken link http://www.fbi.gov/pressrel/pressrel06/mostwantedterrorists022406.htm; Marked broken link http://www.fbi.gov/page2/feb07/rewards021207.htm. Report problems to User_talk:Blevintron.

Message to User_talk:DBaba, new section: Broken link in article 'Mohammed Ali Hammadi'[edit]

Thank you for your contributions to the article 'Mohammed Ali Hammadi'. Sadly, a link that you added has died. The article needs your help to repair link rot.

This link has died after you added it in September 2006:

I'm just a bot, so I don't really know how to fix the problem. Could you please take a look? Thanks!

PS- if you don't want BrokenLinkBot to contact you, simply add {{bots |deny=BrokenLinkBot }} to your user page or your user talk page.

~~~~

Message to User_talk:LDH, new section: Broken link in article 'Mohammed Ali Hammadi'[edit]

Thank you for your contributions to the article 'Mohammed Ali Hammadi'. Sadly, a link that you added has died. The article needs your help to repair link rot.

This link has died after you added it in February 2007:

I'm just a bot, so I don't really know how to fix the problem. Could you please take a look? Thanks!

PS- if you don't want BrokenLinkBot to contact you, simply add {{bots |deny=BrokenLinkBot }} to your user page or your user talk page.

~~~~

Message to User_talk:Steven Russell, new section: Broken link in article 'Mohammed Ali Hammadi'[edit]

Thank you for your contributions to the article 'Mohammed Ali Hammadi'. Sadly, a link that you added has died. The article needs your help to repair link rot.

This link has died after you added it in June 2006:

I'm just a bot, so I don't really know how to fix the problem. Could you please take a look? Thanks!

PS- if you don't want BrokenLinkBot to contact you, simply add {{bots |deny=BrokenLinkBot }} to your user page or your user talk page.

~~~~

Diffs[edit]

These are the changes to the article. Line wraps are introduced to make the diff readable, but are not inserted into the article.

 There has been speculation that his parole was granted as part of a covert
 prisoner swap, in exchange for the release of [[Susanne Osthoff]].  Taken
 hostage in Iraq a month prior, Osthoff was released the week of Hammadi's
 parole.<ref>[http://service.spiegel.de/cache/international/0,1518,391177,00.html
-Freed Osthoff Not Heading Home Yet]</ref>
+Freed Osthoff Not Heading Home Yet] {{Broken link | date=March 2012 |
+bot=BrokenLinkBot/20120322.174726 }}</ref>

 On February 24, 2006, he joined his accomplices on the FBI's Most Wanted
 Terrorists list, under the name '''Mohammed Ali Hamadei'''.<ref
 name="24threlease">[http://www.fbi.gov/pressrel/pressrel06/mostwantedterrorists022406.htm
 FBI Updates Most Wanted Terrorists and Seeking Information � War on Terrorism
-Lists], ''FBI national Press Release'', February 24, 2006</ref> 
+Lists] {{Broken link | date=March 2012 | bot=BrokenLinkBot/20120322.174726 }},
+''FBI national Press Release'', February 24, 2006</ref> 
 
 On February 12, 2007, the FBI announced<ref
 name="reward">[http://www.fbi.gov/page2/feb07/rewards021207.htm FBI 2007
-announcement of reward offer]</ref> a new $5 million reward for information
-leading to the recapture of Hammadi.
+announcement of reward offer] {{Broken link | date=March 2012 |
+bot=BrokenLinkBot/20120322.174726 }}</ref> a new $5 million reward for
+information leading to the recapture of Hammadi.

Case Study: Sean Kennard[edit]

Here's an example of what it would write in response to broken links in the page Sean Kennard

Edit Summary[edit]

Marked broken link http://www.simc.jp/2004/index_p_e.html; Marked broken link http://www.chopin.org/ip.asp?op=2005; Marked broken link http://www.tomaszmagierski.com/gallery_SK.html. Report problems to User_talk:Blevintron.

Message to User_talk:Brandamber, new section: Broken links in article 'Sean Kennard'[edit]

Thank you for your contributions to the article 'Sean Kennard'. Sadly, some of the links that you added have died. The article needs your help to repair link rot.

These links have died after you added them in March 2009:

I'm just a bot, so I don't really know how to fix the problem. Could you please take a look? Thanks!

PS- if you don't want BrokenLinkBot to contact you, simply add {{bots |deny=BrokenLinkBot }} to your user page or your user talk page.

~~~~

Diffs[edit]

These are the changes to the article. Line wraps are introduced to make the diff readable, but are not inserted into the article.

 outside of Asia and Europe to win a prize in the 
 competition.<ref>"[http://www.simc.jp/2004/index_p_e.html Results of the 2nd 
-SIMC -Piano Section-]". [[Sendai International Music Competition]]. 2004.
-Retrieved on
+SIMC -Piano Section-] {{Broken link | date=March 2012 |
+bot=BrokenLinkBot/20120322.174726 }}". [[Sendai International Music
+Competition]]. 2004. Retrieved on
 2009-03-02.</ref><ref>"[http://japansclassic.com/news/040705/01.html Classical
 Music News]". Japan's Classical Music Artists. 2004-07-05. Retrieved on
 2009-03-02.</ref>  He went on from Curtis to study with [[Enrique Graf]] at the 
 After beginning studies with Graf, Kennard won various other prizes in piano
 competitions, including the [[Chopin Foundation of the United States|National
 Chopin Competition]],<ref>"[http://www.chopin.org/ip.asp?op=2005 7th National
-Chopin Competition]". [[Chopin Foundation of the United States]]. 2005-03.
-Retrieved
+Chopin Competition] {{Broken link | date=March 2012 |
+bot=BrokenLinkBot/20120322.174726 }}". [[Chopin Foundation of the United
+States]]. 2005-03. Retrieved
 2009-03-02.</ref><ref>"[http://www.usc.edu/dept/polish_music/news/apr05.html
 Polish Music Newsletter]". [[University of Southern California]]. 2005-04.
 Retrieved 2009-03-02.</ref> Iowa Piano
 the documentary "Pianists," relating the story of the American pianists as they
 travelled to Poland to participate in the 2005 [[International Frederick Chopin
 Piano Competition]].<ref>"[http://www.tomaszmagierski.com/gallery_SK.html
-Pianists: Defining Chopin]". Tomasz Magierski. 2006. Retrieved on
+Pianists: Defining Chopin] {{Broken link | date=March 2012 |
+bot=BrokenLinkBot/20120322.174726 }}". Tomasz Magierski. 2006. Retrieved on
 2009-03-02.</ref>

No Harm[edit]

Exclusions Compliant[edit]

This bot respects {{inuse}}, {{bots}}, {{nobots}}, and all their variants before editing articles or contacting users via User_talk: pages.

When checking links, this bot respects robots.txt on third party sites.

Declaring a link 'broken' with High-confidence[edit]

Here, 'broken' means that all trials of the link consistently yielded: a DNS lookup error, an HTTP Connect timeout, a connection refused error, an HTTP 404 error, or an HTTP 5xx error.

  • The bot carefully avoids links which have already been marked as broken, e.g. <ref>[http://ignored/] Some reference {{broken link}}</ref>
  • The bot carefully avoids links which already have an archive URL specified, e.g. {{Cite web |url=http://ignored/ |archiveurl=http://something/}}


A link must be checked several times (MIN_LINK_FAILURES) with sufficient time between checks (LINK_TRIAL_PERIOD) to ensure that a link is broken, not simply a temporary network or server failure. For instance, it will only declare a link broken if it failed to load 3 times over a period of 5 days.

These checks are performed from a good vantage point in the internet---a prominent university in the USA. National Internet censorship should not have a significant impact on reputable sources.

Emergency Shutdown[edit]

The bot supports an emergency shutdown page on wikipedia. To shut it down, edit that page and add the word 'shutdown'

Throttling & Maxlag[edit]

The bot is carefully throttled for two reasons: first, to minimize load on wikipedia servers, and second, to make it easier for human reviewers to keep up.

Some of the more important limits are listed here:

  • the rate at which wikipedia is scraped during high-traffic (HIGH_TRAFFIC_SCRAPE_PERIOD) and low-traffic (LOW_TRAFFIC_SCRAPE_PERIOD) times.
  • the rate at which wikipedia will edit articles during high-traffic (HIGH_TRAFFIC_EDIT_PERIOD) and low-traffic (LOW_TRAFFIC_SCRAPE_PERIOD) times.
  • the maximum number of links that will be corrected in a single edit to a single article (MAX_LINKS_PER_EDIT). This simplifies a reviewers job, since diffs are smaller.
  • the minimum time between two edits to the same article (MIN_EDIT_PERIOD_PER_ARTICLE). This prevents high-frequency edit wars against humans or other bots.
  • the maximum number of edits to perform in a single calendar day (MAX_EDITS_PER_DAY).
  • the maximum number of User_talk: messages to send to a single user on a single calendar day (MAX_SOLICITATIONS_PER_USER_PER_DAY).

The bot uses the max lag parameter.