Wikipedia:WikiProject Red Link Recovery/Link matching script

From Wikipedia, the free encyclopedia
Note the description below is based on SQL dumps which have been discontinued. Current dumps are XML.

The script below details how the lists of suggested alternate targets for red links were created.

  1. Get yourself a mysql database. It's free to download, easy to set up and runs on most types of computer. The process should work on any reasonably recent version of the mysql software.
  2. Download the latest page and pagelinks tables from here. There are normally labelled enwiki-latest-page.sql.gz and enwiki-latest-pagelinks.sql.gz respectively.
  3. Unzip these files. Most file decompression utilities can do this, for example winzip or the freely available gzip.
  4. Import the unzipped tables into your mysql database. To do this start a mysql session and create a new database (CREATE DATABASE redlinks;) and make sure it's your current database (USE redlinks;). Now use source <filename> to import each of your files. This make take some time, the wikipedia database contains a lot of data.

You now have all the data needed to some quite nifty red link recovery on your PC, ready to use. For more information on this process, check out Wikipedia:Database download.

Now run this script:


DROP TABLE IF EXISTS crushed_art;
DROP TABLE IF EXISTS crushed_links;

CREATE TABLE crushed_art (
  id int(8) unsigned NOT NULL,
  title varchar(255) BINARY NOT NULL
);

CREATE TABLE crushed_links (
  id int(8) unsigned NOT NULL,
  orig varchar(255) BINARY NOT NULL,
  link varchar(255) BINARY NOT NULL
);

INSERT INTO crushed_art
SELECT page_id, page_title
FROM page
WHERE page_namespace = 0;

-- An index helps this next query a lot
ALTER TABLE page ADD INDEX ( page_namespace, page_title );

INSERT INTO crushed_links
SELECT pl_from, pl_title, pl_title
FROM pagelinks l
LEFT OUTER JOIN page t ON l.pl_namespace = t.page_namespace AND l.pl_title = t.page_title
INNER JOIN page f ON l.pl_from = f.page_id
WHERE f.page_namespace = 0
AND   l.pl_namespace = 0
AND   t.page_id IS NULL;


This creates two new tables (crushed_links and crushed_art) and copies all the data related to articles in the main wikipedia namespace - namespace 0.

The reason we've made copies of the data is because we're about to mangle it - or as I like to call the process, crushing. The goal of this process is to transform all of the red links and the article titles into simpler forms, then check if and of the simpler red links now match any of the simpler article titles. So for example if we are considering repeated instances of the letter 's', we might transform the article title Mississippi into Misisippi, Similarly we might transform a red link to Misssisippi into Misisippi. As the red link and title now match we might have found a good candidate to suggest!

Some example crushing operations are:

Having carried out some crushing operations that we believe might have produced useful matches, we now need to pick out these matches and carry out a little checking on them:


DROP TABLE IF EXISTS suggestions;

CREATE TABLE suggestions (
    title varchar(255) binary NOT NULL,
    link varchar(255) binary NOT NULL,
    suggestion varchar(255) binary NOT NULL,
    PRIMARY KEY ( title, link, suggestion )
);

ALTER TABLE crushed_links ADD INDEX( link );

REPLACE INTO suggestions
SELECT f.page_title, orig, t.page_title
FROM   page f, crushed_links, crushed_art, page t
WHERE  f.page_id = crushed_links.id
AND    f.page_namespace = 0
AND    t.page_id = crushed_art.id
AND    t.page_namespace = 0
AND    link = title;

Next it's often useful to carry out a bit of automated weeding out of suggestions we expect to be quite poor. These might include:

  • Suggestions to transform red links into the name of the article that contains them.
  • Red links that are very short after crushing. Short links are likely to match many spurious article titles.
  • Links that have already been suggested and marked as incorrect by project members - see the exception list
  • Suggestions that are repeated for many articles.

As a rule of thumb, I try to ensure that over 75% of all entries on a list of suggestions are valid, even if this means discarding a large number of entries that have a lesser but still significant chance of being correct. There are over 2 million red links out there - it's not worth trying to develop complex techniques when sheer numbers ensure that simple ones produce amply long lists.


DELETE FROM suggestions WHERE title = suggestion;
DELETE FROM suggestions WHERE length( suggestion ) < 5;
DELETE FROM suggestions
WHERE EXISTS (
  SELECT 1
  FROM exceptions
  WHERE suggestions.title = exceptions.title 
  AND   suggestions.link = exceptions.link 
  AND   suggestions.suggestion = exceptions.suggestion );

SELECT count(*), link, suggestion
FROM suggestions
GROUP BY link, suggestion
ORDER BY 1 desc
LIMIT 25;

To export your cleaned-up list of suggestions, the simplest thing to do is use mysql's tee command. This copies the output of your mysql session into a file. This file can then be opened in a text-editor (I like textpad) and prepared for cutting and pasting into wikipedia. Do be cautious of trying to cut and paste results from a command window under Microsoft Windows, this can lead to uncommon characters being lost.

A simple export of your suggestions table to a text file could be done by:


tee c:\suggestions.txt;

SELECT concat( '*[[', title, ']] links to  [[', link, ']], try [[', suggestion, ']]' )
FROM suggestions;

notee;

If your have mysql version 5.0 or newer, much tidier output can be had using the script below. Start the mysql client with the options --disable-column-names --silent to strip out any pretty textboxes. You may need to log in as your mysql sysadmin to have permission to create a stored procedure.


DROP PROCEDURE IF EXISTS report_single_suggestions;
DELIMITER //

CREATE PROCEDURE report_single_suggestions( group_size INT )
BEGIN
   DECLARE sug_pos, sug_base, done INT;
   DECLARE title, link, suggestion VARCHAR(255);
   DECLARE sug CURSOR FOR SELECT * FROM suggestions;
   DECLARE CONTINUE HANDLER FOR SQLSTATE '23000' SET done = 1;

   SET sug_pos = 0;
   SET sug_base = 0;
   SET done = 0;
   OPEN sug;

   REPEAT

     FETCH sug INTO title, link, suggestion;

     IF NOT done THEN
       IF sug_pos = 0 THEN
          SELECT concat( '=== ', sug_base, ' - ', sug_base + group_size - 1, ' ===' );
          SET sug_base = sug_base + group_size;
          SET sug_pos = group_size - 1;
       ELSE
          SET sug_pos = sug_pos - 1;
       END IF;

       SELECT concat( '*[[', title, ']] links to [[', link, ']], try [[', suggestion, ']]' );
     END IF;

  UNTIL done END REPEAT;
  
  CLOSE sug;
END;
//

DELIMITER ;

call report_single_suggestions( 10 );

Where a given method of crushing gives several suggestions per re link, a variation on this stored procedure can produce much neater output:


DROP PROCEDURE report_multiple_suggestions;
delimiter //

CREATE PROCEDURE report_multiple_suggestions( group_size INT )
BEGIN
  DECLARE l_count, l_moredata int;
  DECLARE l_line varchar(1024);
  DECLARE sug CURSOR FOR
         SELECT concat( '*[[', title, ']] links to [[', link, ']], try ', group_concat( concat( '[[', suggestion, ']]' ), ' ' ) )
         FROM suggestions
         GROUP BY title, link;  
  DECLARE EXIT HANDLER FOR NOT FOUND SET l_moredata = 0;

  SET l_count = 0;  
  SET l_moredata = 1;
  
  OPEN sug;
  
  WHILE l_moredata DO
  
    IF mod( l_count, group_size ) = 0 THEN
       SELECT concat( '=== ', l_count, ' - ', l_count + group_size - 1, ' ===' );    
    END IF;
    
    SELECT l_line;  

    SET l_count = l_count + 1;
    FETCH sug into l_line;
    
  END WHILE;
  
  CLOSE sug;

END;
//

delimiter ;

CALL report_multiple_suggestions( 10 );