User:GreenC/WorksByProject

From Wikipedia, the free encyclopedia
For quick information on what this script ("bot") does: see "Step 4: Script Tasks" below

"Works By" Project is a project to add {{Gutenberg author}}, {{Librivox author}} and {{Internet Archive author}} to the External Links section of eligible articles. It also migrates {{Worldcat id}} to {{Authority control}} described in Step 4 #3 below.

Background[edit]

There is general consensus for linking to Project Gutenberg, LibriVox and Internet Archive. However most of the authors on these websites - who have matching Wikipedia articles - are not linked. For example of the 4,401 LibriVox authors with a Wikipedia article, only a few hundred have a {{Librivox author}} template. Gutenberg is better, of about 9,500 Gutenberg authors with a Wikipedia article, about 4,500 have a {{Gutenberg author}} template (less than 50%). Internet Archive has 100s of thousands of authors but only a tiny fraction are linked from Wikipedia. The problem continues to get worse as these sites and Wikipedia grow.

For Project Gutenberg, the problem is almost intractable because adding the template manually is so time consuming as to be basically impossible. Determining which Wikipedia article maps to which Gutenberg account can take 5 to 15 minutes of labor. For example Charlotte Maria Tucker on Wikipedia maps to A. L. O. E. on Gutenberg. To go through the 21,000 Gutenberg names in its catalog (as of Dec 2014) manually searching Wikipedia articles using variances in spellings, pseudonyms, dabs, etc would take perhaps 6 to 12 months almost full time. There are many cases which take considerable mental and search effort to piece together a PG -> Wikipedia map using resources such as the PG book preface, Open Library data, etc. And when done, you'd have to start over from scratch and do it again, to account for new Wikipedia articles being added. It's endless. The end result is that in the 15 years of Wikipedia's existence, the majority of names are still not linked despite a group attempt to do so that has since stalled (though an admirable effort considering it was done manually).

A solution[edit]

The only realistic solution is to automate with software as much as possible. Software that creates databases mapping the Wikipedia ID with the external service ID. So I wrote a suite of tools and procedures that combine automation where possible, and manual labor when required. It's a huge project, but doable by a single person and repeatable in the future.

Step 1: LibriVox[edit]

LibriVox is the easiest since on their website each author page has a link back to Wikipedia. Thus it's a simple matter of scraping every author webpage on LV and pulling the Wikipedia ID into a database, thus creating a map of Wikipedia names to LibriVox IDs.

 Done

Step 2: Internet Archive[edit]

Internet Archive is a monster website with millions of books. It's beyond the scope of this project (at this time) to create a database of every IA author, or to add every author on IA to WP; but we can add the IA template to every page where we add the PG and LV templates. Thus I wrote {{Internet Archive author}} in Lua, which dynamically creates a standardized and optimized IA search string.

 Done

Step 3: Project Gutenberg[edit]

This is by far the most difficult. The reason is that mapping the PG names to WP article titles is difficult (but not impossible) to automate. The basic procedure is to download the PG database (catalog.rdf) which contains author names in the form "Smith, John (1900-2000)" though the data is very noisy and needs a lot of cleanup. From that, the string is converted to "John Smith", the PG book titles are read into an array, the birth-death dates into a variable. Then a bot begins to search Wikipedia for a match using this data as well as data from Open Library. Some matches are made with high confidence, others with lower confidence requiring manual intervention. In total this bot processed over 20,000 names downloading 1+ million Wikipedia pages over the course of about 350 hours. About 100 manual hours were spent cleaning up and verifying the data. The good news is the bot remembers mistakes so it can be run again in a few years and will more quickly find new mappings. The accuracy level is very high greater than 99.8% for false-positives and 99.0% for false-negatives.

 Done

Step 4: AutoWiki Browser script[edit]

Once the mapping databases are created (PG and LV), an AWB external script adds the templates to the articles. This script does a number of other chores including moving WorldCat IDs to {{Authority control}}, and converting Google Book's URLs to their equivalent on Internet Archive. This is a semi-automated process with manual labor determining where in the External links section the templates are put, among other things. It takes about 2 to 3 hours to process 100 names this way, and there are over 11,000 unique names from the PG and LV databases combined (accounting for overlap).

Script tasks[edit]

The script produces these suggestions. Edits are manually checked before saving:

  • Existing {{Gutenberg author}} are converted from {{Gutenberg author|id=Ernest_Bernbaum}} to {{Gutenberg author | id=Bernbaum+,Ernest}}. This later name format is used in the Gutenberg database and more accurate. Dates are omitted by default since many birth-death dates in the Gutenberg catalog are wrong, and if/when they are corrected it would break Wikipedia's search URL. However if an existing Gutenberg template is using dates, the dates are usually kept.
  • The template options for {{Internet Archive author}} are many and the script will usually provide a number of options during run-time, based on the number of books discovered using those options, and the best results are then chosen manually. It will also search for pseudonyms by extracting names in bold from the lead section.
  • 2. Convert Google Books URLs to Internet Archive URLs when books at Internet Archive are available. Example. The script extracts the Google Book ID from the URL, searches for that ID on Internet Archive, and if found replaces the Google URL to a IA URL. (most books uploaded to IA from GB were done by Aaron Swartz ie. user "tpd" on IA a few years before he died).
  • 3. Any instance of {{Worldcat id}} will be deleted if it can be replaced with {{Authority control}} since these two templates are usually redundant. If the LCCN number exists on Wikidata then {{Worldcat id}} is deleted and nothing is done to {{Authority control}} (unless it doesn't exist in which case it's added). If there is no LCCN number on Wikidata, then the LCCN number is copied from {{Worldcat id}} into {{Authority control}} (example). An exception is if the LCCN number in {{Worldcat id}} is different from the LCCN in Wikidata (or {{Authority control}}) in which case the {{Worldcat id}} template is not deleted. If a {{Worldcat id}} template contains an ID other than LCCN (such as VIAF) the {{Worldcat id}} template is left unmodified. Further discussion.
  • 4. Normal AWB General Fixes.
  • 5. Convert <references/> to {{Reflist}}
  • 6. Any other manual work in the External Links and References sections.

 Done (June 26, 2015)

Template counts:
Unique articles processed 11,903

Source code[edit]

Source codes and database mappings are available.

  • LV2WP, code to map LibriVox author IDs to Wikipedia article names.
  • PG2WP, code to map Project Gutenberg IDs to Wikipedia article names.

-- GreenC 18:57, 30 January 2015 (UTC)

Further AWB jobs[edit]

Further AWB jobs different but related to the above.

rev.ia9w[edit]

In November 15', Internet Archive upgraded the search engine from Lucerne to Elasticsearch (ES). Since ES is based on Lucerne it was mostly seamless, however there was a change (for the better) with accented characters and wildcards. In addition there was a bug in {{Internet Archive author}} that caused excessive false positives when using wildcards in certain search conditions. As such the template was modified adding a new sopt=w option, which impacts all existing instances where the title contains accented letters, or about 906 articles. Some fraction of those need new sname and/or sopt parameters. This job implements those changes. -- GreenC 03:49, 5 February 2016 (UTC)