Wikipedia:Proposed tools

This page is currently inactive and is retained for historical reference.
Either the page is no longer relevant or consensus on its purpose has become unclear. To revive discussion, seek broader input via a forum such as the village pump.

Tools, such as bots, semi-automated editing and administrative tools, and Toolserver tools with access to the Wikipedia database, regularly help in dealing with routine everyday tasks, either by automating them completely or streamlining the workflow to only involve human input where needed. They can make certain types of tasks possible that are impossible or too tedious to be cost-effective using ordinary website functions.

However, tool design is currently fragmented among many individuals with limited public discussion. The result of this is that designs are often not well-reviewed before implementation, it is difficult to recruit developers for complex tool development efforts, and creative contributors with tool ideas may have trouble finding people with the skills to make their idea a reality. The purpose of this page is to propose new tool ideas, flesh out their high-level requirements and design, and recruit interested developers.

Please be bold and invite feedback even if you're not quite sure how your tool idea would work - this is a collaborative forum and we can all work together to come up with good designs.

How to propose a new tool[edit]

Come up with a short name or terse description for your proposed tool.
Create a new subpage Wikipedia:Proposed tools/Your tool name.
Copy the following template wikitext into the subpage and fill out each field. If you don't know the answer to a question, leave it blank.
At the bottom of this page, transclude your proposed tool using {{Wikipedia:Proposed tools/Your name}}.
Direct any questions, disagreements, and reservations to the new subpage's discussion page.

In the future, proposed tools may be further categorized and structured as necessary.

== Name of tool ==

(one-sentence description of the tool, with a link to the tool development website, if one exists)

=== Problem ===

(description of the problem motivating this tool)

=== Requirements ===

(what does the tool need to do? do not include details about implementation here)

=== Interface design ===

(describe how you imagine the user interface might look; it can be web-based, GUI-based, console-based, or whatever you like)

=== List of interested developers ===

=== High-level architecture ===

(to be filled in by developers; what components will the tool have, and how will they interact?)

=== Implementation details ===

(to be filled in by developers; how will the tool be implemented? what technologies will be used and what implementation issues do you anticipate?)

=== Progress ===

(as the tool is developed, describe here how far along it is and what problems are being encountered)

Cvcheck[edit]

A copyright tool, checks for text copied from the web.

Problem[edit]

AFAIK there are currently only two WP tools available to check articles for the presence of text copied from the Web. Both have limitations.

User:CorenSearchBot runs as a background task on newly created articles. A particular article can also be run thru it by adding its name to a queue, which article the bot will run, it states, when it has a free moment. Its major limitation is that because it's an automated task, it can't search Google or GBooks.

User:The Earwig's tool [1] is manually invoked. It searches Google, but not Gbooks. It would not have caught the material that caused the recent flap [2]. (Don't know whether CSbot would have caught it either). Its author is a student who has said they won't have time to improve its algorithm. It doesn't create permanent output (I realize that might pose a maintenance problem.) I'm not sure, but I think from looking at the code [3], that if it finds one match, it adds that url to an exclusion list. If true, this means that the person who'll try and clean the article will need to go on manually comparing the rest of the website to the article - it would be much more efficient to see every match.

Requirements[edit]

Check article sentences to see whether they were copied verbatim or close to verbatim from websites (excluding known WP mirrors and public domain) and books in Gprint. Create output: for each match: article section title, matching sentence or good-sized sentence fragment, and url. Optional but would be useful: a second pass option with checkboxes that would allow the user to exclude some of the match websites, because even if the usual WP mirrors are automatically excluded, one often sees random sites that have scraped WP.

While it's under development, or maybe after it goes live too, dump its search strings out somewhere; we could then contemplate why it didn't find a match where we would have expected it to and think of ways to further improve the algorithm. Novickas (talk) 15:23, 5 November 2010 (UTC)[reply]

I've added my enthusiastic support for this idea at the talk page. Something that searches Google Books would be particularly helpful, if this is technically feasible. The checkbox idea would also be useful, although to keep reports manageable I would suggest one difference here: rather than listing complete results and then having a second pass through with a checkbox wherein specific results are excluded, I would propose a brief results page with a checkbox that allows a second pass presenting a complete comparison. (I'm also dreaming of the day when somebody can create a tool to allow me to directly compare two URLs--including old article revisions and current ones; two different Wikipedia articles; a Wikipedia article and an identified external source). --Moonriddengirl ^(talk) 11:31, 6 November 2010 (UTC)[reply]

Interface design[edit]

Console-based, like Earwig's tool.

List of interested developers[edit]

Dcoetzee 01:08, 7 November 2010 (UTC)[reply]
Flatscan (talk) I have MediaWiki API and JavaScript experience, but I may be able to help with side tasks. 05:48, 8 November 2010 (UTC)[reply]
VernoWhitney 18:17, 8 November 2010 (UTC)[reply]

High-level architecture[edit]

(to be filled in by developers; what components will the tool have, and how will they interact?)

Implementation details[edit]

(to be filled in by developers; how will the tool be implemented? what technologies will be used and what implementation issues do you anticipate?)

Progress[edit]

Just this morning I implemented a basic prototype of this that seems to do a pretty good job. It doesn't yet account for a lot of things like detecting close paraphrasing or eliminating common phrases or proper names, but a few people have tried it and given good feedback. See:

Duplication Detector tool on Toolserver
Demonstration: [4]
Comparing to a PDF: [5]

It's based on a simple n-gram search algorithm, where the webpages are stripped down to text, split into a sequence of words, then an index data structure is built out of one of them by collecting for each pair of words all positions at which that word pair occurs. It then goes over the other document's sequence of words and at each position matches its current pair against each position that pair occurs in in the other document, extending it as far as possible. Finally, during the final listing it sorts by number of words in reverse order, and eliminates any search results that are substrings of search results already listed. PDFs are simply filtered through the existing pdftotext tool first. Dcoetzee 17:23, 21 March 2011 (UTC)[reply]