Wikipedia talk:Proposed tools/Cvcheck

Enthusiastic support

Yes, please. We desperately need additional tools for copyright work to make processing them quicker and help us cut down the thousands of articles on backlog for review at WP:CCI. --Moonriddengirl ^(talk) 23:35, 5 November 2010 (UTC)[reply]

Some feedback

Hey all, this looks like interesting - I was not aware until now of the limitations of existing tools. Some ideas:

I think we could effectively separate the problem of identifying candidate sources from the problem of comparing two URLs "in depth". This would make each task easier to do. Comparing an URL to a Google Books URL is difficult because Google Books deliberately makes it difficult to extract text from books.
Although Earwig's tool is console-based, I think the same could be implemented as a web-based Toolserver app as well. This would make it easier to share a report among multiple contributors using a link, and it could give pretty HTML result pages comparing things side by side. It'd also be nice to have a user-customizable list of Wikipedia mirrors, although this might be vandalised.
Should the tool always use the current version of the article? Should it use a specified version? Or should it find any copyvio matching any version? Or maybe matching any version after a given point?
Is it worthwhile to exclude quotations?
In developed articles, a difficult implementation problem is deciding which parts of the article to search for. This can be based in part on the history (on the theory that usually different authors introduce material from different sources). It's also important to avoid using phrases that are common. This would be facilitated by maintaining a statistical database giving frequency of occurrence of many common words and phrases. I'm open to other ideas on this.
For dealing with results that have already been reviewed and are not an issue, one way to do it might be to have a checkbox by each result and if checked the result is "greyed out" and/or "struck out". Then there could be a global checkbox at the top which hides all struck out items. This would allow review of complete results while also making it easy to hide results already reviewed by others.

Let me know what you think, thanks. :-) Dcoetzee 01:38, 7 November 2010 (UTC)[reply]

The MediaWiki API makes it easy to get sequential revisions, but the cost depends on what's being done to each revision. For example, a simple comparison against known problematic text is cheap, but running an external search on each revision would be expensive. Knowing the usefulness of a specific test would help us weigh its benefits against its costs. Flatscan (talk) 05:48, 8 November 2010 (UTC)[reply]

It would be great to have a tool that could compare the current version or a specified version. I don't know if there's any value to running it against each revision, but having it able to compare an identified older edit would be good. I would not exclude quotations, because sometimes those are the easiest way to find an infringed source. While close paraphrasing can bypass most manual detectors, infringers tend to reproduce quotes faithfully. (Frequently, the content was also quoted in the copied source.) I'm pretty tired--it's way past my bedtime, but I wanted to give some preliminary response now that I'm aware of it. Things have been crazy lately! Thanks much for considering it. :D --Moonriddengirl ^(talk) 04:18, 11 November 2010 (UTC)[reply]

Any update?

This is an important tool, IMHO. I was wondering if anybody had an update on this wonderful tool. Thanks. - Hydroxonium (H₃O⁺) 10:08, 3 February 2011 (UTC)[reply]