Wikipedia talk:Labels/Edit quality/Archive 2015

Kickoff[edit]

Hey folks. We're officially kicking off the Wikipedia:Labels campaign for edit quality. You should be able to load the interface by going to Wikipedia:Labels and clicking "Install the gadget". Once you've installed the gadget, the "Install the gadget" button will be replaced by the "campaigns" listing where you can request worksets from "Edit quality (20k random sample, 2015)". Thanks for your help and let me know if you have any questions. The software is still a little rough around the edges. We'll be on the look-out for your bugs reports & feature requests throughout the week. I'll post progress reports here. Thanks for your help and let us know if you have any questions. --EpochFail (talk • contribs) 05:46, 9 May 2015 (UTC)[reply]

Installed, connected, tried a few. Note that when using Chrome, I had to allow cookies for labels.wmflabs.org. JoeSperrazza (talk) 06:10, 9 May 2015 (UTC)[reply]

Damaging?[edit]

I would like to verify that "damaging" means vandalism and spam only, and nothing else like grammar, formatting, MOS issues, bad sources, potential BLP issues, etc. Please let me know if this is not the case.- Mr X 15:04, 9 May 2015 (UTC)[reply]

EpochFail or とある白い猫 (or anyone else who knows): I holding off on any additional labeling until someone can clarify what "damaging" means. I don't want to do this incorrectly. Many thanks.- Mr X 22:01, 10 May 2015 (UTC)[reply]

Hi MrX. You've got it right. In a way, we're relying on your intuition for what is right. Generally, I encourage you to only mark edits as damaging if you would revert them. In the case of BLP issues, this is a judgement call, I know. It's your judgement that we're hoping to train the algorithms to replicate, so please feel free to rely on it. :) --EpochFail (talk • contribs) 16:33, 11 May 2015 (UTC)[reply]

Understood. Thanks for the clarification.- Mr X 17:29, 11 May 2015 (UTC)[reply]

Hello MrX indeed it is a judgement call. I want to give an example scenario for your consideration. Say if you notice a newbie accidentally removing the final |} while updating the information on the table in an article. I would personally mark this as "damaging" but also "good faith". This is different than a user completely vandalizing the article which is something I would mark as "damaging" and "bad faith". Do not worry too much about "making the wrong decision" - just stick to your best judgement even if it appears to be "too close to call". -- A Certain White Cat ^chi? 21:07, 11 May 2015 (UTC)

Suggestions[edit]

Done my first 50, and only found a couple that were borderline damaging.

Edits displayed included bot edits, WikiProject banners and user notifications from WMF, with very little chance of being classified as "damaging" - any chance to filter these out?

Could there be a middle button - to indicate "don't know / can't tell"?

Would it be possible, when opening an article by clicking on its title above the displayed edit, to open a new tab? At present, when you go back to the Labels interface, you often have to re-open the workset and find your place again.

This project looks a little like re-inventing the wheel, when ClueBot has been learning how to score edits for years. Have you considered tapping into the output from a similar exercise - the ClueBot Review interface, although I'm not sure this is being maintained any more.: Noyster (talk), 19:09, 9 May 2015 (UTC)[reply]

Hi Noyster, we need to have a pure random sample of edits classified in order to train the machine learning model. However, I agree that we should be able to know ahead of time that some edits are not damaging. I'll look into ways to pre-label those edits. For now, I hope they are trivial enough to just mark as not-damaging/good-faith.

As for the middle button, it's hard to train the bot to predict a "don't know" category. If you "don't know / can't tell", assume good-faith/non-damage as you would (probably?) when reviewing edits in the recent changes feed. However, I appreciate that some edits are clearly non-damage and good-faith. It would be nice to capture those. I've filed a bug to add a "unsure" checkbox to the form[1]. I also filed a bug for making articles open in a new tab. I'll ping here once I can get those deployed. --EpochFail (talk • contribs) 16:42, 11 May 2015 (UTC)[reply]

Noyster, I forgot to comment on ClueBot. Our work with WP:Labels is part of a larger project: m:R:Revision scoring as a service. You're right that ClueBot already does a lot of counter-vandalism work (reverting about 50% of vandalism within 10 seconds), but ClueBot is only one part of a larger quality control strategy. Regretfully, ClueBot NG only publishes it's predictions via an IRC feed -- which is hard to consume with a javascript gadget. Further, ClueBot NG is only active on English Wikipedia. Our project is designed to work cross-wiki. While we run this campaign in English Wikipedia, there are parallel campaigns in Persian, Portuguese and Turkish Wikipedias.

But really, User:ClueBot NG is not the only machine learned classifier for Wikipedia. There's also WP:Huggle and WP:Stiki (which catch the remaining 50% of vandalism). They have independently developed their own machine classifiers. Our goal is to provide the basic infrastructure that makes ClueBot/Huggle/STiki possible to everyone in a form that is easy to consume. Personally, I intend to use it for WP:Snuggle. We're not really re-inventing the wheel in that we're basing our work on the reams of scientific lit around classifying damage in Wikipedia (c.f. [2]). Our goal is more like building a wheel-builder. However, if you can point me to an recently generated open dataset of Wikipedian-labeled edits that flag damaging-but-good-faith edits separately from intentional damage, I'd be stoked to use it to supplement the labels we are gathering now. Regardless, I imagine no such dataset exists for trwiki, fawiki, ptwiki or the next Wikipedia we'd like to build a classifier for, so we still need this project -- if not this exact labeling campaign. --EpochFail (talk • contribs) 17:13, 11 May 2015 (UTC)[reply]

Thanks EpochFail for the detailed reply. Well, it may be worth noting that WP:STiki users are classifying around 2000 edits each day - identified by machine algorithms as higher-risk - into "Innocent", "Good-faith revert", and "Vandalism". Could Andrew West give you usable access to that output? Also interested to see your mention of Snuggle; if we could re-launch Snuggle and get lots of users for it, it could be of great value, both for identifying & keeping an eye on bad-faith editors, and for selectively welcoming those making really positive contributions: Noyster (talk), 20:30, 11 May 2015 (UTC)[reply]

Feedback[edit]

Going through the first 50 edits I found that most were marked as non-damaging and made in good faith, partially due to the fact that both edits made by bots and comments made to talk pages, user talk pages and even portal talk pages were included. So far I've been marking edits that I believe reduced the quality of an article as damaging, even if they were made in good faith, but I'm not certain if that's correct. I would recommend filtering out some edits such as those made to talk pages and adding an option to mark an edit as having no real effect upon an article. Additionally, can a non-damaging edit be made in bad faith, unless it's something like a WP:POINTy AfD nomination? Pishcal — ♣ 23:51, 9 May 2015 (UTC)[reply]

Hi Pishcal, I'd like you to mark edits that you'd revert as "damaging". It seems like you might be describing that, but I'm worried that you might include contributions that are productive, but imperfect (e.g. style issues, not-BLP and missing a citation). Is that the case?

Also, re. non-damaging + bad-faith. Indeed, it's unlikely to find something like this. I can hardly imagine an example. However, we're limited in the way that we present forms to be able to conditionally ask you if a edit is good-faith only when it is damaging. I suppose that we could replace the two form buttons with a three-button solution ("good", "damaging & good-faith", "damaging & bad-faith"). Do you think that would be more intuitive? --EpochFail (talk • contribs) 16:47, 11 May 2015 (UTC)[reply]

Status update: May 11th[edit]

Hey folks! We've gotten off to an exciting start. You can view basic stats of active campaigns by going to labels.wmflabs.org/campaigns/enwiki/?campaigns=stats, but for the moment, we don't have a pretty UI -- just the raw data -- so I'll take some time to talk through the numbers.

Out of 20,000 revisions, we already got 585 labels and 15 labelers. There are 950 revisions assigned and about 19k revisions to go before we complete the campaign. I've already received a set of feature requests that I'll be working to get deployed this week. There's been lots of discussion about what we should consider "damaging". A good rule of thumb is to mark an edit as "damaging" if you would revert it. --EpochFail (talk • contribs) 17:36, 11 May 2015 (UTC)[reply]

FYI: Updates deployed on 2015-05-12[edit]

See Wikipedia_talk:Labels#Deployment:_2015-05-12 --EpochFail (talk • contribs) 18:48, 12 May 2015 (UTC)[reply]

Status update: May 16th[edit]

Hey folks!

We've been making some good progress. It turns out that we're more than half way there! But a big part of the work is now done due to some automatic labeling of edits made by users with high privileges. This means that, for new worksets you request, you should see no bot/sysop/bureaucrat edits. I haven't messed with your open worksets. See my work-log entries where we worked out how to do this auto-labeling: 5/14 -- exploration and proposals & 5/16 -- auto-labeling the revisions.

OK. So what's our status then? You can always get a live update from the server, but as of right now. Out of 20k revisions, we have labels for 10.8k. That means we're about 54% done! You should notice a higher percentage of damage in the work that remains, but it shouldn't be that severe. Rather than finding 1 or 2 damaging edits per set of 50, you should see 4-6 -- assuming I have the underlying proportions right.

Keep in mind that we're doing regular drives to get key features implemented and bugs fixed. See https://github.com/wiki-ai/wikilabels/issues. --EpochFail (talk • contribs) 16:38, 16 May 2015 (UTC)[reply]

Incompatibility with Xtools[edit]

Be aware of meta:User talk:Hedonil/XTools#CSS class is too generic, which makes the Wiki Labels interface to disappear. Helder 19:51, 20 May 2015 (UTC)[reply]

We've hacked a work-around into place that should now be deployed. That means, Wiki Labels should work even if you have XTools installed. If you have any issues, try refreshing the page. If that doesn't work, let us know. --Halfak (WMF) (talk) 18:21, 9 June 2015 (UTC)[reply]

Accuracy of the progress meter?[edit]

Hello, I recently completed 11 worksets, (labeled 550 edits). If I understand correctly, this goal of this campaign is to label 20,000 edits. On the project page, there is a sort of "progress meter", and from what I can tell, it hasn't changed since I started contributing, remaining at 55.8% completion. If I labeled 550 edits, shouldn't it have advanced by 2-3%? Now, I can think of a couple reasons why this could be wrong, the thing just may not update very often, I may have incorrectly remembered the value it listed when I first saw it, etc. I ask because I wonder if maybe I have done something wrong, i.e. there is some sort of step i missed required to "submit" the data. I don't want to be doing this work and have it end up being useless to the project because I did something silly. Thanks, SarrCat ∑;3 05:18, 10 June 2015 (UTC)[reply]

@Sarr Cat: Do you mean this? I'm updating it manually by copy-pasting the number of labels I get from https://labels.wmflabs.org/campaigns/enwiki/?campaigns=stats. Helder 18:28, 10 June 2015 (UTC)[reply]

@Helder: Ok, good to know, just wanted to make sure! Thanks for the response! SarrCat ∑;3 18:47, 10 June 2015 (UTC)[reply]

Nearly there![edit]

Hey folks,

It looks like we've reached 98.7%. This is the final push. If you get in quick you can get one last workset before we've completed the campaign. Once we're done, I'll do some work to build up a new damage detection model of the m:ORES service and report on what I find. Thanks for all your hard work. --EpochFail (talk • contribs) 13:19, 18 July 2015 (UTC)[reply]

Complete![edit]

Hey folks,

It turns out that I had the last few tasks in my workset. When I finished them today, that completed the campaign. Thanks for all your help. I'll ping when new models are available for m:ORES. --EpochFail (talk • contribs) 21:34, 18 September 2015 (UTC)[reply]

Great news! I just got done training the first model based on the "damaging" question and we're doing well above the state of the art for realtime damage detection.

Accuracy: 0.8503646575572135

ROC-AUC: 0.9047936661709325

         False    True
-----  -------  ------
False     9197      68
True      2255     409

I'll be experimenting with building a model for "good-faith" and building models for the two other campaigns that have finished (Portuguese and Persian). Stay tuned. --EpochFail (talk • contribs) 13:33, 21 September 2015 (UTC)[reply]

(No indent since I'm talking to myself here.) OK. I have some models built. It looks like I was mistaken and the campaign for Persian hasn't finished.

English Wikipedia (damaging): See above
English Wikipedia (goodfaith)

Accuracy: 0.6663034367141659

ROC-AUC: 0.9011881887479476

         False    True
-----  -------  ------
False      274    1870
True        54    9732

Portuguese Wikipedia (damaging)

Accuracy: 0.7300593000918734

ROC-AUC: 0.9100237346629988

         False    True
-----  -------  ------
False     8578      88
True      2568     739

Portuguese Wikipedia (goodfaith)

Accuracy: 0.8468220162031237

ROC-AUC: 0.9204251907951084

         False    True
-----  -------  ------
False      674    2600
True        60    8639

Now to get these models on ORES. --EpochFail (talk • contribs) 00:24, 22 September 2015 (UTC)[reply]

We got some results for Persian Wikipedia!

Persian Wikipedia (damaging)

Accuracy: 0.9261582204382004

ROC-AUC: 0.9460009136260886

         False    True
-----  -------  ------
False    10695      25
True      1104     134

Persian Wikipedia (goodfaith)

Accuracy: 0.9501588894463957

ROC-AUC: 0.9406481545363439

         False    True
-----  -------  ------
False       75     791
True        24   11068

Woot! Time to get these deployed. --EpochFail (talk • contribs) 14:06, 3 October 2015 (UTC)[reply]