User talk:Dpmuk/DpmukBOT

Hey, If you could actually get this working it would be beyond awesome. I have my doubts about the technical feasibility of the checks part, but even if the bot is purely doing housekeeping it would be a great help. Yoenit (talk) 17:37, 26 November 2010 (UTC)[reply]

I think the checks are doable, the problem is making sure the rate of the bot tagging copyright diffs as clean is very low (preferably zero). I've already got code that I think is calling some of the cases correctly (those marked "(coded)" although I've by no means checked them as well as I need to yet.) Once I have code I'm happy with I intend to explain the algorithm in layman's terms as well as publishing the code to ensure what I'm doing makes sense. The house keeping tasks are probably going to be harder for me - I'm quite used to manipulating text, which is all the checks require, while the housekeeping tasks require a understanding of how wikipedia (specifically the API) works which still requires me to look many things up. Dpmuk (talk) 23:46, 26 November 2010 (UTC)[reply]

P.S. Thanks for picking up the typo - doing things like that is one reason I don't get too involved in article writing. Dpmuk (talk) 23:47, 26 November 2010 (UTC)[reply]

Random thoughts[edit]

I'll just apologize in advance for rambling and sounding like a party-pooper (I really am a big fan of clerking bots), but I feel the need to share my ideas/concerns before I forget them. I also haven't gotten around to reading MRG's page for the background on your idea yet, so more apologies if I've overlooked something there.

As far as updating the main CCI page goes, I think the easiest way to handle it (and allow automation) would be to create every request on its own subpage to begin with, which would then be transcluded to the main page, similar to how SPI does it. Then opening a request should just involve moving the transclusion on the main page, setting a parameter (so it's collapsed like all other open investigations) and populating the CCI page around the existing request portion which is copied there currently. I think marking cases as accepted or rejected (and the corresponding movement to archives, populating, other listing) could probably be handled better with a script than a bot, since it involves active interaction and interest from a CCI clerk/admin and would need checking for subpages and other things which would be bad to overlook thinking the bot has already done them.

Figuring out which investigation pages are only articles could be touchy and I think there may have been one which included both articles and files. Also, sometimes the different subsections in a CCI matter. For example, some sockpuppet-full CCIs have different editing styles/interests and sources used for different socks which make it easier to work on one account's contribs at a time. Keeping other articles which have been cleaned at hand also makes it easy to go back and check where the copyvio was from for future articles, so I don't know that moving cleaned articles to a different section would be a net benefit. Also, I don't know if it's common practice, but I know that more than once I've marked an article Y before I've taken any action on it simply because I don't have time to finish the check but I want to make a clear note that I have confirmed at least some copyvio. VernoWhitney (talk) 17:02, 29 November 2010 (UTC)[reply]

Some of your bot checks should already be taken care of by the contribution surveyor. Practically speaking all CCI listings ignore diffs which add less than 100 bytes to the article size, which should account for very short additions and moving text around and maybe formatting changes, although I'm not sure exactly what you mean by formatting. It also checks the last 3 (I believe, I'd have to check the code) versions of the article for likely reversions (using article size, so it can have false positives but it's much more efficient than pulling article content for those revisions). Both of those are checkboxes and not mandatory, but in practice they're the de facto standard for CCI listings.

As far as checking to see if the contributions are copyvios, while pulling the changed text out of a diff is an extra step beyond just pulling up a single URL, the actual check sounds like what's been posted (sadly, little action yet) at Wikipedia:Proposed tools/Cvcheck and would be usable far beyond just CCIs. VernoWhitney (talk) 17:02, 29 November 2010 (UTC)[reply]

Right. I'll do my best to reply, although there are a couple of bits where I'm not sure what you mean.

On the main / page tranclusion issue. As far as I can see from the current page, someone makes a request for a CCI and then ad admin/clerk decided whether the case needs opening and opens it, using (for simple text cases) the contribution surveyor. What I'd intend to do is automatically produce the CCI page once it was marked as accepted by an admin/clerk. This would save the admin/clerk having to create the page manually. I fully intend for the bot to call the already existing contribution survey (no point re-inventing the wheel). As for accepting / rejecting this would NOT be done by the bot - but I'd intend the following actions (e.g. archiving / updating templates etc etc be done by the bot) so as to solve admin / clerks addition tedious work. I have concerns about transcluding cases pages due to their size but exactly how that works is probably a reasonably minor issue that can be ironed out later.
As for the different types of CCI, I'd intend for the patrolling admin marking the case as accepted to mark it if it contained images. It may also be worth having a "special" marking for unusual cases which would insure the bot treated it differently. Again I wouldn't expect the bot to be making such a decision. I'll think some more about this. Keeping different users in different sections would be easy although I appreciate this is only one specific example of your wider concern.
I'm not sure I get your "Keeping other articles which have been cleaned at hand ..." point. I'm not saying they would disappear from the page, merely that they would be in a different section, so they'd still be at hand. Could you explain further?
As for the checks, I propose being clever about the checks. When I say "very short additions" and "moving around" I mean very short additions after the other checks - so far example there was a reference addition and someone corrected a spelling (by adding a character) or reorganising the article at the same time. This would not be caught by a simple diff size check but clearly isn't a problem. By formatting I mean bolding etc. I can explain my current thoughts on an algorithm in more detail if you wish - something I was intending to do anyway once it was more developed. In future I can see this expanding to being clever about adding other tags, although this requires more thought so wasn't intended for a first version.
As for reverting it was cases like this that maybe confused me a bit about exactly what the surveyor did. I hadn't worked out exactly how to do this check yet and maybe instead of just looking back a set amount I need to look at who made previous edits and do something cleverer.
Yes, the copyvio checks does sound very similar to what has been proposed. Part of the reason I haven't even though about this bit yet is that I hadn't fully surveyed what was already out there. As I state above I don't intend to re-invent the wheel so I'd happily use this tool.
The checks could be done as part of the contribution surveyor but I think it's easier to keep them separate - especially if the developers are different.

Dpmuk (talk) 17:43, 29 November 2010 (UTC)[reply]

Thanks for the reply. I've been mulling over improvements to CCI for a while too, so I was thinking maybe we could shoehorn in some improvements there at the same time as firing up another bot. I'll try and be more positive now that I've got my first rush of negative thoughts out of the way. ^_^

I was apparently omitting some key details when I was talking about transcluding CCI pages earlier, sorry. Let me try and explain it better this time around. Currently the open investigations section has a bunch of {{CCI-open}} transclusions and the archive page uses {{CCI-closed}}. What I was thinking of was probably having a single template which acts as both of them do now (pretty much just linking to the CCI page), with the option of transcluding the entire CCI page. Every CCI page could be created containing only {{subst:CCI-request}} (preloaded, just like SPI pages), so the display on the main CCI page would be basically the same as now if the requests were completely transcluded via this new template. The bot could move the template on the main page and change the parameter so it's only linking to the CCI subpage and not transcluding it before the rest of the CCI instructions/contribs are populated and it becomes the overblown behemoth we've come to know and love.

I'm not sure why I thought bot-opened CCIs was a bad idea before - maybe just an issue of time lag between accepting it and being able to work on it; the time could add up if your bot is going to be checking the content of thousands of diffs before fully opening the CCI, but I suppose that should just be time that editors would've had to spend later anyways. Or maybe it was just the circumstances of some CCIs, but those could be handled by a "special/manual" tag. I'm not sure what the best way would be to handle The bot could react to a set of clerk notes like {{BAG Tools}}, or just react to a particular set of parameters in the template I've been thinking of.

Your idea of moving already cleared articles around within CCIs is so that unchecked ones are easier to find, to save time right? I was just saying that knowing what similar articles were copied from is sometimes helpful, and if the cleared articles were moved it would require going to the other section to find them instead of them being on the same screen, thus taking more time. There's also the issue of more edits and so more chance of edit conflicts and a busier page history making it more difficult to track if someone forgets to sign their action. This one's a minor quibble though, and just my opinion of course.

Your explanation of checks makes alot more sense now. You may already have thought of it, but a check for {{Infobox}} and its spawn would also eliminate quite a few edits. Your Orange reversion example wasn't caught because it was a revert of 6 prior edits so the surveyor didn't look far enough back (the more versions it checks the higher likelihood of a false positive). One thing that the contribution surveyor doesn't do is group sequential edits by the same contributor (either for purposes of reversions as here or even just repeated edits by the CCI subject), so that could be something your bot could do at the same time if it's going through and checking all of the diffs anyways. VernoWhitney (talk) 20:27, 29 November 2010 (UTC)[reply]

I don't mind people being negative if they're being honest as I'd prefer to have honest feedback. I'm not much of an article creator as I'm not the best at writing prose but I'd like to think I can code reasonably well so have been looking for a bit task to work on for a while. This one seemed a good one as it's (unfortunately) a reasonably obscure area so I'm unlikely to tread on many toes but as an area I myself am new too I'm looking for advice for more experienced editors such as yourself.

Now I've seen your clarification I think we're thinking along broadly similar lines with respect to CCI sub-pages / transclusions etc. As I say I'll make a mock up once I've got the bot to a certain point and then we can discuss specifics more but I think I understand you well enough to at least make a start.

Ermm, I take your point about wanting to work on it straight away. However I'm guessing a sub-page takes at least a few minutes to set up so if the bot run in that time you wouldn't lose any time. I think I need to get a mock up working properly to see how long the bot will take to run. One thing I have been thinking about is allowing the bot to be run manually (obviously with some sort of limit on how often) which would solve this problem - I think it would be technically feasible but I'm not 100% sure so would need to look into it.

Having never run a bot before I don't know how likely edit conflicts are so I'll wait to the mock up is up and running before deciding whether that's likely to a problem. Even if the moving around idea isn't a starter (and as you say it probably hasn't had enough discussion yet) the very least I can see the bot doing is collapsing completed sections. I think I need to think about this aspect some more as I've got a couple of vague ideas kicking around my head that need thinking through properly.

Hadn't considered info boxes. My one concern with infoboxes (which is also a concern with cite references but less so) is that they can include reasonably large amounts of text and so I think need to be checked for how much text they have in them. I definitely agree with you that it's worthwhile but this may have to wait to version 1.1 as I don't want to try to do too much in a first version - I'm already a little concerned that I've suggested too much as it is! Checking for adjacent diffs seems sensible especially if I do something clever checking for reversions so I'll try and include this. Dpmuk (talk) 00:01, 30 November 2010 (UTC)[reply]

Setting up a subpage and adding it to the open investigations usually only takes a couple of minutes beyond however long the contribution surveyor takes; rather more for multiple users and multiple pages due to the need to link all of the subpages and fix headers and the like. I was thinking your bot would take longer than that, but I'm probably running away with my assumptions here. As I imagined it your bot would get the list from the contribution surveyor, take the diffs and scan each of them for your checks, and then post the resorted and appended contribution list. Did you have a different plan in mind?

From what I can tell, bots don't get edit conflicts very often since they open and edit articles quickly, but they are just as likely to cause them for other editors working on the same page at that time. It should just depend on how often the bot scans the CCI for changes - if you were having it scan every 5 minutes or something it could be an issue; if it runs once a day (which may be enough given how sporadic work often is in CCIs) then edit conflicts are obviously not a big issue.

Another thought that occured to me now that I've been reading over MRG's page is that articles are sometimes tagged with {{?}} in addition to the {{y}} and {{n}} for cases where material is presumptively removed or where it's all been removed already before the CCI making determination moot.

I don't think you've suggested too much, but breaking out the ideas into separable tasks (e.g., opening/populating new investigations, checking diffs, moving listings around within CCIs) should make it easier to trial and get approved. I included a bunch of different ideas into Wikipedia:Bots/Requests for approval/VWBot and it just made the whole thing more drawn out than it had to be to get the basic tasks up and running. VernoWhitney (talk) 14:51, 30 November 2010 (UTC)[reply]