Wikipedia:Bots/Requests for approval/PrimeBOT 16

The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was

Approved.

PrimeBOT 16

Operator: Primefac (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 14:44, Monday, April 10, 2017 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): AWB

Source code available: AWB

Function overview: Update links to Cornell's Law School (i.e. fix dead links)

Links to relevant discussions (where appropriate): BOTREQ

Edit period(s): One time run

Estimated number of pages affected: ~5000, but I might have to load up ~9800 since Special:LinkSearch is acting up.

Exclusion compliant (Yes/No): yes

Already has a bot flag (Yes/No): yes

Function details: Cornell has changed all of their US Legal Code links. There are some odd edge cases, though, which will require two sets of near-identical regex (URLs ending in a non-zero numbers require an extra hyphen).

http\S*(law.cornell.edu\/uscode\/)(?:html\/)?(?:uscode)?([0-9]*)\/usc_sec_[0-9]*_0*(\w+)-+([a-z]*)?0*([0-9]*).*?html
→https://www.$1text/$2/$3$4-$5
http\S*(law.cornell.edu\/uscode\/)(?:html\/)?(?:uscode)?([0-9]*)\/usc_sec_[0-9]*_0*(\w+)-+([a-z]*)?0*.*?html
→https://www.$1text/$2/$3$4

Capture regex (and sample URLs) on Rubular.

Discussion

This looks like a worthwhile task. Has it been found that definitely all urls have been changed, and not just a few? (ie. will it only fix dead links in the old format). TheMagikCow (T) (C) 13:52, 11 April 2017 (UTC)[reply]

While I clearly did not check every link, when I went through the 5000ish similar links I kept an eye out for any odd outliers. As near as I can tell, the only links that have been changed are the ones specifically for the /uscode/ with this particular format (there are some that follow a "usc_sec_##a" format, but those appear unchanged for now). That being said, every link I checked that held to the above regex has been changed (which was about 40). So assuming that they didn't decide to arbitrarily change only half their links, this bot task will only be fixing links that are dead. Primefac (talk) 17:48, 11 April 2017 (UTC)[reply]

Yeah, that seems to be good grounds for the task. What I would suggest is that the bot finds the link, checks if it alive (http 200), if it is not, tests the new link, if that is good, it saves the edit. I am not sure how this would work with your code, but I feel that this is the safest way of changing urls. Thoughts? TheMagikCow (T) (C) 20:09, 11 April 2017 (UTC)[reply]

As far as I'm aware, AWB doesn't check if links work. I'm happy to pass this on to someone who can do such a check, but quite honestly I see no reason for them to not change all the links for a given pattern if they're updating their systems. Primefac (talk) 20:34, 11 April 2017 (UTC)[reply]

I'm just a little wary of false positives and changing links that are still ok. Has anybody else got an opinion on this? TheMagikCow (T) (C) 08:27, 13 April 2017 (UTC)[reply]

I'm not a bot operator, but I originally filed the request. If in some cases both new and old links work, it'd be wisest to go with the new format. (and the short form IS the newer one, adopted in early 2012 according to archive.org) It protects us against any future changes Cornell is likely to make to invalidate the old format. sarysa (talk) 15:44, 13 April 2017 (UTC)[reply]

Looking at some of the links, 404 is coming back to the old ones, and 200 for the new ones. Thus, I feel that my method would work - whether it is the best approach is certainly up for debate. Basically, is this extra safety net needed to catch false positives? Overall, I can't see too many false positives, so I don't feel it is a major issue with the code, but would just be a nice feature. I will certainly not oppose this, just because that extra check is not included. TheMagikCow (T) (C) 20:08, 13 April 2017 (UTC)[reply]

Can you verify a URL like this:

http://web.archive.org/web/20160909/http://www.law.cornell.edu/uscode/42/usc_sec_42_00000300--aa011-.html

Won't get modified? It's not just archive.org that uses the "long" format now, other archive sites do as well. It's needed to prevent link shortening which is policy to prevent spam abuse. Links like this will be preceded either by a "/" as in this example or a "?url=" .. more info at WP:List of web archives on Wikipedia -- GreenC 20:54, 14 April 2017 (UTC)[reply]

I have just tried that link in [www.regex101.com], with the regex at the top. There was a match so it looks like this will need fixing in the regex. I think there are also a few other archive websites used on wiki. TheMagikCow (T) (C) 17:31, 15 April 2017 (UTC)[reply]

At least 20 archive services. It should be possible, don't match if the string is proceeded by "/" or "?url=" -- GreenC 17:45, 15 April 2017 (UTC)[reply]

Yeah, sorry for the late reply, forgot about this (was busy when I got the ping). I'll work on some code. Primefac (talk) 17:49, 15 April 2017 (UTC)[reply]

Okay, it's not pretty but I just modified the \S to only accept <20 chars between the http and the law.cornell.

http[\S]{0,20}(law.cornell.edu\/uscode\/)(?:html\/)?(?:uscode)?([0-9]*)\/usc_sec_[0-9]*_0*(\w+)-+([a-z]*)?0*([0-9]*)?.*?html

This will avoid any archive URLs while still allowing for a (slightly ridiculous) range of prefixes. Primefac (talk) 17:57, 15 April 2017 (UTC)[reply]

I can't think of anything major now, so would suggest a trial, to check false positive rate/any other unforseen issues. TheMagikCow (T) (C) 08:33, 22 April 2017 (UTC)[reply]

OK, let's see 250 edits. Approved for trial (250 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. SQL^{Query me!} 02:48, 8 May 2017 (UTC)[reply]
- Trial complete.. Edits. Primefac (talk) 23:57, 9 May 2017 (UTC)[reply]

@Primefac: Please check this edit: Special:Diff/779608200 looks like some overly greedy expression may be in use? — xaosflux ^Talk 14:43, 27 May 2017 (UTC)[reply]

Xaosflux, it's not technically overly greedy, it's a typo in the text itself. The first URL ends in htm l10. I can amend the regex to find html? just in case that sort of thing happens elsewhere. Primefac (talk) 14:48, 27 May 2017 (UTC)[reply]

Approved. — xaosflux ^Talk 15:14, 3 June 2017 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.