The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at
WT:BRFA. The result of the discussion was Approved.
Function details: Cornell has changed all of their US Legal Code links. There are some odd edge cases, though, which will require two sets of near-identical regex (URLs ending in a non-zero numbers require an extra hyphen).
This looks like a worthwhile task. Has it been found that definitely all urls have been changed, and not just a few? (ie. will it only fix dead links in the old format).
TheMagikCow (
T) (
C)
13:52, 11 April 2017 (UTC)reply
While I clearly did not check every link, when I went through the 5000ish similar links I kept an eye out for any odd outliers. As near as I can tell, the only links that have been changed are the ones specifically for the /uscode/ with this particular format (there are some that follow a "usc_sec_##a" format, but those appear unchanged for now). That being said, every link I checked that held to the above regex has been changed (which was about 40). So assuming that they didn't decide to arbitrarily change only half their links, this bot task will only be fixing links that are dead.
Primefac (
talk)
17:48, 11 April 2017 (UTC)reply
Yeah, that seems to be good grounds for the task. What I would suggest is that the bot finds the link, checks if it alive (http 200), if it is not, tests the new link, if that is good, it saves the edit. I am not sure how this would work with your code, but I feel that this is the safest way of changing urls. Thoughts?
TheMagikCow (
T) (
C)
20:09, 11 April 2017 (UTC)reply
As far as I'm aware, AWB doesn't check if links work. I'm happy to pass this on to someone who can do such a check, but quite honestly I see no reason for them to not change all the links for a given pattern if they're updating their systems.
Primefac (
talk)
20:34, 11 April 2017 (UTC)reply
I'm not a bot operator, but I originally filed the request. If in some cases both new and old links work, it'd be wisest to go with the new format. (and the short form IS the newer one, adopted in early 2012 according to archive.org) It protects us against any future changes Cornell is likely to make to invalidate the old format.
sarysa (
talk)
15:44, 13 April 2017 (UTC)reply
Looking at some of the links, 404 is coming back to the old ones, and 200 for the new ones. Thus, I feel that my method would work - whether it is the best approach is certainly up for debate. Basically, is this extra safety net needed to catch false positives? Overall, I can't see too many false positives, so I don't feel it is a major issue with the code, but would just be a nice feature. I will certainly not oppose this, just because that extra check is not included.
TheMagikCow (
T) (
C)
20:08, 13 April 2017 (UTC)reply
Won't get modified? It's not just archive.org that uses the "long" format now, other archive sites do as well. It's needed to prevent link shortening which is policy to prevent spam abuse. Links like this will be preceded either by a "/" as in this example or a "?url=" .. more info at
WP:List of web archives on Wikipedia --
GreenC20:54, 14 April 2017 (UTC)reply
I have just tried that link in [www.regex101.com], with the regex at the top. There was a match so it looks like this will need fixing in the regex. I think there are also a few other archive websites used on wiki.
TheMagikCow (
T) (
C)
17:31, 15 April 2017 (UTC)reply
OK, let's see 250 edits. Approved for trial (250 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete.SQLQuery me!02:48, 8 May 2017 (UTC)reply
Xaosflux, it's not technically overly greedy, it's a typo in the text itself. The first URL ends in htm l10. I can amend the regex to find html? just in case that sort of thing happens elsewhere.
Primefac (
talk)
14:48, 27 May 2017 (UTC)reply
The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at
WT:BRFA.