User talk:GreenC bot/Archive 5

You can stop the bot by pushing the stop button. The bot sees and immediately stops running. Unless it is an emergency please consider reporting problems first to my talk page.

Archives

Bad link[edit]

Hello, the link the bot just added to Jamey Johnson just turned out to be an HTTP 404 page. Graham87 07:55, 7 July 2020 (UTC)[reply]

It's a soft-404 in the wayback database which reports status 200. I added a {{cbignore}} to keep bots off the cite. -- GreenC 12:51, 7 July 2020 (UTC)[reply]

Archived from the original link to irrelevant page[edit]

Hello. The link was working fine at Fred Broussard before and still is, but the bot reformatted the link and added the "archived from the original" link to an irrelevant web page. The proper archived link is still there, but what was the point of that? I'm still rather new at this, so am I doing something wrong how I format archived links, or is the bot just doing something dumb? (or is this question dumb?) --DB1729 (talk) 03:21, 9 July 2020 (UTC)[reply]

When a |url= is dead, like http://www.databasefootball.com/teams/teamyear.htm?tm=PIT&yr=1955&lg=nfl (technically this is a soft 404 since it is still live but redirects to a wrong page), we add a web archive in the |archive-url= field and retain the original dead url in the |url= field. For example in this case, the web archive URL was incorrectly in the |url= field, so it was moved to the |archive-url= and the original URL was added to |url=. Since the original is a soft 404 and you would prefer it not be displayed/shown, you could set |url-status=unfit -- GreenC 03:30, 9 July 2020 (UTC)[reply]

OK, so in the future, when I link that web page again, put the archived, good url in the |archive-url= and set |url-status=unfit. No problem, but it sounds like I still need to put the original, dead url in the |url= field? --DB1729 (talk) 03:45, 9 July 2020 (UTC)[reply]

Yes. Normally |url-status=dead is standard, but some editors don't like the dead link being displayed so they use unfit, but that's sort of a hack since unfit was designed for URLs "unfit" for public display (porn etc). There;s nothing wrong with a dead link being displayed as it says "archived from the original". The idea is should the archive URL stop working then someone can determine the original source URL (which seems obvious since it's part of the archive.org URL but not all archive providers do that all that time some use shorthand URLs). -- GreenC 03:54, 9 July 2020 (UTC)[reply]

Got it. Sounds like a pain but I will do it that way from now on. Unfortunately I've done a bunch of similar links the wrong way in the last week. Should help keep your bot busy. Thank you very much for your time and your very helpful explanations. --DB1729 (talk) 04:04, 9 July 2020 (UTC)[reply]

Weird uncommenting[edit]

This edit left three citations with open comments by removing the closing of the comment. I'm not sure what was meant here. Jerod Lycett (talk) 04:49, 10 July 2020 (UTC)[reply]

This is a bug, it is fixed. -- GreenC 14:06, 10 July 2020 (UTC)[reply]

Possible Issue[edit]

The bot added the website link to a newspaper article (via the newspaper's website) to an Internet Archive link and then marked the link as dead. Now, clearly, if I am using an Internet Archive link, then the link on the newspaper website is going to be dead. Since this is a Featured Article, dead links can't not be afforded. Regardless of that point, this seems like something that should be fixed before the bot should continue. Adding dead links to a live Internet Archive link and then marking everything as dead seems counterintuitive. - Neutralhomer • Talk • 23:55 on July 14, 2020 (UTC) • #StayAtHome • #BlackLivesMatter

How citation templates work: add the source URL to the |url= field, in this case http://www.winchesterstar.com/article/0625radio .. if/when the source URL dies, add three additional web archive arguments: |archive-url=, |archive-date= and |url-status=dead. This is documented and is standard for millions of templates. In this case the template incorrectly had the web archive URL in the |url= field. There are certain (rare) conditions that is desirable for example when the citation is specially citing the Wayback Machine itself, but that is not the case here. -- GreenC 01:30, 15 July 2020 (UTC)[reply]

I know, I co-wrote that article with a few people. This is the active link which you (and your bot) are linking over with a dead link and marking as dead. There is no other link so I am forced to use the Internet Archive. The Winchester Star link you (and the bot) are adding (if you look at it) is a simple 404 page. There's nothing there. You are adding a 404 page. So, unfortunately, I have a revert that for these two reasons.

Adding a dead link over a live one and adding a 404 page. Neither of these are helpful. Normally I wouldn't care, but this is a featured article, so I care. - Neutralhomer • Talk • 02:07 on July 15, 2020 (UTC) • #StayAtHome • #BlackLivesMatter

I hear what you are saying. It sounds like you don't understand what a web archive is, or how the template works. Archive.org is a web archive (<-- click to learn more what a web archive is) which is a special type of link. Examples are archive.org, archive.today, webcitation.org. There is a specially designated place just for web archive URLs, which is |archive-url=. Web archive URLs don't go in the |url= field. They belong in the |archive-url= field. The |url= field is for the original source URL. It doesn't matter the source URL is dead (404), we still maintain it in the citation (for reasons). Then set the |url-status=dead which tells the citation to display the archive URL and not the source URL. It is done like this in literally millions of citations, including every featured article. -- GreenC 03:49, 15 July 2020 (UTC)[reply]

I do understand what all those are. But the link you (and the bot) are adding isn't so much "dead" as it is "incorrect". A 404 error means it doesn't exist anymore and that's what is occuring. Using the Internet Archive link gives the same information, the same link, from the day it was published (June 25, 2016) and that link is "live" and "correct".

Now, because this is a Featured Article, none of the references can be marked as "dead". This is why I am riding this. I, and several others, worked far too hard on this article to lose the FA status over some petty like this. Normally, I wouldn't care, but I have vested interest in this article. Blood, sweat, and tears...which is a little dramatic, but you get my point. I can't let a dead link and a dead link mention override a live Internet Archive link and potentially cost the article it's Featured Article status. I just can't. - Neutralhomer • Talk • 01:11 on July 16, 2020 (UTC) • #StayAtHome • #BlackLivesMatter

Almost none of that is accurate though. Turning over to the community at Help_talk:Citation_Style_1#Dead_link_in_a_Featured_Articles. -- GreenC 01:58, 16 July 2020 (UTC)[reply]

Neutralhomer, where in Wikipedia:Featured article criteria or related pages does it say that dead links are not allowed when an archived version of the cited source is available? We at Wikipedia can't control whether a source is currently available on the web, just as we can't control whether a specific book or journal article is available at your local library. It may be that you, or another editor, has misremembered or misinterpreted some criterion. – Jonesey95 (talk) 02:20, 16 July 2020 (UTC)[reply]

@Neutralhomer: I've worked on dozens of FAs, and a few FLs, and most of them have links that have gone dead. I've added |archive-url= |archive-date= and |url-status= and moved on. None of them have been nominated for FAR/FLR on that basis. None. You're safe here, and there's nothing to worry about. If someone tries to nominate the article on that basis, let me know because I'll have a trout ready. Imzadi 1979 → 02:37, 16 July 2020 (UTC)[reply]

@Jonesey95: With all due respect, I would greatly appreciate if everyone would not assume I am not a complete idiot when it comes to Wikipedia. I've been doing this for 14 years, I got this. As for "where", dead links in references is kinda one of those things that are frowned upon really anywhere. But in GA and FA articles, it's VERY much frowned upon. Look at the GAR, Peer Review, and FAC and you'll see what I mean. - Neutralhomer • Talk • 02:46 on July 16, 2020 (UTC) • #StayAtHome • #BlackLivesMatter

@Imzadi1979: That's because those editors don't update their articles and references. I do...constantly. A little OCD mixed with a BIG bunch of Autism makes the WINC (AM) article stay updated constantly. Even 6 years after it went to FA status. I updated most of the references during quarantine (had nothing else better to do) so they good and up-to-date. Yes, out of date FAs can be nom'd for FAR/FLR. Doesn't happen often, but it has happened. I'd rather stay updated and have no dead links and my live links not marked dead and no 404 links added in their place, then tempt fate. - Neutralhomer • Talk • 02:53 on July 16, 2020 (UTC) • #StayAtHome • #BlackLivesMatter

Nevermind. Issue dealt with. I just found a better link outside of the Internet Archive which between May and now, I didn't know existed. Problem solved! - Neutralhomer • Talk • 03:07 on July 16, 2020 (UTC) • #StayAtHome • #BlackLivesMatter

Good detective work in finding that archive. I see that you unfixed one of my edits that put "Jr" in the right place, per MOS:JR ("When the surname is shown first, the suffix follows the given name, as Kennedy, John F. Jr.").

Also, I looked in those GAR/PR/FAC pages that you linked to, and I didn't find the word "dead" or relevant uses of "archive", so I was unable to see what you mean (sorry, I am somewhat literal sometimes). So again, where are the FA criteria or consensus discussions that forbid archived, non-working urls, which are present in zillions of articles? Dead links in references are not "frowned upon" as long as there is an archived page available. I don't think you have anything to worry about when there are dead urls in citations; please don't put the archive-url in the url parameter. – Jonesey95 (talk) 03:19, 16 July 2020 (UTC)[reply]

Maybe it's because I'm of a military background (ie: Navy brat), but it would written as "Kennedy Jr., John F." Since the "Jr." is technically part of the last name.

As for the lack of "dead" links, you won't find any. As this link isn't truly dead, just a 404 page. This Internet Archive link (which goes to the same location) is correct and is not a 404.

Don't worrk about being literal. I'm Autistic, I'm always literal. :) - Neutralhomer • Talk • 23:31 on July 16, 2020 (UTC) • #WearAMask • #BlackLivesMatter

A 404 error is a dead link. It means "page not found". And as for "Jr", MOS:JR makes it clear where it goes when the last name is written first. Happy editing! – Jonesey95 (talk) 05:18, 17 July 2020 (UTC)[reply]

Bad link II[edit]

The bot really shouldn't have made this edit to come out with this useless link, which is just a standard Wayback Machine error page. I've updated the relevant data and link. Graham87 06:00, 16 July 2020 (UTC)[reply]

Unfortunately this is an error in the Wayback database/api reporting the link available and status 200. Normally the bot then verifies it is working by checking the HTML of the page, but since it is a (supposed) PDF it bypasses the check. I'll see what can be done for edge cases like this. -- GreenC 13:31, 16 July 2020 (UTC)[reply]

I took a closer look at this and it actually does check PDFs. The problem is the Wayback page is doing some unusual JS redirect magic which I don't fully understand which is throwing off the API and the bots. I'll report it to Wayback. -- GreenC 14:13, 16 July 2020 (UTC)[reply]

Breaking edit[edit]

This edit broke things, because of the lack of closure on {{dead link}}. Jerod Lycett (talk) 11:19, 18 July 2020 (UTC)[reply]

Breaking again[edit]

It appears it removed the closing part of a comment left after a citation this time. Edit made.

Fixed. -- GreenC 14:04, 21 July 2020 (UTC)[reply]

The bug is recorded in the logs so I checked the prior 5 year history of Wayback and this was the only instance, having processed millions of articles. It required a very unusual combination of conditions. -- GreenC 14:14, 21 July 2020 (UTC)[reply]

trashed the citation template[edit]

With this edit the bot trashed a citation template. I assume that is because it was taking something from the (wholly unreadable) archive. Perhaps check everything that the bot is going to add that is new and don't make changes if the 'new' contains replacement chars?

—Trappist the monk (talk) 15:18, 24 July 2020 (UTC)[reply]

Another example of a similar issue.

—Trappist the monk (talk) 15:34, 24 July 2020 (UTC)[reply]

Trappist the monk : About a year ago they began randomly rendering various replacement characters, typically Cyrillic and other non-Latin sites, there are a lot of archive pages that used to work. I worked with a Russian programmer who checked each WC archive page, as listed in the IABot database, and gave it a percentage score how badly mangled the page was, and using WaybackMedic replaced those with other archive providers where possible (but only in the IABot database). This should prevent IABot from adding any new mangled archives. Nothing has been done for Enwiki's existing corpus of webcite links, or any other language site (the Russian programmer did work on ruwiki). I should find that list and incorporate a check into WaybackMedic, so when it comes across a URL that is suspect replaces it with another provider.

In the mean time, to avoid adding mangled URLs in CS1|2, can you tell me which replacement characters generate this error? -- GreenC 16:21, 24 July 2020 (UTC)[reply]

cs1|2 sees the character itself: � (actually as a decimal representation of the three-byte percent encoded code-point, %EF%BF%BD, because Lua doesn't do hex – what were they thinking when they took that decision?). I do not know where the replacement character actually comes from. Did the bot get it from the source? Did the bot attempt to un-percent-encode the percent-encoded value assigned to the word attribute in this url's query string:

http://www.gramota.ru/slovari/dic/?pe=x&word=%EF%E0%E2%F1%E8%EA%E0%EA%E8%E9

that may be because there are ten %xx in the string and ten replacement characters in the url that the bot created:

https://www.webcitation.org/6AMFU7nts?url=http://www.gramota.ru/slovari/dic/?pe=x&word=��

Still, all that cs1|2 sees is the replacement char and it alarms on that.

—Trappist the monk (talk) 17:05, 24 July 2020 (UTC)[reply]

Turns out to be two unrelated problems. The bot was incorrectly percent decoding, which caused the replacement character to be displayed. The bug was logged over the bot's lifetime and it happened in 20 articles. 16 had already been repaired (about half by yourself), and the bot just did the remaining four (example). -- GreenC 21:04, 25 July 2020 (UTC)[reply]

Perhaps there are still lurking problems?

—Trappist the monk (talk) 23:29, 6 August 2020 (UTC)[reply]

The WebCite API returned a URL with a trailing "%" (erroneous) which uncovered a bug in my urldecode() since it expects 2 characters following. It now verifies there are 2 characters following otherwise "decodes" it as a literal %. Good fix to a core function. -- GreenC 01:14, 7 August 2020 (UTC)[reply]

Why is the bot changing https links to http ?[edit]

Hi. I am puzzled why in this edit the bot changed two "https" links to "http". The site in the URLs, TechCrunch, uses only https and forces all http connections to reconnect as https. There is therefore no value in changing the URLs from https to http. Is there any reason why the bot is making this change? - Dyork (talk) 00:42, 25 July 2020 (UTC)[reply]

The URL was archived at WebCite in 2014, with http, presumably before TechCrunch had https. It detects that and updates both the archive and url fields to be in sync. However, it is not ideal if the site has since migrated to https. I added a new feature to detect if the url field is https it will not downgrade to http. -- GreenC 03:54, 25 July 2020 (UTC)[reply]

Bot removed 2 working archiveurls[edit]

The bot removed two working archiveurls here and here. Is there an underlying reason? DaHuzyBru (talk) 00:58, 30 July 2020 (UTC)[reply]

The pages report 404:


  ./header 'https://web.archive.org/web/20160821072959/http://www.basketballireland.ie/leaguecup'

  HTTP/1.1 404 Not Found
  Server: nginx/1.15.8
  Date: Thu, 30 Jul 2020 02:26:46 GMT
  Content-Type: text/html; charset=UTF-8

I've never seen a working page with a 404 header. Would think it impossible. Like a reverse soft-404. Probably an error in the Wayback Machine. I tested some other basketballireland.ie URLs and they don't have the problem so maybe only these two. I'll add some {{cbignore}} so the bot won't mess with them and report it to Wayback. -- GreenC 02:44, 30 July 2020 (UTC)[reply]

Some falafel for you![edit]

For your edit in the same article. ISL fan (talk) 07:01, 30 July 2020 (UTC)[reply]

broke |url-status= parameter value[edit]

This edit?

—Trappist the monk (talk) 19:35, 10 August 2020 (UTC)[reply]

The comment unwinds will be verified manually prior to uploading here out. Too much gigo potential and there are not many of them. -- GreenC 13:49, 11 August 2020 (UTC)[reply]

wikicomment involved in bot confusion[edit]

This edit likely confused GreenC bot. I have reverted it, but the incident deserves some investigation why the bot made the edit and probably to improve its logic to prevent such a problem. Uncommenting wikitext is probably not a good idea. Regards, —EncMstr (talk) 17:29, 13 August 2020 (UTC)[reply]

I started manually checking these (see above) but something went wrong in the processes this run and missed the step. Should be fixed. -- GreenC 18:22, 13 August 2020 (UTC)[reply]

Fanmade episode of TTG![edit]

GreenC bot, Greeny Titans GO! is a fake Teen Titans Go! episode. It's Brain Food episode for Teen Titans Go!, real 43rd episode. — Preceding unsigned comment added by 2600:1700:4210:2450:C05D:DBF:45B0:A10D (talk) 04:51, 21 August 2020 (UTC)[reply]

Bot removed opening HTML comment bracket[edit]

Hi Green, I'd like to report a glitch incorporated with this edit: [1]. The bot removed an opening <!-- from a HTML comment while removing an unused parameter, causing the comment to become live and causing some disturbance. --Matthiaspaul (talk) 09:22, 1 September 2020 (UTC)[reply]

Don't remove url-status=dead from citations[edit]

Hi Green,

Please don't let your bot remove |url-status= from citations if |archive-url= is not present. Both, |url-status= and |archive-url= belong to |url=, but |url-status= does not depend on |archive-url= semantically. While it may be debatable if |url-status=live makes sense without |archive-url= (depending on if editors prefer things to be spelled out explicitly or rely on implicit default), |archive-url=dead/usurped definitely makes sense to be recorded in a citation even without |archive-url= (and regardless of if the current version of the template shows some special behaviour for this combination or not).

On a different note, I welcome if GreenC bot removes empty parameters from citations when they are the result of empty template prototypes being copied into an article and never filled out, however, I hate it when the bot removes empty parameters deliberately inserted into citations to indicate that some imporant info is still missing. As a compromise to satisfy both parties as much as possible, perhaps you can improve the bot rule regarding the removal of empty parameters to do it only if a certain number of empty parameters has been found in a citation, and leave empty parameters alone if only a few of them would be found. I think, a threshold value of 4 or 5 would be a good compromise, still cleaning up junk from citations but not hindering editors in improving citations manually and communicating with each other through this "light-weight communication process".

Regarding both issues, see also: Help_talk:Citation_Style_1#Invalid_parameters_ignored_when_empty

Thanks.

--Matthiaspaul (talk) 16:44, 21 September 2020 (UTC)[reply]

The bot operates according to existing methods and documentation. If a URL goes dead, this condition is recorded with {{dead link}}. The |url-status=dead has never been used for that purpose. It exists to determine how |archive-url= is displayed. You may logically aspire for it to work the same way, but it does not, at this time. It does not display a visual ^{[dead link]} tag, it does not add to tracking categories, it does not have a |date= parameter when the URL was tagged dead, it does not have a |bot= option, the bots that check for dead links do not check for it when looking for dead links, and the vast majority of users understand all of this. If you want to change how things work and address these issue, please go ahead. As for empty arguments, I would remove them doing manual edits, AWB edits or bot edits because I think is the right thing to do. They are added clutter and expand the size of the text. I disagree that they should be used for personal work flow purposes or to try and control other editors into filling them in - no one can guess that is the purpose be it human, AWB or bot. Anyone, bot or person, can remove empty parameters at any time because they have no bearing on how the template displays text, which is the only purpose of the template. The bot has been doing this for over 5 years and removed countless empty archive-related arguments reducing complexity, clutter and article size which has been a benefit to the project as a whole. -- GreenC 21:08, 24 September 2020 (UTC)[reply]

Cross-database JOINs going away[edit]

I don't know if you've been following the Cloud mailing list, but your input on the Wiki Replicas 2020 Redesign thread would be appreciated. The TLDR is that cross-database joins, like the one used in the query for User:GreenC bot/Job 10, will no longer work starting in January/February 2021. --AntiCompositeNumber (talk) 20:52, 16 November 2020 (UTC)[reply]

AntiCompositeNumber, thanks for the heads up. I didn't write the query and have no idea how to solve this with SQL. It's a basic list intersection problem which Unix can solve in under 50MB eg. comm -12 <(sort --buffer-size=50M enwiki-list.txt) <(sort --buffer-size=50M commons-list.txt) .. the problem is retrieving the Commons list of 105,693,753 via sequential API requests could take upwards of 70 hours. There is commonswiki-20201101-all-titles.gz in the Dumps but only updated monthly. Do you know if SQL could download a list of all File: titles on Commons in a way that is fairly fast and direct to a file? Given that the rest is solved. -- GreenC 00:08, 17 November 2020 (UTC)[reply]

This now a Phab T267992. -- GreenC 00:50, 17 November 2020 (UTC)[reply]

@GreenC: You would still have to paginate the query, as a single query for 65 million files doesn't work well. We can use larger chunk sizes than the API though, with a few thousand records per page. https://public.paws.wmcloud.org/User:AntiCompositeBot/ShadowsCommonsQuery.ipynb is a quick implementation of that in Python. Unfortunately the biggest bottleneck isn't the database, but the network between the database and the application. --AntiCompositeNumber (talk) 19:04, 17 November 2020 (UTC)[reply]

Shadowscommons[edit]

Your bot adds {{ShadowsCommons}}, but will it also remove it again, for example on File:August logo.png? — Alexis Jazz (talk or ping me) 06:32, 24 November 2020 (UTC)[reply]

The removal is done manually once someone checks into the situation causing the shadow. I'm not too familiar with that side of it but there are people who monitor the tracking category Category:Wikipedia files that shadow a file on Wikimedia Commons -- GreenC 14:50, 24 November 2020 (UTC)[reply]

Why does this require humans? Your bot can tag it, and your bot could detect it if the Commons file is deleted and untag the file.. I mean, why not? — Alexis Jazz (talk or ping me) 07:18, 25 November 2020 (UTC)[reply]

The bot isn't designed that way, it uses a SQL query that removes files already tagged thus File:August logo.png is invisible to the bot. It would require a separate process to load pages in the tracking category and verify existing tags. At the moment the bot will need a lot of work before February to account for changes in the SQL servers so I will keep this in mind as a possible feature once I get into the code of it again. -- GreenC 14:41, 25 November 2020 (UTC)[reply]

Archiveurl deleted[edit]

Can you explain why the bot made this edit? It says it reformatted the archive url, but all it did was remove it completely. Bob1960evens (talk) 09:10, 30 November 2020 (UTC)[reply]

The |archive-url= field is meant for web archive's such as archive.org or archive.today full list at WP:WEBARCHIVES. By having a non-web archive URL in the field it prevents IABOt from adding a proper archive URL when the URL dies creating link rot. Thus the removal to free up the space. Typically it would have replaced it with a web archive, thus the reformat message, but in this case the URL is still live so there is nothing to do but delete the old URL. The edit summary isn't properly reflecting that. -- GreenC 13:35, 30 November 2020 (UTC)[reply]

Thanks. I obviously copied the wrong url into the archiveurl field, but have corrected that now. Bob1960evens (talk) 13:09, 1 December 2020 (UTC)[reply]

A live url doesn't mean that it supports the cited content that was archived, as some sites such as the CIA Factbook get updated and information is either added, deleted or simply changed, or that the updated version is reliable. The blind removal of the archived urls in this instance is unnecessarily creating problems that need to be sorted out by hand. M.Bitton (talk) 16:11, 25 January 2021 (UTC)[reply]

Capabilities of job 18 (cite%20note fixing)[edit]

I recently independently rediscovered this VisualEditor bug when editing, and saw that you had a script to fix it. Two questions:

I discovered it in the form "cite note" rather than "cite%20note", and a cursory search suggests that the former is (at least now) much more common than the latter. Could you adopt the bot from the former to the latter?
The edit period is noted as "manual". Are the remaining 40-or-so cases places where the bot couldn't deal with it (and therefore it would be worthwhile to fix by hand), or are they just artifacts of the bot not having run in a while?

Vahurzpu (talk) 04:30, 8 December 2020 (UTC)[reply]

Oh no. I didn't know about "cite note". The VE bug causing this creates endless varieties of syntax errors that have to be determined and programmed for and new ones keep showing up, like wack a mole. The bot ran recently, the 40 remaining of "cite%20note" the bot couldn't determine. They may be beyond reasonable repair. -- GreenC 05:07, 8 December 2020 (UTC)[reply]

Bryanston.[edit]

Hello, Do you know please how to edit the typo in the Ronald Neame quotation in this article ? Thanks 79.73.43.190 (talk) 20:14, 11 February 2021 (UTC)[reply]

removing archives[edit]

Hi, in this edit, the bot removed citations with archived copies and replaced them with "better" URLs but stripped the archive-urls. I've since manually fixed these but this is not good behaviour at all. It would have been better for it to leave well enough alone. Could you please mod it so that it doesn't replace URLs where there are archived copies with ones where there aren't? (or where it's too limited to find them, at least?) —Joeyconnick (talk) 03:33, 12 February 2021 (UTC)[reply]