Jump to content

Wikipedia:Bots/Requests for approval/BaranBOT 2

From Wikipedia, the free encyclopedia

Operator: DreamRimmer (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 14:01, Monday, May 27, 2024 (UTC)

Automatic, Supervised, or Manual: automatic

Programming language(s): Python

Source code available:

Function overview: Fix the URLs for the ECI election database.

Links to relevant discussions (where appropriate):

Edit period(s): Every six months

Estimated number of pages affected: 5050

Exclusion compliant (Yes/No): No

Already has a bot flag (Yes/No): No

Function details: The Election Commission of India has moved all of its data (except for very recent elections) to a subdomain. As a result, URLs in more than 5000 pages are now invalid and are giving a 404 error. This bot will replace URLs like https://eci.gov.in/files/file/11699-maharashtra-legislative-assembly-election-2019 with the new URL https://old.eci.gov.in/files/file/11699-maharashtra-legislative-assembly-election-2019. Simply replace https://eci.gov.in/ with https://old.eci.gov.in/.

Discussion[edit]

Why every six months? Primefac (talk) 18:28, 27 May 2024 (UTC)[reply]

In India, elections are held in 5-6 states every year. As the elections approach or conclude, the ECI moves data from previous elections to this subdomain. This means that many URLs will become invalid after each year's elections. – DreamRimmer (talk) 22:19, 27 May 2024 (UTC)[reply]
Apologies if this is coming across as dense, just want to make sure I'm on the same page. Let's arbitrarily say that there's an election in July 2024, and the URL for those pages starts with https://eci.gov.in/ since it's a "recent election". At what point will that URL get archived to the https://old.eci.gov.in/ prefix? If it is archived after the subsequent election, why not just update the URL with the new election information along with the data it represents? Primefac (talk) 15:00, 6 June 2024 (UTC)[reply]
The problem is that I don't know when ECI moves older election results to the old.eci URL. The recent elections, held in November 2023 in six states, were six months ago. So far, the ECI has moved three sets of election data to the old.eci domain. This suggests that they archive election data within six to ten months. For now, we can fix all these broken links, but we might need to do this again for future elections. If the BRFA folks think it's unnecessary to do this regularly (every six months), it's fine to handle it once. I'll try to submit a new BRFA in the future, and we can continue regularly if needed. – DreamRimmer (talk) 14:01, 7 June 2024 (UTC)[reply]
Previous discussion Wikipedia:Link_rot/URL_change_requests#ECI_-_Election_Commission_of_India. Geoblocking is preventing outside-India bots and DreamRimmer has India IP access. DreamRimmer, to caution, there are many non-obvious problems that can arise when operating on URLs. Probably the biggest is archive URLs you don't want to modify. This PCRE regex should capture only non-archive URLs (untested):
(?<!/)(?<!\\?url=)https?://eci[.]gov[.]in/[^\\s\\]|}{<]*[^\\s\\]|}{<]*
Also verify the new URL is working before switching, do a header check, don't assume, websites always have error rates some higher than others. Other issues might arise, most problems will show up during the first 100 or so edits. Common trouble points are |url-status=, {{webarchive}} and {{dead link}}. Also links that are square and bare. It might too difficult to get all these exactly right, if you can change the main |url= and square URLs and verify the new URL works, that will go a long way! -- GreenC 15:51, 8 June 2024 (UTC)[reply]
I would definitely be cautious to avoid any potential mistakes. – DreamRimmer (talk) 16:57, 14 June 2024 (UTC)[reply]