Template talk:Webarchive/Archive 1
This is an archive of past discussions about Template:Webarchive. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page. |
Archive 1 | Archive 2 |
One template
Hi folks, {{webarchive}}
is an old redirect to {{wayback}}
no longer in use. I'm thinking of starting a new project in Lua to repurpose this template to aggregate {{wayback}}
, {{webcite}}
, {{memento}}
, {{cite archives}}
, (any others) into a single generic web archive template. It would also add new features listed below. This came out of the discussion at Wikipedia:Templates_for_discussion/Log/2016_October_7#Template:Cite_additional_archived_pages.
A number of problems were identified with the current methods:
- Lack of support for multiple archive links such as exists in
{{cite archives}}
. This is true in CS1|2 templates like{{cite web}}
; and in templates like{{wayback}}
- Lack of support for archives other than wayback and webcite. There are dozens of web archive services but we only have few templates available and it would be impractical to make a new template for every service.
- Confusion of methods and documentation across multiple web archival templates.
The solution:
- 1. Create a new template
{{webarchive}}
in Lua. It will take parameters like this:|url=https://web.archive.org/web/201609010000/http://example.com
|date=9 September 2016
- It would have standard options like title and nolink; and new support for multiple URLs eg.
|url2=
- It would detect the archive type by domain name, and make sure the rendering mirrors that of any existing template as closely as possible. So it would be possible to replace
{{wayback}}
with{{webarchive}}
and the output would look the same and not break pages. - It will have tracking categories for errors which most templates currently lack.
- It will have CS1-style red inline error messages to alert editors to problems.
- A single documentation page, instead of different docs and methods for each template.
- It would have standard options like title and nolink; and new support for multiple URLs eg.
- 2. Create a bot to change all instances of old templates to the new template.
- 3. Retire/delete the old templates.
- 4. Start discussion with CS1|2 to see if they would be willing to add support for multiple archive links such as
|archiveurl2=
. - 5. Check with IABot it will be able to use this new template.
I've already converted {{wayback}}
to Lua so have some experience. It will take some work to develop and test but don't foresee any roadblocks technically, just time.
Linking here from the talk pages of relevant templates. Also @Evad37 and Cyberpower678:. -- GreenC 15:50, 9 October 2016 (UTC)
- IABot would need to be reconfigured to support this change.—cyberpowerChat:Limited Access 19:48, 9 October 2016 (UTC)
- Support per my comments at the TfD - Evad37 [talk] 01:19, 10 October 2016 (UTC)
- Question: Is the template output going to comply with Citation Style 1? Keep in mind that it seems CS1 is wrongly conflated with CS1 templates which are applications of the style and not the style itself. As a result, the style elements of CS1 are poorly defined, but general guidelines can be intimated. These regard: the order of displayable values, their visible interdependencies, separators, text formatting, terminal punctuation and static (pre-inserted) text. I am bringing this up because the output of templates like {{wayback}} which the proposed module will replace, does not comply with CS1. 65.88.88.127 (talk) 17:47, 18 October 2016 (UTC)
- The output of
{{webarchive}}
mirrors{{wayback}}
and{{webcite}}
. These templates were never designed to be CS1 because CS1 templates have their own support for|archiveurl=
. There's also 85,000+ instances of the templates so to change the output would possibly break many pages. If we wanted a new style of output, as an optional argument switch, that could be done (or make the legacy style the option and the new style the default). For{{cite archives}}
, it was a sort of workaround to CS1's lack of support for multiple archives, so the goal there is to get support for that in CS1 and retire{{cite archives}}
. In the mean time,{{webarchive}}
can support it with an extra option|format=
. -- GreenC 21:55, 18 October 2016 (UTC)- The question wasn't about compatibility with CS1 templates, it was about compatibility with CS1 as a style. The new template can be compatible with the style independent of CS1 templates, CS1 modules etc. The question is, will it? 65.88.88.126 (talk) 22:54, 18 October 2016 (UTC)
- I think the intention is:
- "no" by default (so as to reproduce existing templates' behaviour, except for
{{cite archives}}
) - "yes" with the
|format=
parameter, if placed at the end of a CS1-style citation (to reproduce{{cite archives}}
's CS1-style behaviour) - that if/when CS1|2 module/template support is available, the
{{cite archives}}
behaviour would be deprecated and eventually removed, as it would be redundant to specifying the additional archives in CS1|2 templates directly - Evad37 [talk] 02:45, 19 October 2016 (UTC)
- Exactly. -- GreenC 03:03, 19 October 2016 (UTC)
- Thank you. 65.88.88.126 (talk) 12:27, 19 October 2016 (UTC)
- Exactly. -- GreenC 03:03, 19 October 2016 (UTC)
- "no" by default (so as to reproduce existing templates' behaviour, except for
- I think the intention is:
- The question wasn't about compatibility with CS1 templates, it was about compatibility with CS1 as a style. The new template can be compatible with the style independent of CS1 templates, CS1 modules etc. The question is, will it? 65.88.88.126 (talk) 22:54, 18 October 2016 (UTC)
- The output of
- Comment: I feel that the
url
parameter should be the original canonical URL. The*-date
parameters could instead specify witch archive service is being used ie.wayback-date
,webcite-date
, etc. The inclusion of multiple URLs (ie.url2
etc.) should alert editors that the page was moved. How you would do this? Regards. – Allen4names (contributions) 18:30, 19 October 2016 (UTC)
- What you are proposing seems like a different template altogether. The design goals for this template are set out above, and they may not be trivial. Let's not unnecessarily complicate things. 72.43.99.130 (talk) 19:42, 19 October 2016 (UTC)
Proposed changes: |format=
and |via=
.
Is it possible for |format=
to acquire a more general role? The way I understand it, now it just signifies CS1 compliance. This maybe a waste of a parameter. I recommend that "format" take any of the following values: wayback|webcite|cs1|cs2|memento, and then display the results accordingly. If that is too much work, then maybe start with the most heavily used styles? Additionally, I would like to ask that |via=
be included, with a function/style similar to the one |via=
has at CS1. Personally I think that the current nomenclature ("at Wayback" etc.) should not apply to items that are retrieved. Neither is the link at the related repository. An available copy is there; the link lives in the template code, and the retrieved item is likely on the user device cache. However, it is retrieved via the repository. 72.43.99.130 (talk) 20:00, 19 October 2016 (UTC)
Initial, rudimentary, observations.
As expected, today's version results in faster page loads when compared to use vs. {{cite archives}}
(real-world examples from Order of the Star in the East). Again compared to {{cite archives}}
and also as expected, results in heavier resource use (larger argument size, visited nodes etc.) So far, so good. 65.88.88.75 (talk) 20:52, 20 October 2016 (UTC)
Module vs template
The module page still states that it is not ready for article space, but the related notice has disappeared from the template that invokes it. Which one is correct? 72.43.99.146 (talk) 02:11, 15 November 2016 (UTC)
- Fixed, thanks. -- GreenC 05:24, 15 November 2016 (UTC)
Portuguese Web Archive
Portuguese Web Archive is misspelled, i.e., "Archived August 1, 2016, at the Portuguese Web Archive". 79.76.183.103 (talk) 20:32, 20 January 2017 (UTC)
- Fixed. -- GreenC 05:26, 21 January 2017 (UTC)
|url=
vs. |archiveurl=
It's very annoying that
|url=
is actually the generic web archive URL, not the deadlink. It's not possible to specify a specific snapshot (like is done with the |archiveurl=
parameter in {{cite}}) using this template. This was proposed by many people in the TfD discussion, but no one commented on, or addressed it. Why does this parameter exist then?
Then —Hexafluoride Ping me if you need help, or post on my talk 18:15, 10 February 2017 (UTC)
|date=
is actually the date the link died, not the date of the snapshot. This is inconsistent. |url=
is the original URL, which is automatically linked via The Wayback Machine, so one would assume if the specific Wayback URL can't be used that the date parameter is for pointing automatically at the snapshot in question, but it's not, and it deals with the original URL's date.
|date=
is the date of the snapshot.|url=
is the archive URL not the dead url. There is no need for the dead url it's not a CS1|2 or citation template. The documentation page has examples how the template works. -- GreenC 19:04, 10 February 2017 (UTC)
- Note that the parameter naming issue was definitely addressed/discussed by several users at TfD – have another read of Wikipedia:Templates_for_discussion/Log/2016_October_24#Web_archive_templates – for example, note that RCraig09 even struck that part of their !vote following GreenC's response. If you want to propose changing parameter names, or adding parameter aliases, or whatever, you can make such a proposal, but the "no one commented on, or addressed it" claim is simply not accurate. - Evad37 [talk] 19:35, 10 February 2017 (UTC)
- I'm very sorry. I've got things mixed up. This should be at {{deadurl}} not here. This template works as intended. I was reading the documentation for {{deadurl}}, then read the TfD discussion and somehow the two wires crossed in my brain. —Hexafluoride Ping me if you need help, or post on my talk 19:38, 10 February 2017 (UTC)
Ranges for archive index?
Sometimes official sites change multiple times, so I'd like to see the archive index (for wayback machine citations) show a range of when a page was archived. For example one URL would be 1998-2001, another 2001-2005, and so on WhisperToMe (talk) 04:17, 9 July 2017 (UTC)
- Not sure if Internet Archive supports ranges, how would that URL look? It is possible to link to the index with '*':
{{webarchive |url=https://web.archive.org/*/http://en.wikipedia.org |date=* |title=Enwiki}}
- Enwiki at the Wayback Machine (archive index)
- -- GreenC 13:51, 9 July 2017 (UTC)
Batch referencing books in the public domain
Archive.org has become a de facto library for books in the public domain, and it's widely referenced throughout Wikipedia. It'd be great to have a tool/bot that goes through an article, and adds bibliographical references to all publications cited in the text and wikilinked to a publication article (book, etc), if there isn't one already. I wouldn't know how to code it, but the algorithm seems easy enough to implement:
- Go through all wikilinks
- If it's linked to a publication article, search archive.org using the query:
https://archive.org/search.php?query=title%3A%28'title'%29+AND+creator%3A%28'author'%29+AND+mediatype%3A%28texts%29&sort=-downloads - Pick first result, and, if a reference is not already there with the same author and title, add it in.
Easy enough? — WisdomTooth3 (talk) 23:53, 2 March 2018 (UTC)
Error category with no error text
László Nagy (canoeist) shows the "Webarchive template warnings" category, but I do not see any red error text, so it is difficult to figure out what to fix. – Jonesey95 (talk) 18:45, 1 December 2017 (UTC)
- There is no
|date=
..|date2=
.. etc .. a date argument is required one for each matching|url=
..|url2=
etc.. the template design decision was to silently let it go without a red warning since it can recover a date from the URL (usually), but still adds it to the warning category. -- GreenC 21:36, 1 December 2017 (UTC)
- Jonesey95, my bot WP:WAYBACKMEDIC was able to fix most of these (missing dates). -- GreenC 19:17, 11 March 2018 (UTC)
Idea: Date ranges?
For archive indexes I'd like the option to display a range of dates. Many organizations change official website URLs multiple times, so I'd like to have a range of official site URLs with different dates according to when that URL was maintained.
Thanks! WhisperToMe (talk) 22:00, 22 April 2018 (UTC)
- Also there is the
|addlarchives=
option that will allow for multiple snapshots. -- GreenC 23:09, 22 April 2018 (UTC)
Archive wikiwix
Hello,
I am Pascal Martin from Linterweb, our Wikiwix service archived since 2008 more than 100 million Francophones and Anglophones source links on Wikipedia. Our system is based on a detection of real-time links on Wikipedia and backup the content of external links without compromising the noarchive tag.
Then, I am coming to you to offer you to supply the template Webarchive through our archives, simply by http://archive.wikiwix.com/cache/?url=http://www.letelegramme.fr/ig/generales/regions/cotesarmor/coat-an-noz-le-chateau-retrouvera-son-eclat-01-08-2011-1386817.php
Please, note that I am the manager of a small company, my goal is not to make money with archives but to propose an Alternative to content saveguard and give some big data for the europeen research.
Indeed, since December we will deploy our technology to the entire corpus of the Wikimedia Foundation and in all languages.
We are hosted by the French University Network.
Sincerely, Pascal Pmartin (talk) 18:41, 4 November 2016 (UTC)
Pmartin, thanks for contacting. I look at wikiwix a week ago and had some questions. English Wiki has 10s of millions of external links and we use bots to automate most of the archiving.
- I could not find an API or method to determine if an archive is available. For example this http://archive.wikiwix.com/cache/?url=http://www.nowork.zzz returns a status code of 200 even though the page is not available. It should return 404 in this case, or whatever the status code of the original page was. Otherwise bots will not be able to verify if the page is available and working.
header "http://archive.wikiwix.com/cache/?url=http://www.nowork.zzz" HTTP/1.1 200 OK Date: Fri, 04 Nov 2016 19:19:08 GMT Server: Apache/2.2.22 (Debian)
- There is no date. Is there a way to know when the page was archived? This is important as we keep tracking of archive dates since pages change over time.
- Will the link disappear? I recall reading that links on wikiwix.com are deleted if the link on Wikipedia is deleted. It's unclear which language of Wikipedia this is tracked or how stable the archive cache is.
- Archive.org has been adding all links from all languages so there is overlap and they have excellent API tools and reliability.
-- GreenC 19:34, 4 November 2016 (UTC)
- If it s only an api that you need, you could ask me, i will do it :)
- it s the date when the link appear in wikipedia but we could add the date.
- and maybe archive add a bit too many links - it does not seem to respect the NOARCHIVE tag, which wikiwix does example : https://web.archive.org/web/20150815000000*/https://www.facebook.com/cocacolafrance/
"User-agent: ia_archiver Allow: /about/privacy Allow: /ajax/pagelet/generic.php/PagePostsSectionPagelet Allow: /full_data_use_policy Allow: /legal/terms Allow: /policy.php User-agent: ia_archiver Disallow: /"
Pmartin (talk) 19:57, 4 November 2016 (UTC)
Pmartin, that's great your willing to make changes. Yes I think we would need only two things, an API to check if a URL is available, and the date - for example http://archive.wikiwix.com/cache/?url=http://example.com&date=20160901120101 .. or date retrievable via an API. It might also be good when archiving a page, save the status code (also retrievable via API). Very often pages are archived non-200 but that information is lost and there's no way to determine later if the page is any good. Archive.is has this problem making it almost useless for bots since they have a high rate of non-200 pages. Wayback tells you the original page status which is very important for maintaining links. Also, what is Wikiwix policy of deleting links from the cache, are they deleted automatically if the original link is deleted from Wikipedia? -- GreenC 20:14, 4 November 2016 (UTC)
- A good choice for the API might be the Memento API. It is supported by many existing archive services. —RP88 (talk) 20:18, 4 November 2016 (UTC)
- I'm familiar with it and use it in bot. Unfortunately for whatever reason it's not always accurate, things get out of sync between what's actually in the archive database and what Memento says. This was particularly true with Webcite - emails to them went unanswered. The archive.is results are often 404 even though Memento reports 200. Most of the smaller archives like LOC and other national libraries work, but they don't usually have much content so rarely get hits. I think an API for Wikiwix could be very simple since they don't have multiple snapshots over time - a single date, page and URL. -- GreenC 20:33, 4 November 2016 (UTC)
- Green I am not understand why do you need an api , if it s just to update the template webarchive to take a link in another archive ? --Pmartin (talk) 19:48, 7 November 2016 (UTC)
- Here is how the template looks:
{{webarchive |url= http://archive.wikiwix.com/cache/?url=http://www.letelegramme.fr/ig/generales/regions/cotesarmor/coat-an-noz-le-chateau-retrouvera-son-eclat-01-08-2011-1386817.php}}
- Produces: Archived (Date missing) at Wikiwix
- Unfortunately a date is required or it gives a red error. This mirrors
{{citeweb}}
which requires|archiveurl=
and|archivedate=
. If a date is not provided, it too will give a red error message. There is broad consensus that archives should provide the date for source verification. Wikiwix could have the date in the top grey bar, where it says "This page is a cached version of this URL," then editors can manually type in the date into the template. The API is needed if you want bots to automate adding Wikiwix links to Wikipedia, which is recommended otherwise it won't get much usage from manual additions alone. -- GreenC 21:38, 7 November 2016 (UTC)- FYI https://fr.wikipedia.org/wiki/Discussion_utilisateur:Pmartin#I_left_you_a_message.21 we are talking about "Exclusive Solution" of IABot--Pmartin (talk) 01:16, 24 November 2017 (UTC)
- Hi GreenC, i'm johan, a technical ressource of linterweb. We added an API like request to request datas on archive.wikiwix.com about an URL : http://archive.wikiwix.com/cache/?url=http://www.linterweb.fr&apiresponse=1 , We also added the date in the webpage content at bottom and a long form url with the datetime inside, like http://archive.wikiwix.com/cache/20180329074145/http://www.linterweb.fr . Maybe could you help us to add wikiwix as a known archiver in Module:Webarchive ? --Johan linterweb (talk) 07:53, 30 March 2018 (UTC)
- Here is how the template looks:
- Green I am not understand why do you need an api , if it s just to update the template webarchive to take a link in another archive ? --Pmartin (talk) 19:48, 7 November 2016 (UTC)
- I'm familiar with it and use it in bot. Unfortunately for whatever reason it's not always accurate, things get out of sync between what's actually in the archive database and what Memento says. This was particularly true with Webcite - emails to them went unanswered. The archive.is results are often 404 even though Memento reports 200. Most of the smaller archives like LOC and other national libraries work, but they don't usually have much content so rarely get hits. I think an API for Wikiwix could be very simple since they don't have multiple snapshots over time - a single date, page and URL. -- GreenC 20:33, 4 November 2016 (UTC)
- @Johan linterweb:That's great news. The Webarchive Module itself shouldn't need an update (it already recognized Wikiwix URLs), but other things need to be done:
- Search for all instances of Wikiwix URL's on en.wikipedia and convert to long form eg. http://archive.wikiwix.com/cache/20180329074145/http://www.linterweb.fr
- If in a citation template, add a
|archivedate=
(if not already) - If in a webarchive template, add a
|date=
(if not already) - this will fix red errors - Update WP:WAYBACKMEDIC so that it includes Wikiwix in its list of web archive services to search when looking for new archives. WMedic will be able to add new Wikiwix archives into Wikipedia.
- Update WP:WAYBACKMEDIC so that it does the first three steps automatically going forward so it can maintain the system should users add Wikiwix URLs in the short form.
- I'll start looking at this soon. -- GreenC 16:10, 30 March 2018 (UTC)
- @Johan linterweb: - Question: is it true that links are deleted from WikiWix if they are deleted from the French Wikipedia? This would be a problem at English Wikipedia, if the wikiwix link stopped working because of removal at the French Wikipedia. -- GreenC 16:29, 30 March 2018 (UTC)
- Hi GreenC, great news ! About your question, the answer is no, we don't delete links from our archives even if they are removed from frwiki (because everyone can use wikiwix to archive a weblink, not necessary an external link from frwiki). --Johan linterweb (talk) 11:43, 5 April 2018 (UTC)
Hello @Johan linterweb:, that's excellent links are preserved. I'm working on updating WaybackMedic to add new WikiWix archives (for links marked with {{dead link}}
) and the WikiWix API is returning many soft-404s. For example [1]. This will require manual checking which means new additions would be small since it can't be fully automated. Do you know what percentage of WikiWix archives could be soft-404? Initial tests show it might be as high as 50% (when checking the API for links on Wikipedia marked with {{dead link}}
). -- GreenC 14:57, 7 April 2018 (UTC)
- Hi GreenC, well we don't know how many archives could be soft-404, we didn't mark them as 404 pages in the past, from now we will. --Johan linterweb (talk) 08:49, 12 April 2018 (UTC)
- @Johan linterweb: I've once again blocked Wikiwix-bot on the IABot Management Interface for adding archives like http://archive.wikiwix.com/cache/20150618143943/http://business.financialpost.com/news/retail-marketing/dollarama-tests-market-in-latin-america-with-sourcing-deal to IABot's DB which are bad.—CYBERPOWER (Chat) 00:18, 9 April 2018 (UTC)
- In a span of 20 minutes, the bot added roughly 200 bad archives.—CYBERPOWER (Chat) 00:21, 9 April 2018 (UTC)
- Hi CYBERPOWER (Chat), as i said you on pmartin's page, we have fixed the problem (due to meta tag "noarchive" detection). It would be simplier to write us in pmartin's page about problem between wikiwix and IABot, it's not related with wikiwix integration in webarchive module or template. --Johan linterweb (talk) 11:58, 12 April 2018 (UTC)
- In a span of 20 minutes, the bot added roughly 200 bad archives.—CYBERPOWER (Chat) 00:21, 9 April 2018 (UTC)
- Hi GreenC, we fixed archive wikiwix code to detect soft404 URLs (using patterns like you suggested) and restarted to push our archive URLs to IABot (on frwiki but it can get side effect on others projects). We hope the detection rate is quite good. We also ran a bot internally to fix all our archives datas.--Johan linterweb (talk) 08:17, 27 April 2018 (UTC)
- Ok. I know from experience with archive.is this filter method requires continual monitoring and updating and it's not 100% perfect. What is the rate of 404s that get through the filter? I'll be able to monitor the rate when I run WaybackMedic as I do manual verification. --GreenC 14:44, 27 April 2018 (UTC)
Preview error please?
Would it be possible to make it so that this template cause an error or warning in Preview if an editor has put it inside <ref> tags? -- 109.78.242.41 (talk) 13:54, 22 July 2018 (UTC)
- Technically don't think possible. And not a error - probably a majority of cases. -- GreenC 15:44, 22 July 2018 (UTC)
- How do you mean not an error? The top of the template page includes the warning: "This template is intended for external links. It is not designed for use as a citation template." -- 109.77.213.7 (talk) 12:14, 23 July 2018 (UTC)
- Ah true, but it's still often used inside citations because of IABot this is the only mechanism it can use for citations not in CS1|2 format. The warning would be more applicable if webarchive is used as the entire citation, in which case it should be converted to a CS1|2 template (which IABot also does). But if the citation is free-form this template is appropriate. Also when using the
|format=addlarchives
which is designed to be used following a CS1|2 citation. -- GreenC 13:49, 23 July 2018 (UTC)
- Ah true, but it's still often used inside citations because of IABot this is the only mechanism it can use for citations not in CS1|2 format. The warning would be more applicable if webarchive is used as the entire citation, in which case it should be converted to a CS1|2 template (which IABot also does). But if the citation is free-form this template is appropriate. Also when using the
- How do you mean not an error? The top of the template page includes the warning: "This template is intended for external links. It is not designed for use as a citation template." -- 109.77.213.7 (talk) 12:14, 23 July 2018 (UTC)
Date format
How do you get the output date into the appropriate format, DMY, MDY or ISO when there is no |date=
specified? I was expecting a |df=
parameter. Keith D (talk) 18:09, 19 August 2018 (UTC)
|date=
is required, like|archivedate=
in CS1|2. Unlike CS1|2, if no date is provided it will try to figure it out from the URL and default to ISO, but no guarantees or options. There's no|df=
parameter because there is only one argument that uses dates,|date=
, which serves dual purpose of specifying the date and format for itself. -- GreenC 19:14, 19 August 2018 (UTC)
Importance of template
This chart shows this template to play an important role in Wikipedia infrastructure. What it says: 6% of all readers (excluding logged in users) click through to at least one external link. Of those external links, about 40-50% are going to web.archive.org .. there are 348636 transclusions of this template (representing about 300,000 wayback links) and about 2 million wayback links in total on enwiki. The math gets complicated but one can intuit it's serving a non-trivial portion of total external-link click-through on Wikipedia. -- GreenC 15:37, 3 September 2018 (UTC)
changes in Module:Webarchive/sandbox
Because of this discussion and hacks that I made at bn:Module:ওয়েব_আর্কাইভ, I started looking at Module:Webarchive/sandbox with an eye toward better/easier support for internationalization. Some of that I have already done (see the new table at serviceName()
). With that change and other changes that I think make the code more understandable, I am more-or-less ready to work on i18n. For now, I have left most of the original code in the /sandbox so that it is easily available for reference.
I have run the /sandbox against all of the /testcases pages that I can find so I am pretty sure that as the code stand now, nothing is horribly wrong. However, there are some 'errors':
- Template:Webarchive/testcases/CiteArch – no errors – any tests in this page that are not also in /Production should be moved to /Production and this page deleted
- Template:Webarchive/testcases/Webcite – no errors – any tests in this page that are not also in /Production should be moved to /Production and this page deleted
- Template:Webarchive/testcases/Production
- A4.3, A4.4, and A4.6– there is a difference in how /sandbox detects and reports date errors; the live module can report an error for the first individual part of a date that is invalid, the /sandbox just reports that the date as a whole is invalid
- A4.5 – archive.org url dates in the form YYYYMM00000000 cause archive.org to return a snapshot that is presumably the last snapshot taken in the month MM of year YYYY. I think that this should be flagged as a date error because that snapshot might not be the right snapshot
- A8.1 – /sandbox does not include a comma in the rendering when the rendered date is an error message; the comma is reserved for mdy dates
- Module talk:Webarchive/testcases
- test_addlpages_5 – this test fails because the live module does not accept
frame.args.url1
as an alias offrame.args.url
- test_z1_notdate_archiveis – fails for the same reason as A8.1 above
- test_addlpages_5 – this test fails because the live module does not accept
I have questions/comments about the /sandbox:
- in the function
dateFormat()
:- why is the min year 1900? Shouldn't the min year be rather more recent? the advent of the internet c. 1980, perhaps?
- why is the max year 2200? Shouldn't the max year be the current year?
- in the function
decodeArchiveisDate()
:- is an archive.is short link guaranteed to to never be digits-only? (always alpha or alphanumeric)
- in the function
createRendering()
:- for wayback and loc wayback when date is
*
shouldn't the code create an archive link like this:[<url> Archive index]
instead of:[<url> Archive] index
?
- for wayback and loc wayback when date is
- in function
webarchive()
:|nolink=
is treated the same as|nolink=yes
which, to me, is poor practice because an empty named parameter does not convey any meaning to an editor who is reading the wikisource (this is why cs1|2 does not support empty parameters); parameters that have no value should have no meaning- there are two 'bugs' tracked; what are these bugs? where do they exist? why haven't they been fixed? is this special code really needed?
- the value assigned to
|url=
is inspected to see if the first bit is one of the two uri schemeshttp
or//
; if not, the code addshttp://
to the beginning of the url; is this a good idea? if the url is malformed because of a typo (htp://...
), addinghttp://
ahead of that will not fix it; is it not better to throw an error and halt? - for wayback and locwebarchives, when
|date=*
, why isn't that index value compared to the date extracted from the url? they should match should they not?
—Trappist the monk (talk) 15:50, 1 September 2018 (UTC) (edit conflict)
- More:
- in
decodeWaybackDate()
anddecodeArchiveisDate()
in the live module, timestamp length is checked; less than 4 is an error. If less than 14, timestamp is right-filled with '0's to 14 digits. Then only the first 8 digits of the 14-digit timestamp (year, month, day) are inspected. Shouldn't the initial length test be for a full 14-digit timestamp? If less than 14 and especially if there is an odd number of digits, then the timestamp does not uniquely identify a snapshot of the source and non-unique timestamps cause archive.org to return a snapshot that may or may not be correct. - in
decodeWaybackDate()
trailing*
characters are deleted from the timestamp without comment. Timestamp with trailing*
causes archive.org to show a calendar display. Removing the*
doesn't necessarily guarantee that the 'corrected' timestamp now points to the intended snapshot. When this is done, the module should show an error message.
- in
- —Trappist the monk (talk) 16:03, 2 September 2018 (UTC)
I was unprepared when the template started being used in other Wikis, I never considered other languages when doing the original template mergers, at the time just needed something working ASAP due to Iabot adding {{wayback}}
at high volume during the time. So I'm glad your doing this refactor as it will have a significant benefit, the template could be used wherever IABot is running about 17 wikis and growing. I have a bot called "wam" that can do conversion of old templates like {{wayback}}
. (I'm also learning from your code and incorporating idioms, methods and functions into my current template Module:Calendar date still in progress).
2 - With a list of about 5000 archive.is URLs in short form I just ran a script and sure enough there were 2 cases of all-digits. So it's rare (1:2500) but yes it can happen the old code is in error.
3 - sure
4.1 - this feature was requested by an editor I think some templates do the same. If it is changed, existing instances will need to be flagged and fixed. Maybe create a temporary tracking category. I doubt there are many it can be done manually.
4.2 - these bugs are/were produced by IABot and fixable by WaybackMedic. I've run Medic through the tracking category at least once so any existing cases are fixed. I suppose you can remove it if you want though it might be a good idea to wait until I run Medic one more time so I can log cases and open a Phab with IABot, assuming it's not already fixed.
4.3 - this was meant to catch missing schemes, to normalize the URL. If a URL with no scheme is an error, I don't know, but certainly htp:/ would be - maybe there should be a normalization function.
4.4 - |date=*
was meant to be used if the URL timestamp is '*' because |date=
is required something needs to be there. But it looks like an error in the original code, it doesn't check what happens if the timestamp and |date=
are not in agreement. I suppose it should treat these cases as the |date=*
being an error and the timestamp being intended, not the other way round. Thus render as if |date=
did not exist ie. try to determine the date from the timestamp, render it including a warning message about missing date. If the timestamp is "*" and there is a valid |date=
, same situation of giving priority to the URL and date being in error though in this case it would render as if |date=*
.
-- GreenC 15:56, 2 September 2018 (UTC)
- 2 – archive.is short form detection tweaked:
{{webarchive/sandbox|url=https://archive.is/zKyrW |date1=11 May 2018}}
→ Archived 11 May 2018 at archive.today{{webarchive/sandbox|url=http://archive.is/84274}}
→ Archived (Date missing) at archive.today
- —Trappist the monk (talk) 00:12, 3 September 2018 (UTC)
- 3 – code for this is in-place and tested but deferred for the time being – I want to minimize the number of live/sandbox comparison errors
- 4.1 –
insource:/\{webarchive[^\}]*\| *nolink *= *[\|\}]/i
finds three pages with empty|nolink=
parameters – code for this is in-place and tested but deferred as above - 4.2 – deferred
- 4.3 – url without scheme is an error because mediawiki doesn't make an active link from urls that don't have a valid scheme:
[www.example.com]
; apparently all of the archive services use http or https so I suppose that we could dream up some sort of mechanism to normalize obvious errors like 'htp'; the application of any such fixes should show an error message because the source data are broken - 4.4 – comparison implemented
- —Trappist the monk (talk) 13:26, 3 September 2018 (UTC)
5.0 - I suppose, but Wayback often redirects 14-digits snapshots around - it's a dynamic changing database. Part of the design of Wayback, not a static system. So not sure how much to weigh getting the right snapshot, which may have been redirected anyway, as getting the user to a page that is working at all where they can try to discover the right page. The template is limited what it can do here, but probably should assume good faith that a 4-digit year-only timestamp leads to something the user intended. Also, my bot fixes urls with 000's by discovering the underlying 14-digit Wayback is redirecting to.
6.0 - Same as #5 - there are no guarantees with Wayback. It is reason #37 why the WMF should be running its own archival service instead of relying on outsourced services. They are all broken in various ways for which we have no ability to fix, tune to our requirements, add new features or improve performance.
-- GreenC 23:50, 2 September 2018 (UTC)
- 5 – Perhaps Wayback and archive.is do redirect 14-digit timestamps; that is not the issue I'm addressing. The issue is timestamps in the wikisource that have fewer than 14 digits which compels these services to employ some sort of guessing mechanism so that they can display something other than a 404 error. As the live module stands right now, it gives no indication when a url has fewer than 14 digits. I have remedied that in the sandbox. Tested but currently disabled for Wayback; here is archive.is with an 8-digit timestamp:
{{webarchive/sandbox|url=https://archive.is/2016.08.28/http://example.com}}
→ Archived 2016-08-28(Timestamp length) at archive.today
- Such template instances are added to Category:Webarchive template warnings. The sandbox does not modify the timestamp to make it 14 digits.
- 6 – It seems to me that you are wanting to both have and, at the same time, consume your cake. In 5.0 above, you argue that we should
assume good faith that a 4-digit year-only timestamp leads to something the user intended
yet the module (both live and sandbox) unconditionally strips trailing*
characters from Wayback timestamps which seems contrary to theassume good faith
rubric (something that you have been at pains to reiterate). I think that instead of stripping the*
character, the module should leave it in place and emit a warning as the sandbox now does for short archive.is timestamps. - —Trappist the monk (talk) 12:13, 6 September 2018 (UTC)
- More:
- 7 – Error and warning messages are wrapped in a
<span>...</span>
tag that includes the same class attributes that apply to cs1|2 template errors:[https://archive.is/2016.08.28/http://example.com Archived] 2016-08-28<span style="font-size:100%" class="error citation-comment"><sup>(Timestamp length)</sup></span> at [[archive.today]]
- I have added code to the sandbox to limit which namespaces will be categorized to the same namespaces categorized by cs1|2. This is tested but currently disabled – this change breaks most of the test cases because live module renders the categories in ~/testcases pages but, when enabled, this change to the sandbox does not so the comparisons fail.
- 7 – Error and warning messages are wrapped in a
- —Trappist the monk (talk) 12:13, 6 September 2018 (UTC)
- 5 - That is a good solution.
- 6 - If the timestamp is 14-digit and there is a *, it presents a contradiction and likely the user made a mistake. We could give authority to the 14-digit timestamp not the *. But if it's less than 14 digits, the other way around, give authority to the *. Or treat it not as a mistake, simply pass-through and let users decide, in which case the timestamp digits are ignored and it's treated as a 'index' or * page for
|date=
purposes. It just looked like a mistake to me. - 7 - Good idea about limiting namespace given those pages clutter up the many tracking categories. I hope a way can be found so it works with the testcase pages.
- -- GreenC 13:57, 6 September 2018 (UTC)
- 6 – Just to be clear: the trailing
*
is only stripped as part of the date validation/decoding indecodeWaybackDate()
; the link into archive.org is not modified so if it has the*
in the wikisource, it will have the*
in the rendered page. Neither the live nor the sandbox modules modify the url. - 7 – I have enabled namespace limits in the sandbox and tweaked the code to except Module talk:Webarchive/testcases and Template:Webarchive/testcases/Production. The errors shown on these pages are described above except test A2.5 on Template:Webarchive/testcases/Production which shows a comparison error because the live module does not detect. Template:Webarchive/testcases/Webcite is not in the excepted list so this page shows errors because it is a testcases page that is not listed in the excepted pages list.
- —Trappist the monk (talk) 16:46, 6 September 2018 (UTC)
- 6 – Just to be clear: the trailing
- Prior to the update to the live module this morning, I found this:
{{webarchive|url=https://web.archive.org/save/https://gaite-lyrique.net/en/event/red-bull-music-festival-paris |date=September 7, 2018 }}
- I have tweaked the sandbox to detect
/save/
and emit an error message:{{webarchive/sandbox|url=https://web.archive.org/save/https://gaite-lyrique.net/en/event/red-bull-music-festival-paris |date=September 7, 2018 }}
→ Error in Webarchive template: Timestamp not a number.
- —Trappist the monk (talk) 11:44, 9 September 2018 (UTC)
- Another type of error waybackmedic can fix, so long as it shows up in the warning category. Possibles include "/save/" or "/_embed/" or "/save/_embed/" -- GreenC 12:31, 9 September 2018 (UTC)
- What does
/_embed/
do? In the case of/save/
, a new snapshot is saved at archive.org each time the url link is clicked so a 'fatal' error message is appropriate. Is that kind of response appropriate for/_embed/
? - —Trappist the monk (talk) 12:46, 9 September 2018 (UTC)
- It's something useful, but I can't remember, for Wikipedia purposes it should warn. I've run the bot on the warning category and reduced it, the remaining need to be checked by hand, there might be some unknown problems there. There's no instance of the two bugs so the code can probably be deleted as bug resolved. -- GreenC 13:58, 9 September 2018 (UTC)
- I found this:
<iframe src="https://archive.org/embed/pacificmarinerev1821paci" width="560" height="384" frameborder="0" webkitallowfullscreen="true" mozallowfullscreen="true" allowfullscreen></iframe>
- from the share link here; not identical to
_embed
but I would surmise that they have or had similar function, which I suspect is not applicable to{{webarchive}}
in the same way that/details/
and/stream/
(https://archive.org/stream/pacificmarinerev1821paci#page/n539) urls are not applicable. Given that, unless it can be shown that there is need for/_embed/
or/embed/
detection I don't think that we need to worry about these urls. - —Trappist the monk (talk) 14:42, 9 September 2018 (UTC)
- I have looked through the remaining articles in Category:Webarchive template warnings. A goodly number of them are there because they use
{{webarchiv}}
which appears to be some other template but is currently a redirect to{{webarchive}}
. - —Trappist the monk (talk) 15:20, 9 September 2018 (UTC)
- Yeah something with frames embedding in a page. Still, if no timestamp it should warn. I've cleaned out most of the warn cat but new ones are rolling in with from the new warnings you added, will give it a day or so and rerun later. The webarchiv is from German, it uses a different set of arguments, my bot can convert them except when they are copy-pasted with "webarchive" but using the "webarchiv" argument set, so I changed those to "webarchiv" manually and rerun the bot on them later. Someone is also copy pasting total garbage related to those and I just cleaned up a dozen or so. -- GreenC 15:27, 9 September 2018 (UTC)
- I have looked through the remaining articles in Category:Webarchive template warnings. A goodly number of them are there because they use
- I found this:
- It's something useful, but I can't remember, for Wikipedia purposes it should warn. I've run the bot on the warning category and reduced it, the remaining need to be checked by hand, there might be some unknown problems there. There's no instance of the two bugs so the code can probably be deleted as bug resolved. -- GreenC 13:58, 9 September 2018 (UTC)
- What does
- Another type of error waybackmedic can fix, so long as it shows up in the warning category. Possibles include "/save/" or "/_embed/" or "/save/_embed/" -- GreenC 12:31, 9 September 2018 (UTC)
Other discussions
Hello @Trappist the monk:,
In function serviceName() it strips the hostname assuming "www" or "web", but there is a large variety as documented in wp:List of web archives on Wikipedia (in the "Hostname" field). This list is not complete and can change by remote providers (all unannounced and undocumented of course). I used mw.ustring.find() to check for what it includes rather than what to exclude.
It's not uncommon for timestamp years to range from 1890s to 2100. This is due to many factors mostly bot bugs and remote archive bugs. These archive URLs will often work despite not being literally accurate times. Also timestamps with a month of "15" etc, that are nonsensical, they in fact work on Wayback - it's a bug in their API that produces these timestamps. They end up redirecting to a sane timestamp and my bot WaybackMedic detects and fix them when it runs across them (not easy as there are about 5 different redirect types on Wayback including Javascript) -- so ideally the template would still render the archive as intended, assuming good faith it is a working archive, but also leave a tracking category warning entry for bots to cleanup. -- GreenC 05:29, 2 September 2018 (UTC)
- I have tweaked
serviceName()
so that it first looks for a key in theservices
table that exactly matcheshost
. Failing that, it scans through the keys inservices
with this:host:find ('%f[%a]'..k:gsub ('([%.%-]])', '%%%1'))
- since we're looking for ascii strings, no need to use
mw.ustring.find()
.find()
uses a lua pattern: the'%f[%a]'
prevents finding 'archive.org' in 'europarchive.org', for example; thek:gsub ('([%.%-]])
escapes lua.
character class and the-
pattern item so that they are treated as plain characters.
- I have tweaked
-
- I'm not convinced that nonsensical timestamps truly work. Yeah, the service may redirect to what it thinks is a sensible timestamp but, for us, sensible has a different meaning. We use these archive services to provide snapshots of ephemeral sources as they were at a specific time in support of statements made in our articles. The services care not a whit about that and will happily provide a snapshot of a 404 page if that just happens to be the snapshot targeted by the redirect. Nonsensical dates – before the c. 1980 advent of the internet and any future dates – should be flagged as such just as we flag bogus dates like this (added to Chukotka Autonomous Okrug with this edit by IA bot – bots and automated tools should not produce bogus urls):
- Archived (Timestamp date invalid) at the Wayback Machine
- Archived (Timestamp date invalid) at the Wayback Machine – sandbox
- timestamp
20131915452200
redirects to timestamp20131204095123
, the last snapshot of 2013; it may or may not be an accurate representation of the source on the day that the editor consulted it.
- I'm not convinced that nonsensical timestamps truly work. Yeah, the service may redirect to what it thinks is a sensible timestamp but, for us, sensible has a different meaning. We use these archive services to provide snapshots of ephemeral sources as they were at a specific time in support of statements made in our articles. The services care not a whit about that and will happily provide a snapshot of a 404 page if that just happens to be the snapshot targeted by the redirect. Nonsensical dates – before the c. 1980 advent of the internet and any future dates – should be flagged as such just as we flag bogus dates like this (added to Chukotka Autonomous Okrug with this edit by IA bot – bots and automated tools should not produce bogus urls):
-
- Do you have answers for the others of my questions?
- —Trappist the monk (talk) 11:37, 2 September 2018 (UTC)
- Recently Cyberpower678 said he would be adding a timestamp verification routine as one spectacularly wrong timestamp crashed the bot. However I will say, most of these (and there are a lot) are perfectly good archive URLs and it would be unfortunate to block them out. WaybackMedic can and does fix them on a regular basis (not totally trivial BTW). Sometimes these bogus dates are what are returned to users who cut and paste URLs from the browser bar into the template. I think a little flexibility and good faith that these URLs go to the intended page and there are processes to fix them. -- GreenC 16:13, 2 September 2018 (UTC)
update of the live module
I have updated the live module from the sandbox. At the time of the update there were 219 articles in Category:Webarchive template warnings and 7 articles in Category:Webarchive template errors.
—Trappist the monk (talk) 11:03, 9 September 2018 (UTC)
- Great! My bot can fix most of the warnings category (mostly missing dates). I'll run it today and it will also discover any bug-fix cases discussed earlier. -- GreenC 12:20, 9 September 2018 (UTC)
Over the last little while I have been tweaking the sandbox so that everything that should require tweaking for another language is now in Module:Webarchive/data (except the code that looks for the '/sandbox' substring of the module name which needs must remain the module). Lots of minor tweaks mostly regarding error reporting. You can see these in Module talk:Webarchive/testcases and Template:Webarchive/testcases/Production. Without objection I shall update the live module from the sandbox sometime before weeks end.
—Trappist the monk (talk) 10:03, 10 October 2018 (UTC)
Date with space
I added a new testcase A2.11. Date has extra space - this is generating a warning. If not too difficult, it would be good if the date could be reduced to single-spacing before processing, to avoid the warning caused by a minor user typo. The other option is keep the warning so bots can fix it but if kept it shouldn't be silent to avoid confusing editors trying to find and fix things manually. -- GreenC 13:13, 10 September 2018 (UTC)
- Fixed in the sandbox.
- —Trappist the monk (talk) 14:37, 10 September 2018 (UTC)
- Thanks. If you don't mind my asking what is the purpose of the semi-colon at line end, it seems to be used some places and not others, but I never used it at all. Is it functional or style? -- GreenC 14:45, 10 September 2018 (UTC)
- The habits of too many decades writing C which requires the semicolon. I've also been using it as a flag to distinguish stuff that you wrote from stuff that I wrote. I also added semicolons to your stuff when I touched or reviewed it. Because Lua doesn't care, and I know that Lua doesn't care, I'm not always consistent in how I use semicolons.
- —Trappist the monk (talk) 14:56, 10 September 2018 (UTC)
- I C. Won't worry about semicolons then. I wrote a few applications in C in the 80s, now using Nim (programming language) which compiles to C - it's lazy to say you program in C without actually. -- GreenC 15:18, 10 September 2018 (UTC)
- Thanks. If you don't mind my asking what is the purpose of the semi-colon at line end, it seems to be used some places and not others, but I never used it at all. Is it functional or style? -- GreenC 14:45, 10 September 2018 (UTC)
Date mismatch error
From Southern New England School of Law:
- Archived 2012-02-08 at the Wayback Machine
Appears to be caused by a "13" hour. Maybe the template should ignore invalid H:M:S values as this is what Wayback provides and there isn't a way to fix it other than finding a new timestamp or archive provider, which may or may not be possible. -- GreenC 13:26, 10 September 2018 (UTC)
- The module doesn't evaluate the time portion of the timestamp so '13' hour is not the sources of the error (besides, time in the timestamp is a 24-hour clock). The whitespace between
2012
and}}
in the template wikisource is the unicode character U+2009, thin space, which does not get removed bymw.text.trim()
nor by MediaWiki's normal named parameter trimming. Unless we start seeing a rash of these characters showing up in the template source, I see no reason to 'fix' the module for this problem. - —Trappist the monk (talk) 14:17, 10 September 2018 (UTC)
- Ah yes of course 13 is fine my mistake. This is the first thin space I've seen agree can be skipped. -- GreenC 14:41, 10 September 2018 (UTC)
Date, url parameter aliases
Please tweak this so that |archive-date=
, |archivedate=
, |archive-url=
, and |archive-url=
also work, so it's less of a hassle to convert between templates. — SMcCandlish ☏ ¢ 😼 04:54, 3 December 2018 (UTC)
- No these were intentionally avoided. It's not a CS1|2 template, using CS1|2 arguments causes confusion leading to other CS1|2 arguments being mistakenly used (accessdate, deadurl etc), or both url and archiveurl being used at the same time because that is how CS1|2 works. The template takes 1 type of URL as described in the name: archive. Disambiguation of URL type creates confusion over what the template does. It would also introduce complexity at this point, many bots and tools have to be rewritten. -- GreenC 05:29, 3 December 2018 (UTC)
Language
Please can someone add a |language=
parameter, with the appropriate markup, for pages that are not in English? Andy Mabbett (Pigsonthewing); Talk to Andy; Andy's edits 23:43, 28 December 2018 (UTC)
edit request: enable archive.is mirrors
In this Tweet, archive.today writes, "Please do not use http://archive.IS mirror for linking, use others mirrors [.TODAY .FO .LI .VN .MD .PH]. .IS might stop working soon."
Would it make sense to update the template so links like archive.vn, etc. would work? I tried this edit using .vn, which gave errors. Alternatively, the best practice might be to switch from "archive.is" everywhere to "archive.today". = paul2520 (talk) 15:07, 5 January 2019 (UTC)
- Yes .vn, .md, and .ph are new as of yesterday and should be added. Discussion about archive.today Wikipedia_talk:Using_archive.is#which_TLD_to_use?. -- GreenC 15:31, 5 January 2019 (UTC)