User talk:Glrx/Archive 11

This is an archive of past discussions. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.

Archive 5

←

Archive 9

Lee de Forest/Gill

Just an FYI, Mount Hermon Boys' School Wikipedia article lists it as in Gill, Mount Hermon, Massachusetts redirects to Gill, Massachusetts (notes "The campus of Northfield Mount Hermon School is located in the Mount Hermon section of the town."). Their ZIP 01354 is Gill[1]. Towns (such as the one I live in) have sub-sections that have a different name, not sure if there is already consensus on how to handle this. Fountains of Bryn Mawr (talk) 20:29, 14 January 2017 (UTC)

The infobox on the right of the article says the school is in Mount Hermon.

The school's website says its address is Mount Hermon. I take that as authority.

A ZIP5 is worthless because not all towns have their own ZIP5. The ZIP5 determines which post office handles the mail. A friend of mine lives in one county, but his mail service comes from a town in the next county that is five miles away.

If you say "Gill, MA" to USPS, it returns Mount Hermon:

https://tools.usps.com/go/ZipLookupResultsAction!input.action?resultMode=1&companyName=&address1=1+lamplighter+way&address2=&city=Gill&state=MA&urbanCode=&postalCode=&zip=

Glrx (talk) 21:06, 14 January 2017 (UTC)

Copyeditor's remark

I think you mean "rollback" (instead of "rollover") in your RFA !vote. Just sayin' it because it confused me and may others. Softlavender (talk) 22:44, 8 February 2017 (UTC)

Ooops, I did mean "rollback". Sorry. Glrx (talk) 22:50, 8 February 2017 (UTC)

Finite element method edition

re: this revert

Hello Glrx, I saw that you reverted my edition on FEM. What does "Not RS" stands for? I do not understand the meaning of this.Nicoguaro (talk) 04:12, 24 February 2017 (UTC)

It means not WP:RS, not a reliable source. Wikipedia does not consider self-published internet forums reliable sources. Glrx (talk) 05:23, 24 February 2017 (UTC)

Speedy deletion nomination of Physics Girl

A tag has been placed on Physics Girl requesting that it be speedily deleted from Wikipedia. This has been done under section G6 of the criteria for speedy deletion, because it is an orphaned disambiguation page which either

disambiguates two or fewer extant Wikipedia pages and whose title ends in "(disambiguation)" (i.e., there is a primary topic); or
disambiguates no (zero) extant Wikipedia pages, regardless of its title.

Under the criteria for speedy deletion, such pages may be deleted at any time. Please see the disambiguation page guidelines for more information.

If you think this page should not be deleted for this reason, you may contest the nomination by visiting the page and clicking the button labelled "Contest this speedy deletion". This will give you the opportunity to explain why you believe the page should not be deleted. However, be aware that once a page is tagged for speedy deletion, it may be removed without delay. Please do not remove the speedy deletion tag from the page yourself, but do not hesitate to add information in line with Wikipedia's policies and guidelines. GILO A&E⇑ 23:40, 26 February 2017 (UTC)

Incompetent tag as DAB when it is not. Then changed to redir to nonexistent article, but that CSD does not include "pages which are useful to the project such as user subpages and talk pages, talk page archives, information for a future article, Articles for Creation drafts, etc." What happened to WP:BEFORE? US News wrote about her.[2] It was also tagged while I was creating the redir target article Dianna Cowern. Glrx (talk) 00:01, 27 February 2017 (UTC)

DYK for Oroville Dam crisis

On 28 February 2017, Did you know was updated with a fact from the article Oroville Dam crisis, which you recently created, substantially expanded, or brought to good article status. The fact was ... that nine million fish were rescued from the Feather River Fish Hatchery during the 2017 Oroville Dam crisis (pictured)? You are welcome to check how many page hits the article got while on the front page (here's how, Oroville Dam crisis), and it may be added to the statistics page if the total is over 5,000. Finally, if you know of an interesting fact from another recently created article, then please feel free to suggest it on the Did you know talk page.

Mifter (talk) 12:02, 28 February 2017 (UTC)

Hello I noticed that you reverted my both explanation and codes for Boyer Moore

Grix. I have to say there is some bug in the original C code and I modify it based on my understanding (based on unity tests in C make). My comments are also necessary supplement to Boyer Moore Alogithm to enlighten connection between KMP and Boyer Moore or its possible variance.

So please don't revert the original state. If you have some concerns or insights on it, please let me know.

— Preceding unsigned comment added by 101.251.252.38 (talk • contribs) 10:00, 4 March 2017 (UTC)

re: Apparently about my revert almost 4 months ago at Boyer–Moore string search algorithm

re: User:Andrew Helwer also reverted the editor.

The edits were reverted because the prose was too informal and code was commented out rather than properly edited. The description of BM should not rely on the reader already understanding KMP.

Glrx (talk) 19:07, 4 March 2017 (UTC)

Enigma Simulators Revert

Hello - just a quick question regarding the Enigma simulators listing, I appreciate that Wikipedia is not a directory, but what gives the other emulators / simulators listed the 'rights' (for lack of a better term) to be listed, but mine not to be? If there is criteria set out or details of why mine is not sufficient or meet certain expectations or requirements, it would be appreciated. Also given that the list is marked as 'incomplete' surely you want some additional ones with the article?

Kind regards, --Chris Preston (talk) 19:20, 12 March 2017 (UTC)

re: my revert at Enigma machine

Many of the simulators in the list should not be there. Arduino and PIC projects probably are not accessible to most WP readers, and those projects are probably not mentioned in secondary source. Wikipedia is not a place for advertising one's programming projects. If I had more time, I'd clean more out.

One WP metric is WP:DUE. Frode Weierud's simulators would be in; his website offers many relevant documents. His website is often cited in journal articles about the Enigma.

WP also does not like personal web pages as sources. WP prefers secondary sources because WP editors are not supposed to do WP:OR. WP wants some secondary source to judge a primary source's work. It's not good enough that something exists; WP want some independent and reliable source to judge the work.

WP:COI implies that we cannot take an involved person's statements as truth. Your comments about how good your simulator is would not carry much weight.

What does your simulator offer that others do not? I would tend to keep simulators that simulated Enigmas that were not covered by other programs.

Daniel Palloks' Universal Enigma has broad coverage. In addition, IIRC, it has been recommended by published sources. It is certainly the simulator to which I would point someone. I've used his simulator to verify the accuracy of my Enigma simulator (which is not in the list) and to check results from other, published, simulators. Many programmers (even published researchers) get the stepping wrong and therefore have a defective simulator and dubious results.

WP:OTHERSTUFFEXISTS is not a reason to add your project.

The overarching point: what is so important about your simulator that an encyclopedia should mention it?

Glrx (talk) 20:52, 12 March 2017 (UTC)

Hi Glrx,

Firstly thank you for your response. I'm glad you said that most probably should not be there, and talked about accessibility. From my perspective, the only argument I would make for mine over *some* others is accessibility, as it's built for web (to modern standards, ie. html5), and not using a device specific platform, or an increasingly niche platform such as flash or java - it potentially makes it more accessible to most users of WP. That said, as you rightly pointed out, my comments alone don't carry weight. As per the personal site, I did wonder if that would be a reason - and while there isn't any real content on my personal site to promote myself (it's just a collection of random things), I did consider giving this project it's own domain, etc for this reason and appreciate this completely.

I know full well about the stepping issues, I did end up working with an expert to get this working correctly originally. I do always check against the example you provided, and several other simulators whenever a change to the codebase is made to ensure error hasn't been unintentionally introduced. But your other points still apply here.

Thanks again for your honest and helpful response. It would be nice if the list were to be cleaned up going forward, if nothing else to reduce the likelihood of this kind of issue going forward. Chris Chris Preston (talk) 23:37, 12 March 2017 (UTC)

I deleted the request to add to the list of Enigma simulators.

I deleted four or five simulators that were no longer available, deficient, or sounded more in advertising.

Glrx (talk) 22:36, 20 March 2017 (UTC)

Ray-Chaudhuri: May confuse readers (BCH codes)

Don't you think the line: "(mistakenly, in the case of Ray-Chaudhuri)" may confuse readers. So it's better it be removed. What are BRH codes... as in the case of "Misunderstands: BRH codes"? Thanks.-Polytope4d (talk) 16:14, 3 April 2017 (UTC)

Well, it's alright. I hadn't seen the latest update.-Polytope4d (talk) 16:18, 3 April 2017 (UTC)

Invalid invocation of CITEVAR

Hi Glrx, I take issue with your reversion ([3]) of my edit in the Hacker's Delight article on the grounds of WP:CITEVAR. Please note, that I am the only editor who added citations to this article so far, and I am also one of the major editors of the article (counted by volume added, number of edits or editing span). Therefore, per WP:CITEVAR it would be me to decide which citation style to use, not you, who did not contribute to the article at all. My preferred style is to use the cite/citation template framework. In articles which can benefit from it (that is, articles which already have reasonably well-formatted references, so it isn't a wasted effort), I also prefer references to be listed in the references section as they are much easier to maintain this way, in particular if the article grows and more references get added. If you want a different style to be used, please contribute to the article and raise the issue on the article's talk page first. --Matthiaspaul (talk) 19:54, 10 April 2017 (UTC)

The article had 2 existing inline references in the typical manner that such refs are used. The ref style for the article had been set and was not inconsistent. You converted those two references to indirect/out-of-line. Such a conversion is specifically to be avoided at WP:CITEVAR: "changing where the references are defined, e.g. moving reference definitions in the reflist to the prose, or moving reference definitions from the prose into the reflist."

Glrx (talk) 20:26, 10 April 2017 (UTC)

(edit conflict) Sorry, but this does not make sense at all. In an article this short nothing is set in stone - that would be bean-counting and actively hindering article development, which would be against our primary project goals. Anything that helps the contributors of the article to continue working on the article and further improve it is fine - and fine with me as well. Now, I am one the contributors to the article, whereas you did not contribute anything so far. Still, you are hindering me in my work on the article. This is very unconstructive. However, if you still think you must invoke WP:CITEVAR, then I insist on that you read it correctly. It clearly states:

"As with spelling differences, it is normal practice to defer to the style used by the first major contributor or adopted by the consensus of editors already working on the page, unless a change in consensus has been achieved. If the article you are editing is already using a particular citation style, you should follow it; if you believe it is inappropriate for the needs of the article, seek consensus for a change on the talk page. If you are the first contributor to add citations to an article, you may choose whichever style you think best for the article."

It happens that I am the contributor who added not only the first but all references, and I was in the very process to add yet another reference when you decided to disturb me and waste my precious time I would prefer to spend on article improvements. Please don't get me wrong, I certainly don't own the article and don't want to, but as you invoked WP:CITEVAR, this very guideline makes me the one to define the format. Also per WP:CITEVAR, consensus for changes that could be controversial should be sought on the talk page among those who already contributed to the article. You are not one of them, so per WP:CITEVAR I would not even have to consider your opinion. Being constructive I still would, if you would have raised any plausible argument pro another style instead of playing a road block. Unfortunately, you did not.

--Matthiaspaul (talk) 00:00, 11 April 2017 (UTC)

" If you are the first contributor to add citations to an article, you may choose whichever style you think best for the article." does not mean that you can come back months later and alter the existing style.

Glrx (talk) 20:26, 10 April 2017 (UTC)

Says who? This obviously is your private opinion, but it is not backed up by WP:CITEVAR. I would not normally care about it, but I happen to be the first and only contributor of the six references in the article, and I did not explicitly "set" any particular style. I am just continuing to use the same style I was using before, with the only difference that I moved the reference definitions into the References section (as it was the case with the refs in Further Reading already - so much regarding consistent style), as this makes the article body much easier to read on source code level and it makes the references much easier to maintain and keep their style in sync. Also, this isn't much different from invoking one reference multiple times - as soon as you do, a clean "inline" style cannot be maintained any more, anyway. All this is perfectly normal procedure in article development and nothing special. :: In most articles, references get added in all sorts of rough "styles" at the beginning just to get something in at all first. Later, the references get filled with more and more information and most often they are converted to use cite template style. <references> tags typically get replaced by {{reflist}}. And in the better maintained articles, the refs are often moved into the References section for easier maintenance, although this seldomly happens right from the start when the infrastructure is still developing. All this happens all over the place in peaceful and constructive cooperation of uncountable editors with very little discussion - simply because no discussions are needed when there is general consensus that any reference improvements are a good thing. And now you come along and are basically telling me I would not be allowed to edit and further improve the references I myself added? At the same time when the article is boldly tagged for reference improvement? Had I thrown in raw links as references, you would now try to disallow me to change them to cite templates? I hope you recognize how absurd your arguing is.

--Matthiaspaul (talk) 00:00, 11 April 2017 (UTC)

RfA

Hi. Just my personal opinion: WP:ORCP is a purely informal, no-obligation process - IMO, what it does and/or what its outcomes are/were should probably not be topics of discussion in an actual RfA. It's nevertheless an excellent initiative and one which I very much support. It's encouraged some editors to throw their hat in the ring with success, and saved others from excruciating embarrassment. Kudpung กุดผึ้ง (talk) 01:08, 11 April 2017 (UTC)

Wiktionary template placement

Hey, not that I mind on that specific article since you ended up adding the idiomatic usage, but in general there isn't a standard requirement for where the wiktionary template is placed. From the documentation: "The template may be placed anywhere, such as the External links section, the beginning of the article or in the article's etymology section if one exists." Opencooper (talk) 01:50, 21 May 2017 (UTC)

Fixing the intro is the better thing to do; putting a wikt box nearby does not fix the problem. Wikt is not a reliable source, so it does not belong at the top in a regular article despite what the template suggests (and the template really suggests EL); up top in a DAB would be OK because a DAB is more like a see also/external links section. The article's wikt entry was already down below, so moving it up top sounds in a style change. Glrx (talk) 20:00, 22 May 2017 (UTC)

Fair enough, best to address the issue directly. Though I didn't want to segue the article to a secondary topic since that's more what disambiguation pages are for. Opencooper (talk) 21:10, 22 May 2017 (UTC)

Perfect hash function: removed link

You have removed a link to the library I wrote. I wonder what the rules are for adding such links (when would it be OK to add a link). ThomasMueller (talk) 15:50, 19 June 2017 (UTC)

Editors with a conflict of interest need to be circumspect about adding their own work.

The link I removed was a duplicate; the GitHub library was cited as a reference.

I'm not sure, but it may be appropriate to revert the modified bit per key claim and restore the earlier value. The qualifier about given enough time makes me queasy. See also WP:RS. Personal websites are not reliable sources. I'd prefer a secondary source that states the claim; I'll accept a refereed journal.

Glrx (talk) 16:10, 19 June 2017 (UTC)

Conflict of interest: I see, and kind of agree. However, I believe most (if not all) other links were added anonymously. If you want to remove the other link feel free (there is no published paper yet). Once a paper is there I guess adding a link is appropriate.

"Given enough time": this is in the "Hash, displace, and compress" paper, Conclusions (page 11): "... the CHD algorithm can be tuned ... to obtain more compact functions. ... it has the ability,..., to approximate the information theoretic lower bound...". Also, Rasmus Pagh (who did a major edit of this page and is a subject matter expert) told me by email that "it is well known that one can approach 1.44… bits/key arbitrarily given enough time". So I think it makes sense to keep this information. ThomasMueller (talk) 16:59, 19 June 2017 (UTC)

That was a problem I had with a quick scan of the paper. There seemed to be tests that showed low bits per key, but there was not a definitive statement of bpk. If I'd seen one, then I would have reverted. WP does not want editors reading papers and then drawing some conclusion. See WP:OR. WP wants statements that can be verified. Email communications are worse than personal websites. I don't know the material. It would be good to state an information theoretic lower bound if there is one, but such statements can be dangerous if there are huge costs to get there. Glrx (talk) 18:37, 19 June 2017 (UTC)

The paper doesn't have a definitive statement of bits per key for the MPHF, that' true. But in the conclusion (which is also part of the paper, and is reviewed as well), it says that the algorithm can approximate the theoretical lower bound. The paper also shows that it is possible, with todays computers, to easily reach 2.1 bits per key, and that it is getting more and more expensive to get lower bits per key. I understand emails are not valid references of course. By the way, the information theoretic lower bound is around 1.44 (that is already on Wikipedia, with a reference, I can add other references if needed). ThomasMueller (talk) 08:16, 20 June 2017 (UTC)

Case opened

You were recently listed as a party to or recently offered a statement in a request for arbitration. The Arbitration Committee has accepted that request for arbitration and an arbitration case has been opened at Wikipedia:Arbitration/Requests/Case/Maglioladitis 2. Evidence that you wish the arbitrators to consider should be added to the evidence subpage, at Wikipedia:Arbitration/Requests/Case/Maglioladitis 2/Evidence. Please add your evidence by August 6, 2017, which is when the evidence phase closes. You can also contribute to the case workshop subpage, Wikipedia:Arbitration/Requests/Case/Maglioladitis 2/Workshop. For a guide to the arbitration process, see Wikipedia:Arbitration/Guide to arbitration. For the Arbitration Committee, Miniapolis 16:57, 23 July 2017 (UTC)

Please trim your evidence, which is currently three times the 500-word limit. For the Arbitration Committee, Miniapolis 23:30, 23 July 2017 (UTC)

Having received no request for a raise in the word limit and having seen no progress in reducing the length, I have now trimmed your evidence to 500 words. [4]. Ks0stm ^{(T•C•G•E)} If you reply here, please ping me by using {{re|Ks0stm}} in your reply. 00:22, 26 July 2017 (UTC)

RfA

Thanks for supporting my run for administrator. I am honored and grateful. ) Cullen³²⁸ Let's discuss it 20:47, 23 July 2017 (UTC)

Magioladitis evidence phase closing soon

It's scheduled to close in a few hours. For the Arbitration Committee, Miniapolis 17:04, 13 August 2017 (UTC)

Just to let you know

RE: Voynich manuscript edit

Just to let you know, "CreateSpace" is a branch of Amazon that allows people to self-publish their own works.--Mr Fink (talk) 20:48, 29 August 2017 (UTC)

Thanks. I googled it and saw "self-publish", so I figured most CS material would be similar to a blog and not WP:RS. I did not know it was Amazon, but that makes sense because the book was for sale there. Glrx (talk) 22:20, 29 August 2017 (UTC)

Yeah, Amazon made the ultimate "Poor Man's Vanity Press"--Mr Fink (talk) 22:58, 29 August 2017 (UTC)

Field-programmable gate array

RE: diff at Field-programmable gate array

Why did you remove my addition of open-source section to the article, there wasn't an explanation.

Just wondering why?

— Preceding unsigned comment added by Back ache (talk • contribs) 12:30, 4 September 2017 (UTC)

The addition has several problems. My edit comment of "DUE?" is a reference to WP:DUE. The statement "An opensource set of tool's has recently appeared to address the previously closed nature of the chips and the software to program them" just implies the existence of such tools; it does not show that the tools are reasonable or accepted. Most FPGA manufacturers are providing free tools for the their (proprietary) products. Furthermore, Wikipedia wants WP:SECONDARY sources that provide independent assessments. Moreover, Wikipedia wants reliable sources; YouTube is not a reliable source. On another front, WP's mission does not include being a how-to-guide. Glrx (talk) 15:27, 4 September 2017 (UTC)

Voynich wiki

re: my revert at Talk:Voynich manuscript

I mentioned the wiki primarily to indicate an alternative place for the discussions (as 'WP is not a forum') and Original Research etc that the VM tends to attract (like some other topics). Jackiespeel (talk) 09:34, 12 September 2017 (UTC)

ANI notice

There is currently a discussion at Wikipedia:Administrators' noticeboard/Incidents regarding an issue with which you may have been involved. ―Mandruss ☎ 23:08, 11 October 2017 (UTC)

ANI archive

Revert of edit in "Replay attack"

You just reverted my edit in Replay attack. I added "lang=en" to the SVG file used as illustration, which you reverted with the comment "if lang=en is needed here, then something is wrong with the SVG file)". First, thanks for adding a comment - many users don't bother. However, I don't understand why you reverted the edit.

The "lang=en" is important to make sure the SVG file is rendered in English. It's true that in its current revision, the SVG file would be rendered in English even with the "lang=en", but that's only because there is fallback text without an associated language in the SVG file (i.e., a text tag without "systemLanguage" attribute). If someone edits the SVG file and drops the fallback text, or substitutes a different SVG file without one or with fallback text in a different language, the rendering will be wrong. So I think "lang=en" makes sense, both as safety against future changes, and to make explicit that we are selecting the English text. Even if it's not presently needed, it does not hurt either, so I think it makes sense to use it. Or do you see some specific problem with it?

I'd like to reinstate my edit. However, I'll wait a while for your response, since I don't want to start an edit war :-).

Sebastian (talk) 09:52, 21 November 2017 (UTC)

All switch-translated SVG files should have a fallback/default. The Replay attack illustration's fallback should be en because the diagram started out in English. Specifying the language is not needed when the language is the default.

Removing the fallback clauses from the SVG file would be an error, so that is not a reason to insist on lang=en. Somebody could also edit the SVG file and drop the English text completely, and that action would wreck the display even if there were an explicit en.

Leaving out lang=en does not display the SVG's default text; it displays as if lang=en were specified.

MediaWiki has unusual semantics about systemLanguage. In a sensible world, the xx.Wikipedia.org should default lang=xx. Somebody using the xx.Wikipedia.org site would expect images in ~~lang=en~~lang=xx. Instead, MW defaults lang=en on all Wikipedias. That default prevents MW from generating lots of identical PNGs for languages the SVG does not support.

If MW ever serves the SVG directly, the lang parameter will be trouble. The parameter is a substitute for the browser's user preference setting. If the SVG is served directly, then the browser (not the lang parameter) will determine the displayed image (and MW would need to build PNGs for different languages).

In the short term, specifying lang=en does nothing. In the long run, it confuses the issues.

Glrx (talk) 17:36, 21 November 2017 (UTC)

Thanks for taking the time to explain the background. I did not know that "lang" defaults to "en" in Wikipedia and Mediawiki in general. I agree that there is no point in including a parameter with the default value. I'll edit the help pagess for File to mention this. Sebastian (talk) 08:39, 28 November 2017 (UTC)

I wish that MW handled SVGs more sensibly, but I have little hope of that happening anytime soon.

One would expect the following mapping for the lang parameter in a File:Replay attack on hash.svg inclusion:

no lang specified → src="//upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Replay_attack_on_hash.svg/300px-Replay_attack_on_hash.svg.png"
lang=en → src="//upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Replay_attack_on_hash.svg/langen-300px-Replay_attack_on_hash.svg.png"
lang=de → src="//upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Replay_attack_on_hash.svg/langde-300px-Replay_attack_on_hash.svg.png"

but 2 produces the same URL as 1. Go figure.

The result is also complicated by the default that librsvg uses when it renders.

If I want to see the default text in an SVG file, I use lang=tlh (Klingon) or lang=qqz (private language).

Glrx (talk)

ArbCom 2017 election voter message

Hello, Glrx. Voting in the 2017 Arbitration Committee elections is now open until 23.59 on Sunday, 10 December. All users who registered an account before Saturday, 28 October 2017, made at least 150 mainspace edits before Wednesday, 1 November 2017 and are not currently blocked are eligible to vote. Users with alternate accounts may only vote once.

The Arbitration Committee is the panel of editors responsible for conducting the Wikipedia arbitration process. It has the authority to impose binding solutions to disputes between editors, primarily for serious conduct disputes the community has been unable to resolve. This includes the authority to impose site bans, topic bans, editing restrictions, and other measures needed to maintain our editing environment. The arbitration policy describes the Committee's roles and responsibilities in greater detail.

If you wish to participate in the 2017 election, please review the candidates and submit your choices on the voting page. MediaWiki message delivery (talk) 18:42, 3 December 2017 (UTC)

CITEVAR

Don't. Just... don't. --SarekOfVulcan (talk) 20:23, 19 December 2017 (UTC)

Merry Christmas

And a Happy New Year. Thanks, Glrx, for all you do around Wikipedia. I hope your holiday season is a joyous one and the coming year brings many days of happiness and wonder. _{(By the way, if you don't celebrate Christmas then please take it as a Happy Hanukkah, Merry Makar Sankranti, Enlightening Bodhi Day, Merry Yule, Happy Tenno no tanjobi, or fill in whatever holiday is your preference.)} Zaereth (talk) 00:48, 23 December 2017 (UTC)

@Zaereth: Thanks for the note and what for what you do around Wikipedia. Season's greetings to you and yours, too. I hope Steve doesn't get jealous of reindeer this time of year. Glrx (talk) 19:33, 24 December 2017 (UTC)

Seasons' Greetings

...to you and yours, from Canada's Great White North! FWiW Bzuk (talk) 21:03, 24 December 2017 (UTC)

Thanks for the note, and may you have a great new year. Glrx (talk) 20:44, 26 December 2017 (UTC)

Request for feedback

Hello Glrx, season's greetings and a very merry Christmas to you. Hope all is well. If you might recall, at my Rfa early this year, you and other established editors had provided significant feedback on the areas where you felt I could improve. Over the past year, I've worked on the areas that were pointed out during the Rfa (including the Afd participation you had pointed out; I've quite often since then taken the lead in the discussions). I wanted to request you, whenever you might have time, to review my contributions and to provide further feedback if necessary. It would really be wonderful to hear from you. Happy holidays once more and wishes for a great new year. Warmly, Lourdes 08:35, 24 December 2017 (UTC)

@Lourdes: Thanks for the greetings, and may things be well with you. I don't know how to predict RfA outcomes, but you should wait several more months or even longer before running again. I took a brief look at your AfD !votes, and yes, you are not delaying your opinion to polish the stats, but I still sense problems. RS is not passing mention, and it must be at least regional rather than merely local. Even newspapers with global reputations publish reworked press releases; that does not make the PRs reliable. Articles that merely quote an interviewee are not secondary. I worry about the time you devote to your opinions: I doubt that politician Fawzia Peer ("a labour consultant by profession")[5] is the same as radiologist Fawzia Peer ("who became the first female radiographer in South Africa to have achieved a doctorate in radiography and presently specialising in nuclear medicine at the state-run Inkosi Albert Luthuli Central Hospital in Cato Manor")[6]; the subject pictures look different. I want AfD arguments that apply policy to the situation, and those arguments should include an explanation of the application rather than just linking some WP policy and listing hyperlinks. Other issues may surface after an RfA has run for a few days and editors have had a chance to sift through contributions and talk pages. Glrx (talk) 02:29, 2 January 2018 (UTC)

Glrx, first of all, thanks for taking the time out to review the contributions, specially at this time of the year. I've gone through your review and do accept that the second link is completely unrelated to the subject – I'm surprised how did this miss my view and those of all the other commentators; I guess this is the once in a year share of an unexpected slip-up. I've taken care to point out the two most relevant links in my follow-up comment in the same Afd. I also deeply understand what reliable sources can be accepted towards assessing notability and the viewpoint that interviews are predominantly primary sources; sometimes, in my opinion, interviews have significant introductory material that may assist in assessing whether the subject qualifies on WP:BASIC.

That said, I also completely accept and understand your viewpoints and expectations about editors providing explanations of the application rather than just linking policy and hyperlinks. In most of my Afd participations over the past year (for example, from November and December 2017: in Afds like 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11... and in most Afds over the past months), I've taken care to provide exhaustive explanations rather than just link policy and hyperlinks. It's perhaps in the straightforward cases where I have provided a concise comment like "Delete, fails GNG, as I could find no sources...".

I am confident you'll consider the effort I've taken as an editor in most of my Afds in context. Having said that, I'll necessarily keep the points you've raised at the top of my mind. If I were to run an Rfa in between, I'll put a note out for you and also mention your comments within the Rfa for the perusal of other editors. I'm not saying this just for the sake of saying it as I really mean it – I absolutely value the time you've taken out to review my contributions and I've significantly gained from your perspective. Love and wishes for the new year, look forward to interacting with you in the future too. Warmly as always, Lourdes 04:07, 2 January 2018 (UTC)

Undoing edit / call by macro expansion

Re: this edit

I just saw you reverted an edit of mine on evaluation strategies. Please don't undo an edit like that without providing sources. Call by macro expansion is not a term of the Computer Science literature, AFAIK, what is described there is essentially call by name. And the paragraph is wrong about hygienic macros, who exist in languages with clear mix of call-by-need and call-by-value, like Scheme and Common Lisp. Nowhere man (talk) 15:59, 21 November 2017 (UTC)

You should raise your issues on the article talk page. Macro expansion is an evaluation strategy that uses substitution, and it is used by TeX. TeX has enormous problems with the user needing to know the call level for quoting expressions. Nobody claims that languages must use a uniform evaluation strategy. Glrx (talk) 17:54, 21 November 2017 (UTC)

I disagree with your reverting of my edit and wish to discuss it. I don't get why you would just plainly revert instead of correcting the text, because to my knowledge, TeX has nothing like hygienic macros and their very concept comes from a place where there's nothing like call by macro expansion as an evaluation strategy. Nowhere man (talk) 15:55, 2 January 2018 (UTC)

Deletion discussion about Hilda Clayton

Hello, Glrx,

I wanted to let you know that there's a discussion about whether Hilda Clayton should be deleted. Your comments are welcome at Wikipedia:Articles for deletion/Hilda Clayton .

If you're new to the process, articles for deletion is a group discussion (not a vote!) that usually lasts seven days. If you need it, there is a guide on how to contribute. Last but not least, you are highly encouraged to continue improving the article; just be sure not to remove the tag about the deletion nomination from the top.

Thanks,

Icewhiz (talk) 14:00, 7 January 2018 (UTC)

Berlekamp Welch article

I'm considering a rewrite, please take a look at Talk:Berlekamp–Welch_algorithm Rcgldr (talk) 04:45, 15 January 2018 (UTC)

distribution for polyphase merge sort - unsourced - but mathematical based proof of concept

Why would a source be needed when basic math can be used as a proof? I'll see if I can find a source for this, but I would think that a basic mathematical proof should be sufficient.

For the 3 file case, the pattern is a Fibonacci sequence as stated earlier in the article, and the basic equation for Fibonacci sequence can be stated as fib(i+1) = fib(i) + fib(i-1). a has n runs, where fib(i) < n < fib(i+1). Move n - fib(i) runs to b, "rewind" b, move fib(i+1) - n runs to c (do not "rewind" c). The math at this point:

runs remaining on a = n - (n - fib(i)) - (fib(i+1) - n) = n + fib(i) - fib(i+1) = n - fib(i-1)

runs on b = n - fib(i)

runs on c = fib(i+1) - n

Merge n-fib(i) runs from a and b, appending the n - fib(i) merged runs to c, "rewind" b, "rewind" c, but do not "rewind" a.

runs remaining on a = (n - fib(i-1)) - (n - fib(i)) = fib(i) - fib(i-1) = fib(i-2)

runs on b = 0

runs on c = (fib(i+1) - n) + (n - fib(i)) = fib(i+1) - fib(i) = fib(i-1)

The end result is an ideal distribution of runs for 3 files. Rcgldr (talk) 04:19, 5 April 2018 (UTC)

WP requires sources for difficult topics. Polyphase merge sort is a difficult topic. Knuth says that PPM Algorithm D is not optimal, "but it is not clear how to do much better than Algorithm D without considerable complication in such cases, especially if we do not know

S

in advance." (pp. 278–279.)

As for your "proof", it fails at "a has n runs". You are confusing records and runs. In general,

S

, the number of runs in a, is not known until the entire input tape has been read and processed. In the typical sorting problem, we may know the number of input records, but we do not know how those records will associate into runs.

Glrx (talk) 06:14, 5 April 2018 (UTC)

"... especially if we do not know

S

in advance" which implies that there are instances where "we do know

S

in advance". This could be a deliberate choice such as tracking record counts for datasets and using a fixed number of records per initial run during the initial distribution. My point here is the existence of an optimizing distribution algorithm if the number of runs is known in advance, and it should be included in the wiki article. I chose the 3 file case since the math is relatively simple. Rcgldr (talk) 08:49, 5 April 2018 (UTC)

But you need a source for the

S

is known in advance claim. In most sorting routines, little is known about the input other than it's a sequence of records. Knuth's statement also applies if

S

is known. Existence is not proof of optimality, and WP is not the place for editors to publish their insights into sorting algorithms. Glrx (talk) 15:54, 5 April 2018 (UTC)

I'll see if I can find one. The issue is even if keeping track of record counts was common at some shops, finding documentation will be difficult because since the time period is close to 50 years ago. Rcgldr (talk) 00:09, 6 April 2018 (UTC)

"But you need a source for the $S$ is known in advance claim." I found a source that includes an optimal PPMS algorithm based on knowing the number of runs in advance: optimal polyphase sorting .pdf : "The first algorithm requires that the number of strings (runs) to be dispersed be known in advance; this algorithm is optimal." It's not until section 7 on page 43 of the article that "blind dispersion" (where the number of runs is not known in advance) is covered and is noted as being non-optimal. The key point here is that optimal PPMS requires knowing the number of runs in advance and there is a credible reference for this. A particular site's implementation for the storage and sorting of datasets could enforce knowing the number of runs in advance in order to keep PPMS optimal. I also assume that the backwards reading sorts would require knowing the number of runs in advance, but I haven't found a source that describes the algorithms involved, only that backwards reading sorts did exist. Rcgldr (talk) 04:42, 6 April 2018 (UTC)

Knuth covers practical considerations for tape sorting.^[1] This includes keeping track of the number of records per tape (and therefore the number of runs if using a fixed number of records per run, except for the last run of a dataset). As mentioned in the section below, read backwards polyphase merge sort needs to know run count in advance in order to produce the proper run pattern. If the record / run count is not known, a N-1 way split could be done, followed by a N-1 way merge back to the original tape drive. This does one ordinary N-1 way merge sort pass. Another alternative would be to put the runs onto one or more tapes, and with a now known run count, do a partial distribution / merge to the remaining tapes (this could be complicated). I'll see if I can find a reference for this. Knuth also notes that the original tape should not be overwritten, so once it's read, the operator would replace the original tape with a working tape. This could occur as late as the second merge iteration with the algorithms I've mentioned. On the next page (338), Knuth compares some actual run times versus algorithms, and rewind time is a significant part of this, as well as the time the operator spends replacing the original tape to avoid overwriting it. Rcgldr (talk) 08:29, 19 May 2018 (UTC)

The reason I mentioned no dummy runs are needed when the initial number of runs is known in advance was for the readers that wouldn't realize that an initial setup that ends up with a "perfect" run distribution (of real runs) would not need any dummy runs. Clearly if there are less runs than the number of files - 1, then only n+1 files are used, except for the edge case where a single run is determined to be already sorted, in which case nothing is written and the original tape is already sorted, details that I didn't think needed mentioning. I'll leave the comment about no dummy runs out. Rcgldr (talk) 08:29, 19 May 2018 (UTC)
@Rcgldr:. Thanks. There's a difference between optimal and perfect. One of my professors used to say, "certain statements require proof". I also think the article was going too far afield. In general, the number of runs are not known in advance, so the prong might be academically interesting but impractical; spending a lot of time on that prong may not be worthwhile. Glrx (talk) 15:10, 19 May 2018 (UTC)

In one of the references I cited (Knuth?), it was mentioned that in the days of tape sorting, at many data sites, the number of records on each tape in a library was kept track of (could be kept on the tape itself using the directory method I mentioned before), and with a fixed number of records per run on the initial runs, the number of runs would be known in advance. Other than being the fastest way to implement an internal 3 stack sort, polyphase merge sort is nearly obsolete, as most external sorts, such as Gnu sort for large text files, do a 16 way merge (17 files), way beyond the point where ordinary merge is faster than polyphase merge. Most of this is due to having 1GB or more of user space for an external sort, where each I/O operation involves a large amount of data, reducing the random access overhead. Rcgldr (talk) 21:25, 19 May 2018 (UTC)

@Rcgldr: and if somebody used a fixed number of records in a tapesort run, he'd be chastised for using twice as many runs as needed. Don't navigate from unknown positions. Glrx (talk) 21:40, 19 May 2018 (UTC)

Using a fixed number of records per run would make sense if the data was effectively random, like starting with a dataset sorted by name and sorting it by social security number. If there is the expectation of significant pre-ordering in a data-set, then the run size should vary, and yes the run count would not be known in advance. Rcgldr (talk) 23:46, 19 May 2018 (UTC)

It is appropriate to use variable length runs in sorting random data. The snowplow effect is too good to pass up. Glrx (talk) 23:54, 19 May 2018 (UTC)

Almost half of the distribution section in the other reference I cited is about optimal distribution when run count is known in advance, so at least that author thought it was important enough to include, and it's a citable reference. For an internal sort, such as the ones described in Wiki articles, the initial run size is often 1 record or 1 element. The merge sort article includes a section on natural merge sort, but most of it is about basic merge sort. Hybrid sorts like introsort or Timsort may take advantage of pre-ordering (although there's overhead when keeping track of variable run sizes). Rcgldr (talk) 00:09, 20 May 2018 (UTC)

PPMS is not about internal sorting. The game in external sorting is much different. Glrx (talk) 00:20, 20 May 2018 (UTC)

Unindent - the article is fine as is. Reordering the paragraphs might be a slight improvement (but not needed), mention ideal distribution due to ideal file size, or knowing number of runs in advance and performing a partial distribution to achieve ideal distribution. Then mention in the more common case when run count is not known in advance that dummy runs are used. I'm not sure if the optimal distribution using dummy runs when run count is known in advance is too complicated to explain for the article (it's 30 pages in that pdf file). I'm also not sure if it is still truly optimal or if it was only optimal at the time of publishing. The only reason I used that reference was to cite a reference where run count is known in advance. Knuth mentions keeping track of record counts when working with datasets, but runs counts could also be kept track of, if the performance improvement warrants it.Rcgldr (talk) 15:26, 20 May 2018 (UTC)

There's also the issue of what to do when multiple input files go empty (including all dummy runs) at the same time. In this case, it would be better to cycle output between the just emptied input files on the next iteration, or otherwise, all but one of the emptied input files is just left idle, as if the number of files was reduced. I'm not sure where this belongs in the article. Rcgldr (talk) 15:26, 20 May 2018 (UTC)

Side note - Derek Zave that wrote the Stanford article passed away in 1987. Donald Knuth is still alive at 80 yeas old and working on volumes 4, 5, 6, and 7 of the art of computer programming. There's still an email address for the art of computer programming for corrections, but staff filters much of that, and response time is stated to be around 6 months. The point here is that sources for algorithms from the 1960's and 1970's are being lost. I ran into the same issue when working on the Reed Solomon history section, but in that case, fortunately Dave Forney is still active and I was able to correspond with him via email to update the history section. Rcgldr (talk) 15:26, 20 May 2018 (UTC)

You are engaging in WP:OR and making assumptions that are troubling. My sense is that the number of runs is almost never known. There was an academic exercise that asked an academic question about optimal distributions, and led to a side investigation with an assumption that the number of runs was known. The existence of a paper does not imply the method is used or is practical. IIRC, Knuth's book remarks that the algorithm he gives with dummy runs is the one that is used; even if the number of runs is known, it may be too much trouble to deal with given its return. So then we get into WP:UNDUE territory. If known number of runs algorithms were a minor topic, then why is WP covering it at all? Or suggesting that KNofR is better. Your suggestion that systems kept track of then number of records, so they could keep track of the number of runs is completely out of bounds. Where are the secondary sources (such as Knuth) that tell us that algorithms using known number of runs are used. WP does not want editors doing their own primary source literature surveys and WP:SYNthesizing conclusions. WP does not want editors propounding their own theories or ideas. WP also does not accept personal communications as a reliable source because those communications are not verifiable. Glrx (talk) 15:52, 20 May 2018 (UTC)

Considering that Derek Zave dedicated 30 pages of his Stanford paper optimal polyphase sorting .pdf on optimal distribution when the number of runs is known implies it was considered an important part of polyphase merge sort, regardless of the practicality. All I'm suggesting is to reorder the distribution section so that all of the idealized cases are listed first (run count just happens to be ideal, run count known in advance), followed by the more common case where run count is not known in advance. This would follow the flow of the article better, since the prior section just described distribution for an ideal 3 file case. Rcgldr (talk) 03:25, 21 May 2018 (UTC)

I did not rely on any personal communication or experience for the polyphase merge sort article. My communication with Dave Forney was for the Reed Solomon article. He supplied references to books I could use to confirm dates (I already had the book by Dr Weldon). He also noted that the Reed Solomon article was missing "original" view practical decoders, despite the fact that one of them, Berlekamp-Welch, already had it's own WP article, so I added those to the Reed Solomon article. Rcgldr (talk) 03:25, 21 May 2018 (UTC)

You haven't responded to the issue of multiple input files being emptied at the same time (due to using dummy run counts). The issue is that although this can be mathematically demonstrated that it can occur, depending on the distribution algorithm, I haven't found a reliable source as "proof" that this occurs or how to deal with it. Seems the article should at least mention this can happen, but without having to explain how to deal with the situation. The situation could just be ignored, leaving a file with nothing but dummy runs idle until all of the dummy runs are exhausted, which may not be optimal, but it would work. Rcgldr (talk) 03:25, 21 May 2018 (UTC)

You are doing OR when you look at Zave's TR and declare that it is significant. A crackpot could devote 30 pages to why the moon is made of green cheese, but that does not make the paper significant. Where is the secondary source that tells us PPMS using runs known in advance were relevant. Even if a method is not used in practice, it can still be a reasonable research topic. Plenty of MS and PhD theses are written about esoteric ideas.

Knuth cites Derek Zave's work on pages 279 (PPMS distribution) and 682 (answer to a PPMS question) of TAOCP chapter 3, second edition. Rcgldr (talk) 18:13, 21 May 2018 (UTC)

I'll have to look. I'm pretty sure K's discussion says most imps don't bother trying to finesse the distribution because it is hard and there's little gain. I wouldn't be surprised to find that K is Zave's thesis advisor. Glrx (talk) 19:10, 21 May 2018 (UTC)

I haven't perused the RS article in a long time, but Nageh and I had quite a discussion about the original view of RS. There was no practical decoder at the time. The BW decoder used the non-original polynomial coefficient view. I think much later there was a practical decoder for the original view. In any event, there's nothing wrong with talking to the principals, and that practice should be commended. As I said, I haven't studied the article, so I cannot comment on its current state.

When was this, before 1986? Berlekamp-Welch (1986) is an original view (values) decoder, based on linear algebra, Gao (2002) is another original view decoder, based on extended Euclid algorithm. Both are covered along with examples in the current RS article. Perhaps you were thinking of Berlekamp-Massey (1969) a BCH view (coefficient) decoder? Rcgldr (talk) 18:13, 21 May 2018 (UTC)

You are right, I was thinking of Berlekamp–Massey (which is the only one I've looked at in detail). I've got a 1970's mindset. Glrx (talk) 19:10, 21 May 2018 (UTC)

I'm not sure what you're talking about, and I haven't waded through everything again. In Knuth's algorithm description, the dummny runs are used to fake a perfect distribution, so multiple files do not exhaust at the same time except on the final merge. IIRC, Knuth also recommends using the dummy runs first, so real runs remain until all the dummies are gone. Glrx (talk) 16:16, 21 May 2018 (UTC)

It may have been to a poor choice in distribution algorithm, ignore this for now until I can find an example of this again. I don't recall where I read this. Rcgldr (talk) 18:13, 21 May 2018 (UTC)

OK. Glrx (talk) 19:10, 21 May 2018 (UTC)

"PPMS is not about internal sorting" - What is your basis for making this claim? PPMS is a sorting algorithm. It's use as an external sort algorithm is mostly legacy now, and it's practical as an internal sort, other than it requires 2 working arrays to implement a 3 array PPMS (in my testing 3 array PPMS is about 5% faster than 2 array 2 way bottom up merge sort). Rcgldr (talk) 05:52, 21 May 2018 (UTC)

The primary use of PPMS was as an external sort. It uses sequential reads and writes. Its structure acknowledges a limited number of drives. Internal sorts do not have a limitation on the number of drives.

I agree that the primary use of PPMS was as an external sort. For internal sorts, there's a performance related limit to the number of virtual drives, if trying to improve performance by taking advantage of register based variables, such as 8 pointers used to keep track of start and end of runs in a 4-way merge sort, which is faster than a 2 way merge sort on a system with 16 registers, but slower on a system with 8 registers. There's also the programming challenge of a 3 stack (LIFO) sort, where PPMS is fastest (in this case the number of files is arbitrarily chosen). Rcgldr (talk) 18:13, 21 May 2018 (UTC)

Today, I think internal sorts are uninteresting. Computers are fast, and I seldom have a lot to sort. A couple years ago I was doing internal sorts of about 50,000 1 kB items (50 MB, big deal), but the speed of the sort was never an issue, and qsort() was good enough. I've even done linear search on 1000 entry tables. Last night I was doing some database queries that took 9 seconds, but they were looking at a lot of records. For people who worry about the speed of internal sorting, I think issues such as cache coherency are significant. At least that's what BLAS libraries worry about. A friend used to work for VISA, and he talked about running huge external sorts at night. I think there's an annual sorting challenge: fastest sort, lowest power sort, .... Glrx (talk) 19:10, 21 May 2018 (UTC)

After reading the article again, the section about distribution is fine as is. No need to reorder it. Rcgldr (talk) 06:00, 21 May 2018 (UTC)

References

^ https://books.google.com/books?id=e3wLBAAAQBAJ&pg=PA337&lpg=PA337#v=onepage&q&f=false

Polyphase merge sort - general comments

This is mostly a historical article, since modern computers have so much memory that 16-way merges (17 working files per merge step) are common, such as Gnu sort for large text files. Rcgldr (talk) 09:49, 5 April 2018 (UTC)

Some aspects of tape based sorting are not covered. I recall some implementations of polyphase merge sort took advantage of the ability of tape drives to read backwards, but it's not even mentioned in the article (and yes, I'm that old). It's somewhat complicated. I've written a 3 "stack" polyphase merge sort to get an idea of what is involved. The merging needs to alternate between ascending and descending after each merge step (after each merged run is produced). Unless there's some clever algorithm I'm unaware of, the run count needs to be known in advance for the distribution, which has to create the initial ascending or descending runs in order for the final merge to result in a single ascending sorted run. I had the impression that tape based sorting was competitive and resulted in some fairly complicated algorithms being developed, especially when 4 or more tape drives were involved, but such algorithms were later lost as computers evolved beyond the need for tape base sorts. Rcgldr (talk) 09:49, 5 April 2018 (UTC)

To reduce the number of variables needed for sorting, tape drives could use single file marks to indicate end of run and double file marks to indicate end of data (note old tape drives could not sense end of data, so double file mark was a common method). IBM hard drives in the 1960's used variable block size, so a small record on a hard drive could be used similar to a file mark on a tape drive. Rcgldr (talk) 09:49, 5 April 2018 (UTC)

The old tape drives had the ability to create something like a directory. There was a gap command that would erase something like 3 inches of tape. 10 gap commands could be used to erase 30 inches of tape, then a file mark written, followed by the data. Then the tape could be rewound and the directory info (like record count) could be written. As mentioned above, there was no sensing for end of data or gaps, so to read such a tape, the program would read the directory record(s), then space forward file mark to skip the gap erased tape and the initial file mark to position to the first data record on a tape. I used this method on mini-computers back in the 1970's. Rcgldr (talk) 09:49, 5 April 2018 (UTC)

I was a bit surprised by the speed of an internal (all ram) based polyphase merge sort. On my system (Intel 3770K 3.5ghz, Windows 7 Pro 64 bit, Visual Studio 2015), a 3 array polyphase merge sort is about 5% faster than a 2-way bottom up merge sort, in spite of the fact that the polyphase moves about 5% more elements (and the fact that I used 2^24 elements, a 2-way "Friendly" count), perhaps cache related. The 3 "stack" polyphase merge sort is about as fast as a 2 way bottom up merge sort, but my "stacks" are implemented as pointers to elements in an array. All of these sorts are pointer based C++ code. I tested with 2^24 (16,777,216) 64 bit unsigned integers, and the sorts take about 1.5 seconds on my system. Rcgldr (talk) 09:49, 5 April 2018 (UTC)

It's an algorithm that deserves to be covered. The k-way merge is common, but I think a DB used PPMS a few years ago (and has stopped using it since).

The main change is the size of memory, and the ability to read or write large amounts of data in a single I/O, which reduces the relative overhead of random access. The wiki external sorting mentions using 100MB of memory and 10MB buffers, Gnu sort might use 1GB of memory on a 64 bit system, but the main factor is that Gnu sort defaults to a 16-way merge, way beyond the point (8 working files) where ordinary merge sort is faster than PPMS.

Yes, there were backwards-reading sorts, but I don't think they lasted long. IIRC, Knuth mentions them, and I tried some backward tape reads back in the day. The real problem is reading a tape took 5 minutes and rewinding it took 1 minute, so saving the rewind time was not much. Tape-based algorithms were competitive when disks were small, tape was cheap, and the input file fit on a single tape (no mount/unmount during the sort). When disks got big, it was easier to load the file to disk, sort it on disk, and write it to the output tape. System configurations were also changing. One mainframe I used had 11 200-MB disks but only two tape drives. The system's sort/merge package used disks.

That was why I mentioned that much of polyphase merge sort is historical, at least for tape sort. The backwards reading sorts were a part what could be called the competitive era of tape based sorting. As mentioned previously, the algorithms involved are probably lost by now. Some of these may have been used in commercial products where the algorithms were never published.

Knuth covers backwards reading sorts.^[1] Rewind time versus I/O speed ratio was more like 2.5 to 1 for common 9 track tapes from the 1960's and 1970's. The runs on each tape alternate between ascending and descending. For the final output tape, the pattern starts with an ascending run, for the rest of the tapes, with a descending run. This would imply that run count was known in advance (more on this in the other section).

IBM 3400 ca 1970 rewind time is 60 seconds or less. https://www-03.ibm.com/ibm/history/exhibits/storage/storage_3420.html For the 125 inch per second drive, minimum full tape read time is 3.84 minutes (ratio 3.84) and for 200 ips drive, minimum full tape read time is 2.4 min (ratio 3.2). That timing assumes no start/stops, but PPM cannot spin all the tapes at full speed. For 4 tapes, write tape will go at full speed but the 3 input tapes will run at 1/3 speed due to output bottleneck. Glrx (talk) 02:22, 10 April 2018 (UTC)

Typo, that should have been 3.5 to 1, not 2.5 to 1, which matches your numbers. I'm thinking the write tape's rewind time is the pacing rewind time (since it has the most data compared to the input tapes), assuming the rewinds (the output tape and one input tape) are done in parallel.

update - based on IBM specs, and assuming 1600 bytes per block, tape capacity would be 20MB at 800 BPI and 35MB at 1600 BPI. IBM 2314 drives held 29MB, with CDC SMD drives holding 40MB and 80MB in 1974. At my first job (1973), there were 10 of the CDC SMD 40MB hard drives and 1 tape drive connected to a single HP 2100series 16 bit mini-computer, which was part of a multi-computer / multi-tasking database system for about 150 pharmacies. The drives were physically larger than the mini-computer. There was a staged power up sequencer for the 10 hard drives to prevent circuit overload, which was cool to watch (the system was shut down around 2am each night for backup).

Using a filemark to mark end-of-run is a kiss-of-death; filemarks take too much tape. It is simpler to just notice the step-down in key value at the end-of-run or use a sentinel. In core comparisons are cheap (especially considering the slow I/O rates back then).

Or in the common case of sorting fixed size records (typically 80 or 120 bytes) written in blocks of 20 to 64 or more records, then a small block could be used to indicate end of run. I seem to recall a maximum block size on the early tape drives, but not the reason (perhaps related to error detection / correction?) . Filemarks were needed since as mentioned above, there was no sensing of end of data for tape drives in those days. Again, the IBM 360 era hard drives had variable block size, so the same method of using a small block as a separator could be used.

I don't like tape partitions even when the tape drive explicitly supports them (e.g., some of the 3M rubber band cartridges). We insisted on using ANSI-labeled 10.5-inch tape because it gave automatic identification and assignment when the tape was mounted. Record blocking is very important; just because one can write an 80-byte record doesn't mean one should. IRGs should be a small percentage of the tape.

Tape partitions still exist on current high end tape drives like Linear Tape Open. They needed something to replace the "legacy" method of using long gaps as a means to provide a post write directory area on the tape. I forgot to mention record blocking in my prior comments about the initial distribution. The directories were used as a type of tape based file system in some environments, multiple files on a single tape, and the directory held a list of file names, record counts, record sizes, blocking factor, date, time, ... . Blocking isn't a space issue on modern tape drives, since they group records into large internal fixed size blocks on the tape, with optional compression, but blocking is an issue with I/O overhead.

To me, big sorts are external sorts and all about buffer management and I/O speed. Sorting 16 Mrec / 64 MB isn't that interesting (your cache is 8 MB). A friend of mine works for a company that spends hours sorting terabytes of credit card transactions every night. I expect modern sorting algorithms to actively manage the cache with prefetch and discard. I haven't looked at I/O speeds recently. Twenty five years ago, we had a tough time getting an IBM 7200 RPM disk to deliver a sustained 6 MB/s with direct control of the hardware (i.e., no operating system nonsense). I think a RAID rebuild on a SATA 150 drive would run at 250 GB/hr (70 MB/s). I've been tempted to get a 4 GB/s SSD just to try the read speed. The k-way merge has an output I/O bottleneck.

Seagate specs for the 3.5 inch 7200 rpm Barracuda is a bit over 200 MB/s on the outside cylinders. (Western Digital doesn't make it easy to find transfer rate specs). Actual bench marks show overall rate from the system side from 50MB/s to 70MB/s. It's been about the same for at least the last 7 years, apparently they've reached a practical limit on bit density. Hard drives (Seagate, Western Digital, ...) can be "destroked" so that the inner cylinders are not used, reducing capacity and increasing average transfer rate. I don't know if this feature can be accessed by consumers or is limited to bulk purchasers (the commands to do this are vendor unique). Seagate "Nytro" SATA SSDs are over 500 MB/s, and 98K IOPS (random access I/O operations per second). The PCIe (plugs into motherboard) SSDs have 8GB/S rates and 975K IOPS.

Glrx (talk) 17:44, 5 April 2018 (UTC)

Rcgldr (talk) 01:28, 6 April 2018 (UTC)

Rcgldr (talk) 01:50, 10 April 2018 (UTC)

@Glrx: The article doesn't mention what to do when multiple input files are emptied at the same time, such as 2 of 4 files emptied at the same time, and the circumstances that can lead to this. In this case, the output can cycle between the multiple emptied files, similar to an ordinary merge sort, so that when the next input file is emptied, it's back to having data on N-1 files, unless another occurrence of multiple input files being emptied occurs again. As for the circumstances that lead to this ... A poor distribution made up for with dummy runs. Although unlikely, using out of sequence order as an end of run indicator, in which case two (or more) runs could appear to be a single run. I don't know if a detail like this should be included in the article. Rcgldr (talk) 08:15, 19 May 2018 (UTC)

References

^ https://books.google.com/books?id=e3wLBAAAQBAJ&pg=PA346&lpg=PA346#v=onepage&q&f=false

Knuth on tape sorts

I hadn't read that much of TAOCP volume 3 second edition to realize it was published in 1998. Part way through the external sorting section, a bit after describing a balanced merge sort using tapes, Knuth notes a transition away from tape drives to disk drives during the 1980's and by the late 1990's, tape sorting had "become of little relevance to current needs", but also states that it "reflects some of the best research done in computer science in it's early years" and "too nice to be discarded", which is why he decided to keep the tape based algorithms in his book, and is why I wanted to preserve some of this stuff in the Wiki article. As for the trend in disk versus tapes, LTO-8 12 TB (native, 30TB claimed with compression) tape drives cost about $5000 (USA), while SATA 4TB hard drives cost about $100. 2TB SSD's are around $800 or more, with the PCIe versions costing even more. I'm not sure there's much point to using SSD's for something like a 12TB (5 drives including 2 for ECC) raid cluster (transfer rate is maxxed, but there would be some random access benefit) and a SATA based 12 TB raid would cost about $500. Rcgldr (talk) 03:30, 22 May 2018 (UTC)

I think most companies have shifted away from tape drives, and the market has shifted away from high performance tape drives. In the 70s and 80s, a 9-track drive was expensive, but the reel of tape was cheap. The tape could be read in 5 minutes (we had that discussion recently). The problem was it cost money to keep the operator there switching tapes. We were using cheap tape for backup, but there were problems with manpower and the number of tape mounts that were required. We shifted to using an Exabyte 2GB drive and never went back. It took an hour to write one tape, but that didn't matter because we only needed one tape to do an incremental for all our systems. It was so convenient that even weekly fulls were not a big deal. We ended up buying 4 Exabyte drives as our disk space increased. But that was a shift in the market; we cared about the amount of storage, but we didn't care much about read/write speed. It was for backup. Now I run a magnetic RAID and back up to disk; I also have some files in the cloud.

There are still tape advocates, but the sales message has shifted from economical price per byte to archival storage. You can write a tape and supposedly read it 30 years later. Well, maybe, if one has a working tape drive thirty years hence — and a working interface for the drive. If the bits are valuable, they need to be copied to contemporary technology. I copied all my card decks to disk a long time ago, but I still have some programs on punched tape.

I think some WP article suggests using SSD for sorting, but that just seems like a bad idea; SSDs wear out, and sorting applications will cycle the memory hard.

Glrx (talk) 05:13, 22 May 2018 (UTC)

For data centers with a large budget, back in the 1970's and 1980's IBM did make libraries for the old reel type tapes that could pull tapes from the "library" and "mount" them into auto-loading tape drives. I don't recall if this could be solely triggered by JCL (job control language) or if the operator had to allow the request, which may have depended on the job site. There are libraries for the current cartridge type tapes. I haven't kept up with what mainframes use today for "nearline" storage. I seem to recall that that some libraries use raid (hard drive) clusters for nearline storage.

[1] ttps://books.google.com/books?id=e3wLBAAAQBAJ&pg=PA337&lpg=PA337#v=onepage&q&f=false

[2] ttps://books.google.com/books?id=e3wLBAAAQBAJ&pg=PA346&lpg=PA346#v=onepage&q&f=false

[1]

[1]