Talk:Random binary tree/GA1

GA Review

Article (edit | visual edit | history) · Article talk (edit | history) · Watch

Nominator: David Eppstein (talk · contribs)

Reviewer: Czarking0 (talk · contribs) 02:28, 28 March 2024 (UTC)[reply]

Hello and thanks for your contribution. At a glance this seems like a cool article. I will attempt to provide a though and well paced review. I am placing this table below too keep track of the criteria but feel free to suggest another format.

Rate	Attribute	Review Comment
1. Well-written:
	1a. the prose is clear, concise, and understandable to an appropriately broad audience; spelling and grammar are correct.	I am not sure if there is a guideline on this but it seems like this section could really be served by having the mathematical notation written as like a derivation. "By linearity of expectation, the expected number of ancestors of x {\displaystyle x} equals the sum, over other keys y {\displaystyle y}, of the probability that y {\displaystyle y} is an ancestor of x {\displaystyle x}. A key y {\displaystyle y} is an ancestor of x {\displaystyle x} exactly when y {\displaystyle y} is the first key to be inserted from the interval [ x , y ] {\displaystyle [x,y]}. Because each key in the interval is equally likely to be first, this happens with probability inverse to the length of the interval." "Its length, for a tree with n {\displaystyle n} nodes, is almost surely " this seems like a very odd place for an "almost surely ". After reading the page on this I understand. In my professional life where I in part work as a statistician just say that it occurs with probability 1 which I find more clear, but I recognize that this language may be the norm here. "Expected number of leaves" Wanted to highlight how well written this section is "(or extended binary trees with n {\displaystyle n} internal nodes and n + 1 {\displaystyle n+1} external nodes) " I would remove this
	1b. it complies with the Manual of Style guidelines for lead sections, layout, words to watch, fiction, and list incorporation.	Does the stuff about treaps really belong in the lead section? Overall it feels like the lead could be better but I recommend we come back to this point after making the other edits. It may be worthwhile to break out critical Galton-Watson trees into a subsection. Or maybe just a paragraph break before this sentence: "Another application of critical Galton–Watson trees (in the version where the root must be internal) arises in"
2. Verifiable with no original research:
	2a. it contains a list of all references (sources of information), presented in accordance with the layout style guideline.	"However, only the weaker upper bound log 3 ⁡ n + o ( log ⁡ n ) {\displaystyle \log _{3}n+o(\log n)} has actually been proven" I am weary of WP pages claiming that something has never been proven since that is usually hard to verify. Do you have a source for that?
	2b. reliable sources are cited inline. All content that could reasonably be challenged, except for plot summaries and that which summarizes cited content elsewhere in the article, must be cited no later than the end of the paragraph (or line if the content is not in prose).	Looks good, read through several of them. Byline checked all.
	2c. it contains no original research.	Looks good
	2d. it contains no copyright violations or plagiarism.	Earwig looks good
3. Broad in its coverage:
	3a. it addresses the main aspects of the topic.	"By choosing the priorities to be independent random real numbers in the unit interval" Is it important that the distribution is uniform? You can tell me to read the sources as an answer. Why does the reader want to know about radix trees? What are they used in?
	3b. it stays focused on the topic without going into unnecessary detail (see summary style).	I was not left feeling like I had questions. I will double check through the sources but I think the author did a good job
	4. Neutral: it represents viewpoints fairly and without editorial bias, giving due weight to each.	No issues
	5. Stable: it does not change significantly from day to day because of an ongoing edit war or content dispute.	No issues
6. Illustrated, if possible, by media such as images, video, or audio:
	6a. media are tagged with their copyright statuses, and valid non-free use rationales are provided for non-free content.	The author has taken the time to generate these great images and dedicate them to the public domain!
	6b. media are relevant to the topic, and have suitable captions.	Super great
	7. Overall assessment.	Article looks good I left a couple notes to the author on what might make it better but I do not think they make a difference for GA. I appreciate this deeply knowledgeable author writing a very clear article on what is an important subject. I especially appreciate the images.

Some replies:

Re 1a: "Derivation": you mean a formal proof? Really? Or am I misunderstanding this point? "Almost surely": should have been with high probability (that is, probability tending to 1 in the limit of large $n$ rather than probability 1 even for fixed $n$ ; fixed. Re "(or extended...)": removed.

After reading it again I changed my mind and I think it is fine the way it is. Done Czarking0 (talk) 01:51, 29 March 2024 (UTC)[reply]

Re 1b: There is only one sentence about treaps in the lead. The paragraph it is in is mostly not about treaps. It is intended as a brief summary of the long "from random permutations" section, most of which is motivated by the average-case analysis of insertion-only binary search trees without any rebalancing. Treaps are a trick to get that same analysis to work for worst-case inputs with both insertions and deletions; I think they are worth mentioning in the lead, but not for more than a sentence. Re critical Galton-Watson trees: it's difficult to split out the p=1/2 cases from the p<1/2 and p>1/2 cases in the analysis, but I added a paragraph break before the algorithmic application as suggested, and added subsection headers to the whole section.

Done Czarking0 (talk) 01:51, 29 March 2024 (UTC)[reply]

Re 2a and "only the weaker upper bound": removed "only" and "actually" to avoid the implication that this is the last word on the subject (although I think it is).

I like this change and I think it is fine the way it is. The problem I am trying to avoid is when an article claims something that was true at the time of writing but a new proof is found later then years go by with no change. That is probably outside the context of a GAN.

Re 3a: "Is it important that the distribution is uniform?" (for treap priorities) No. It is important that they be independent and (likely) distinct, but uniformity doesn't matter. That's why I omitted it from this part. Some of the later works cited in the last paragraph of this section address the issue that getting computer representations of uniform reals to enough precision to make sure they're all distinct needs a lot of bits of randomness per number (and then all these bits need to be stored in the treap), leading to unnecessary inefficiencies. Instead they use more complicated schemes to save bits, but then they don't get trees whose distribution perfectly matches the uniform insertion order distribution.

Just to be clear, and because I find this surprising, I would still be correct if I replaced random with exponentially distributed or beta distributed (to maintain the unit interval) such that it read: "By choosing the priorities to be independent exponentially distributed real numbers in the unit interval, and by maintaining the Cartesian tree structure using tree rotations after any insertion or deletion of a node, it is possible to maintain a data structure that behaves like a random binary search tree." I wanted to make sure I am on the same page with you in that it is not just that it does not need to be uniform or even approximately uniform. Czarking0 (talk) 01:51, 29 March 2024 (UTC)[reply]

Ok after reading more I get it. Since "behaves like" is talking about the structure of the tree and not making claims about the priorities themselves it does not really matter how the priorities are assigned. The treap article covers this well. I think we are

Done here. Czarking0 (talk) 03:13, 29 March 2024 (UTC)[reply]

Re: "Why does the reader want to know about radix trees?": If the reader cares about probability distributions over random trees, the radix tree gives you a predetermined number of internal nodes, unlike the trie. If the reader cares about data structures, they need to know about both tries and radix trees as both are standard ideas in this area. If the reader cares about algorithms, both tries and radix trees model the behavior of certain binary-representation-based divide-and-conquer sorting algorithms on random inputs, but different algorithms. The trie models a sorting algorithm that divide on the bits of the binary numbers given, in strict succession from most significant to least significant. The radix tree models a sorting algorithm that finds and then divides on the next bit that separates some inputs. My personal interests are in data structures and algorithms, but actually my own motivation for including the section on tries and radix trees in this article was less about algorithms and more from the point of view of probability distributions over random trees: they give a quite different distribution than the other ones described (more strongly balanced, especially in the radix tree case) and I thought that made an interesting contrast to the other distributions.

Great reply, I think the article would be served a little bit by saying some of this. There is already radix tree but an uninformed reader may not know that they want to click on that page just from reading this. Something like, "Radix trees provide a baseline for comparing data structures. In divide-and-conquer algorithms, radix trees model the behavior on random inputs" I am not sure if those statements are 100% accurate I am just compressing your reply. Citations will likely be needed unless your sources, which I will review shortly, have some info on that. Czarking0 (talk) 01:51, 29 March 2024 (UTC)[reply]

Yes, I suspect citations motivating this as a subtopic of random trees more generally (rather than just talking about the data structures and their analysis) are likely to be sparse. —David Eppstein (talk) 06:04, 29 March 2024 (UTC)[reply]