Talk:Reservoir sampling

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Algorithm Description[edit]

I've made the basic algorithm description better by rewriting it using an ALGOL variant (for all its syntactic oddness, Algol's best for presenting algorithms). Probably needs more cleanup, but at least it is now clear why it is definitely O(n)! --Donal Fellows (talk) 06:54, 11 April 2010 (UTC)[reply]

"Algorithm R"[edit]

This is not Vitter's name. It is the name given by Donald Knuth in his Volume 2. In the 1969 original edition it is cited to a 1962 paper. In the 1981 second edition it is credited to Alan G. Waterman without a citation. Astarabadi (talk) 11:58, 7 April 2011 (UTC)[reply]

Missing information[edit]

I think that according to Vitter's paper, this algorithm should include a bound such that "t > thresh, where thresh is equal to T x n." (direct reference from the paper). Such a restriction must be in place for a sufficiently large data set, for the frequency of picking up new items will be ever decreasing to 0.

In my opinion the article should discuss more about the techniques/analysis presented in Vitter's paper. --Ltsampros2 (talk) 09:38, 4 July 2012 (UTC)[reply]

Algorithm Z[edit]

In doi:10.1145/3147.3165, Vitter presents Algorithm Z, which outperforms R when skipping records is possible, and is optimal under this condition. Paradoctor (talk) 14:06, 20 July 2014 (UTC)[reply]

Algorithm A-Chao[edit]

I think there is a slight error in this algorithm. This line

WSum = WSum + S[i].Weight

Should be placed before the probability is calculated. Consider the stream as the sequence 1, 2, 4, 8, etc. For item N, the sum of all the earlier items will be WSum = 2^N - 1, however S[N].Weight = 2^N so p = 2^N/(2^N - 1) > 1 which means that item will always be selected even though that clearly should not be the case. — Preceding unsigned comment added by TowerOfBricks (talkcontribs) 19:27, 2 February 2016 (UTC)[reply]

This issue seems to have been addressed in diff. I wasn't sure whether I should just remove this paragraph or leave it in place with a notice. — Merlight (talk) 14:36, 18 June 2018 (UTC)[reply]

However, I suspect there's a bigger issue with this section -- the pseudo-code doesn't implement the referenced algorithm. The first loop fills the reservoir with the initial k items of the population. Before considering item k+1 in the second loop, inclusion probabilities of all items in the reservoir are equal to 1. Thus the argument "The trick here is that, if the probabilities of all items in the reservoir are already proportional to their weights, ..." holds only when the first k items of the population have equal weights, which cannot be guaranteed. The pseudo-code apparently skips section "2.1 Normalizing the first-order probabilities" of Chao's paper, and doesn't handle cases when the probabilities of items' presence in the reservoir are not proportional to their weights. — Merlight (talk) 14:36, 18 June 2018 (UTC)[reply]

I agree this is a big issue with the algorithm. I added a causiary note, but maybe the algorithm should be completely removed, since it's barely useful in its current state.
There are other "strictly proportional" algorithms that don't require any normalization, such as the Cyclic method described in Rao's paper. Maybe Chao could be replaced by that. Thomasda (talk) 05:26, 20 July 2022 (UTC)[reply]

Sample Size 10 Example[edit]

The sample size 1 example was way too "special case"-y and did little to elucidate the algorithm in general.

Almost any other sample size would be an improvement. So I modified the example to work for sample size 10.

Algorithm A-Res: random(0,1)[edit]

I think there is a slight error in the pseudocode of Algorithm A-Res.

r = Random(0,1) ^ (1/S.Weight)  // important: inclusive range

According to the authors, the random(0,1) should produce a uniform value in (0,1), that is excluding 0 and 1. The intuition behind this is probably because either random would render the item weight irrelevant. Gstamatelat (talk) 12:22, 21 November 2017 (UTC)[reply]

You are correct, this is explicitly stated in e.g. Efraimidis (2015)[1]: "In A-ES, each item of the population V independently generates a uniform random number and calculates a key ." The interval (a,b) excludes both endpoints. -- Lorenzh (talk) 10:29, 5 November 2019 (UTC)[reply]
I was referring to the comment. Gstamatelat (talk) 14:04, 28 November 2019 (UTC)[reply]
I know, I was agreeing with you that the comment in the pseudocode was wrong. It should be fixed now. -- Lorenzh (talk) 11:48, 4 December 2019 (UTC)[reply]

References

  1. ^ Efraimidis, Pavlos S. (2015). "Weighted Random Sampling over Data Streams" (PDF). Algorithms, Probability, Networks, and Games. Lecture Notes in Computer Science. 9295: 183–195. doi:10.1007/978-3-319-24024-4_12. {{cite journal}}: Cite has empty unknown parameter: |1= (help)

Proof of uniformity of 'Algorithm R'[edit]

There has been at least one person who found the proof that this sampling method is indeed uniform unclear (and I agree). That person asked a question which I answered here: https://cs.stackexchange.com/questions/87631/reservoir-sampling-algorithm-probability/87633#87633 I think adding this myself would be considered 'Original research', so I won't do that, but I think the answer is pretty good and can (after a rewrite, the style there isn't encyclopedic) replace the current proof. ShearedLizard (talk) 10:30, 23 February 2018 (UTC)[reply]

Serious issues with this article[edit]

There are some serious issues with this article, including factual mistakes.

  • The problem definition is badly stated, to the point of being misleading
  • "Algorithm R" is very slow and should not be used. Vitter's paper, which is cited as the source, says as much[1] and contains a much faster (in fact, optimal) algorithm. However, a much simpler, also optimal algorithm was given by Li[2]
  • This directly implies that the section "Fast Approximation" should be deleted. I don't know why it exists in the first place if simple and exact algorithms have been known since the mid-80's (Vitter) / 90's (Li). Maybe the author of the article cited therein should also be notified :)
  • Algorithm A-Res is stated incorrectly, it requires selecting the items with the largest keys, not those with the smallest.
  • The "example implementation" is a second statement of Algorithm R in python and doesn't add any value beyond the pseudocode already given.

I'm working on a comprehensive overhaul of this article, however, I'm rather new to encyclopedic writing (my background is in CS research and I have published on the topic of random sampling, including reservoir sampling, but I do not intend to cite my own papers). -- Lorenzh (talk) 10:22, 5 November 2019 (UTC)[reply]

I finished with this faster than expected and have just committed my revised version of the article. Please let me know if there are any issues. -- Lorenzh (talk) 15:03, 5 November 2019 (UTC)[reply]
Thank you for your contributions. I gave it a quick glance and did not spot any issues. I did not check what you removed, so you might have removed stuff other editors may consider relevant.
"I'm rather new to encyclopedic writing": consider checking out WP:MOS then, it might help. It's a good reference when you are wondering what is the right way of writing something. "(...) but I do not intend to cite my own papers", thank you very much for this. This could very easily violate WP:SELFCITE. However, if you think something you wrote is the best citation for something written here, you should cite it anyway. BernardoSulzbach (talk) 15:57, 5 November 2019 (UTC)[reply]
Thanks! I've removed the approximation section (which is superseded by the optimal algorithm of Li (1994)) and the distributed section, which had a naive algorithm that doesn't meet the goal of continuously maintaining a sample and makes some unrealistic assumptions (exact splitting of the input). A revised distributed section might be a good addition, however, there are multiple different machine models (distributed streaming model of Cormode et al, mini-batch model which is more realistic but is used by fewer publications), the optimization criterion is typically different (number of messages, with time and space being secondary) and describing all that would probably be beyond the scope of this article. I guess I could add a few papers as Further Reading. Lorenzh (talk) 16:32, 5 November 2019 (UTC)[reply]
Oh, OK. Briefly describing what was done and citing the publication works well as quite often a description of what was achieved and of what is known is all the reader needs. You can also cite good literature surveys, which can reduce the number of cites you have to do yourself. BernardoSulzbach (talk) 16:58, 5 November 2019 (UTC)[reply]

References

  1. ^ Vitter, Jeffrey S. (1 March 1985). "Random sampling with a reservoir" (PDF). ACM Transactions on Mathematical Software. 11 (1): 37–57. CiteSeerX 10.1.1.138.784. doi:10.1145/3147.3165.
  2. ^ Li, Kim-Hung (4 December 1994). "Reservoir-Sampling Algorithms of Time Complexity O(n(1+log(N/n)))". ACM Transactions on Mathematical Software. 20 (4): 481–493. doi:10.1145/198429.198435.

Algorithm L ("An Optimal Algorithm")[edit]

The text says

> This algorithm computes three random numbers for each item that becomes part of the reservoir, and does not spend any time on items that do not.

I do not think that is correct. If you imagine a very small k (ie 1 or 2) and a very large n, and consider that the "skip" amount only depends on k, it will do more skips (and more random() calls) for larger n. The big-O complexity also uses n in its calculation, suggesting that it _does_ depend on n, not just k, and _does_ spend time on items that don't end up in the reservoir - ie items that were temporarily in the reservoir, but then later replaced.

Is the algorithm wrong, or the text? (And if the algorithm is wrong, could we not calculate the skip to be still geometrical, but also relative to n, ...hand waves... and get an actual "only do work on selected items" algorithm?) — Preceding unsigned comment added by 216.94.43.6 (talk) 21:30, 27 July 2020 (UTC)[reply]

That is unavoidable (and in fact necessary) by the definition of reservoir sampling. Of course the algorithm has to spend time on any item that enters the reservoir at some point in time, even if it is later replaced. How else would it present a valid sample in the meantime (i.e., before the item is replaced)? Recall that the task is to provide a sample without replacement of all items seen so far, at any time, and not just at the end. In fact you can prove that the expected number of items that enter the reservoir is something like O(k * (1 + log(n/k))). That's why the running time of the algorithm is optimal. If you just want a sample without replacement of all items, with all items presented at once, then you don't need reservoir sampling, you need an algorithm for sampling without replacement. If you want an efficient algorithm for that, have a look at Section 3 of https://arxiv.org/pdf/1610.05141.pdf (disclosure: I'm one of the authors of that paper). tl;dr neither is wrong. --Lorenzh (talk) 12:20, 4 August 2020 (UTC)[reply]
Ah, so the wording just threw me off then. "becomes part of the reservoir" could be "temporarily becomes part ..." or something like that. (As for sampling without replacement (ie given known population size), I _think_ you should be able to "similarly" calculate a random skip distance (of correct probability) for each selected item, and only do `k` steps, but the math gets hairy. Might be an interesting direction to research...) — Preceding unsigned comment added by 216.94.43.6 (talk) 21:24, 14 September 2020 (UTC)[reply]
When you have the entire input to start with, the challenges are completely different from those in reservoir sampling. But it is possible to achieve (expected) linear time. I described a parallel algorithm that accomplishes this with expected linear work in Section 6 of https://arxiv.org/abs/1903.00227 but it's quite complex. For uniform (unweighted) sampling without replacement you can use random variates from a hypergeometric distribution to split the work recursively (obtaining a divide-and-conquer algorithm) but the same doesn't work for weighted sampling because the a set of items can't simply be modelled by their number and/or a simple sum of weights any more. --Lorenzh (talk) 15:55, 12 November 2020 (UTC)[reply]