Talk:Good–Turing frequency estimation

This needs a description of the actual technique used by Good and Turing, which is notable by its absence from this article. -- The Anome 00:25, 27 October 2006 (UTC)[reply]

You are right and this has now been (partly) rectified Encyclops 01:17, 11 February 2007 (UTC)[reply]

Good Turing Smoothing[edit]

Jurafski and Martin's authorative work "Speech and Language Processing" (Chapter 6, section 6.3) mentions a Good-Turing Smoothing algorithm which seems to have some relevance to this article but doesn't entirely fit the formulas described here as far as I can tell. Does anyone know how or if it is related, and whether it should be incorporated here or in a separate article? Gijs Kruitbosch 00:41, 11 May 2007 (UTC)[reply]

Derivation of the estimates?[edit]

According to the article:

The first step in the calculation is to find an estimate of the total probability of unseen objects. This estimate is $p_{0}=N_{1}/N$

The next step is to find an estimate of probability for objects which were seen r times, this estimate is $p_{r}={\frac {(r+1)S(N_{r+1})}{NS(N_{r})}}$

Where do these estimates come from? What prior probabilities are being assumed? -- Jheald 10:13, 4 March 2007 (UTC)[reply]

Do we define p_r in the article? Depending on the definition this formula is either correct or wrong.— Preceding unsigned comment added by Srchvrs (talk • contribs) 03:41, 11 February 2013 (UTC)[reply]

It is not a Bayesian scheme that starts with prior probabilities. Why it works is somewhat of a mystery and perhaps still an open research problem (although Orlitsky, in the first cited reference [[1]], claims to have figured it out. I haven't read his work).

Using the number of species that have been seen only once

N_{1}

as an estimator of the number of species that have not been seen yet seems reasonable, but I don't know a formal justification. If there are a lot of species that we have observed only once, then we must still be in the early stages of species discovery, so there are probably a lot of species out there that we have not seen yet.

The second formula is even less intuitive. It has to do with 'adjusting' the observed counts in a certain manner. Perhaps it would help to add some discussion, but I don't feel qualified to do that. Encyclops 16:34, 4 March 2007 (UTC)[reply]

The formulas and procedure don't precisely agree with my understanding of Turing's original work, which was based on a general theory of distributions of distributions of ... which so far as I know was never published and may still be classified.

I agree that it is not a Bayesian scheme as such.

A reasonable intuitive estimate for the likelihood of any single specified species which has not yet been seen (e.g. a purple ball) would seem to be

1/2N

, reasoning that in the absence of any other information, the current set of observations is on the border between having seen one of that species and having seen none of that species, and indeed this assumption has been heavily used by cryptanalysts when they know the full set of species in advance. (The interesting thing about Good–Turning is that the full set of species is not known in advance.) So the formula

p_{0}=N_{1}/N

amounts to saying that the expected number of unseen species is

2N_{1}

, which is interesting in its own right (and still needs justification, which I'm not prepared to provide).

Somewhere at home I have a copy of the original Biometrika paper, which should be cited in the article as the initial publication concerning this topic. — DAGwyn (talk) 21:41, 14 February 2008 (UTC)[reply]

In case it helps you find it,here is a reference that might be what you are looking for: I. J. Good: The population frequencies of species and the estimation of population parameters. Biometrika, 40:237--264, Dec 1953. Encyclops (talk) 23:12, 14 February 2008 (UTC)[reply]

The formal similarity between the formula being asked about here, and that of Robbins' empirical Bayes methods, is obvious. Perhaps that can answer the question. I'll look at this more closely and then opine further. Michael Hardy 22:56, 4 May 2007 (UTC)[reply]

Example plot request[edit]

The following is copied from the Stats project talk page. Melcombe (talk) 14:23, 2 December 2010 (UTC)[reply]

"Instead we plot [...]"

Shouldn't there be a plot? Please add some illustrations or adjust the text. Thank you! --Peni (talk) 14:01, 1 December 2010 (UTC)[reply]

P.S. Source: Good-Turing smoothing without tears, William A. Gale Journal of Quantitative Linguistics, 1995. --Peni (talk) 15:39, 1 December 2010 (UTC)[reply]

Be bold. --Qwfp (talk) 17:27, 1 December 2010 (UTC)[reply]

(end copy)

The Novel by Robert Harris[edit]

Has anyone read the novel by Robert Harris? Is it relevant to the topic of this article or should the reference be removed? The comment "The book, though fiction, is criticised by people who were at Bletchley Park as bearing little resemblance to the real wartime Bletchley Park" makes me doubt that the authour has any valid technical or historical points to offer on the subject at hand. Encyclops (talk) 16:54, 30 March 2011 (UTC)[reply]

N_r is still zero, so Z_r iis zero... i think this is a mistake[edit]

N_r is still zero, so Z_r iis zero... i think this is a mistake. maybe Z_r is the average of Z_q and Z_t ? — Preceding unsigned comment added by 84.110.184.55 (talk) 09:16, 9 April 2014 (UTC)[reply]

source for Sampson Gale[edit]

There's a tag calling for a non-primary source for the Sampson/Gale work - how about http://www.cs.dartmouth.edu/~lorenzo/teaching/cs134/Archive/Spring2010/milestone/20100511-134-milestone-cooley/node5.html 109.158.8.39 (talk) 18:23, 24 June 2014 (UTC)[reply]

Distribution of population frequencies[edit]

The current version of the article states: "The assumption of Good–Turing estimation is that the number of occurrence for each species follows a binomial distribution." and cites ^[1]

However, I can not find any support for this statement in the source given.

On the contrary Good himself states in his original publication of the estimator that:

"The methods of the first six sections of the present paper are largely independent of the distributions of population frequencies."

^[2]

References

^ The Good–Turing estimate" (PDF). Computer Science (course guide). CS 6740. Ithaca, NY: Cornell University. 2010.
^ Good, Irving J. "The population frequencies of species and the estimation of population parameters." Biometrika 40.3-4 (1953): 237-264.

[1] The Good–Turing estimate" (PDF). Computer Science (course guide). CS 6740. Ithaca, NY: Cornell University. 2010.

[2] Good, Irving J. "The population frequencies of species and the estimation of population parameters." Biometrika 40.3-4 (1953): 237-264.

[1]

[2]