Talk:Entropy estimation

	Mathematics portal This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.MathematicsWikipedia:WikiProject MathematicsTemplate:WikiProject Mathematicsmathematics articles
Low	This article has been rated as Low-priority on the project's priority scale.

Statistics Low‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
Low	This article has been rated as Low-importance on the importance scale.

Article does not reflect the literature

I believe articles referred here were put there by the authors themselves, and I guess that means that this article should have a banner of some sort. I just don't know which one. I don't know enough about the subject to clean this page up but it seems to me that important authors in this field are Joe (1989) and Hall (1987). And no, my last name is neither Joe or Hall: I was in primary school in the eighties.

Regards, Jelmer Wind.

References:
Hall, P. (1987). On Kullback-Leibler loss and density estimation, Ann. Statist., 15, 1491-1519.
Joe, H. (1989). Estimation of entropy and other functionals of a multivariate density, Ann. Inst. Statist. Math., 41, 683-697.

Dear Jelmer - thanks for your comment. It's not considered polite to assume "bad faith" (see Wikipedia:Assume good faith). All of the references on this article were added my me, and I have had no involvement in writing any of them. If you feel able to improve the article yourself, please do --mcld (talk) 09:54, 20 August 2009 (UTC)[reply]

an analytic solution?[edit]

Maybe I'm reading this wrong, but it sounds like this paper gives an analytic solution:

http://dukespace.lib.duke.edu/dspace/bitstream/handle/10161/2458/D_Paisley_John_a_201005.pdf?sequence=1

(pages 26-32)

$E[H(\pi )|\alpha g_{0}]=\psi (\alpha +1)-\sum _{k\in K}g_{0k}\psi (\alpha g_{0k}+1)$

where $\psi ()$ is the digamma function and $\sum _{k\in K}g_{0k}=1$ and $\alpha g_{0i}=$ the # of observations of i, plus the Dirichlet prior (usually 1).

Kevin Baas^talk 20:32, 14 June 2015 (UTC)[reply]

I wrote a little computer program to test the formula, and the values look right. Here's what it gives for a binomial distribution with various observations (listed in the {}'s):

{0,0}: 0.721347520444483

{0,1}: 0.7213475204444812

{0,2}: 0.6612352270741079
{1,1}: 0.8415721071852273

{0,3}: 0.6011229337037343
{1,2}: 0.841572107185227

{0,4}: 0.5490256127827456
{1,3}: 0.8095122173876965
{2,2}: 0.8896619418815277

{200,200}: 0.998207835026435
{0,200}: 0.04201675094320411

Kevin Baas^talk 16:16, 15 June 2015 (UTC)[reply]

I think you are reading the thesis correctly--there is an analytic solution for the given class of Dirichlet processes. There are other processes, like Gaussain processes, that allow for analytic forms for entropy, too. But this article is about estimating entropy from empirical data, where the assumption of a particular distribution or process may not be valid. --Mark viking (talk) 18:08, 15 June 2015 (UTC)[reply]

I believe the thesis is also about estimating entropy from empirical data, where the assumption of a particular distribution or process may not be valid. All estimations of entropy necessarily require one to assume a particular process/distribution, or at least prior probabilities thereof. That is, they all require a bayesian prior (Prior_probability), and possibly even priors on the model (as in whether it is an exponential distribution, normal distribution, etc.) The thesis in question is of course no exception to this rule (there cannot be any exceptions - that's why its a rule). Note it does not assume any particular parameters to the distribution. If the parameters of the posterior distribution were known, and it's known that it's a categorical distribution, then the entropy is just

H(P)=-\sum _{i\in I}p_{i}ln(p_{i})

but that is not what this solution is. This solution, like the other solutions in this article, is for when the p_i's aka the "distribution" are unknown, and all we have are a small finite sample of empirical observations, A.

E(H|A)=\iint _{\sum _{i\in I}p_{i}=1}H(P)p(P|A)dp_{i}=-\iint _{\sum _{i\in I}p_{i}=1}(\sum _{i\in I}p_{i}ln(p_{i}))\prod _{i\in I}p_{i}^{a_{i}-1}dp_{i}

The thesis shows that the expection of H, given the empirical data A, under the possibly invalid assumption that the conjugate prior of the distribution is Dirichlet (though presumably its known that it has discrete support), is:

E[H(\pi )|\alpha g_{0}]=\psi (\alpha +1)-\sum _{k\in K}g_{0k}\psi (\alpha g_{0k}+1)

note this is an expectation, not a certain result. More generally, the probability that H=h is given by the far more ominous integral:

p(H=h|A)={\frac {1}{B(a)}}\iint _{h=-\sum _{i\in I}p_{i}ln(p_{i}),\sum _{i\in I}p_{i}=1}\prod _{i\in I}p_{i}^{a_{i}-1}dp_{i}

which to my knowledge remains unsolved.

Kevin Baas^talk 18:26, 15 June 2015 (UTC)[reply]

Also note that this is not to be confused with the entropy of a Dirichlet process, which is:

H(X)=\log \mathrm {B} (\alpha )+(\alpha _{0}-K)\psi (\alpha _{0})-\sum _{j=1}^{K}(\alpha _{j}-1)\psi (\alpha _{j})

which is neither the same equation nor an expectation.

Kevin Baas^talk 20:01, 15 June 2015 (UTC)[reply]

Please don't shout. Entropy estimation does not require a Bayesian formulation. Here is a review paper of nonparametric entropy estimation methods, with most of them frequentist in formulation. Has the thesis above been published in peer-reviewed form? If not, it doesn't really qualify as a reliable source by WP standards. --Mark viking (talk) 20:37, 15 June 2015 (UTC)[reply]

Didn't mean to shout. Just emphasizing. I understand that not all entropy estimation methods start with a bayesian formulation. Nonetheless, they are all neccessarily Bayesian, since all probabilities are necessarily conditional probabilities. This holds true for models that use a frequentist interpretation. It holds true regardless of the interpretation. A bayesian formulation just acknowledges it explicitly.

I don't know if it's been published in peer-reviewed form. That's one of the things I was hoping to clarify - whether it meets WP:RS. (I can't vouch for step 1.39->1.40 or 1.40->1.41, but all the others steps look right, and running it on some sample values looks right.) I'm hoping it can get in the article because I know I spent a long time searching for something like this and I can imagine other people working in AI/Machine Learning would be very interested in an analytic solution to this problem. Wanted to get others opinions on whether it's right, though, first. (peer reviewed does not equal correct, nor applicable.) Being it a dissertation, presumably it was at least reviewed by his professors, and presumably he graduated... maybe it's a question for WP:RS? Kevin Baas^talk 20:51, 15 June 2015 (UTC)[reply]

I understand the Bayesian orthodoxy and I understand the frequentist orthodoxy. In my opinion, maintaining neutrality (one of our five pillars) means representing methods of both doctrines where possible. Regarding sources, I know from my work at AfD that theses are not considered reliable sources for the purposes of notability. WP:SCHOLARSHIP is the relevant section for theses. It points out that they are primary sources with varying levels of scrutiny. I wonder if the thesis has bee cited anywhere? Citations would lend credence. Here a list of the author's publications, perhaps one of these has the highlights of his thesis. Wikipedia:Reliable sources/Noticeboard is the place to ask about RS. If the thesis or a subsequent paper is found to be a RS, I think it would certainly be reasonable to include in the article. --Mark viking (talk) 21:19, 15 June 2015 (UTC)[reply]

Thanks! I have posted it there Wikipedia:Reliable_sources/Noticeboard#Entropy_Estimation_-_Machine_Learning_with_Dirichlet_and_Beta_Process_Priors:_Theory_and_Applications. So far I've gotten feedback from one person who says it's considered reliable but primary, so it's admissible long as the content added is not meta (is a claim from the source and not about it). So... I'm starting to think of how to add it to the article. I'm thinking adding a paragraph or two to the section "Estimates based on expected entropy", or add a small new section something like "analytic solution for processes with dirichlet priors" and a short paragraph or two. It seems to me like the current paragraph in "Estimates based on expected entropy" might need modification, too, and esp. the second sentence, since it is now demonstrably false (since the analytic solution is of course not limited to small biases and variances). Kevin Baas^talk 18:30, 17 June 2015 (UTC)[reply]

If it is good enough for the noticeboard, then I support adding it, too. Yes, the "Estimates based on expected entropy" seems somewhat promotional in tone to me and could use some work. The second sentence is simply wrong--even in the frequentist realm, there are finite-size scaling and bootstrap approaches that take into account sample size and estimate precision. Adding the thesis work to that section would be reasonable. Another approach would be to add it to the "Bayesian estimator" section. Both this and the NBS estimator are Bayesian approaches with types of Dirichlet priors. Might be interesting to compare their assumptions and domains of applicability. --Mark viking (talk) 19:32, 17 June 2015 (UTC)[reply]

I guess I didn't really see that section before. Seems a bit over-lapping, since the most common loss function used reduces the bayesian estimator to just the expectation (E[theta|X]). "The NSB estimator uses a mixture-of-Dirichlet prior, chosen such that the induced prior over the entropy is approximately uniform" - I went for the same thing - using a uniform prior over the entropy - but to get it, i just multiplied by the derivative of the entropy (e.g. -Math.log(theta/(1-theta) for the bernoulli distribution). But I digress. Wonder how to present the three things now, given that the "bayesian estimator" contains "estimates based on expectation" as a special case (namely, using a naive prior and MSE as the loss function). Kevin Baas^talk 20:43, 17 June 2015 (UTC)[reply]

FWIW I agree with the preceding discussion, and specifically with points made by Mark viking. (1) The "Estimates based on expected entropy" section definitely seems promotional, so it should either be drastically reworked or simply deleted. (2) It is possible to interpret almost any estimator from a Bayesian perspective, but it's false to conclude that means "they are all necessarily Bayesian" (a method is more than just its choice of probability distributions), and when editing the article there's no need to impose a Bayesian perspective as primary. (3) It seems OK to add the thesis paper, and particularly helpful if the article discusses the choice of prior. Note that Dirichlet processes are a convenient and widely useful generative model for categorical data, but far from the only generative model (see Pitman-Yor etc) - so I would say that although a Dirichlet-based method is generic enough to go in this article, it's worth being clear that it is a method for a specific class of model. Best wishes --mcld (talk) 09:21, 23 June 2015 (UTC)[reply]