Talk:Ensemble learning

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Pseudo-code[edit]

The batches of pseudo-code on this page are a parade example why formulae work better than pseudo-code, for most readers. The fact that you write a verbal description in Courier does not improve the readability, in fact it worsens it. — Preceding unsigned comment added by 129.67.26.155 (talk) 16:04, 10 March 2014 (UTC)[reply]

Rename[edit]

Would anyone be opposed to renaming this article ensemble learning? I feel that it would be more succinct.—3mta3 (talk) 16:06, 8 September 2009 (UTC)[reply]

I agree. It fits with the scholarpedia article [1] and is in wide use, though Ensemble methods is also common. EverGreg (talk) 10:30, 10 September 2009 (UTC)[reply]
That's a good point (personally I don't really like the term "learning"), but I feel that it clarifies it from other types of ensembles.—3mta3 (talk) 15:43, 10 September 2009 (UTC)[reply]


Gating and model selection[edit]

Though I agree that cross validation dosn't sound like an ensemble method, it's not so straightforward with gating. Specifically, the scholarpedia article uses gating as another word for mixture of experts [2]. This method makes a weighted linear combination of the different predictors. If the weights are 1 for one of the models and 0 for all others, the mixture of experts method reduces to simple model selection. In this way, we see that "model selection" can be seen as a special case of an ensemble method. So it's not correct to say that model selection is not an ensemble method.

However, it's probably better to wait with this fine point until we've expanded this article enough to present all the basics of ensemble learning. EverGreg (talk) 12:52, 11 September 2009 (UTC)[reply]

That's a good point, perhaps the article needs a section contrasting ensemble methods with model selection.—3mta3 (talk) 15:14, 21 September 2009 (UTC)[reply]

If an ensemble of models is tested using cross-validation-selection across a set of several problems, it will obtain a higher overall score than any one of the models could obtain by itself across the same set of problems. Cross-validation-selection, therefore, meets the definition of ensemble because it creates a synergism from the models that it contains. Of course, this synergism only happens when there are multiple problems involved. Same with gating. So I propose that we add a section called "model selection" (with gating and cvs sub-sections), and state that all of these techniques technically count as ensembles, but only when there are multiple problems involved.--Headlessplatter (talk) 19:46, 30 September 2009 (UTC)[reply]


Stacking[edit]

No explanation of how the models are to be combined after using CV is given in the stacking section. Does stacking refer to any method of using CV to weight the members of the ensemble? That was not my understanding. I was told that: "Stacking actually trains a model on the outputs of the base models so it learns how to combine the base outputs." However, this is not what is said in the stacking section. How exactly does stacking work, and can we update the section appropriately? Jlc46 (talk) 22:51, 26 March 2010 (UTC)[reply]

My understanding is consistent with your explanation, and not the article's. I am in favor of someone re-doing this section. The stacking section is written in very different language than the other sections, and is hard to follow. For example, the expression "bake-off" is not described, referenced, or linked. Also, it seems to be constructed in a bottom-up manner, while the other sections use a top-down approach to describing their methods.--67.166.113.28 (talk) 13:13, 3 October 2011 (UTC)[reply]

Bayes optimal[edit]

Is this statement accurate? "On average, no other ensemble can outperform it, so it is the ideal ensemble". It follows by the no free lunch theorem that averaged over all data sets, no predictor is better than any other. I'm guessing we mean to say something slightly different here? EverGreg (talk) 21:42, 7 July 2010 (UTC)[reply]

Optimal bayes classifer is a theoretical construct. It's not a learning algorithm per se - it assumes that you already know exactly all the probability distributitions in its formula, so there's no training phase, and so no-free-lunch theorems simply do not apply. In practice when you try to estimate those distributions from a training set, you get approximations, and so optimality is gone. -- X7q (talk)
I haven't heard anyone viewing it as an ensemble though, so maybe we have a case of original research here. -- X7q (talk) 22:24, 7 July 2010 (UTC)[reply]
I separated Bayesian Model Averaging (BMA) from the Bayes Optimal Classifier (BOC) section. This seemed to be a natural division since BMA is a practical technique and BOC is only theoretical. There are many references in the literature to BMA as an ensemble technique. I also have not seen BOC directly referred to as an ensemble, but I cannot see how BMA could be an ensemble and BOC not. Further, BOC makes predictions by combining multiple hypotheses, which sounds like an ensemble to me. Is practical implementation a requirement of ensembles? Perhaps a simple solution may be to dodge this issue by creating a separate article for BOC, and link to it from the BMA section. On the other hand, I argue that a discussion of BOC is useful in an article about ensembles, since much of ensemble theory revolves around trying to approximate the BOC.--128.187.80.2 (talk) 17:31, 8 July 2010 (UTC)[reply]
"The Bayes Optimal Classifier is an optimal classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it, so it is the ideal ensemble[10].". There is the possibility for confusion here. Clearly, Bayes optimality, in general, refers to the minimum possible error rate achievable in a problem domain. A classifier which achieves the Bayes Error rate might be said to be a, (perhaps one of many) Bayes Optimal classifier. In the case with 1D, two class, data which exhibited perfect Gaussian distributions for the class probabilities such a Bayes Optimal Classifier would be achievable with a suitably chosen threshold.

The Bayes Optimal classifier being referred to here is one which, within a finite and potentially small hypotheses space, is a combination method for making the optimal prediction based on a finite set of hypotheses. Perhaps a better term might be "Bayes Optimal Combiner".

The statement that "On average, no other ensemble can outperform it, so it is the ideal ensemble[10].", seems to be very likely wrong. The performance of the ensemble is constrained by the hypotheses space. I can conceive of a very poor hypotheses space in which, even using the Bayes Optimal combination method, the resulting ensemble could easily be outperformed by some majority vote ensemble in a far superior hypotheses space.

Perhaps the statement should be along the lines of ... given a finite set of hypotheses, no other ensemble can exceed the performance of the Bayes Optimally combined ensemble.

As noted in the Wikipedia article on "Mathematical optimization", "optimal" means finding the element(s) of a solution set that maximize or minimize the value of an objective function. Something that is "best" ("optimal") for one purpose (objective function) may be terrible (even "worst") for some other objective(s).
A current example of something that is great for one purpose but terrible for another is the capacity of the US healthcare system: The US has been closing hospitals for years, because they were not operating close enough to capacity, and local demand didn't seem to justify the expense. Now in the era of COVID, people die, because we don't have the capacity. Many say we've optimized the wrong thing. For more on this, see, e.g., the Wikipedia articles robust optimization and multi-attribute utility.
The section in this article on "Bayes optimal classifier" does not use the word "objective function" but does provide the math, which clearly contains the objective function that is optimized over alternatives . DavidMCEddy (talk) 15:47, 8 September 2020 (UTC)[reply]

BMA not defined[edit]

In the current version, there is no definition of BMA. Just BPA, then suddenly Bayesian Model Averaging turns up in the next sentence, and BMA is used in the next section. Looks like some confusion caused by the editing mentioned above.Biomenne (talk) 15:24, 27 December 2015 (UTC)[reply]

Super Learner[edit]

Should anything related to the Super Learner be mentioned? Mark van der Laan et al. (2007) developed the theoretical background for stacking & renamed it "Super Learner". Merudo (talk) 15:58, 3 November 2016 (UTC)[reply]

"Bayesian parameter averaging" (BPA) and "Bayesian model combination" (BCC) vs. "Bayesian Model Averaging" (BMA)[edit]

As of 2018-05-26 this article includes sections on "Bayesian parameter averaging" and "Bayesian model combination". In the next few days, I hope to find time to combine these two sections into one on "Bayesian model averaging" with "Bayesian model combination as a subsection for the following reasons:

First, there is a huge literature on "Bayesian model averaging" dating back at least to a 1995 paper by Adrian Raftery, developed further in the 1999 Hoeting et al. paper, popularized by Burnham and Anderson (2002), and developed and critiqued in numerous papers and books since then. This includes more than 10 packages in the Comprehensive R Archive Network for R (programming language) to help people use these methods.

Second, I'm not aware of any literature on "Bayesian parameter averaging". In particular, I cannot find it mentioned in the paper cited for it: Hoeting, J. A.; Madigan, D.; Raftery, A. E.; Volinsky, C. T. (1999). "Bayesian Model Averaging: A Tutorial". Statistical Science. 14 (4): 382–401. doi:10.2307/2676803. JSTOR 2676803..

Third, the literature on "Bayesian model combination" is, in my judgment, far from adequate: The current sections on BPA and BCC cite two papers on BCC, neither of which seems to have appeared in a standard refereed academic journal:

Google Scholar identified for me other conference papers on "Bayesian model combination", but nothing comparing to the depth of literature available on "Bayesian model averaging", which does not even merit its own section in this article as of 2018-05-26.

If you see a problem with this, please provide appropriate citations while explaining your concerns about this. Thanks, DavidMCEddy (talk) 02:53, 27 May 2018 (UTC) [minor revision DavidMCEddy (talk) 15:16, 27 May 2018 (UTC)][reply]

It's not clear to me that Bayesian Model Averaging (BMA) belongs on the Ensemble Learning page. See section 16.6 of Kevin Murphy's Machine Learning: A Probabilistic Perspective (or 18.2.2 of the updated version, Probabilistic Machine Learning: An Introduction, set to be published in 2022). The section title is "Ensembling is not Bayes model averaging", and it opens with "It is worth noting that an ensemble of models is not the same as using Bayes model averaging over models (Section 4.6)".
A further reference is given to a published paper title by T. Minka "Bayesian Model Averaging is not Model Combination".
The essential reason they are different is that ensembling/model combination considers a larger hypothesis space - a more detailed, mathematical explanation can be found in both above sources.
From Murphy:
"The key difference is that in the case of BMA, the weights p(m|D) sum to one, and in the limit of infinite data, only a single model will be chosen (namely the MAP model). By contrast, the ensemble weights wm are arbitrary, and don’t collapse in this way to a single model."
As someone new to ensemble learning, I was confused by the inclusion of BMA in this article. Rmwenz (talk) 01:42, 7 August 2022 (UTC)[reply]
That quote from Murphy is not entirely correct. Murphy is correct that the weights all sum to one, but it may not be true that "only a single model will be chosen" "in the limit of infinite data". That claim is correct if you use BIC not AIC AND the "true model" is among the list of candidates, as noted by Burnham and Anderson (2002, p. 211). However, if the "true model" is NOT among the list of candidates, which seems likely in many cases, then AIC reportedly produces a more complicated model but better predictions than BIC, which could converge to something that was closes to the "true model", while delivering less accurate predictions. See Gerda Claeskens; Nils Lid Hjort (2008), Model selection and model averaging, Cambridge University Press, Wikidata Q62568358, ch. 4.
Beyond that, in the four years since I created this Talk page section on '"Bayesian parameter averaging" (BPA) and "Bayesian model combination" (BCC) vs. "Bayesian Model Averaging" (BMA)' in 2018, the discussion of Bayesian parameter averaging seems to have been removed from this article. Moreover, the discussion of '"Bayesian model combination" (BCC)' seems to rest on one conference paper, not in any refereed journal nor more substantive scholarly work. In 2018 I could not find any more credible source for BCC. I wonder if it should be removed entirely? BCC does NOT seem to meet Wikipedia's Template:Notability criterion.
But to your question about whether "Bayesian Model Averaging (BMA) belongs on the Ensemble Learning", I think they are clearly related, and BMA should at least be mentioned in this article.
There is ample material to justify a separate article on BMA. If someone knowledgeable about BMA wrote such an article, I would support shrinking the section of this article on BMA, perhaps from its current 5 modest sized paragraphs to 2, while citing that other article. However, without a separate article on BMA, I would not support shortening this brief section on BMA. DavidMCEddy (talk) 04:27, 7 August 2022 (UTC)[reply]

2020 update[edit]

I've completely rewritten the section on "Bayesian model averaging", using some of the previous references where the claims sounded plausible relative to everything else I know about this, citing several other sources. I've also deleted the section on "Bayesian model combination", because it doesn't make sense to me and is based on a single conference paper that was probably not refereed -- which may help me understand why it doesn't make sense to me. I'm parking here material I'm deleting to make it easy for someone to access it if they think I'm deleting something that should in this article:

MATERIAL DELETED:

Despite the theoretical correctness of this technique, early work showed experimental results suggesting that the method promoted over-fitting and performed worse compared to simpler ensemble techniques such as bagging;<ref>{{cite conference |first=Pedro |last=Domingos |title=Bayesian averaging of classifiers and the overfitting problem |conference=Proceedings of the 17th [[International Conference on Machine Learning|International Conference on Machine Learning (ICML)]] |pages=223––230 |year=2000 |url=http://www.cs.washington.edu/homes/pedrod/papers/mlc00b.pdf }}</ref> however, these conclusions appear to be based on a misunderstanding of the purpose of Bayesian model averaging vs. model combination.<ref>{{citation |first=Thomas |last=Minka |title=Bayesian model averaging is not model combination |year=2002 |url=http://research.microsoft.com/en-us/um/people/minka/papers/minka-bma-isnt-mc.pdf }}</ref> Additionally, there have been considerable advances in theory and practice of BMA. Recent rigorous proofs demonstrate the accuracy of BMA in variable selection and estimation in high-dimensional settings,<ref>{{cite journal |last1=Castillo |first1=I. |last2=Schmidt-Hieber |first2=J. |last3=van der Vaart |first3=A. |title=Bayesian linear regression with sparse priors |journal=[[Annals of Statistics]] |volume=43 |issue=5 |pages=1986–2018 |year=2015 |doi=10.1214/15-AOS1334 |arxiv=1403.0735 |s2cid=88513264 }}</ref> and provide empirical evidence highlighting the role of sparsity-enforcing priors within the BMA in alleviating overfitting.<ref>{{cite journal |last1=Hernández-Lobato |first1=D. |last2=Hernández-Lobato |first2=J. M. |last3=Dupont |first3=P. |title=Generalized Spike-and-Slab Priors for Bayesian Group Feature Selection Using Expectation Propagation |journal=Journal of Machine Learning Research |volume=14 |issue= |pages=1891–1945 |year=2013 |doi= |url=http://www.jmlr.org/papers/volume14/hernandez-lobato13a/hernandez-lobato13a.pdf }}</ref> === Bayesian model combination === Bayesian model combination (BMC) is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weightings drawn randomly from a Dirichlet distribution having uniform parameters). This modification overcomes the tendency of BMA to converge toward giving all of the weight to a single model. Although BMC is somewhat more computationally expensive than BMA, it tends to yield dramatically better results. The results from BMC have been shown to be better on average (with statistical significance) than BMA, and bagging.<ref>{{cite conference |author=Monteith, Kristine|author2=Carroll, James |author3=Seppi, Kevin |author4= Martinez, Tony. |url=http://axon.cs.byu.edu/papers/Kristine.ijcnn2011.pdf |title=Turning Bayesian Model Averaging into Bayesian Model Combination |conference=Proceedings of the International Joint Conference on Neural Networks IJCNN'11 |year=2011 |pages=2657–2663 }}</ref> The use of Bayes' law to compute model weights necessitates computing the probability of the data given each model. Typically, none of the models in the ensemble are exactly the distribution from which the training data were generated, so all of them correctly receive a value close to zero for this term. This would work well if the ensemble were big enough to sample the entire model-space, but such is rarely possible. Consequently, each pattern in the training data will cause the ensemble weight to shift toward the model in the ensemble that is closest to the distribution of the training data. It essentially reduces to an unnecessarily complex method for doing model selection. The possible weightings for an ensemble can be visualized as lying on a simplex. At each vertex of the simplex, all of the weight is given to a single model in the ensemble. BMA converges toward the vertex that is closest to the distribution of the training data. By contrast, BMC converges toward the point where this distribution projects onto the simplex. In other words, instead of selecting the one model that is closest to the generating distribution, it seeks the combination of models that is closest to the generating distribution. The results from BMA can often be approximated by using cross-validation to select the best model from a bucket of models. Likewise, the results from BMC may be approximated by using cross-validation to select the best ensemble combination from a random sampling of possible weightings.

Bayes optimal classifier[edit]

As of 2018-05-27 the section on "Bayes optimal classifier" ends with the following:

Unfortunately, the Bayes Optimal Classifier cannot be practically implemented for any but the most simple of problems. There are several reasons why the Bayes Optimal Classifier cannot be practically implemented:
  1. Most interesting hypothesis spaces are too large to iterate over, as required by the .
  2. Many hypotheses yield only a predicted class, rather than a probability for each class as required by the term .
  3. Computing an unbiased estimate of the probability of the training set given a hypothesis () is non-trivial.
  4. Estimating the prior probability for each hypothesis () is rarely feasible.

This section cites only one source, and that's a 1997 book on Machine Learning. I take issue with each of these four points:

  1. While it may have been true in 1997 that "Most interesting hypothesis spaces are too large to iterate over," that's less true today for two reasons: Most importantly, compute power has increased dramatically since then, to the point that many problems that were computationally infeasible than are now done routinely at a cost that's too cheap to meter. A rough estimate of the increase in compute power since then in provided by Moore's law, which says that computer performance doubles every 18-24 months. By this estimate, compute power available for a given budget has doubled more than 10 times in the more than 20 years since 1997. Ten doublings is 1,024. Secondly, improvements in algorithms make it possible to do more and better computations with the same hardware -- or to do computations today that were not done before.
  2. There may be many "hypotheses" (or algorithms) that "yield only a predicted class". However, many if not all procedures based on the theory of probability could produce a probability.
  3. Many statistical procedures maximize a likelihood function, and many more maximize a Bayesian posterior, which is proportional to a likelihood times a prior. This would seem to contradict the claim that "Computing an unbiased estimate of the probability of the training set given a hypothesis () is non-trivial."
  4. In most practical problems, the data provide more information that the prior, to the point that Bayesian methods routinely (though not always) use uninformative priors.

I therefore plan to delete this portion of that section. If you think it belongs there, please provide a credible reference. DavidMCEddy (talk) 18:37, 27 May 2018 (UTC)[reply]

Multiple classifier systems[edit]

@Boghog: May I ask for more discussion of why you deleted a block of text following, "The broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner[citation needed]"?

The block deleted was as follows:

The broader term of ''multiple classifier systems'' also covers hybridization of hypotheses that are not induced by the same base learner{{citation needed|date=December 2017}}. This methodology and phenomenon has also been described by another term, wisdom of the crowds, which emerged from multiple DREAM biomedical data science challenges.<ref>{{Cite book|url=http://worldcat.org/oclc/1052026078|title=Prediction of overall survival for patients with metastatic castration-resistant prostate cancer: development of a prognostic model through a crowdsourced challenge with open clinical trial data.|last=.|first=Guinney, Justin. Wang, Tao. Laajala, Teemu D. Winner, Kimberly Kanigel. Bare, J Christopher. Neto, Elias Chaibub. Khan, Suleiman A. Peddinti, Gopal. Airola, Antti. Pahikkala, Tapio. Mirtti, Tuomas. Yu, Thomas. Bot, Brian M. Shen, Liji. Abdallah, Kald. Norman, Thea. Friend, Stephen. Stolovitzky, Gustavo. Soule, Howard. Sweeney, Christopher J. Ryan, Charles J. Scher, Howard I. Sartor, Oliver. Xie, Yang. Aittokallio, Tero. Zhou, Fang Liz. Costello, James C. ,|oclc=1052026078}}</ref><ref>{{Cite journal|last=Eduati|first=Federica|last2=Mangravite|first2=Lara M|last3=Wang|first3=Tao|last4=Tang|first4=Hao|last5=Bare|first5=J Christopher|last6=Huang|first6=Ruili|last7=Norman|first7=Thea|last8=Kellen|first8=Mike|last9=Menden|first9=Michael P|date=2015-10|title=Erratum: Prediction of human population responses to toxic compounds by a collaborative competition|url=http://dx.doi.org/10.1038/nbt1015-1109a|journal=Nature Biotechnology|volume=33|issue=10|pages=1109–1109|doi=10.1038/nbt1015-1109a|issn=1087-0156}}</ref><ref>{{Cite journal|last=Bansal|first=Mukesh|last2=Yang|first2=Jichen|last3=Karan|first3=Charles|last4=Menden|first4=Michael P|last5=Costello|first5=James C|last6=Tang|first6=Hao|last7=Xiao|first7=Guanghua|last8=Li|first8=Yajuan|last9=Allen|first9=Jeffrey|date=2014-11-17|title=A community computational challenge to predict the activity of pairs of compounds|url=http://dx.doi.org/10.1038/nbt.3052|journal=Nature Biotechnology|volume=32|issue=12|pages=1213–1222|doi=10.1038/nbt.3052|issn=1087-0156}}</ref>

The first reference cited seems to exist, though as a Lancet article, not a book. More importantly for an article on "ensemble learning", the abstract to this Lancet article says they used a "penalised Cox proportional-hazards model". That doesn't sound like "ensemble learning" to me.

I was unable to access easily the second "Erratum" reference.

The third article, "A community computational challenge to predict the activity of pairs of compounds", talks about "SynGen". I was not able to find a clear and concise definition of that method in skimming that article.

The term "Wisdom of [the] crowd(s)" appears in the first and third article, but seem to refer to crowdsourcing data analysis, not restricted to anything that might relate to "multiple classifier systems" nor to "ensemble learning" more generally.

If User:47.188.75.82 feels this deletion is inappropriate, it is incumbent on him / her to explain better how these references relate to "ensemble learning" in general and "multiple classifier systems" more specifically. S/he should also document that in more detail here -- and preferably use a more standard citation template, at least for the first reference. DavidMCEddy (talk) 14:28, 26 December 2018 (UTC)[reply]

Hi. The main reason I deleted it was the user 47.188.75.82 was adding citations to multiple articles that were co-authored by Tao Wang and suggests there may be a conflict of interest. Many of the citations seemed to be shoehorned into the text and as you point out have dubious connection of the subject of the article. Finally all the sources were primary whereas secondary are preferred. I will take a closer look at the above sources. Boghog (talk) 15:24, 26 December 2018 (UTC)[reply]
I replaced the erratum source with the original source in the following. Also two of the three citations are coauthored by Tao Wang. Finally all are primary, and per WP:MEDRS, secondary are preferred. Boghog (talk) 15:39, 26 December 2018 (UTC)[reply]
The broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner[citation needed]. This methodology and phenomenon has also been described by another term, wisdom of the crowds, which emerged from multiple DREAM biomedical data science challenges.[1][2][3]

References

  1. ^ Guinney J, Wang T, Laajala TD, Winner KK, Bare JC, Neto EC, et al. (January 2017). "Prediction of overall survival for patients with metastatic castration-resistant prostate cancer: development of a prognostic model through a crowdsourced challenge with open clinical trial data". primary. The Lancet. Oncology. 18 (1): 132–142. doi:10.1016/S1470-2045(16)30560-5. PMID 27864015.
  2. ^ Eduati F, Mangravite LM, Wang T, Tang H, Bare JC, Huang R, et al. (September 2015). "Prediction of human population responses to toxic compounds by a collaborative competition". primary. Nature Biotechnology. 33 (9): 933–40. doi:10.1038/nbt.3299. PMID 26258538.
  3. ^ Bansal M, Yang J, Karan C, Menden MP, Costello JC, Tang H, et al. (December 2014). "A community computational challenge to predict the activity of pairs of compounds". primary. Nature Biotechnology. 32 (12): 1213–22. doi:10.1038/nbt.3052. PMID 25419740.
@Boghog: Thanks for your diligence. Scammers are infinitely creative. DavidMCEddy (talk) 16:02, 26 December 2018 (UTC)[reply]

Boosting[edit]

This line really needs a citation, and is definitively incorrect as far and theoretically & empirically unsupported as I know: "boosting has been shown to yield better accuracy than bagging, but it also tends to be more likely to over-fit the training data." Boosting is most certainly robust to overfitting - see the Margin-based analysis of boosting.

  • Schapire, Robert; Freund, Yoav (2012), Boosting: Foundations and Algorithms, The MIT Press, ISBN 978-0-262-01718-3