Talk:C4.5 algorithm

This is the talk page for discussing improvements to the C4.5 algorithm article.
This is not a forum for general discussion of the article's subject.

Put new text under old text. Click here to start a new topic.
New to Wikipedia? Welcome! Learn to edit; get help.

Article policies

Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL

Advantages and Disadvantages[edit]

All machine learning algorithms can overfit, and C4.5 is no exception. It would be useful to point out when C4.5 works well and where it does not.

Since c4.5 so heavily relies on IG (information gain), the difference in class distribution priors between Train and Test is a MAJOR determining factor for how well this algorithm will work on new (test) data. This is crucial to using the c4.5 and should be stated clearly in the article. Agreed? — Preceding unsigned comment added by 173.8.132.62 (talk) 05:40, 25 May 2013 (UTC)[reply]

Untitled[edit]

I believe I've considerably cleaned it up. However, I don't usually edit Wikipedia pages and it could probably be brought up to speed Wikipedia-wise Weston.pace 22:47, 24 May 2007 (UTC)[reply]

I agree. I hope the original author decides to edit -- This reads like notes taken during a lecture. Fonny 13:18, 17 April 2007 (UTC)[reply]

C4.5 is considered to be pretty important in the machine learning community (at least, I see it a lot in papers about text classification -- usually being beaten by support vector machines on particular datasets) so this page is noteworthy. However, it has terrible English. I'm not familiar enough with the software/algorithm(s) involved to edit it seriously. Whenning 05:23, 3 February 2007 (UTC)[reply]

Discussion of improvements[edit]

This article and the restatement of ID3_algorithm should be replaced with an explanation of the improvements offered by C4.5. From the article:

Choosing an appropriate attribute selection measure.
Handling training data with missing attribute values.
Handling attributes with differing costs.
Handling continuous attributes.

What does C4.5 do in each of these areas? Details about the implementation are useless! --Beefyt 21:55, 24 April 2007 (UTC)[reply]

J48 algorithm[edit]

J48 redirects to this page, though this article doesn't mention it.. are there significant differences between C4.5 and J48 or is J48 no more than a Java implementation of C4.5? Thanks, Simeon87 (talk) 12:16, 28 May 2008 (UTC)[reply]

J48 is a Java implementation of C4.5, Release 8, the last free release of the algorithm before Quinlan moved to C5.0 Jdoucett (talk) 20:41, 6 November 2011 (UTC)[reply]

Pseudocode doesn't describe the C4.5 algorithm[edit]

What does the last base case mean? ("Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.") How is that even possible? A more detailed explanation is required.

The listed pseudocode is only a basic decision tree induction algorithm but not the C4.5 algorithm (if you check the source you'll see that kotsiantis didn't describe the C4.5 algorithm in any listing). I'm not sure if it was planned to just list a general decision tree induction algorithm, if so it would be helpful to note this in detail. --B.eberhardinger (talk) 09:42, 24 June 2011 (UTC) Benedikt Eberhardinger[reply]

The pseudocode also does not describe how C4.5 recursively prunes branches. I will update the pseudocode now. Gabefair (talk) 16:15, 7 September 2015 (UTC) Still working on it. Gabefair (talk) 22:44, 20 September 2015 (UTC)[reply]

C5.0: neutrality (and factual accuracy) disputed[edit]

The section on C5.0 seems to be copied from Quinlan/RuleQuest's ad for the C5.0 program. It also confuses algorithm and implementation by claiming that C5.0 is faster than C4.5. I've labeled it {{npov}} and have requested a source; if none is found, then a rewrite may be in order. Qwertyus (talk) 20:43, 9 August 2011 (UTC)[reply]

Although I don't have a source for this, as a practitioner in the area and a user of both C4.5 and C5.0, I can vouch that the advertised features are widely accepted within the machine learning community. C5.0 is _algorithmicly_ faster as well as being a faster implementation of C4.5, since it makes different splits, and generates different trees, as well as doing feature selection prior to generating the tree. In spite of this, C5.0 generates models with very similar performance to those produced by C4.5, so they are often treated as being interchangeable algorithms in the literature, even though they are not. Jdoucett (talk) 20:45, 6 November 2011 (UTC)[reply]

Proposed ("he claims", and omitting marketing talk):

Quinlan went on to create C5.0 and See5 (C5.0 for Unix/Linux, See5 for Windows) which he markets commercially. He claims a number of improvements over C4.5^[1]^{[third-party source needed]}: speed, memory efficiency, smaller trees, support for boosting, weighting the cost of misclassification and winnowing.

Source for a single-threaded Linux version of C5.0 is available under the GPL.

I don't know what the state of affairs was when you wrote that, but looking at Quinlan's page now, he's got C5.0 posted GNU/GPL, so it's no longer commercial. — Preceding unsigned comment added by 98.247.50.244 (talk) 00:32, 5 December 2014 (UTC)[reply]

References

^ Is See5/C5.0 Better Than C4.5?

C5.0: NPOV: recommend deletion[edit]

I have never heard of "c5.0" prior to this article, has anyone else? CERTAINLY "c5.0" is not deserving of 10% of this article, and should be AT MOST, reduced to a footnote about alternatives. — Preceding unsigned comment added by 76.21.11.140 (talk) 02:19, 5 June 2013 (UTC)[reply]

Paragraph about significance[edit]

There should be a paragraph about the significance of the C4.5 Algorithm and its usage in Data Mining. I will try to add some info whenever I can but this is just a suggestion for improvement.--Diaa abdelmoneim (talk) 20:12, 26 March 2013 (UTC)[reply]

[1] Is See5/C5.0 Better Than C4.5?

[1]