Talk:Isolation forest

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Computational complexity of Isolation Forest[edit]

It very unlikely that Isolation Forest (IF) is the world fastest anomaly detection algorithm. In essence, IF is at least O(n), and will likely be much slower than simple statistical detectors such as the following: 1: Compute empirical mean and standard deviation, 2: Label all observations as anomalies that violate the 3 sigma rule. — Preceding unsigned comment added by Mbt19 (talkcontribs) 08:35, 26 November 2021 (UTC)[reply]

indeed, there is the proposal to just subsample a few objects, and use the distance to the closest as anomaly score: Mahito Sugiyama, Karsten M. Borgwardt:

Rapid Distance-Based Outlier Detection via Sampling. NIPS 2013: 467-475. I bet this is usually faster. 93.132.170.165 (talk) 22:05, 6 July 2022 (UTC)[reply]

Poor introductory example[edit]

The introductory sample with the web server in "Fig. 1" is misleading. This is longitudinal data, but isolation forest is for point data, ignoring time. 93.132.170.165 (talk) 22:05, 6 July 2022 (UTC)[reply]

I added a new example, please check if this is appropriate.Mbt19 (talk) 10:37, 8 July 2022 (UTC)[reply]

Repeated self-promotion, cite spam, and inaccurate content[edit]

It appears that a primary contributor has WP:ACTUALCOI and is unfortunately not improving the quality from a technical point of view. I am seeing unnecessary references being included (both 2008 publications of Isolation forest: The conference AND the journal publication whose content is very similar), self-promotion (the name of the primary author of isolation forest is explicitly mentioned in the lead text, the history section AND in "Open Source Implementation" section), removal of perfectly fine technical content by other authors, as well as nonsensical promotion of the method itself. For example, in the lead section the author added Isolation Forest is fast because it splits the data space randomly, using randomly selected attribute and randomly selected split point. This is nonsensical, because random splitting has no relation with an algorithm's execution speed. The true reason why isolation forest CAN BE fast (if O(1) number of trees are built) is because each split is binary and hence only logarithmically many splits are needed to build a full tree. Also No density estimation is performed in the algorithm is a pointless statement, because no tree-based data mining algorithm does that. The entire lead section does not really make sense anymore since it focuses on distinguishing isolation forest from other data mining methods rather than describing what the method is and how it works. Finally, statements such as Isolation Forest is fast, ... is the world fastest anomaly detector have been repeatedly added to the article. Isolation forest is NOT fast. Its runtime is linear with a moderate constant overhead that scales with the number of trees. Whether linear runtime is "fast" depends on the context and cannot be claimed in general. Other anomaly detectors with linear runtime can have much less constant overhead, e.g., copula-based outlier detection when the data are low-dimensional.Mbt19 (talk) 18:34, 12 October 2023 (UTC)[reply]

This is a recurring problem with small articles on obscure algorithms. Feel free to strip out anything that is inaccurate, poorly sourced, or overly promotional. Suriname0 (talk) 19:51, 12 October 2023 (UTC)[reply]