Talk:Isolation forest

This article is within the scope of WikiProject Computer science, a collaborative effort to improve the coverage of Computer science related articles on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.Computer scienceWikipedia:WikiProject Computer scienceTemplate:WikiProject Computer scienceComputer science articles

Low

This article has been rated as Low-importance on the project's importance scale.

Things you can help WikiProject Computer science with:

Here are some tasks awaiting attention:

Article requests :
- Requested articles/Applied arts and sciences/Computer science, computing, and Internet
Cleanup :
- Computer science articles needing attention
- Computer science articles needing expert attention
Copyedit :
- Computing
Expand :
- Computer science
Infobox :
- Computer science articles without infoboxes
Maintain :
- Timeline of computing 2020–present
Photo :
- Find pictures for the biographies of computer scientists (see List of computer scientists)
- Computing articles needing images
Stubs :
- Computer science stubs
Unreferenced :
- WikiProject Computer science/Unreferenced BLPs
Project-related :
- Tag all relevant articles in Category:Computer science and sub-categories with {{WikiProject Computer science}}

Statistics Low‑importance

	This article is within the scope of WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.StatisticsWikipedia:WikiProject StatisticsTemplate:WikiProject StatisticsStatistics articles
Low	This article has been rated as Low-importance on the importance scale.

The following Wikipedia contributor may be personally or professionally connected to the subject of this article. Relevant policies and guidelines may include conflict of interest, autobiography, and neutral point of view.

drtonyliu (talk · contribs) This user has contributed to the article.

Individuals with a conflict of interest, particularly those representing the subject of the article, are strongly advised not to directly edit the article. See Wikipedia:Conflict of interest. You may request corrections or suggest content here on the Talk page for independent editors to review, or contact us if the issue is urgent.

Computational complexity of Isolation Forest[edit]

It very unlikely that Isolation Forest (IF) is the world fastest anomaly detection algorithm. In essence, IF is at least O(n), and will likely be much slower than simple statistical detectors such as the following: 1: Compute empirical mean and standard deviation, 2: Label all observations as anomalies that violate the 3 sigma rule. — Preceding unsigned comment added by Mbt19 (talk • contribs) 08:35, 26 November 2021 (UTC)[reply]

indeed, there is the proposal to just subsample a few objects, and use the distance to the closest as anomaly score: Mahito Sugiyama, Karsten M. Borgwardt:

Rapid Distance-Based Outlier Detection via Sampling. NIPS 2013: 467-475. I bet this is usually faster. 93.132.170.165 (talk) 22:05, 6 July 2022 (UTC)[reply]

Poor introductory example[edit]

The introductory sample with the web server in "Fig. 1" is misleading. This is longitudinal data, but isolation forest is for point data, ignoring time. 93.132.170.165 (talk) 22:05, 6 July 2022 (UTC)[reply]

I added a new example, please check if this is appropriate.Mbt19 (talk) 10:37, 8 July 2022 (UTC)[reply]

Repeated self-promotion, cite spam, and inaccurate content[edit]

It appears that a primary contributor has WP:ACTUALCOI and is unfortunately not improving the quality from a technical point of view. I am seeing unnecessary references being included (both 2008 publications of Isolation forest: The conference AND the journal publication whose content is very similar), self-promotion (the name of the primary author of isolation forest is explicitly mentioned in the lead text, the history section AND in "Open Source Implementation" section), removal of perfectly fine technical content by other authors, as well as nonsensical promotion of the method itself. For example, in the lead section the author added Isolation Forest is fast because it splits the data space randomly, using randomly selected attribute and randomly selected split point. This is nonsensical, because random splitting has no relation with an algorithm's execution speed. The true reason why isolation forest CAN BE fast (if O(1) number of trees are built) is because each split is binary and hence only logarithmically many splits are needed to build a full tree. Also No density estimation is performed in the algorithm is a pointless statement, because no tree-based data mining algorithm does that. The entire lead section does not really make sense anymore since it focuses on distinguishing isolation forest from other data mining methods rather than describing what the method is and how it works. Finally, statements such as Isolation Forest is fast, ... is the world fastest anomaly detector have been repeatedly added to the article. Isolation forest is NOT fast. Its runtime is linear with a moderate constant overhead that scales with the number of trees. Whether linear runtime is "fast" depends on the context and cannot be claimed in general. Other anomaly detectors with linear runtime can have much less constant overhead, e.g., copula-based outlier detection when the data are low-dimensional.Mbt19 (talk) 18:34, 12 October 2023 (UTC)[reply]

This is a recurring problem with small articles on obscure algorithms. Feel free to strip out anything that is inaccurate, poorly sourced, or overly promotional. Suriname0 (talk) 19:51, 12 October 2023 (UTC)[reply]