Wikipedia talk:STiki/Feature development

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

Greetings! The purpose of this page is to discuss the development of features to be included in improvements to STiki's anti-vandalism algorithm. In this context, features are things which the STiki server can use to calculate the probability that a given edit is vandalism.

This page does not need to operate as a standard talk page and should perhaps be more of an "outline". When you make a new feature proposal, try to provide: (1) intuition about *why* it might be useful, and why this isn't something already captured by another feature, (2) Example API queries that show how the evidence would be produced, (3) Any scalability concerns, (4) The feature type (real-valued, integer, nominal, binary), etc.

Consider also that these features must not only be calculated in a live fashion, but done in hindsight for training purposes. Take a feature like: "How many external links are there on a page"? In an online fashion, it is a simple API query. To answer in hindsight, one would have to go fetch an old version of the page and parse out the links.

Features currently implemented by STiki[edit]

Taken from:

  • feature_set.java in the core_objects directory of the STiki source;
  • feature_builder.java in the learn_frontend directory of the STiki source; and
  • adtree_model.java in the learn_adtree directory of the STiki source.

Content-based[edit]

Non-language based[edit]

  • Byte change in article length - BYTE_CHANGE; used

Text based (content-based, not specific to any natural-language features)[edit]

  • Maximum number of repetitions of a single character added - NLP_CHAR_REP; used
  • Percentage of edit that is upper-case - NLP_UCASE; used
  • Percentage of edit that is alphabetical - NLP_ALPHA; used

Language-based[edit]

  • Number of offensive words (as judged by an internal list) added - NLP_DIRTY; used

User-based[edit]

  • Whether the user is an IP - IS_IP; used
Note that this will need updating for IPv6; I just reverted my first IPv6 vandalism (spotted via STiki/CBNG catching a subsequent edit as vandalism, on top of the first). Allens (talk | contribs) 02:56, 17 June 2012 (UTC)[reply]
  • Reputation of the user (time-decayed # of "offending edits") - REP_USER; not currently used
I suspect that time since last rollback of this user (below) is substituting - perhaps time-decay should be limited? Allens (talk | contribs) 10:16, 25 May 2012 (UTC)[reply]
What counts as an offending edit? Is this just and edit that has been reverted? Is it specifically edits that have been reverted by STiki? Yaris678 (talk) 11:40, 25 May 2012 (UTC)[reply]
In the context of the paper, OE = vandalism undone by human users with the rollback right that used a standard edit summary (manual rollback, STiki, Huggle, TW) West.andrew.g (talk)
We'll get a lot further with this feature, I think, by normalizing it against the authors's edit count.
Total or recent? Allens (talk | contribs) 18:10, 25 May 2012 (UTC)[reply]
TBD, I suppose. West.andrew.g (talk)
So is that based purely on edit summaries? Or does the code look for a null diff? I'm guessing it is the first (more efficient, theoretical potential for abuse but its very unlikely and not devastating if it does happen).
Obviously if it were based on looking for null diffs it would be relatively easy to expand to non-standard reverts... but I know you go for efficiency every time.  :-)
I often use a non-standard edit summary in STiki. Does that mean my reverts aren't counted! The summary will still contain the WP:STiki link and be marked as minor if I'm pressing the vandalism button.
Yaris678 (talk) 13:24, 25 May 2012 (UTC)[reply]
The decision to use the edit summary to gleam this was because I wanted to write a pure metadata approach (EUROSEC'09). Indeed, then I could make claims that I could process 250+ edits a second while the people who were looking at language properties were doing 5 edits/second. I think the API now makes available a hash-code for each page version, and using these as the basis for *revert* detection would work very well. There is some question if this is best though, re: revert/rollback. Some reverts could be good faith for instance, whereas rollback shouldn't be.
Unless it's a self-rollback; when I goof, I do those instead of self-reverts to reduce server load. Allens (talk | contribs) 15:11, 25 May 2012 (UTC)[reply]
Easy to detect. Self-rollbacks to do not contribute to reputation.West.andrew.g (talk)
Your reverts are still counted, as STiki writes directly to the database and doesn't have to deal with this edit summary bit. However, non-standard rollback comments using any other tool might slip by (regardless, feature is not brought into force by the ADTree model currently installed). West.andrew.g (talk)
STiki updates the database without going via edit summaries - logical and efficient.
Ah yes. Missed the "not currently used". Is there a particular reason? Was it found to be not very effective? I agree that dividing by number of edits could help. Especially for IPs, which could be used by multiple people. I'll have a think about this one. I think it could be very powerful if done right. Yaris678 (talk) 15:22, 25 May 2012 (UTC)[reply]
If it is implemented but "not used", it is because the ADTree training phase found the feature to have insufficient info-gain for inclusion. In other words, it couldn't find a way to use the feature to its advantage. Note this doesn't mean it was completely worthless (it obviously encodes some value), but we limit the depth of the tree to avoid over-training. West.andrew.g (talk)
It looks like you're not trusting non-rollbackers to indicate offensive edits, unless they're STiki users? I suspect this plus not using bot reverts (I agree on that part) is reducing down the frequency of counted OEs sufficiently that the time-decay removes useful information. For learning vandalism vs non-vandalism, I can see excluding non-rollbacker, non-STiki users, but quite so stringent a criterion may not be needed for, call it, REP_USER_FULL and TS_RBU_FULL. Allens (talk | contribs) 18:10, 25 May 2012 (UTC)[reply]
This is exactly how it is done. The paper uses stringent criteria. The online system is non-stringent. It includes basically all rollbacks and reverts that follow standard edit summaries (and non-standard ones with STiki). Is it robust? No. If normalized should it work in practice? Yes

I am apparently confused; I had thought otherwise given the following code in rollback_handler.java:

		if(!rb_type.equals(RB_TYPE.BOT)){
			boolean rb_perm = api_xml_user_perm.has_rollback(
					api_retrieve.process_user_perm(cur_rev_md.user));
			if(!rb_perm) // If user not permissioned, abandon
				return (new pair<Long, RB_TYPE>(-1L, RB_TYPE.NONE));
Oops, bots are included, rollbackers are not. I could go either way on the rollback point. Maybe even make autoconfirmed the cutoff. West.andrew.g (talk)
  • Time since user's first edit - TS_R; used
  • Time since the user's last edit was rolled back for vandalism (or -1 if none rolled back) - TS_RBU; used
Note that, while recent (less than 12-13 minutes ago) incidents are counted far more by the current tree, it also does make a difference if the user has ever committed vandalism that's been rolled back. Allens (talk | contribs) 18:10, 25 May 2012 (UTC)[reply]
Note also that this is affected by the same criteria as REP_USER for determining what is an admissible rollback/revert. Allens (talk | contribs) 19:59, 25 May 2012 (UTC)[reply]
I was thinking about the possible normalisation REP_USER... but the only way I could think to make it work (given the time decay and the fact that multiple edits can be reverted at once... after a delay) involved having a pretty large database (or else doing a lot of API calls). I think the answer may be an improved version of TS_RBU. Maybe call it TSE_RBU...
TSE_RBU = min(TS_RBU,a*E_RBU)
where E_RBU is the number of edits the user has made since the last revert and a is some constant. Perhaps choose a = 3 hours... this means that if an IP edits rarely, but when it does it is always vandalism, STiki is ready... but also a user can't just do a load of null edits and quickly gain a reputation.
You could measure E_RBU from the standard feed that you get all edits from... you just need to do the work in recording it. You could do something where you give up recording it after 100 edits, and just set E_R to a large value like 1E30.
Similarly, if someone has never had an edit reverted, it probably makes sense to set TS_R to a number like 1E30.
Yaris678 (talk) 13:03, 26 May 2012 (UTC)[reply]
  • Reputation of the user's country, if available (user is an IP) - REP_COUNTRY; used
How frequently does this return ""? Only for satellite internet users (or IP addresses that STiki shouldn't be seeing), or more frequently than that? Your source claims 99.5% accuracy, but how's their coverage? Allens (talk | contribs) 18:10, 25 May 2012 (UTC)[reply]
That is correct. Exceedingly rarely. I think that my system treats the "" case as its own unique country code, though, so these folks are getting in on the reputation logic. Same with registered users, If I remember correctly they map to their own country code (which is extraordinarily well behaved). West.andrew.g (talk) 18:31, 25 May 2012 (UTC)[reply]
Actually, it gives a -1 "reputation" for both, resulting in the tree treating REP_COUNTRY as missing. Huh. Looks like countries with reputations below 0.068 are actually treated as being less likely to commit vandalism than registered users. Wonder what countries those are... It occurs to me that something may need to be done to prevent country reputations from countries with very few edits from influencing matters too much - those are going to fluctuate up and down a lot, even with normalization by number of edits. (Admittedly, there's also going to be very few edits to be affected by such fluctuations, so it may not be worth the trouble!) Allens (talk | contribs) 19:07, 25 May 2012 (UTC)[reply]

Article-based[edit]

  • Reputation of the article (an indication of how often it's vandalized) - REP_ARTICLE; used
  • Reputation of categories associated with an article - currently commented out
I'm guessing this didn't work? Allens (talk | contribs) 10:16, 25 May 2012 (UTC)[reply]
Really darn inefficient in its current form. There are some categories out there which are simply too large. i.e., "Categories of Living People." Any re-inclusion would have to blacklist things like that one. West.andrew.g (talk)
Was there indeed a negative correlation (or, more likely, first a positive then a negative correlation - as in, a certain optimal category size or size range) between category size and usefulness? "Wikipedia good articles" and "Wikipedia featured articles" are likely to be helpful; the latter is argued for by the analysis in[1] of edits to featured articles.
Never measured as a function of category size. Note per EUROSEC'09 that administrative categories were filtered out to focus on topical ones (both you describe were in the administrative category; and administrative categories tend to be huge). West.andrew.g (talk)
  • Time since last page edit - TS_LP; used
Perhaps, as well as this, something like the mean or median of times between the last 10-20 page edits, plus time since the possible-vandalism edit divided by the mean/median would be helpful? The former may be a better way to figure out how frequently edited the page is, and the latter as a way of telling whether people are likely to have seen the possible-vandalism and not reverted it. (I wasn't sure if this belonged here or down below with new features.) Allens (talk | contribs) 18:27, 25 May 2012 (UTC)[reply]
Hrm. I realized that the second idea would result in all probabilities changing while the edits are in the queue - yuck! Allens (talk | contribs) 23:41, 25 May 2012 (UTC)[reply]
The intuition here was that if edit y was 10 seconds after edit x, it was probably an immediate response to something that happened in x. What generates the most immediate responses on Wikipedia, vandalism reversion. Thus edit y was likely good (and reverting prior vandalism). West.andrew.g (talk)

Other[edit]

  • Time of day, if determinable (user is an IP) - TOD; used (before/after 3 PM)
  • Day of the week, if determinable (user is an IP) - DOW; not currently used
  • Length of comment left - COMM_LENGTH; used
Does the comment include the section header (only applicable for registered users) and/or any automatically-generated comments? Allens (talk | contribs) 10:16, 25 May 2012 (UTC)[reply]
It does include the section header in the *current* version. However, per PAN-CLEF '11 the better way to do this is three features: (1) raw comment length, (2) comment length w/o section header, and (3) was this a section-wise edit? West.andrew.g (talk)
Exactly what I thought; I'm not sure if the back of my brain was recalling the PAN-CLEF '11 paper, or if great minds think alike :-}. I would be surprised if all three wound up being simultaneously useful - the info-gain ranking in the paper is of each characteristic independent of the others, right? Otherwise, the ordering would be different for with/without ex-post-facto criteria. Allens (talk | contribs) 17:03, 25 May 2012 (UTC)[reply]
Yep, independent info-gain, thus they could all be measuring the exact same thing and not capturing novel sets. West.andrew.g (talk)
Note also the possibility of picking up automatically-generated ones: they start with [[WP:AES|←]]. Allens (talk | contribs) 21:35, 22 June 2012 (UTC)[reply]

Feature ideas: Content-based[edit]

Non-language features[edit]

  • Ratio of new article size to old article size. From CICLING'11.
  • Ratio of byte change to mean(old article size, new article size). Should help spot wholescale deletions of most of an article. Allens (talk | contribs) 02:02, 26 May 2012 (UTC)[reply]
  • Did this edit convert a functional wikilink to a redlink or vice-versa? Parsing for wikilinks, at least those outside of images or templates, is far easier than parsing for external links, incidentally. Allens (talk | contribs) 23:56, 29 May 2012 (UTC)[reply]
Admittedly, one difficulty with this idea is telling whether a page was present at a particular time - this may require parsing the page addition and deletion logs. Allens (talk | contribs) 11:15, 30 May 2012 (UTC)[reply]
  • A couple from an article looking at editor interactions:[2]
    • "The number of whitespace-delimited tokens added or removed from the page, δ, i.e., its change in size"
    • "the number of tokens whose content was changed, θ"
"Token changes include character insertions, removals, and additions as well as adding and removing tokens." "[T]hese parameters are largely independent of the page size itself, with Pearson correlations of ρ(δ, size) = 0.147 and ρ(θ, siz e) = 0.142, which makes them suitable for classification across all revisions."
They're able to differentiate between "additions", "removals", and "edits" using the relationships between these - abs(δ)/θ and the sign of δ were the most important measures. A modification using \W instead of whitespace could be of interest. One problem with the current diff, however, is its tendency to detect a movement of content (even with as little difference as the insertion/deletion of an extra return) as a large-scale change in content. I had thought of the measure of length(deleted text) + length(added text), but this problem with the diffs dissuaded me. I'm not sure how they're dealing with this difficulty. Allens (talk | contribs) 23:47, 30 May 2012 (UTC)[reply]

Textual features[edit]

  • Length of longest string of letters in changed text, edit summary, or username without a vowel; for username, may wish to use length of longest string of characters. Note that "y" should probably be counted as a vowel for these purposes - see "rhythm". I have noticed a tendency for vandals who are simply hitting the keyboard to go for significant stretches without a vowel, and such usernames as sdfs# and similarly unreadable edit summaries. Allens (talk | contribs) 22:42, 25 May 2012 (UTC)[reply]
  • Amount of change in wiki markup/syntax ({, [, <, etc I'm guessing?). From PAN-CLIF'11.
  • How about some extensions onto this? For instance, change in the ratio of references (ref tags + sfn tags, citation templates, etc) to mean(text before change, text after change). This helps reveal if, for instance, someone is removing referenced (bad) or unreferenced (OK) material. Allens (talk | contribs) 23:18, 25 May 2012 (UTC)[reply]
  • BTW, this is down twice in the PAN-CLIF'11 table #4 (as #25 and #33). Allens (talk | contribs) 23:27, 25 May 2012 (UTC)[reply]
  • Longest stretch between word boundaries (I'm guessing \b?). From PAN-CLIF'11.
  • Should this be the longest stretch in added text or the longest stretch in the text after the additions? Allens (talk | contribs) 23:18, 26 May 2012 (UTC)[reply]
  • Maximum length of deletions that start, end, or both inside of words. Quite often, vandals will just select a random section on their screen and delete it. (Probably do as two features - one with starting in the middle of a word, another with ending in the middle of a word.) The reason for using length is that a small deletion in the middle of a word may well be a typo correction. Allens (talk | contribs) 01:01, 26 May 2012 (UTC)[reply]
  • Changes in abs([# of {] - [# of }]), abs([# of <] - [# of >]), abs([# of '['] - [# of ']']). Vandals (and AGF cases!) frequently mess up wiki syntax. Allens (talk | contribs) 01:46, 26 May 2012 (UTC)[reply]
  • Ratio of numerical characters ?changed? to all characters. From CICLING'11.
  • Kullback-Leibler divergence of character distribution. From CICLING'11.
I am thinking that this would have P as before and Q as afterward? How is the complete lack of a character in P or Q handled? This could be particularly important with the introduction of non-US-ASCII characters by people, for instance, signing their names. Allens (talk | contribs) 19:19, 31 May 2012 (UTC)[reply]
  • Previous length of the article. From CICLING'11.
  • Maximum number of non-alphabetical (or perhaps non-lower-case) characters in insertions into the middle of words. Helps detect random insertions inside words. Allens (talk | contribs) 14:16, 26 May 2012 (UTC)[reply]
  • Insertion/deletion of an entry (prefaced with an asterisk) on a disambiguation page with that entry lacking a wikilink (including a cross-wiki link) Allens (talk | contribs) 19:19, 31 May 2012 (UTC)[reply]

Language features[edit]

Regarding many of the below, it is important how to best distinguish cases in which there are already, for instance, "inappropriate" words in the article, and more are being introduced, from ones in which they are not present in the article already. It is inadvisable in most cases to give a complete exemption to changes that "only" add more "inappropriate"/unusual features to an article, since a previous vandal (even the same vandal, depending on how multiple changes from one user/IP are treated) may have added such material in the first place. Also significant is that, as I understand it, the tree in use will only include any one variable once on a given branch - as in, it will not check a variable again if it has previously done so. I therefore suggest using two variables for each measure:

  • The first variable is a simple -1/0/+1: -1 if "inappropriate" words or whatever have decreased, 0 if they have remained the same, and +1 if they have increased. (Alternatively, if -1 is hardwired in as a "missing value", use -1 for no change, 0 for a decrease, and 1 for an increase.) An alternative would be [# new]/[new page size] - [# old]/[old page size], treating blanking of the page as a change of 0.
  • The second variable should indicate the significance of the change from the prior state. log([# new]+1/[# old]+1) is my current leading candidate; the absolute value may need to be used if -1 is hardwired in as a "missing value". The division of new by old (with a +1 to prevent division by 0) causes the "new" state to be considered relative to the "old" one, with additions to a "pristine" page counting much more than additions to a page already containing such "inappropriate" words or whatever; the log acts to symmetricalize it (vandalism reversion will look like the exact opposite of vandalism). Allens (talk | contribs) 19:53, 31 May 2012 (UTC)[reply]

Any thoughts? Allens (talk | contribs) 19:53, 31 May 2012 (UTC)[reply]

Inappropriate words[edit]

  • The following dirty-word regexes are from Perl's Regexp::Common (see CPAN for a copy). The first is of words that are inappropriate in almost all cases. The second is of words that may or may not be appropriate, and actually includes the first. Apologies for the language...
  • (?:poop?|pee|piss(?:.??take|e(?:rs|[srd])|ing|y)?|quims?|shit(?:t(?:e(?:rs|[dr])|ing|y)|e(?:rs|[sdry])|ing|[se])?|t(?:urds?|wats?)|wank(?:e(?:rs|[rd])|ing|s)?|a(?:rs(?:e(?:.??hole|[sd])|ing|e)|ss(?:.??holes?|ed|holes?|ing))|b(?:ull(?:.??shit(?:t(?:e(?:rs|[dr])|ing)|s)?|\Wshit(?:t(?:e(?:rs|[dr])|ing)|s)?|shit(?:t(?:e(?:rs|[dr])|ing)|s)?)|low(?:.??jobs?))|c(?:ock(?:.??suck(?:ers?|ing))|rap(?:p(?:e(?:rs|[rd])|ing|y)|s)?|u(?:nts?|m(?:ing|ming|s)))|dick(?:.??head|ed|ing|less|s)|f(?:uck(?:ed|ing|s)?|art(?:e[rd]|ing|[sy])?|eltch(?:e(?:rs|[rsd])|ing)?)|ha(?:rd.??on|lf(?:.??a[sr])sed)|m(?:other(?:.??fuck(?:ers?|ing))|uth(?:a(?:.??fuck(?:ers?|ing|a+))|er(?:.??fuck(?:ers?|ing)))|erde?))
  • (?:p(?:ork|r(?:onk|icks?)|uss(?:ies|y)|iss(?:.??take|\Wtake|take|e(?:rs|[srd])|ing|y)?)|quims?|root(?:e(?:rs|[rd])|ing|s)?|s(?:od(?:d(?:ed|ing)|s)?|punk|crew(?:ed|ing|s)?|h(?:ag(?:g(?:e(?:rs|[dr])|ing)|s)?|it(?:t(?:e(?:rs|[dr])|ing|y)|e(?:rs|[sdry])|ing|[se])?))|t(?:urds?|wats?|its?)|wank(?:e(?:rs|[rd])|ing|s)?|a(?:rs(?:e(?:.??hole|[sd])|ing|e)|ss(?:.??holes?|ed|ing))|b(?:on(?:e(?:rs|[sr])|ing|e)|u(?:gger|ll(?:.??shit(?:t(?:e(?:rs|[dr])|ing)|s)?|\Wshit(?:t(?:e(?:rs|[dr])|ing)|s)?|shit(?:t(?:e(?:rs|[dr])|ing)|s)?))|a(?:stard|ll(?:e(?:rs|[dr])|ing|s)?)|lo(?:ody|w(?:.??jobs?)))|c(?:ock(?:.??suck(?:ers?|ing)|\Wsuck(?:ers?|ing)|suck(?:ers?|ing)|s)?|rap(?:p(?:e(?:rs|[rd])|ing|y)|s)?|u(?:nts?|m(?:ing|ming|s)))|d(?:ongs?|ick(?:.??head|ed|ing|less|s)?)|f(?:uck(?:ed|ing|s)?|art(?:e[rd]|ing|[sy])?|eltch(?:e(?:rs|[rsd])|ing)?)|h(?:ump(?:e(?:rs|[rd])|ing|s)?|a(?:rd.??on|lf(?:.??a[sr])sed))|m(?:other(?:.??fuck(?:ers?|ing))|uth(?:a(?:.??fuck(?:ers?|ing|a+)|\Wfuck(?:ers?|ing|a+)|fuck(?:ers?|ing|a+))|er(?:.??fuck(?:ers?|ing)|\Wfuck(?:ers?|ing)|fuck(?:ers?|ing)))|erde?))
The first might also be used to check edit summaries and usernames. Allens (talk | contribs) 22:34, 25 May 2012 (UTC)[reply]
A couple of slight alterations will need to be done to the above for Java (namely the addition of extra \). Substitution of "i" with "[i1]" and "o" with "[o0]" is probably preferable. Allens (talk | contribs) 12:18, 28 May 2012 (UTC)[reply]
Some others that may be of interest:
"User talk:" AND "\\[\\[User[^:]*:" - both because of later insertions sometimes interrupting (same for most other AND cases)
"Example\\.jpg\\]\\]" AND "File:Example\\.jpg" (latter especially for gallery cases)
"Heading text" AND "== Heading"
"Bulleted list item" AND "\\* Bulleted"
coo+l
!!!+ - has to be at least 3 to avoid tables
poo+p? - replaces earlier
hi+ - short, do only with \b around
yo+ - ditto
hell+o+ - replaces earlier
balls
yes
boo+bs?
[:;]-?\\) - replaces earlier
\\(-?[:;]
:D
"was\\s*here" - replaces earlier
"your\\s*m[uo]m"
"tu madre"
ha(?:ha)+ - replaces earlier
he(?:he)+
''Italic AND "Italic text" - replaces earlier
'''Bold AND "Bold text" - replaces earlier
lo+l[sz]? AND "l(?:ol)+[sz]?" - replaces earlier
put[ao]s*
hola
hoes
"doo+\\s*doo+"
caca
"f\\W*[uv]\\W*c\\W*k" - replaces earlier; includes disguised
f\\*\\*[k*] - catch attempts at censorship
jizz+
vaginas? - would better go into "non-vulgar sex-related words", actually
"AND" ones: note that both aren't going to match any single piece of text. One problem with the existing code is that ones like the current ":-\\)" are unlikely to match with "\b"s surrounding them. I suggest replacing "\b" with "(?:\b|\s)" for those cases (or preceding with (?:\b|(?<=\s)) and following with (?:\b|(?=\s)), using lookbehind/lookahead to avoid "taking up" spaces that may be used by other parts of the regex). The above are a compilation of variations I've seen in using STiki, mostly with the CBNG queue. Allens (talk | contribs) 15:24, 26 May 2012 (UTC)[reply]
  • Change in number of offensive terms normalized by the change in article size. From PAN-CLIF'11.
How is this normalization done, given that the change can be 0? Frequency of new minus frequency of old? Allens (talk | contribs) 22:57, 25 May 2012 (UTC)[reply]
  • Change in number of English pronouns normalized by the change in article size (From PAN-CLIF'11; CICLING'11 specifies first person and second person only, not sure if PAN-CLIF'11 does).
  • Change in number of "colloquial words w/high bias", normalized by the change in article size. From CICLING'11.
Some of the "peacock terms" and similar listed at WP:PEACOCK and nearby may belong here. Allens (talk | contribs) 15:29, 26 May 2012 (UTC)[reply]
  • Change in number of "non-vulgar sex-related words", normalized by the change in article size. From CICLING'11.
It may be a good idea to check if the article's title or any of the article's categories matches "\\bsex" (or others, depending on what terms are in question), and reduce or eliminate (fix at 0) this term if so. Allens (talk | contribs) 15:24, 26 May 2012 (UTC)[reply]
  • Change in number of "miscellaneous typos/colloquialisms", normalized by the change in article size. From CICLING'11.
I would personally suggest separating typos and colloquialisms (the latter includes contractions, I assume?). Allens (talk | contribs) 15:29, 26 May 2012 (UTC)[reply]

Other language-based[edit]

  • Typically infoboxes have a "(?:name|title|above) = [name of article]" at their beginning. Changes away from this by vandals are not infrequent (and thus changes in reverse tend to be fighting vandalism). "\\{\\{\\s*(?:Infobox|Geobox)\\s*(?:[^|]+|\\|)+?(?:name|title|above)\\s*=\\s*[quoted name of article]\\s*[|}]" (case insensitive) should recognize the "good" version. This would be +1 if the change was towards this, -1 if it was away, and 0 if unchanged (either both negative or both positive). Allens (talk | contribs) 23:34, 26 May 2012 (UTC)[reply]
  • Look for the insertion/deletion of bigrams not found in English; an approximate regex (excluding anything found in my local words file or a list of unusual loanwords) would be "(?:bx|c[jvx]|dx|f[qx]|g[qx]|hx|j[cfgqsvwxz]|k[qx]|mx|p[xz]|q[bcgjkmnptvxyz]|sx|v[bfhjmpqtwx]|wx|x[jx]|z[jqx])". (I would not do this case-insensitively, to avoid picking up as many proper nouns.) This should help pick up both random insertions and insertions in other languages. (Of course, pages in categories indicating they have non-English text, or with titles containing such bigrams, should be exempt.) Allens (talk | contribs) 11:16, 27 May 2012 (UTC)[reply]
  • For pages not in categories indicating non-English text or with titles with unusual characters, (an increase in) characters neither in US-ASCII nor common mathematical symbols nor IPA characters should be considered bad. Allens (talk | contribs) 22:44, 27 May 2012 (UTC)[reply]

Feature ideas: User-based[edit]

  • According to the EUROSEC'10 paper, a good post-facto filter would be whether an edit is by a privileged user. If already wanting to get total editcount, group membership can be gotten in a list=users query along with that. The information can then be cached for some period of time. Alternatively, an initial listing of users with privileges can be downloaded then a regular list=logevents query can be made for the gblrights/usergroups entries in the glblrights log to update it; not sure if this might miss group upgrades/downgrades by stewards, though. I note that information on the rollback group membership is already being retrieved for some users for determining offensive edit rollbacks; this could be sped up by using the same cache. (Gathering this information should be done even if rollbacker is not required for determining offensive edits, since training should use such privilege indications - and that needs to be done with data as of the time the edit took place, to avoid training on future data.) Whether an account is autoconfirmed/confirmed or not may also be a significant flag, although length of time since first edit and editcount may adequately encompass this. Allens (talk | contribs) 18:57, 25 May 2012 (UTC)[reply]
    • I agree on all fronts, but think autoconfirmed might be especially good. I seem to remember from one of my more infamous wiki-spam papers that "autononfirmed" doesn't even exist in the database, its just something the system computes (edits X registration time) whenever it needs to determine/output it. Of course, things might have changed over the past year or two. West.andrew.g (talk)
    • From the API's documentation, autoconfirmed is actually not a group - it's a right. Weird. I guess they did it that way to allow "confirmed" to also add the right. Yuck - rights data is only available by allusers. I think calculating it on the fly (then keeping the 1s in the database) is the way to go. Allens (talk | contribs) 19:39, 25 May 2012 (UTC)[reply]
  • From the STiki talk page: Whether an IP is in Category:Shared IP addresses from educational institutions may well be of interest, especially if expanded from the current single IPs (see below for more on this possibility). Admittedly, it also includes some college and university addresses - but only ones from which there has been vandalism previously, as far as I can tell.
  • Grouping IP reputations: Given dynamic IPs, it may be best to broaden what's counted for reputation, school IP, or repeated-vandalism purposes. The most obvious way to do this is clumping into /24 and /16 - just take off the last number or two. More complex ways, probably to wait a bit, are looking the address up in IRR - see http://www.irr.net/docs/overview.html for an overview of the main one, including downloadable databases - and/or whois - I believe downloading ARIN's database is possible, and that should cover a high proportion of cases. One can then group addresses by what ISP/school/whatever "owns" them, what organization maintains them ("mnt-by" field; usually an ISP), and how they're supposed to be routed (the latter being grouping them by AS number). I note that it should be of interest to see if any particular ISP has a higher rate of vandalism... This is a long-term idea; one thought is for me to write a Perl script to translate between the IRR/ARIN/etc databases and whatever STiki's back-end databases run on. Allens (talk | contribs) 21:18, 25 May 2012 (UTC)[reply]
This will need updating for IPv6 addresses, incidentally.
  • Whether an IP is IPv6, since most anti-vandalism mechanisms have probably not been updated yet to catch IPv6 anonymous user vandalism. Allens (talk | contribs) 14:57, 18 June 2012 (UTC)[reply]
  • Number of user edits over the past month, week, day, or hour (in order of preference). From PAN-CLEF'11.
This will need to have a top limit of 500 for efficiency; it may be best to do this restricted to NS0. Allens (talk | contribs) 14:57, 18 June 2012 (UTC)[reply]
  • Number of user edits ever divided by time since first edit. From PAN-CLEF'11.
  • User's talk page size. From PAN-CLEF'11.
This has the disadvantage of requiring an additional API query, although given list=search, not one returning the full page. Allens (talk | contribs) 23:07, 25 May 2012 (UTC)[reply]
  • Number of warnings on the user's talk page. From PAN-CLEF'11.
This requires downloading and partially parsing the user's talk page. Allens (talk | contribs) 23:07, 25 May 2012 (UTC)[reply]
Or you could do it by looking at the edit history of the user's talk page. If ClueBot NG has ever edited your talk page that would say something.
To go any further than that, you'd need to parse the edit summaries... which may or may not be easier than parsing the talk page itself.
It does cover for talk page blanking though.
Yaris678 (talk) 12:26, 28 May 2012 (UTC)[reply]
I'd say it's probably easier than parsing the talk page - not as much other text to search through. Good point about talk page blanking. Allens (talk | contribs) 14:15, 28 May 2012 (UTC)[reply]
  • Whether this user was also the previous editor of the page. From PAN-CLEF'11.

Feature ideas: Article-based[edit]

  • How similar is this edit to the last reverted/rollbacked offensive edit on the same article? (Perhaps not counting BOT reverts/rollbacks, to avoid false positives, and perhaps further restricting to rollbackers/sysops?) Vandals whose edits are very quickly reverted tend to repeat the same or similar vandalism, or at least vandalize the same article again. Similarity can be judged by a number of criteria, some overlapping:
  • Hash of resulting article versus hash of article after last offensive edit - I'm not seeing this in the MediaWiki API documentation except for images, though
  • Same size edit (might want to count more if same size is significantly away from 0, as in counting 1243 more than 3 - less chance for coincidence)
  • Similar or greater deletion (if previous was a significant deletion relative to the article's size - with some minimum for stubs - and this one is the same or greater)
  • Same user
  • Similar IP user (I will be putting into the user-based section some ideas on how to determine IP similarity)
  • Same external links added (can get this from the external links processor)
  • Time since previous offensive edit's reversion/rollback (this is distinct from TS_LP when the reversion/rollback wasn't the most recent edit before the possible offensive edit)
Most if not all of the above can be gotten from already-available data. Each of the above could be used as a separate input to the tree, and/or some sort of combining could be done, particularly for the ones that include each other. Allens (talk | contribs) 20:34, 25 May 2012 (UTC)[reply]
  • Article's total number of edits divided by its age (time since creation) - the article's edit density. From PAN-CLEF'11.
  • Article's total number of edits. From PAN-CLEF'11.
  • Article's age (time since creation). From PAN-CLEF'11.
  • Whether the article's title or categories match "\\b[Ss]chool" - especially of interest in combination with whether the edit is coming from an educational IP address. Allens (talk | contribs) 12:50, 26 May 2012 (UTC)[reply]
  • Whether the article's title or categories match "\\b(?:[Aa]lbum|[Ss]ong|South\s+Park\\b)"; primarily for use in combination with "inappropriate word" detection, as pointed out on the WP Talk:STiki page. Allens (talk | contribs) 19:58, 31 May 2012 (UTC)[reply]

Feature ideas: From 3rd parties[edit]

Note that WikiTrust is not currently working very well (as of 25 May 2012).

  • See what abusefilters (edit filters) were set off by an edit, by adding "abusefilters" to "list" with "query". This may need to be limited to abusefilters that aren't changing much (and whose changes can be publicly tracked, as in not using private abusefilters.) One interesting one is deletion of an entire section. Allens (talk | contribs) 01:52, 26 May 2012 (UTC)[reply]
  • Use the ClueBot NG score as a feature for the STiki algorithm. This has been discussed before as a way to create a combined queue. Yaris678 (talk) 11:50, 28 May 2012 (UTC)[reply]
    • This would undoubtedly rank as a very high feature. But I have two thoughts: (1) Could the feature be so influential that it essentially makes the "STiki (metadata)" queue a a mirror the CBNG one? Part of the attractiveness of the metadata queue right now is that it captures a distinct subset of the problem space. (2) We would require two ADTrees in the case of CBNG failure. There is a guarantee we can calculate every other feature discussed on this page assuming the en.wp API is functioning. But CBNG can go down separate to that and we wouldn't want that to ruin our logic. West.andrew.g (talk) 14:50, 28 May 2012 (UTC)[reply]
      • I agree with both of these points, which is why it would have to be done as seperate queue. You have the slot for that queue waiting in the interface! Yaris678 (talk) 16:08, 28 May 2012 (UTC)[reply]
      • Admitedly I am getting away from features here, but... Another way to create a combined queue would be to multiply the odds of vandalism generated by the two queues by each other. That would be much more computationally efficient than running a parallel ADTree. Strictly speaking, the results should be divided by the expected value of the odds if the two measures are parallel or square routed if they are orthoganal... but we are only using the result for ranking purposes so it doesn't matter that we don't know the extent of orthogonality. Yaris678 (talk) 17:26, 28 May 2012 (UTC)[reply]
      • A third way to create a combined queue... which takes advantage of your exact points about Clubot NG scores being very informative and the STiki approach capturing a distinct subset... Use a regression tree, as opposed to an ADTree. Make it calculate the odds of vandalism as a linear function of the odds derived from the ClueBot NG probability, the coefficients of the linear function being output by a decision tree informed by metadata etc. Yaris678 (talk) 17:08, 29 May 2012 (UTC)[reply]
  • Compare the edit to User:Lupin/badwords as happens in User:Lupin/Anti-vandal tool. This looks similar to some of the ideas that Allens has suggested but it has been worked on for a while so is quite extensive. Yaris678 (talk) 17:53, 28 May 2012 (UTC)[reply]
Good thought. Lupin's badwords list is rather mixed in and of itself, though (and the STiki server is already looking for, for instance, repeated characters via a hardwired function) - "chlamydia" should count a lot less than "poo", for instance. I'm also uncertain how stable it would be as a feature if it was used directly. Dividing it into categories is preferable:
  • to allow the tree to figure out how much to weigh each part of it (which may be affected by the other inputs to the tree);
  • to allow for existing legitimate text on the page in one category but not another; and
  • to allow for using different categories on edit summaries and/or usernames (for instance, the smileys are fine in edit summaries).
The regexes on it can also be considerably combined for greater efficiency of parsing. Allens (talk | contribs) 19:18, 28 May 2012 (UTC)[reply]
Categorising is an interesting issue. Perhaps more of an art than a science. You don't want too many categories because if you get ones that are never/rarely used then the machine learning algorithm won't be able to learn. Here's a suggested split:
  1. Words that are almost never encyclopedic.
    • e.g. The harshest swear words, smilies.
      • Although smilies are OK in edit summaries. Allens (talk | contribs) 11:18, 30 May 2012 (UTC)[reply]
      • Yes. I think it would make sense to double up on the categories. Have (say) 4 categories for edit content and 4 categories for edit summary. The two sets of categories would be very similar but there would be some differences. Smilies are an example, messed up parenthises are another example. Yaris678 (talk) 13:39, 30 May 2012 (UTC)[reply]
  2. Words that are more likely to be non-encyclopedic than encyclopedic
    • e.g. Medical terminology for genitalia, peacock terms
  3. Other words common in vandalism
    • e.g. Words that can be innocent but are also slang for something less innocent. Maybe even something like "bad" would be included.
  4. Non-word features common in vandalism
    • e.g. Messed up parentheses.
I was also thinking of cleverer ways to do it... but I ended up with something that seemed a bit like CBNG... so I thought why not just use the CBNG score? (Already suggested above)
Yaris678 (talk) 17:32, 29 May 2012 (UTC)[reply]
  • Possibly improve the revert detection by using the algorithm described in this paper by Flöck, Vrandecic and Simperl mentioned in this Signpost article. This would improve TS_RBU and REP_USER. I have quickly scanned it and it looks impressive. I'm guessing it will take considerably longer than the current edit-summary-based method... but the authors do seem to be taking steps to improve the speed. Yaris678 (talk) 10:37, 30 May 2012 (UTC)[reply]
Good thought. I'm not sure how to access the MD5 hash mentioned; the API doesn't say anything about it. Allens (talk | contribs) 11:18, 30 May 2012 (UTC)[reply]
Its the "sha1" parameter of "prop=revisions" and "rvprop" of the API that you are looking for. Why they didn't the use the word "hash" somewhere in the region to make it more searchable is beyond me. This was integrated last year after a lot of technical mailing list traffic about the possibility. West.andrew.g (talk) 12:32, 30 May 2012 (UTC)[reply]
  • Another way of improving detection/classification of reverts, but still based on edit summaries. meta:WSoR_datasets/reverted talks about D_LOOSE and D_STRICT. It's a bit vague but I get the impression that these are regexs that determine, from an edit summary, if an edit is likely to be a revert of vandalism. I am guessing that D_STRICT has more false negatives and fewer false positives than D_LOOSE. Both could be used to inform revert-based features. I would ask User:Steven (WMF), or possibly User:EpochFail if they can tell you the regexs. Yaris678 (talk) 13:13, 18 December 2012 (UTC)[reply]

Feature ideas: Other[edit]

  • Comment length without section header (PAN-CLEF '11, Allens)
  • Whether edit was sectional or full-article (PAN-CLEF '11, Allens)
  • For IPs, whether edit was done on a weekday or weekend, according to the country and day of the week. Weekend should be useful for this. (I may count Fridays after 5:00 as weekends in western countries in addition.) Everything needed for this information is already being gathered. Allens (talk | contribs) 23:35, 25 May 2012 (UTC)[reply]

References[edit]

  1. ^ Jones, John (April 2008). "Patterns of Revision in Online Writing: A Study of Wikipedia's Featured Articles". Written Communication. 25 (2): 262–289. doi:10.1177/0741088307312940.
  2. ^ Jurgens, David; Lu, Tsai-Ching (2012). Temporal Motifs Reveal the Dynamics of Editor Interactions in Wikipedia (PDF). ICWSM. Retrieved 30 May 2012.