User talk:TerryE/Supplemental SPI analysis

I am watching this page so feel free to add comment or critique on this paper. -- TerryE (talk) 18:49, 6 May 2010 (UTC)[reply]

Comments by Arthur Rubin

Arthur posted the following comment on a related SPI and since the SPI does not facilitate discussion format, I've duplicated his comments here:

"It should be pointed out that K-S tests generally may deduct some correlation if there is any pattern, such as U posting from work and D from home in the same time zone, even if there is no other pattern. If you wish to E-mail me your full analysis, I'll comment on it, in addition to the the fact that if your analysis could produce "positives" on one of many analyses, the probability of a "false positive" is the probability that any of the analyses produce a significant result, even those analyses similar to, but not discussed in, the overall analysis."

Arthur, my write-up is in this cover article. If you are familiar with Perl script then you can read the scipt / run analysis yourself. I can send you the actual intermediate files / spreadsheets if you want, just PM me so I know where to send them. If you want extra intermediate results then I will need to modify the code, rerun the script myself and send you the results.

I would query your first statement. The 2-sided KS test investigates the efficacy of a hypothesis that 2 data samples are drawn from the same distribution. The KS statistic allows you to put a confidence level on accepting or rejecting this hypothesis. If you read my paper then the two samples are a base sample which is generated by comparing the "turnaround intervals" for the two users and each datum is one such interval, and a reference sample which is generated by shifting one of the user's post times by an integral number of weeks. No assumptions are made other than that this time-shift doesn't fundamentally alter the distribution, and in particular no assumptions about actual posting patterns.

The efficacy of this approach depends on the validity of the assumption that this time-shit will consistently leave the underlying distribution invariant if the users are independent. It does need to be challenged. It could fail in certain circumstances -- eg. if a user was working shifts and therefore his / her posting times varied fundamentally from week-to-week. This is why I tried a number of controls. In practice the test worked well for these with a c_α of 0.6-0.8 which is a pretty good result, showing the test discriminates for controls. I am trying another batch of 10 as I write this.

However, the 3.28 figure for U vs D is just off the end of the scale. Unlike the controls that I've tried, these are fundamentally different distributions. If you look at the data, the main reason is that in the base case there is a minimum 52 mins and typically up to 90 mins greater difference in these turn around times than you would expect. Shift one of the data sets by 1-3 either way and this gap vanishes. Only when you sync the weeks up do we have the situation that if D goes off the air it's a minimum of 52 minutes later before R starts posting and v.v. This plus the strange and convenient juxtaposition of events really needs explaining.

The X data is more tenuous. << comment removed >>

My main reason for developing this approach was as a general analytic technique. I suspect that this type of sleeper / parallel identity is a lot more common than we would care to admit. I was tossing around how to introduce this as a possible approach without pinning it down to specific users which is why I made the user IDs in the paper anonymous when I wrote it.

Anyway, have a browse of the article and if you want the data, let me know your preferred format: tab-separated text, Excel, Calc, ... -- TerryE (talk) 23:27, 6 May 2010 (UTC)[reply]

Needs more maths!

Seriously, it’s much easier to build a model if you start thinking in terms of mathematical symbols. Then most of the controversies can be easily spotted and eliminated. In your case you start with stating that there are two individuals: U and D, and they post their messages at times $t_{i}^{U}$ and $t_{j}^{D}$ . Now you state that the time intervals $\Delta t_{i}=t_{i}-t_{i-1}$ between posts “follow some weekly pattern” and you assume that this weekly pattern is constant from week to week. Then you collect 7 weeks worth of data and attempt to estimate the empirical distribution functions and compare them between individuals U and D. Actually, the K-S test skips the estimation part and goes directly to the comparison step.

Unfortunately what you actually estimating is not what you think you are estimating. If you have “some weekly pattern”, it means that at any given time of the week the distribution of the time intervals $\Delta t_{i}$ may be arbitrary and in fact you have at most only 7 data points to estimate it. So when you using the K-S data, you pooling all the data together, which is equivalent to disregarding the aforementioned “weekly pattern”.

From the general standpoint, the K-S test measures whether the two distributions are identical. I don't see a sound reason why a person and his socket puppet would have the identical distributions. I would expect them to be different. Say, one username may be used to perform routine WP edits, such as minor spell corrections, copy-edits, tending to your favorite articles, etc; while the socket-puppet would be used for altogether different activities: disruptive edits, vandalisms, flaming the talk pages, and so on. These two different activities may have different time patterns to them. What you really need to test is not the equality of distributions, but rather their dependence or independence. // stpasha » 01:29, 7 May 2010 (UTC)[reply]

Hi thanks for the feedback. Unfortunately for some reason, I missed this on my watchlist. Yes, the article could have been better written; it was a rough draft that I somewhat hastily decided to move into WP after certain events. In retrospect, this was a mistake. However I do want to comment on some of the general criticism of the method itself in slow time. The analytic technique is a variant of one that I've worked with elsewhere professionally. I would prefer to discuss general users, say A and B. (U and D were two specific users). Yes I am collecting time interval date for transition interval, but for this test the data was collected over a 2-year period. I agree that the (2-sample) KS test applies to the comparison of 2 samples and not to the methods for measurement of individual data within those samples. As you say if I am comparing two roles then the events associated with these respective rolls would reflect the nature of those rolls, which is why I can't make general assumptions about posting patterns within rolls. What I am looking at is an estimate for the transition time between the two rolls, that is their hand-off interval if they are two coordinated individuals or the roll-switch if it is one individual executing the two rolls. My base dataset is a set of "hand-offs" from A to B taken over a two year period, say, and I want to test the distribution of this hand-off. However, there isn't a sound basis for predicting a parametric model for this, so can I derive an alternative estimator? I want to to discuss the approach in more detail and avoid commenting on specific dataset pairs. Give me a week or so to think about this and I'll post back some supporting data for a general population. -- TerryE (talk) 10:12, 14 May 2010 (UTC)[reply]