Wikipedia:WikiProject Wikidemia/Quant/Security

From Wikipedia, the free encyclopedia

Security Models for Readership Statistics[edit]

To begin with the most general definition of usage data, we would like to be able to know:

De-identifying Pageviews by individual, per article, per unit of time[edit]

Following the human subjects de-identification method: we must obscure the IP addresses of users in some way. A simple cryptographic hash function may be a sufficiently acceptable measure. It would prevent direct access to the access history of a single individual, but unless we increase the entropy of the rest of the available data, there are some cases in which security can be breached and user identity can be established by an attacker.

  • An attacker could use data from isolated articles (which are rarely accessed) to potentially identify contributors. For instance, if an article receives very low numbers of readers (say, less than 1 visit/day), and an edit occurs in close proximity to one of those visits, then it would be possible to deduce a relationship between a hashed IP and an editor, and then use the hashed IP value to obtain further information about the user's browsing habits. Attacks might focus on user pages; it seems pretty likely that the most frequent visitor to a user page is the user associated with that page. (Can we verify this somehow?)
  • Furthermore, such an attack becomes more damaging as the time frame in which the hash function used to sanitize the data remains static. Then one might be able to ascertain the identity associated with a hashed IP with greater certainty, and apply this knowledge to a broader swath of the data.

NAT'ed/Firewalled users[edit]

With respect to this concern (and generally with respect to the need to obtain and represent consistent user identities using IP addresses):

  • I'm also interested to know how many users of wikipedia are NAT'ed/Firewalled/DHCP'ed, and if IP address is truly a meaningful representation of identity. I feel that it almost certainly is in the short term (for most non-NAT cases), but the consistency of the relationship between IP address and an individual decreases greatly with time.
  • To estimate the mean TTL of a user/IP address relationship, we could look at the number of IPs the average registered user uses in a given period of time. It would at least be a representative datapoint. Perhaps this is something we could request from the WM admins as part of this discussion.
  • That said, we probably should try to base the security model(s) on relatively certain aspects of use of Wikipedia. In all foreseeable future cases, users will be able to be identified in some way by the IP address that they use to access WP, even if that time period is very short. I merely mean to note that if the TTL of user/IP relationships is long, we

may want to suggest a periodic change of deidentification hash function. This may be in order in any case.

Thus, further efforts to sanitize the data[edit]

  1. Adding gaussian noise to timestamps
    1. We might have to establish and publish information about the function used.
    2. Depending on the variance of the noise, this may prevent certain kinds of analysis, such as:
      • the process of user response to news events,
      • or (more problematically) tracing the access paths of users to see the method by which they browse the site,
      • and how this varies with respect to the type of page they first reach (i.e., it's size, "pagerank"/in-degree, number of editors, etc.)
      • (and more speculatively, the real-world event which might have drawn them the page and the way that access patterns change wrt the type of news event which occurs)
      • looking at the probability that a visitor contributes to an article with respect to these other factors.
    1. Depending on its variance, gaussian noise would also frustrate smaller observations of usage, such as in infrequently-access articles, and perhaps force researchers to focus on only the subset of important articles on the site.
  1. Hashing BOTH user IP and Article name/ID. (This could further increase security relative to the previously mentioned pattern matching attack).
    1. Here the security attempts to increase the complexity of the pattern matching attack previously mentioned.
    2. Certain article identification qualities would have to remain for some statistics to be drawn from the doubly-obscured dataset. Some of these parameters, such as page age, size, number of contributors, and link connections to other pages could be used to break the security measure by matching against the Wikipedia data dump. (I haven't checked, but revealing some of these characteristics may not meaningfully decrease the computational complexity of a pattern-matching attack.)
    3. Doing so would frustrate study of relationships between (previously mentioned) out-of-band events and in-band user behavior.


  1. Removing information about pages in which there are a small number of accesses/edits.
    1. In this manner, the noise can be of lower variance and still be effective.
    2. Or, the noise could be unnecessary.
    3. Users gain security because their behavior is partly obscured by their presence in a "crowd".
    4. However, in frustrating analysis of the readership behavior on smaller pages, which may cause serious limitations on studies which would like to verify claims about all of Wikipedia.



Readership per article, per unit of time[edit]

This is a more tenable security model for release of readership data to the public (or researchers).

Simply removing consistent references to individuals precludes the afformentioned pattern matching attack. It also prevents every analysis which attempts to link access patterns across pages by identifying individuals.

One could still use the link topology of the pages to look at the relationship between changes in the access stream of a given page and the access rate of the pages linked to it, or linked to pages which link to it (etc.). This might be quite interesting. However, we won't be able to talk about individuals, and it will be impossible to look at the behavior of various classes of users.

Furthermore, just having this data is preferable to no data. Security is relatively tight. By disaggregating the access histories of individual pages, it prevents every attack I can think of which could be used to trace individual behavior. (Correct me if I'm wrong/if you can think of something which would still be open).