User:Enterprisey/AIV analysis/Appendix

From Wikipedia, the free encyclopedia

This page contains random details about the AIV analysis so that you can more thoroughly check my work.

Trimming the overlap[edit]

The September 2023 analysis generated two files, https://apersonbot.toolforge.org/aiv-analysis/2022-09-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json and https://apersonbot.toolforge.org/aiv-analysis/2023-02-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json. These had an overlap of about a month or so because I started the second job at February 1 to catch the change made to {{IPvandal}}. I removed the overlapping cases and uploaded the resulting file to TODO TODO. Here's the Python session where I did the filtering:

aiv-analysis $ python
Python 3.10.10 (main, Mar  5 2023, 22:26:53) [GCC 12.2.1 20230201] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> a=json.load(open('2022-09-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json'))
>>> len(a)
17826
>>> b=json.load(open('2023-02-01T00:00:00Z--2023-09-01T00:00:00Z--cases.0.json'))
>>> len(b)
22559
>>> next(case['report']['aiv_removal_revid'] for case in a)
1107803846
>>> next(case['report']['aiv_removal_revid'] for case in b)
1136759073
>>> a[-1]['report']['aiv_removal_revid']
1142517023
>>> b_revids = set(case['report']['aiv_removal_revid'] for case in b)
>>> a2=[case for case in a if case['report']['aiv_removal_revid'] not in b_revids]
>>> len(a2)
14728
>>> a2[-1]['report']['aiv_removal_revid']
1136750374
>>> json.dump(a2, open('2022-09-01T00:00:00Z--2023-02-01T00:00:00Z--cases.0.json', 'w'))

As you can see, the task was straightforward: I generated a list of AIV removal revids for b, and filtered out the cases with those revids in a to make a2, which I wrote into the new file.

Note that the resulting two files have no gaps in between them. This can be verified by starting at the last diff that I printed for a2, which is Special:Diff/1136750374, and stepping forward to the next instance of removed text, which is Special:Diff/1136759073, which is, as expected, the first revid that I printed for b.

The resulting file is https://apersonbot.toolforge.org/aiv-analysis/2022-09-01T00:00:00Z--2023-02-01T00:00:00Z--cases.0.json.