Wikipedia Vandalism


The following is a recently discovered entry I was planning on making back in 2006, when I had largely stopped posting to CQ. It refers to some code I wrote for detecting number changes in Wikipedia articles. It discusses grand plans I made for a version 2, which naturally never materialized. What I did end up writing is available here

The article references screenshots I never took, and ends abruptly with a parenthetical instruction to myself. I have no intention of fleshing it out, and am treating this as an archaeological find.

So, enjoy/scorn/ignore/what have you:

According to a Wikipedia article dated Sep 7, Abraham Lincoln died on April 15, 2005. This is in the article revision number 74381589, and is shown in the screenshot below. About 3 minutes later, this act of vandalism was reverted, correcting the date to April 15, 1865. A version of the article from Sep 1, revision 73277864, placed Lincoln as the 6th president of the United States. This vandalism was reverted 46 seconds later.

These examples, along with being humorous, illustrate a widescale problem faced by Wikipedia: the assumption of good faith vs. internet users' class-clown mentality. Can we all hold hands and work together in harmony? Of course not, but if the number of inspired cooperators is high enough, the background noise of experimental malace can be somewhat quieted. This reminds me of a scene in Kim Stanley Robinson's "The Years of Rice and Salt". There was a fire in a Chinese city that the townsmen joined together to help battle in a series of bucket brigades. In the background, the police were busy catching and executing all the looters who were stealing from the empty houses.

So what is the member of Wikipedia's "Counter Valdalism Unit"? The bucket brigade putting out the fires? The police executing citizens with prejudice? The looters? [[The truth is, you're the weak, and I'm the tyranny of evil men. But I'm trying, Ringo. I'm trying real hard to be the shepherd.]]

Anyway, I stumbled across these examples of numeric vandalism because I was looking for something to do. I found a "Bot request" for a bot that would check page edits to see if only numbers had been changed. A quick glance by a user checking for vandalism may see nothing wrong -- no insertion of profanity, no maps replaced with pictures of penises, no references to the GNAA, etc. Just a little number change that could easily be overlooked as a legitimate correction. So the solution is to write a bot that checks for number changes from new-ish users (or anonymous users) that have not been reverted back.

There are three basic problems with this.

1 - A bot can find numbers that have changed, but can't judge legitimacy.

A number change from a new user can be a correction. For example, revision 74324984 of the article on Alabama was from an anonymous user, and appears on first glance to be a real correction to land area measurements. Was it ligitimate or vandalism?

2 - A number change that stays after another revision may or may not be vandalism.

Popular articles are updated, vandalized, and reverted very frequently. What should a bot do when it finds a revision with a number change followed by a revision with other changes? Do we assume good faith or assume that the second guy didn't see the vandalism? Should a bot care about a revert back to a potentially vandalized version?

3 - A revert doesn't always look like a revert.

How should a bot check for a nonstandard revert? E.g., I want to add some information about Honest Abe not mentioned in his article, and while I'm in there I change him from the 6th to the 16th president, assuming a typo. The article now has other differences in addition to the number change. Should the bot see the number change as a revert and ignore the prior vandalism, or assume the vandalism is still outstanding?

As I started my first attempt to attack the problem, I became aware of the sheer scale of the English Wikipedia project. A compressed archive of the text and revision histories of all the articles currently weighs in at 5 gigs. There is also an RSS feed of recent changes that shows hundreds of updates per minute.

I decided on using the large archive file, which is a static file I can play with locally. The version I downloaded is compressed in the 7-Zip format, which I had not previously heard of. Fortunately, there is a Linux version of the extractor that lets you pipe to standard out. Also fortunate was the file I chose to work with, enwiki-20060906-pages-meta-history.xml.7z, failed to extract properly. It had only a small subset of the English Wikipedia articles in it, and weighed in at a mere 20 meg.

The extracted file is in a very readable XML format, and is easy to parse. Here is a sample section of the file from Sep 6:

  <page>
<title>AaA</title>
<id>1</id>
<revision>
<id>233181</id>
<timestamp>2001-02-06T20:07:40Z</timestamp>
<contributor>
<username>JimboWales</username>
<id>479</id>
</contributor>
<comment>*</comment>
<text xml:space="preserve">The use of this page is now discouraged.

</text>
</revision>
<revision>
<id>10030713</id>
<timestamp>2002-02-25T15:43:11Z</timestamp>
<contributor>
<ip>Conversion script</ip>
</contributor>
<minor />
<comment>Automated conversion</comment>
<text xml:space="preserve">#REDIRECT [[A]]
</text>
</revision>


Since the file I will ultimately be using is very large, I didn't want to attempt a full XML parse, but instead relied on the fact that the file has very little tag-overlap. What I mean by that is the tags are mostly unique to your position in the document structure. The above example can be parsed thusly:

page - title:
page - id:
page - revision1 - id:
page - revision1 - timestamp:
page - revision1 - contributor - username:
page - revision1 - contributor - id:
page - revision1 - comment:
page - revision1 - text:
page - revision2 - id:
page - revision2 - timestamp:
page - revision2 - contributor - ip:
page - revision2 - minor:
page - revision2 - comment:
page - revision2 - text:

So the only tag that implies more than one possible meaning is the <id> tag. Otherwise, every time I see <title>, I know I'm in a new page, every time I see <text>, I know I'm looking at a revision's text, etc. <id> is the only line where I need to know what section of the document I'm in. In the case of the revision number, the only ID I care about, the <id> tag always immediately follows <revision>, so rather than needing to do full XML parsing, I can get away with tracking the current and previous lines as two string (scalar) variables, which should have a speed advantage.

Here is the parsing method I chose, using perl:
$line = '';
$intext = 0;
while (<>) {
$lastline = $line unless $intext;
$line = $_;
if ($line =~ m!</revision>!) {
spitit if matchcrit;
clearmost;
}
if ($intext) {
$current .= $line;
$intext = 0 if $line =~ m!</text>!;
} else {
$title = $1 if $line =~ m!<title>([^<]+)</title>!;
if ($line =~ m!<id>([^<]+)</id>!) {
my $tmp = $1;
$revision = $tmp if $lastline =~ /revision/;
}
$author = $1 if $line =~ m!<ip>([^<]+)</ip>!;
$author = $1 if $line =~ m!<username>([^<]+)</username>!;
$timestamp = $1 if $line =~ m!<timestamp>([^<]+)</timestamp>!;
$comment = $1 if $line =~ m!<comment>([^<]+)</comment>!;
if ($line =~ /<text/) {
$intext = $line =~ m!</text>! ? 0 : 1;
$current = $line;
}
}
}

I have three subroutines being called, namely spitit, matchcrit, and clearmost. spitit prints a summary of changes, matchcrit compares the text of this revision with the previous revision, and clearmost resets variables to the empty string to ensure one revision doesn't reuse last revision's values. The rest of the code updates variables as it sees them.

After playing with some sample data, the three problems from above all became very apparent. I found some criteria matching that I liked and found some interesting hits, like the Abe Lincoln ones above, and a lot of what appears to be legitimate updates, and some changes to pixel size on images, changes to what size image a link points to, etc. Here's a sample of the output of my initial run:

------------------------------------------------------------

Article: Alabama
Timestamp: 2006-09-07T13:03:51Z
Revision: 74324984
Contributor: 212.147.51.217
Comment: Fixed total area, land area figures off by 100,000 sq. mi.

Changes:
- TotalArea = 135,775 |
+ TotalArea = 32,561 |
- TotalAreaUS = 152,423 |
+ TotalAreaUS = 52,423 |
- LandArea = 131,442 |
+ LandArea = 31,521 |
- LandAreaUS = 150,750 |
+ LandAreaUS = 50,750 |

------------------------------------------------------------

Article: Abraham Lincoln
Timestamp: 2006-09-07T19:15:51Z
Revision: 74381767
Contributor: 67.67.42.253
Comment:

Changes:
- | term_end=[[March 15]], [[1865]]
+ | term_end=[[March 15]], [[1861]]

------------------------------------------------------------

Article: Andre Agassi
Timestamp: 2006-09-07T15:52:20Z
Revision: 74348851
Contributor: 82.198.250.12
Comment:

Changes:
- |retired= [[September 3]], [[2006]]
+ |retired= [[September 3]], [[1999]]

------------------------------------------------------------

Article: Alaska
Timestamp: 2006-09-07T22:25:29Z
Revision: 74415298
Contributor: 24.12.2.95
Comment:

Changes:
- | Name = Alaska
+ | Name = Alaska9

------------------------------------------------------------

Article: Alaska
Timestamp: 2006-09-07T22:47:24Z
Revision: 74419077
Contributor: Jarfingle
Comment: rvv

Changes:
- | Name = Alaska9
+ | Name = Alaska

------------------------------------------------------------

Alabama seems to be not vandalism. Whether or not the numbers are correct is uninteresting, and I'm not going to look them up at another source. The comments are specific, and the changes do not involve simply removing the leading 1 from all the numbers. If this is a type of vandalism that is abundant, I would be surprised... and unable to spot it programmatically. The goal is to spot impulse vandals, and this doesn't qualify.

The Abraham Lincoln change is vandalism, but is subtle. Unless you knew ahead of time the year his "term ended", or the fact that he couldn't have been assassinated earlier than 1863 (four score and seven + 1776), then this might slip by. Fortunately we are aided by the fact that the contributor is anonymous and there are no comments, so the revision is more "high risk".

Andre Agassi is easy to spot as vandalism, as he has recently been in the news, making him a bigger target. He's in the news for retiring, so obviously was still playing after 1999. Still, a script would have no way to correlate that information, but it could also flag the change as high risk because of the anonymous update and lack of comments.

Alaska is obvious, adding a number after a name that did not previously contain one, anonymous contributor, and no comments. Equally obvious is the revert, commented and contributed by a real user.

So after analyzing the output, it becomes clear that the biggest problem is from anonymous users who don't leave comments, which is completely unsurprising. My "version 2" of the checker script would flag those as high risk, perhaps sorting the output so those are closer to the top. The low risk ones I could leave off entirely, or maybe have a flag in the script to display them on demand.

That leaves the biggest challenge of all, how to decide if a high-risk change has really been reverted. In the same spirit of high/low risk changes, I can attempt to write code to determine "likely/unlikely". The Alaska case is simple, the changes are backwards in the two revisions, a true revert. The Alabama case is more difficult. Compare these two revisions, one from above and a new one:

------------------------------------------------------------

Article: Alabama
Timestamp: 2006-09-07T13:03:51Z
Revision: 74324984
Contributor: 212.147.51.217
Comment: Fixed total area, land area figures off by 100,000 sq. mi.

Changes:
- TotalArea = 135,775 |
+ TotalArea = 32,561 |
- TotalAreaUS = 152,423 |
+ TotalAreaUS = 52,423 |
- LandArea = 131,442 |
+ LandArea = 31,521 |
- LandAreaUS = 150,750 |
+ LandAreaUS = 50,750 |

------------------------------------------------------------

Article: Alabama
Timestamp: 2006-09-07T18:12:58Z
Revision: 74371031
Contributor: Anivron
Comment: Neither the prior edit nor the measures before were entirely correct; data now from [[List of U.S. states by area|here]]

Changes:
- TotalArea = 32,561 |
+ TotalArea = 135,765 |
- TotalAreaUS = 52,423 |
+ TotalAreaUS = 52,419 |
- LandArea = 31,521 |
+ LandArea = 131,426 |
- LandAreaUS = 50,750 |
+ LandAreaUS = 50,744 |
- WaterArea = 14,333 |
+ WaterArea = 4,338 |
- WaterAreaUS = 1,673 |
+ WaterAreaUS = 1,675 |
- PCWater = 3.19 |
+ PCWater = 3.20 |

------------------------------------------------------------


Let's ignore for now that we don't think revision 74324984 was vandalism. The same "number space" was altered, but the new values of the last revision are not the pre-74324984 numbers. This was a correction, not a revert. In addition, more numbers have been altered. Is it necessary to keep a hash of revision numbers and their respective changes for a final analysis? If so, is that going to bring the script's speed to a crawl? Am I ultimately looking for pages that exist with any potential vandalism that has been overlooked, or am I trying only to see if the current version is high-risk?

I'm not going to be able to write strong AI in my free time, so the end script will not catch all possible numeric vandalism. After mulling these problems over, I've decided to approach the problem from a slightly different direction. Each revision to an article has risk. Using tactics similar to SpamAssassin, I can weight potential risks. Off the top of my head, here are some weights I can apply:

Contributor is anonymous +1
Comments are blank +1
Only one line is changed +1
Change is true revert -2 (changes are exactly opposite of previous change)
New number is current year +1
New number is more digits +1
New number is "rounder" +1 (e.g., 312 becomes 100)
New number is repeating digit +2 (e.g., 312 becomes 999)

With these slung together numbers, here is how the above revisions turn out:

Alabama "figures off by 100,000" change: 1
Alabama "Neither the prior edit nor the measures before": 2
Abraham Lincoln: 3
Andre Agassi: 3
Alaska: 3
Alaska revert: -1

Not bad. The three changes I'm most suspiscious of have the highest score. A potential problem if I continue with this idea is memory usage. If I create a hash of changes with their revision numbers, authors, comments, article text, and risk scores, and then sort everything, I run the risk of using too much memory. (# of revisions here)

Comments: Post a Comment
<< Home

This page is powered by Blogger. Isn't yours?