Friday was a bit of a headache. I woke up confronted with ~150
trackback spams on my Moveable Type 2.51 blog. The day ended with a painful 'upgrade' to MT 3.0D and ultimately necessitated some band-aid code written in Java and Perl. This post is a brief (!) recap of the events and some details on how the code helped fix some problems (and the code itself of course). I'm posting this in case it helps someone else stuck in a similar morass.
The trackback spamming itself is unremarkable. It was bound to happen sooner or later. Trackback pings by their nature shouldn't be fettered by authentication or validation systems (such as
typekey or
sapience) and are thus vulnerable. Trackback throttling can help. I think the real fix is a change in the very fabric of blogs and the rest of the Internet, but that's a post for another time.
Still, it caused me to finally decide to upgrade my MT installation to 3.0 since it would be easier to delete the pings. Following the directions I copied all the 3.0 upgrade files over the current 2.51 equivalents, and since I was using the Berkeley DB as data storage rather than MySQL I did not "
not need to run any special upgrade scripts". However I was no longer able to log into my MT admin page. I found an error in the log that said something along the lines of an incompatibility in the DataDumper module. This led me to believe that the 3.0 data interface was having problems reading the existing data, probably due to a version mismatch I reasoned.
This is where I made my second stupid move (I'll get to the first in a moment). I started to second guess that I had actually started at version 2.51.
Didn't I upgrade to 2.61 a couple of months ago? Yes, yes I did, I remember it. Surely that's the problem. The upgrade directions for a 2.61->3.0 upgrade call for running one of those upgrade scripts, which must convert the data to the 3.0 format. So I ran the mt-upgrade26.cgi and watched as the 'Upgrading blah blah' messages scrolled past, finishing with a satisfying exclamation of "Upgrade completed successfully!"
Ah sweet, sweet success. Sure enough, I could now log into my MT admin panel. The next step was to click on the Trackback menu option so I could mass delete all those spam pings. My satisfaction quickly dissipated in a flurry of error messages. More DataDumper issues. In fact, I found that although I now managed to log in, I could access almost none of the actual blog data: templates, comments, posts. All of it as inaccessible as the
Voynich Manuscript.
At a loss I started to double check my 2.61 conclusion, and of course it came up wrong. I actually -had- been at 2.51. The MT installation I had upgraded to 2.61 a couple of months ago was Mie's blog
Kokochi. And here's where I realized what my first stupid move was.... (all together now!) ...I didn't back up my Berkeley DB data directory. I had made an export of my blog and I had backed up the actual blog web directory with all the shtml files, but I had completeley forgotten about the Berkeley DB directory where all of the juicy MT content is really stored. So I no longer even had the option of restoring my system back to its original state because running that upgrade26 script had altered the Berkeley DB data to the 3.0 format.
At this point I could see only one way out. I had to do a fresh install of MT 3.0 (not an upgrade) and then import my old blog data into a completely new blog. This was relatively easy. I was able to restore my old web directory and connect it to the old Berkeley DB and although I couldn't rebuild or edit posts I was able to get the templates open which was all I really needed to copy over to the new MT 3.0 installation (which was using a different Berkeley DB data directory, by the way).
So now I had a brand new MT 3.0D blog, but the problem was the URLs. Previously my blog post ids were aroung the 1250 range, so the permanent URL for my most recent posts was something like http://AkuAKu.org/archives/001254.shtml. Now that I imported all my posts into a new blog, all their IDs got renumbered so that the same post that used to be http://AkuAKu.org/archives/001254.shtml was now something like http://AkuAKu.org/archives/000456.shtml. SUX0R! I have a number of posts that are linked to from other websites, and all of those links were now completely b0rk3d. I had just contributed to web entropy.
MT 3.0 has a new feature for creating the URL for individual posts that uses the title and date of the post instead of the post ID #. I turned this on which made it so that the post titled "Art of T-shirt Folding" which had the 2.51 URL http://AkuAku.org/archives/001217.shtml, and then the 3.0 URL http://AkuAku.org/archives/00424.shtml, now had the URL of
http://www.akuaku.org/archives/2004/05/art_of_tshirt_f.shtml. The benefits of this naming system are that the URL now tells you something about the post and that if I were to export/import into a new blog, the URL would still be the same. Of course, that wasn't helping for the current export/import situation I had created.
So this is where things got hacky. I wrote some code that examined the apache http access logs for AkuAku and determined which posts had a referer line (meaning someone at another web site had a link to it). Then for each post that had a referal, the code would parse the old static html page to determine the posting date and title. Using this information it could construct the new URL. The output of this process was an
Apache Rewrite directive that told apache to take any requests for the old URL and forward them to the new URL:
RewriteRule ^/archives/001217.shtml$ http://www.akuaku.org/archives/2004/05/art_of_tshirt_f.shtml
Hundreds of my old posts had a referer line somewhere in the log, but I decided to only enter a rewrite directive for the ones that seemed most popular in order to not bog down the apache server with lots of rewrite pattern checks. So I only included the most popular 100 which ranged from that T-Shirt Folding post with 313 unique referals to somewhere around 12 referals for the #100 post. Actually I have no idea if that number is too much or too little, but it seems to be working OK.
The code in principle is pretty simple, but naturally the implementation was not a smooth road. I started with perl and the
Apache::ParseLog module. I couldn't get it to work, and still have no idea why. I could not get it to load the custom combined log for my AkuAku virtual host.
The format of the apache log is pretty simple though (space delimited columns, and I only needed the 7th and 11th which are the url requested and the referer url respectively), so I decided to not use the fancy cpan module and just do a normal perl line split to get the data. This worked fine, but I wanted store the data in a hash table where the key is the requested url and the value is an array of all the different unique refers for that post. Did I mention I'm not really a perl programmer? I was unable to create that simple data structure in perl. Maybe it can't be done, maybe I'm an idiot.
I
am a java programmer. So I switched modes and wrote a simple java app that would do what I wanted, and output a line consisting of the number of referals and the URL for each requested URL. I only counted referals that did not appear to come from yahoo/google/msn.com and other search engines, since I figured the search engines would soon reindex my site and fix their links on their own. I was only really concerned with fixing links that were coming from more static web sites (like other blogs).
This output was piped into a unix "sort -n" command which sorted then by the number of referals. I saved this sorted output into a data file.
I wrote a perl program that would read the sorted file, take the old URL and construct a new URL by parsing the original static archives/*.shtml file for the title and date. I took educated guesses on how to construct the new URL from the title/date because I quickly bored of trying to locate the actual code MT3.0 was using to do this. It was a trial and error method of comparing my output to the actual new urls for a few dozen posts. After determining a new URL, the perl code wrote the appropriate apache Rewrite directive to an output file. I copied the top 100 of those into the AKuAKu section of my $APACHE_HOME/conf/httpd.conf file and finally washed my hands of the hold stupid affair.
If, buddha help you, you find yourself in need of doing something similar, I'm posting the code I used below. I'm releasing it as-is. It worked for me but it has some stuff coded into it that might be specific to my installation (like the format of the apache log file and directory paths perhaps) so you might need to tweak it for yourself.
AnalyzeRefererLog.java
AnalyzeRefererLog.class
stage2.pl
Oh yeah, and in the end I was able to delete all the trackback spams pretty easily. Fuck you, spammer.