Christian Semmler

Archiving a vBulletin forum using HTTrack and Netlify

Apr 7, 2023
/assets/images/blog/blog-post-8.jpg
Rymdreglage at work in Terminator 2 - 30 Years

For many years, I’ve maintained and hosted a vBulletin forum that I’ve created to provide a platform for discussion and a central repository for information for any users interested in the vaultmp project I’ve been working on. Development has since ceased due to time constraints on my part, and eventually, in 2015, I took the forums offline.

Almost 15 years ago, I’ve rather randomly decided that multiplayer functionality in the PC version of Fallout 3 would be pretty awesome, so I’ve started going down the rabbit hole of developing a mod that would implement precisely that: allowing players worldwide to connect and gather in an instance of our favorite, post-apocalyptic video game. The details of this endeavour are worthy of a separate blog post in the future, and shall not be covered today.

The project has attracted a significant number of users and other interested parties. A YouTube video demonstrating my first (very poor) attempt of player synchronization has reached over a million views, with thousands of comments and people trying to contact me to inquire if there was a way for them to contribute (the original video in question is gone due to my channel being banned because of unrelated, alleged copyright infringements; here’s a re-upload).

A brief history

To accomodate the volume of interest, I’ve purchased a vBulletin 4 forum license and set up the system on one of my web servers. Back at the time, vBulletin has been a popular forum software employed by many large-scale sites hosting sometimes enormous communities. It runs on the LAMP stack, which many web developers from the 2000s and early 2010s are likely familiar with.

Initially, the forums mainly served as a platform for me to share my development progress. Over time, thousands of users have registered and contributed their ideas and experiences. While the scope of the project kept expanding, I added wiki functionality to centralize all information regarding the mod and provide documentation on how to use the software and its facilities.

Due to the complex and time intensive nature of the project, development has eventually ceased around 2014, since my full-time job as a software engineer didn’t leave me enough time to make any further meaningful progress on the mod. In 2015, I took down the forums and all its resources; the constant, necessary upkeep of the system proved to be too much of a nuisance (spam bots were rampant and exploits in the vBulletin software kept popping up).

Two weeks ago, casually reminiscing about the past, I’ve felt an itch to revisit the old forum and its contents, mainly attributable to nostalgia. Fortunately, I’ve made a full backup of the system and all its components, so I dug up the tar.gz archive, extracted it on a Linux VM and initiated my attempt of putting things back together.

Replicating the LAMP stack

Expectedly, the server software necessary to run the forums are not readily available on any modern Linux distro anymore, at least not using the default package sources. vBulletin 4 requires the Apache 2 web server in combination with PHP 5, which is long EOL. The easiest way to effortlessly instantiate the setup I needed was using an appropriate Docker image. A quick search yielded the mattrayner/lamp repository, which was exactly what I was looking for.

Using the instructions, I managed to spin up an instance with this command:

docker run -i -t -p "80:80" -p "3306:3306" -v ${PWD}/app:/app -v ${PWD}/mysql:/var/lib/mysql mattrayner/lamp:latest-1404-php5

The app folder contained the vBulletin application, and the mysql volume allowed me to persist the MySQL database between subsequent container instantiations of the image.

Once the container was up, I had to re-create the MySQL database based on the SQL dump I’ve found in my backup. This was pretty straightforward using the MySQL CLI:

mysql --host 0.0.0.0 --port 3306 -u admin -pJeaZPlt5eTFF < ./vaultmp.sql

The next step was to fix the vBulletin configuration to point to the new database. Moreover, since the forum assumed it would run on a certain domain (vaultmp.com), I’ve adjusted my local hosts file so that vaultmp.com would resolve to localhost. This was easier than fixing various database tables containing forum posts that referenced the site with absolute URLs using that domain.

Once all of that was done, I opened up http://localhost and voilà, I was greeted with the familiar landing page of my old forums!

Creating a static, read-only copy

For a vanishingly brief moment, I’ve contemplated hosting this stack publicly again, albeit I’d be disabling forum functionality so that no account registrations and post creation would be possible, and be done with it. However, that would have brought back all the problems of old (mentioned above), and probably amplified since the majority of the software used is no longer being maintained and is thus likely riddled with known security issues and other vulnerabilities.

Clearly, it was preferable to generate a replica of the forums that consists of static HTML and other assets only to avoid having to run a backend dynamically generating the pages, so I started doing some research on various web archiving technologies. My goal was to reinstate the forums at the original domain, viewable by all, with the experience being as close to the original as possible.

The first archiving method I came across was based around Web ARChive files. Apparently this file format is also used by the amazing Internet Archive library, so it seemed like an obvious choice. Various tools exist to create, manipulate and view those files. The caveat though was that it cannot be easily used to host the replica without having to run intermediary viewing software.

Other options included basic wget to crawl the site, WAIL, HTTrack and a few others. After evaluating those options I settled on using HTTrack, the main reason being its apparent maturity and inherent main purpose of creating full, offline copies of websites.

Preparing the site for HTTrack

The main issue with turning a dynamic site like a forum (with thousands of posts) into a completely static one is that it will inevitably generate a very large amount of files. Each variation in parameters of a particular URL will end up being its own HTML page. For instance: a single thread of posts may be sorted, viewed and filtered in different ways and combinations, and a thread may have multiple pages. A single thread with 10 pages can easily end up generating multiple hundreds of files because of that. This is the price I had to pay with my approach - while it’s certainly possible to reduce the redundancy again with some effort (essentially introducing some sort of dynamic composition of HTML fragments again), I’ve opted to keep it simple.

During the course of coming up with a functioning instruction set for HTTrack, I’ve come across various bugs and issues that needed to be fixed:

Infinite URL recursion

The wiki part of the forums was created using VaultWiki. The particular version installed was quite outdated (from today’s perspective) and came with various issues. One of those issues manifested itself a short while after I’ve started crawling the site with HTTrack for the first time: some URLs on the dynamically generated wiki were improperly formatted (using a variation of extraneous URL encoding and HTML escaping).

When the HTTrack crawler followed these URLs, the server also errorneously accepted this URL without rewriting it, returning a page. This page in turn improperly formatted its embedded URLs again based on the current page URL, adding further escape sequences to the URLs, which were again (correctly) identified as unique, new URLs by HTTrack. For example:

/showwiki.php?title=OnPlayerDisconnect&amp;do=history&amp;action=revision&amp;oldid=551

would lead to:

/showwiki.php?title=OnPlayerDisconnect&amp;amp;do=history&amp;amp;action=revision&amp;amp;oldid=551

and so on (notice the spurious HTML ampersand). This caused HTTrack to never terminate (at least not until resource exhaustion or confronted with URL length limits). Eventually, I’ve identified the bugs in the VaultWiki plugin code and templates and managed to get rid of this issue.

Issues with minified CSS

The first iteration of HTTrack output looked odd on some pages: some images referenced in the forums’ CSS were missing. After some digging, it turned out that HTTrack didn’t properly parse the CSS. The reason for that appeared to be the fact that the CSS was minified, which took a few hours to figure out. I’ve documented my findings on this GitHub issue and hopefully will get around to fixing the bug in the library at some point.

This time, I’ve opted to disable minification of the CSS which resolved the issue.

Excluding superfluous pages

As mentioned before, generating static pages from a dynamic website can cause quite the bloat in terms of HTML files. In the case of this vBulletin forum, it turned out that there were some areas that yielded way too many files for my tastes, most notably member.php which due to multiple tabs and paging functionality, ended up being the source of multiple hundreds of HTML files for a single user. I ended up at least partially excluding some of those URLs to minimize the volume of generated pages.

Furthermore, I’ve excluded all pages relating to the newreply.php endpoint, for which there was at least one URL for each post made in the forum. Since creating content such as replying to posts won’t be possible anyway, it’s not much of a loss.

Final HTTrack command

The final command ended up being this:

httrack -qwC2%Ps2u1%s%u%I0p3DaK0R10H0%kf2A99999999%c20%f#f -n -#L5000000 -N "%h%p/%n%Q.%t" -F "Mozilla/4.5 (compatible; HTTrack 3.0x; Windows 98)" -%F "" -%l "en, *" http://www.vaultmp.com -O1 "/home/foxtacles/vaultmp-archive" -* +vaultmp.com/* +www.vaultmp.com/* -*/newreply.php* -*/member.php*page=*

HTTrack supports a large number of options, and they are all rather well documented in its manual. A complete crawl of the entire site took about 5 hours and generated just under 45,000 files in total. Note that this is the result of a small-sized forum that only contains about 8,000 posts. The final web pages looked pretty much perfect to me, visually there was no difference to the original.

Hosting on Netlify (for free)

Now that I’ve had a static copy of my forums, the question on where to host them remained. My website and this blog, Blaubart, are dynamically created using Jekyll and hosted on Netlify, which is a development platform that includes build, deploy, and serverless backend services for web applications. Deploying a completely static set of HTML files and a bunch of other assets appeared to be trivial, so I gave it a shot and was pleasently surprised that it worked out of the box!

In addition to that, I’m currently able to host my sites for free, since I’m staying under the allotted maximum monthly traffic of 100GB on the platform’s Starter plan, which is pretty neat.

A caveat is that there is a limit of 54,000 files per directory on Netlify. This may be a problem when trying to archive larger sites than mine, although you could probably employ various strategies using HTTrack options to distribute the artifacts across a particular folder structure that would allow you to stay under this limit.

Redirecting old URLs to new ones

One final issue remained: since the output of HTTrack naturally does not retain the original structure of the dynamic website, all the URLs relating to the website have changed. This would not be a concern for a visitor landing on the archived forums’ root page, however, a large number of the original URLs linking deep into the forum are still floating around the Internet and are regularly hit by crawlers (and probably a few humans) to this day.

Fortunately, the rewrite rules of HTTrack are predictable and thus it’s technically possible to dynamically redirect someone using one of the old URLs to the new one. In this case, I’ve used a user-defined option of %h%p/%n%Q.%t. The essential piece of information in this is %Q, which represents the MD5 hash of the query/search paramters of the original URL that is being crawled. This means that the following URL:

/showthread.php?373-How-to-great-a-server-master-server-with-hamachi-tut&highlight=hamachi

will be converted automatically into:

/showthread4917b12cf1e1fc38438a27f32597f9e7.html?373-How-to-great-a-server-master-server-with-hamachi-tut&highlight=hamachi

Note the MD5 hash embedded in the HTML file name. Barring a few exceptions when it comes to assets other than HTML pages, this rule holds generally true.

Netlify has a feature called Edge Functions, which is based on the Deno runtime environment and allows one to execute (mostly) arbitrary JavaScript/TypeScript when a route is being requested. This functionality is currently in Beta as of this writing, but seemed stable enough to me so I came up with this function to fix up all the old URLs:

import { crypto, toHashString } from "https://deno.land/std@0.182.0/crypto/mod.ts";
import type { Config } from "https://edge.netlify.com";

export default async (request: Request) => {
  const url = new URL(request.url);

  // These endpoints originally returned files such as RSS feeds and should not be rewritten

  if (
    url.pathname.startsWith('/external') || 
    url.pathname.startsWith('/blog_external') ||
    url.pathname.startsWith('/index73aa617507c5d84c945576bfd4233665.php') ||
    url.pathname.startsWith('/showwikidddd7f2a88159222edcdf9e9ca46c38d.php')
  )
    return;

  // We did not archive member pages with the `page` parameter, so we remove it

  if (url.pathname == '/member.php' && url.search) {
    url.search = url.search.replace(/([&?])page=\d+/g, (_match, p1) => (p1 === '?' ? '?' : ''));
  }

  // Due to several bugs in the VaultWiki software, some mangled URLs were generated and subsequently indexed.

  if (url.pathname == '/showwiki.php' && url.search) {
    url.search = url.search.replace(/amp;/g, '');
    url.search = url.search.replace(/%20/g, '+');
  }

  const hash = url.search ? toHashString(
    await crypto.subtle.digest(
      "MD5",
      new TextEncoder().encode(url.search.substring(1)),
    ),
  ) : '';

  url.pathname = url.pathname.replace(/\.php$/, `${hash}.html`);

  console.log(`Redirecting ${request.url.replace(url.origin, '')} to ${url.href.replace(url.origin, '')}`);
  return Response.redirect(url, 301);
};

export const config: Config = {
  path: ['/*.php', '/*.php?*'],
};

This function executes on every URL that uses one of the original vBulletin PHP endpoints, and subsequently rewrites the URL taking into account some of the special cases outlined earlier as well. Here’s an example of the function in action:

(old URL) http://vaultmp.com/showthread.php?373-How-to-great-a-server-master-server-with-hamachi-tut&highlight=hamachi

redirects to:

(new URL) https://legacy.vaultmp.com/showthread4917b12cf1e1fc38438a27f32597f9e7.html?373-How-to-great-a-server-master-server-with-hamachi-tut&highlight=hamachi

Result

I’m quite happy with the final outcome! Completing this project took me almost 4 days, but finally all the pages are properly archived and are live on legacy.vaultmp.com, and almost all original, previously broken URLs are functioning again. There’s no maintenance overhead and no worries about an outdated, dynamic backend running somewhere. Plus, at least for now, hosting the archive doesn’t cost me a dime! My nostalgia itch has been thoroughly scratched.

Thank you for reading!