Smartermail, Spamassassin, Virtuozzo VPSs

A follow up to Smartertools answers the cry on the fight against spam with smartermail 4.0.

Alot of clients have been asking about how we’re handling spamassassin with Smartermail 4.0. It’s no secret that spamassassin on a windows server runs horribly slow. If more than a handful of domains are involved I have no doubt that spamassassin would cripple the server if not fail completely. However I also believe that greylisting is the more effective component in the smartertools anti-spam arsenal and will reduce spam to a fraction of what it would be with just spamassassin alone.

So there’s a ton of interest in farming out spamassassin to a Linux vps. Why, you ask? Well quite simply spamassassin runs like a mad cow on steroids on a Linux server. Okay maybe I’m exaggerating but it’s a ton faster. Plus as hard as it is to admit it, being a die hard windows geek, it was developed on Linux and the community support for is still very much linux so it just runs better. Fortunately, smartertools (under the leadership of Tim Uzzanti, formerly of Crystaltech and my two superhero-style developer home-boys Grady W and Bryon G) saw ahead and knew this could be a problem. What did they do? They devised smartermail to support not only a remote spamassassin processing server on linux but if need be a farm of spamassassin processing servers. By going with a linux install of spamassassin you’ll gain the added support of the spamassassin community (also linux geeks er um developers .. ehh linux developer, geek … same thing 😉 ).

What’s so great about Spamassassin on Linux?

Out of the box spamassassin isn’t very effective. Okay, it’s good but not nearly as good as it should be. To really take advantage of spamassassin you’ll want to add a few functions:

DCC, DCC is the Distributed Checksum Clearinghouse. Basically your server creates a checksum from messages you receive compares this checksum to a distributed database of checksums to decide if the message is spam or not and then scores it accordingly. Basically you and a bunch of other mail server operators are teaming together to create a distributed, constantly updated database of spam and non-spam messages. Very cool.
Vipul’s Razor, is similar to DCC but uses the Cloudmark Spamnet network (my understanding is it’s the same database that backs their commercial services).
Pyzor, Similar to Razor, Pyzor is a completely free database and client written in .. you guessed .. python. It was developed out of fear that the Razor database being commercial may be ripped away from the opensource community at some point.

Now, these three tools will slow down your message processing (around 2-10 seconds generally and you should set a timeout so that they don’t hold up email too long) but they really add some power behind Spamassassin.

You now have evolved from the rules only processing of spamassassin into a rules processing system combined with a series of independent distributed message clearinghouses. I should note that if you have any volume whatsoever DCC is going to want you to setup your own DCCD (which we have setup currently but are still beta testing smartermail 4.0 before rolling out completely).

Why Rules? Don’t the Spammers Know These Rules too?

So now you have the default rules (around 91 I believe) and the clearinghouses. But what good are the rules right? I mean afterall if I have them the spammers have them too. Now enter the SpamAssassin Rules Emporium (SARE) a series of frequently updating rules that you can download at various times updating your rules using a tool like sa-update. This means your rules are constantly evolving just like the spammers are. Now we got kerosene on the fire. We have a set of consistently changing rules (which you’ll want to pick from carefully remember these could be touchy and some rules may flag good mail as bad) and a series of Independent distributed message clearinghouses.

A note about rules from SARE: There are different levels of rules, some that when tested against a mail test database picked up only spam messages but not all of the spam messages, some that picked up more spam messages but flagged a few good emails as spam too and finally some that picked up all the spam messages but flagged more ham as spam. It’s really up to you to decide what’s safe and what’s now.

Which rules do you deploy? Our own testing has shown that greylisting filters 90% of the spam and that spamassassin does a good job of flagging almost all of those that get through greylisting with just the safe level of rules employed. We have about 501 tests we run each message through currently and it takes between 1.2 and 5 seconds without the distributed database checks, with the database checks it takes 1.2 seconds to 20 seconds. Now our system hasn’t been fully optimized and tweaked yet but it’s getting there.

Rules and DCC what else does Spamassassin Give me?

So now we have a constantly updating database of rules, a way to compare our messages to a distributed database of email signatures to see if others have flagged them as spam and… here’s the coolest part. You know those annoying image emails you get selling viagra or stocks? That you can’t for the life of you figure out how to filter? Well spamassassin has OCR (object character recognition) plugins available that will read these messages and then review the text to see if it’s truly spam. This is VERY cool! But as the cat and mouse game goes, have you noticed that your image spam is becoming colorful now? Strange backgrounds? Multi-colored text? You know all those tricks we perform with CAPTCHA to keep bots from registering on our forms? Yeah the spammers are using those techniques in spam messages now (the rat bast*rds).

The Spam Fighting Duo becomes a powerful Dynamic Trio!

Spamassassin is very cool and Smartermail has gotten even cooler. Now enters the final member of our Team of Superhero Techno-tools, SWSoft‘s Virtuozzo. Virtuozzo is a OS virtualization VPS engine. What’s this mean? Hardware virtualization systems like Microsoft Virtual Server and VMWare have a overhead (reported on the order of 20%) due to virtualizing the hardware. This means 4 VPSs on a single server will only deliver the processing power of the single box at 80%. With hardware virtualization you gain a great deal of flexibility in being able to run mixed guest operating systems on a host system (IE, running Linux and Windows VPS’s on a Windows Host machine) but you pay for that with a performance loss (most argue with today’s processing power it’s an acceptable loss but you decide for yourself).

With OS virtualization you are still very much virtualized but you run the same Guest OS as the Host OS so you can’t run Linux on windows. But guess what? You aren’t getting bottlenecked as you are in HW virtualization. Now Virtuozzo gets even cooler. You get all the raw power, plus now that you’re using the same OS at the Host and across all of your guest OS’s they can actually share common memory and diskspace. So the 2GB of diskspace you’d normally lose in a 10GB VPS partition isn’t lost at all. You only give up any diskspace for files that differ from the host machine’s version (for instance if you created your own bind binary it and it’s necessary libraries would be unique to your vps and use your diskspace and memory allotment of your VPS servers) I believe this is around 100 to 200MB on average.

Next you get something called Virtuozzo templates. These are ready made application, operating system and in some cases full VPS machine templates that are shared across multiple VPS virtual engines (VE’s or VPSs if you will). So now you can have a series of very similar VEs (vps’s) running on a single hardware node all sharing resources. This means although your apps and virtual machine is very much separated and secure you’re not running all of the overhead of the guest operating system on your virtual machine and you’ll gain performance over a HW virtualized system. Our own informal testing showed this to be a great benefit and very much worth the tradeoffs between HW and OS virtualization for a hosted application and webhosting platform.

So why Virtuozzo for our spamassassin VEs?

The performance difference between HW virtualization and OS virtualization. HW virtualization is great, adds alot of functionality that you may or may not need and will get the job done but OS virtualization is the only way to go in a production hosting environment that demands maximum performance, reliability and scalability.
Shared OS resources reducing the need for redundant processes and diskspace waste. Allowing for more VPSs per HW node and thus lower cost.
The ability to create templates of a working VPS design and then replicate it across hundreds of VPS’s within a matter of minutes (I didn’t really get into that but it’s extremely cool)
The ability to patch a single VPS and then create a template for this patch and replicate it automatically across all VPSes.
The ability to move a VPS from one HW node to another HW node with near zero downtime (again extremely cool)
Finally, it’s a platform we’ve already adopted and have been using for about 3 years now and are extremely familiar with it and find it quite popular in the hosting industry.

I know there’s already been a ton of work on a VMWare image in the smartertools community and this is without question trail blazing efforts. For many servers the ready built solution is a clear winner. I mean afterall how many admins are going to have a Virtuozzo Linux HW node sitting around? Please don’t think I’m downplaying this solution or the great benefit this donation to the community has been, it’s a very very clever solution. But I honestly believe the more practical solution is a dedicated Linux VPS. Under high loads any mail server is going to slow down and require maximum disk I/O. Dedicated some of this disk I/O to a VPS engine on the same machine (using HW virtualization no less) is going to come at a cost and potentially not provide the performance required.

Side Note: Early on our shared mail servers were using SATA raid arrays. SATA drive I/O is known to burst to SCSI levels but won’t sustain those levels. As a result we had no choice but to move from SATA to SCSI and that was the only difference between the two configurations. Disk I/O is king in a mail server and fast drives and plenty of them in a RAID array is the only way to go for a mail server. Giving up some of this disk I/O to a collocated VPS scares me in our own environment. Your environment is probably much different and may or may not have the same issue but that’s for you to decide.

We’re creating these VPS engines so that we can offer not only a farm of Spamassassin servers for our shared hosting mail servers that we’re able to dynamically add additional nodes to quickly, but provide dedicated managed Spamassassin VPSs to our dedicated hosting clients and potentially mailserver admins worldwide regardless of where their mail servers reside.

Think about it, a plug and play spam fighting solution. This may not be an original Applied Innovations “Innovation” (that distinction goes to: someone_else )but it’s definitely one we’ve taken to the next level and that my friend is just why our company is named Applied Innovations, it’s not just a name, it’s what we do.

The Applied Innovations Spamassassin VPS solution is currently available in beta mode. It will be fully available following the completion of our beta testing. If you’re an Applied Innovations dedicated hosting client and need a spamassassin managed VPS online today, let us know and we’ll quote you a price.