Data Processing and Interpretation

You might be wondering how the chart data for the Anti-Spam Resources is compiled. Where does it come from? What kind of senders does it represent? Here is a general overview how to read the chart data and introduction to our methods of classifying and processing data.

Data Interpretation

Every blacklist chart is showing the accumulated weekly statistics for the last eight weeks. The "SPAM HITS" bar is a percentage measurement derived from the number of spam-mails that a particular blacklist correctly classified as spam. The "HAM HITS" bar is a percentage measurement derived from the number of ham-mails (non-spam) that a particular blacklist incorrectly classified to be spam.

Example: Assuming blacklist "alpha" correctly tagged 70% of spam received (spam hits), and incorrectly tagged 0.1% of non-spam mail (ham hits). Assuming blacklist "beta" correctly tagged 90% of spam received (spam hits), and incorrectly tagged 10% of non-spam mail (ham hits). You may conclude that blacklist "beta" blocks more spam than blacklist "alpha", however "beta" is incorrectly classifying hundred times more desired mail as spam than "alpha".

Data Classification

Intra2net’s core database is fed by a cluster of reporting-servers located in Central Europe. We use the real mailstream of few selected Intra2net business customers - no spam traps, dead email addresses or similar methods are involved. All mails are automatically classified by mail-server subfolders location. No additional steps are required, just everyday user interaction.

UNDEFINED	All mails located in top of "inbox" or tagged in "spam suspect" subfolder
SPAM	All mails located in subfolder "spam"
HAM	All mails located in any other subfolder

Users are trained to move undetected spam mails from top of "inbox" to "spam" subfolder and to sort out the subfolder "spam suspect" occasionally. By "moving" mails to according subfolders instead of "deleting", a minimum change of behaviour is required by user. Deleted or collected (POP3) emails are ignored, same goes for backscatter. The automatic spamfilter is set to medium, not to hit ham-mails accidentally.

Data Processing

All incoming mails are duplicated to a dedicated reporting-server. Thus all protocol information is preserved in order to run network-based tests. All tests on the reporting-server are processed completely independent from the main mail-server. In order to classify mails we use unique IDs. Classification is done by user interaction as described above. Reporting-servers only report final test results to our core database server.

On average the core database is showing a volume of 90% spam and 10% ham-mails, equal to the worldwide mailstream. However, we do have a high bulk mail volume (mailing lists). Our experience has shown that some network-based tests have problems differentiating legitimate and unsolicited bulk mails. We do not count any IP- addresses, just single mails. It makes no difference how many mails we receive from the same IP-address. Keep in mind: We have no influence on users accidentally classifying ham-mails as spam or vice versa, nor do we have access to the content or body of customer mails due to privacy reasons!

If you have any questions or comments about anything here, about the Anti-Spam Resources, please don't hesitate to contact us.

Visit the Blacklist Monitor mainpage for more blacklist statistics.