Website obfuscation
Posted: 25 May 2006 at 01:52:27
I've been so busy lately, but tonight I took some time to get caught up on the Utah Open Source Planet and, I must say, there was lots of good stuff to read. Thanks to y'all sharing your knowledge. You rock.
I thought I'd pick on one of my favorite UOSSP bloggers, Aaron Toponce, but not in a negative way. I read his semi-recent entry about using HTML entities to obfuscate web site data in an attempt to foil robots -- particularly robots intent on harvesting e-mail addresses and other information.
Some years ago, I implemented this technique on several sites, personal and professional. It seemed to make sense the average spammer/data-harvester, was not going to implement the code necessary to de-entity-ize the content in search of e-mail addresses. In retrospect, however, I think that's a poor assumption.
See, spammers have money and they give their money to poor souls who will write code for money and, in many cases, have the smarts to pull it off. So, semi-smart coders tasked with maximizing the pool of e-mail addresses gleamed from a vast array of websites will very quickly implement techniques to foil the simplest of data obfuscation techniques. Converting text to HTML entities has got to be one of the first obfuscation techniques they are faced with circumventing.
After that, they probably implement simple OCR techniques to gleam data from sites that convert all their e-mail addresses into text rendered as image files.
That said, this HTML entity-based obfuscation technique is better than nothing, right? Because spammers like their pools of e-mail addresses to be fresh, it usually only takes a couple of weeks to see if any anti-spam technique results in a significant reduction of incoming spam, so it's easy to verify your technique is working. When we implemented the HTML-entity based obfuscation technique, there was a decrease in the amount of spam, but there was still plenty of spam.
If you're interested in playing with ways of automating the process of converting text data to a string of HTML entities, check out the HTML::Entities Perl module -- part of the comprehensive HTML::Parser distribution of modules.
Once you have this installed, you can do something like this:
perl -MHTML::Entities -ne 'print encode_entities($_, "\32-\255")'
For the Perl head-scratchers, this is a one-liner that loads the HTML::Entities module, wraps a loop around reading from STDIN or a filename parameter, and prints the result of the encode_entities() function call for each line of input read. Hit Control+D to get out of it.
[foo] /home/fozz 19 % perl -MHTML::Entities \
-ne 'print encode_entities($_, "\32-\255")'
Aaron Toponce
Aaron Toponce
When it was clear the HTML entity-based obfuscation simply did not have what it takes to win against increasingly smart harvesting bots, we deployed a CAPTCHA solution using the Authen::Captcha Perl module for our clients that really needed/wanted to publish e-mail addresses on their websites. This solution has worked out much better and, paired with educating users about the risks of leaving your e-mail address on websites, we've seen more significant decreases of incoming spam.