Wednesday 29 May 2013

Preventing spamBots From Harvesting or Scraping Email Addresses From Web Pages

Spam has become a worldwide epidemic, and prevention is the current focus. A day will come when a cure is sought more aggressively than a bandage, but for now companies are making tons of money selling us filters and spam prevention kits. Our approach at the PHP Kemist is to byte back with encrypted or obfuscated email addresses that spamBots cannot scrape or harvest from your web pages.

spamBots are robots that spider sites looking for email addresses by searching for patterns that match an email address. They perform the function of crawlers when they find their targets as they collect the email addresses from pages that contain pattern matches. spamBots started out as simple programs that were fed lists of web addresses and methodically worked through all available links seeking email addresses to collect. The general public and most webmasters were unaware of this process and were loading web pages with email addresses to provide customers with more methods of easy contact with store owners and business representatives. Unfortunately, this lack of awareness bred the modern age of the intelligent spamBot.

As spamBots became common knowledge and the Online community grew angry about the spam they were receiving, companies moved in to provide solutions for a price. Their anti-spam solutions work anywhere from poorly to really well. Our own anti-spam through Go Daddy reduced spam to less than 1%, which has been extremely easy to manage. However, with anti-spam measures comes the responsibility to check filtered messages for incorrectly tagged email, which must be tagged as “good,” else you may start to lose email. The training process for anti-spam is fairly easy, but requires methodical and regular diligence.

While anti-spam looked great to many companies and email users, webmasters continued to provide an easily accessible list of company emails through web page publications. spamBot programmers became more savvy and spamBots grew more powerful. Third world countries got into the game using Internet bars and quickly found a new source of revenue. Not getting into the world of scams and Ponze Schemes, email rapidly matured from simple communications into a seething pit of crap from which we had to carefully pluck our good email messages. The problem has contiues to grow while software companies remain reluctant to fix the problem, since they are cashing in on temporary solutions.

So, what is the cure to the problem, one might ask? There are many aspects of the solution that require effort from different members of the software companies and Online communities. From the perspective of the webmaster,our part of the cure is to stop providing spamBots with the food that keeps them alive. Stop placing email addresses in easily accessible locations with simple formats. There are a few simple strategies webmasters can use to prevent spam and spamBots. How these strategies are applied can vary greatly from webmaster to webmaster, but they typically fall from the webmaster’s hands to the programmer’s hands.

1. Email is a strnig of characters that creates a recognizable pattern. If we break the expected pattern, spamBots may overlook the email address and move on. A common and simple method of pattern breaking is Unicode character replacement. Browsers interpret Unicode efficiently and convert the Unicode segments back into alphanumeric characters. Take the email address bob@hates-spam.com and perform Unicode replacement and that email address becomes:

bob@hates-spam.com

The letter b was replaced with the Unicode string of  b that is not picked up by at least 97% of spamBots. Simply replacing the @ symbol with the @ string obfuscates the email address from most pattern matching spamBots. Add the replacement of the period with . and you have a string of characters that spamBots are likely to ignore.

2. Dynamic character replacement is a more advanced method of obfuscation. Using the Javascript programming language, your web page can dynamically generate email address links and trigger the mailto command from the browser. This method is slightly more advanced than Unicode character substitution, but is still pretty easy to integrate into any website.

The first component of dynamic character replacement is creating an array of the characters in the email address in the page. Replacement characters can be of any convention, but we’ll use numeric replacements for this example. Lets consider the alphabet and the letter A beng the first letter. We use the number 1 to replace the letter A, 2 for B; 3 for C, etc. We then create a segment of Javascript code in the email address hyperlink that triggers the Javascript character replacement process, and use our array of character substitutes to reconstitute our intended email address.

Javascript methods may be deployed globally using a complete alphanumeric-symbolic substitution array, or a reduced set on a per-page basis. This choice is one mde by the programmer based on web server performance, extensibility for the number of email addresses to be used, etc. Regardless the scope or deployment, the Javascript character replacement method dynamically replaces the obfuscated mailto hyperlink segment with the correct alphanumeric-symbolic characters, then triggers the mailto command, resulting in a normal and expected email address launch.

3. Less sophisticated methods of Javascript character handling can be used, such as reconstituting an email address in chunks. The Javascript code may have the pieces of the email address broken into multiple objects such as “bob” in the first, the @ symbol in the code, “hates” and “-” and “spam” in another set of objects, and the “.com” in another. When the user clicks the mailto link, javascript assembles the pieces on-the-fly and triggers the mailto event.

4. One of the more efficient methods of blocking email address recognition is using Flash media. Search engines and spamBots alike are not capable yet of interpreting the content of a Flash movie. Flash movies can have rather small dimensions and fit into your web site design efficiently. The file size is extremely small and yet be extremely functional. In most cases, an expected text block would not look clean  or acceptable with small Flash movies inline. But, in sections of the layout where the email address can reside on its own, Flash is a great solution to protectyour email address.

5. Don’t publicize your company email addresses at all! There are two ways to protect email addresses. The first is to provide a temporary contact email address in your website (hopefully obfuscated) that can be changed periodically should spam start showing up. Email addresses are super cheap and easy to manage, so use them more often. The email address you publish is not likely your personal email address, and is likely only going to be used by a web user while looking at your website. Once you establish communications with that customer, you are likely to provide direct contact information, including a more personal email address.

6. The next method of not using your email address is to provide an Online form, which submits the content to you behind the scenes, but via email. The form can provide an easier method of contact and communication for the customer, if you us dropMenus and preselected list of options specific to your products and services. I addition, there are advantages to this method of using contact forms, as your web server can differentiate between the selected subjects and send email to different team members. Let’s say your Subject dropMenu had three options: Request A Brochure, Ask A Question, Voice A Problem. These three Subject options would surely be delivered to different team members, which alows them to be more efficient at responding to the sender.

Caveats include formBot abuse to send spam. formBots are special spamBots that cruise the Internet looking for contact forms to submit spam. This is another area where webmasters have created the problem by not integrating security programming into their forms, and they assume everyone will play nicely.  Get your programmer involved in ANY aspect of communication with the user, especially a programmer that understands web security well. Web Security Experts are a speial breed of programmers, separate from regular programmers, and a world apart from webmasters.

formBots look for contact forms and submit their spam to you, sometimes in bucket loads. If the formBot is successful sending one form submission, why not send a few thousand? This converts the formBot function from simple spam to attempted DoS (Denial of Service) as you’ll be so inundated with contact form submissions that you won’t easily find your real form submissions to respond to. By simply adding some healthy PHP Programming to cleanse and validate form submissions, you can prevent the formBots from successfully submitting anything at all, you can track the IP Address used to attack your system, blacklist the IP Address, report abuse to their network provider, etc. Don’t just filter communications on one field, filter tham all and reject any submission that meets your criteria for spam or abuse.

7. Blogs, chatrooms, and other Online social websites allow signups. To signup you must provide an email address. This is a valid and healthy method for those websites to prevent formBots from signing up bucket loads of fake users, all presenting spam information in their profiles, and similar. You are supposed to receive an email asking you verify your address and accept the membership, else the membership never activates. All of this is healthy and expected.

So, you have yourself a social website membership and you want to share with others. That’s really nice of you and we appreciate your information, assuming it’s correct. But, did you stop to consider whether that website posts your email address as part of your identity? Have you checked to see if that website sells their email lists to other companies as part of their revenue stream? Most reputable social websites do not distribute or post your email address, but there is a simple method for being sure.

Since almost all web hosts that sell email service provide far more email addresses than you need, create some fake ones for this purpose. Go Daddy basic web hosting accounts costing as little as $2.80 per month supply as many as 100-500 email addresses for one domain name. Let’s say you’re signing up for Charlie Chick-Chocks Rotiserie Grill website because he has a free giveaway or a newsletter with coupons. His website is www.charliesgrill.com and you need to provide an email address that is real, and from which you can validate your account. Create an email address in your email service called charliesgrill@yourdomain.com, where yourdomain.com is whatever your domain name really is. Now, you can signup, verify your email address, and you can get your cool coupons for greasy ribs or whatever turns your gut. If Charlie sells the list of email addresses to Pamela’s Pink  Panties store, and she starts sending you pantyhose emails, you’ll quickly recognize who sold your email address by the address she sends to.

If you create MANY fake email addresses, which we do for this exact purpose, you might forward al of them into a bulk email account where you can easily read them from one location. Odds favor you won’t be using those address for regular correspondence, so a bulk email account is a best practice.

We’ve created hundreds of fake email addresses to prevent our real addresses from being distributed, as well as to test sites for the redistribution issue. Not a single email address has crcled back via the wrong domain name (sender), which makes us feel either safewith these sites, or thatthe method worked. The sending company could easily filter out email adresses that contain references to their domain or company name, but heck, that’s really what we wanted right?

Whatever the creative solution you choose to concoct,  just make sure you avoid making life easy for spamBots and formBots! We need to work together to kick spamBots in the pants and take money away from spamBot programmers.



Source: http://blog.phpkemist.com/2008/02/23/preventing-spambots-from-harvesting-or-scraping-email-addresses-from-web-pages/

No comments:

Post a Comment