Spam in blogs

From Wikipedia, the free encyclopedia
Jump to: navigation, search

Spam in blogs (also called simply blog spam, comment spam, or social spam) is a form of spamdexing. (Note that blogspam also has another meaning, namely the post of a blogger who creates no-value-added posts to submit them to other sites.) It is done by posting (usually automatically) random comments or promoting commercial services to blogs, wikis, guestbooks, or other publicly accessible online discussion boards. Any web application that accepts and displays hyperlinks submitted by visitors may be a target.

Adding links that point to the spammer's web site artificially increases the site's search engine ranking on those where the popularity of the URL contributes to its implied value, an example algorithm would be the PageRank algorithm as used by Google Search. An increased ranking often results in the spammer's commercial site being listed ahead of other sites for certain searches, increasing the number of potential visitors and paying customers.

Contents

[edit] History

This type of spam originally appeared in Internet guestbooks, where spammers repeatedly filled a guestbook with links to their own site and with no relevant comment, to increase search engine rankings. If an actual comment is given it is often just "cool page", "nice website", or keywords of the spammed link.

In 2003, spammers began to take advantage of the open nature of comments in the blogging software like Movable Type by repeatedly placing comments to various blog posts that provided nothing more than a link to the spammer's commercial web site. Jay Allen created a free plugin, called MT-BlackList,[1] for the Movable Type weblog tool (versions prior to 3.2) that attempted to alleviate this problem. Many blogging packages now have methods of preventing or reducing the effect of blog spam built in due to its prevalence, although spammers too have developed tools to circumvent them. Many spammers use special blog spamming tools like trackback submitter to bypass comment spam protection on popular blogging systems like Movable Type, Wordpress, and others.

Other phrases typically used in the comment content can be stolen comments from other websites, "nice article", something about their imaginary friends, plagiarised parts from books, unfinished sentences, nonsense words (usually to defeat a minimum comment length restriction) or the same link repeated.

[edit] Application-specific implementations

Particularly popular software products such as Movable Type and MediaWiki have developed or included anti-spam measures, as spammers focus more attention on targeting those platforms due to their prevalence on the Internet. Whitelists and blacklists that prevent certain IPs from posting, or that prevent people from posting content that matches certain filters, are common defences although most software tends to use a combination of the variety of different techniques documented below.

The goal in every potential solution is to allow legitimate users to continue to comment (and often even add links to their comments, as that is considered by some to be a valuable aspect of any comments section when the links are relevant or related to the article or content) whilst preventing all link spam or irrelevant comments from ever being viewable to the site's owner and visitors.

[edit] Possible solutions

[edit] Banning malicious IP addresses

Early spam handling management relied on whitelists and blacklists that prevent known malicious IP addresses from posting. This was usually a re-active technique to comment spam and due to the large number of IP addresses that generate comment spam became a particularly in-effective solution by itself.

[edit] Disallowing multiple consecutive submissions

It is rare on a site that a user would reply to their own comment, yet spammers typically do.[2] Checking that the user's IP address is not replying to a user of the same IP address will significantly reduce flooding. This, however, proves problematic when multiple users, behind the same proxy, wish to comment on the same entry. Blog spam software may attempt to circumvent this by faking IP addresses or using open proxies themselves to post similar blog spam using many different IP addresses [3].

[edit] Blocking by keyword

Blocking specific words from posts has been argued to be one of the simplest, most effective and least intrusive ways to reduce spam. A lot of spam can be routinely blocked with few false positives simply by banning names of popular pharmaceuticals and casino games.[citation needed]

This is a good long-term solution, because it's not beneficial for spammers to change keywords to "vi@gra" or such, because keywords must be readable and indexed by the search engine crawlers to be effective.[citation needed]

Unsophisticated implementations of this may lead to examples of the Scunthorpe Problem.

[edit] nofollow

Google announced in early 2005 that hyperlinks with rel="nofollow" attribute[4] would not be crawled or influence the link target's ranking in the search engine's index. The Yahoo and MSN search engines also respect this tag.[5]

Using rel="nofollow" in theory negates the attractiveness of the spammer posting their link on a site that implements it. A lot of weblog software now marks reader-submitted links this way by default (with no option to disable it without code modification). A more sophisticated server software could spare the nofollow for links submitted by trusted users like those registered for a long time, on a whitelist, or with a high karma. Some server software adds rel="nofollow" to pages that have been recently edited but omits it from stable pages, under the theory that stable pages will have had offending links removed by human editors.

Other websites like Slashdot, with high user participation, use improvised nofollow implementations like adding rel="nofollow" only for potentially misbehaving users. Potential spammers posting as users can be determined through various heuristics like age of registered account and other factors. Slashdot also uses the poster's karma as a determinant in attaching a nofollow tag to user submitted links.

rel="nofollow" has come to be regarded as a microformat.

Critics of the nofollow tag argue that unsophisticated spam software will not attempt to determine whether a particular site implements nofollow before posting as there is no real perceived cost in submitting ineffective spam and so it will not prevent spam comments or the nuisance to the website viewers that it causes, more just negate the pagerank effect the spammer is expecting.

[edit] Validation (reverse Turing test)

A method to block automated spam comments is requiring a validation prior to publishing the contents of the reply form. The goal is to verify that the form is being submitted by a real human being and not by a spam tool and has therefore been described as a reverse Turing test. The test should be of such a nature that a human being can easily pass and an automated tool would most likely fail.

Many forms on websites take advantage of the CAPTCHA technique, displaying a combination of numbers and letters embedded in an image which must be entered literally into the reply form to pass the test. In order to keep out spam tools with built-in text recognition the characters in the images are customarily misaligned, distorted, and noisy. A drawback of many older CAPTCHAs is that passwords are usually case-sensitive while the corresponding images often don't allow a distinction of capital and small letters. This should be taken into account when devising a list of CAPTCHAs. Such systems can also prove problematic to blind people who rely on screen readers. Some more recent systems allow for this by providing an audio version of the characters. A simple alternative to CAPTCHAs is the validation in the form of a password question, providing a hint to human visitors that the password is the answer to a simple question like "The Earth revolves around the... [Sun]".

One drawback to be taken into consideration is that any validation required in the form of an additional form field may become a nuisance especially to regular posters. One self published original research article noted a decrease in the number of comments once such a validation is in place.[6]. It can be difficult to explain (to usually less technical computer users) why they need to do this and it often can cause some confusion to them. This technique may also cause problems for users who may struggle to interpret the question (for example due to language issues).

[edit] Disallowing links in posts

There is negligible gain from spam that does not contain links, so currently all spam posts contain (and usually an excessive number of) links. It is safe therefore to require passing Turing tests only if post contains links and letting all other posts through as it is likely to then make the site an uninteresting enough spam target and would require the spammer to adjust their tools to bypass the link filter (for example "www dot google dot com / .."). While this is highly effective, spammers do frequently send gibberish posts (such as "ajliabisadf ljibia aeriqoj") to test the spam filter for moderation system detection. These gibberish posts will not be labelled as spam (though will cause no real benefit to the spammer) and will still clog up a comments section of a blog.

Garbage submissions might however also result from poorly coded spam-bots, which do not parse the website's HTML form fields first, but send generic POST requests against pages ("comments=","message=" etc) in the hope of triggering the desired action from the website. So it sometimes happens that a "content" or "forum_post" POST variable is set and received by the blog or forum software, but other expected fields may not be included, thus the software will reject the submission and it will not feature as a comment.

[edit] Redirects

Instead of displaying a direct hyperlink submitted by a visitor, a web application could display a link to a script on its own website that redirects to the correct URL. This will not prevent all spam since spammers do not always check for link redirection (see the same criticisms about nofollow above), but effectively prevents against increasing their PageRank, just as rel=nofollow. An added benefit is that the redirection script can count how many people visit external URLs, although it will increase the load on the site (though usually by only a small amount).

Redirects should be server-side to avoid accessibility issues related to client-side redirects. This can be done via the .htaccess file in Apache.

Another way of preventing PageRank leakage is to make use of public redirection or dereferral services such as TinyURL. For example,

<a href="http://example.com/alias_of_target" rel="nofollow" >Link</a>

where 'alias_of_target' is the alias of target address.

Note however that this prevents users from being able to view the target of a link before clicking it, thus interfering with their ability to ignore websites they know to be spam. TinyURL now offers a preview feature to help avoiding this situation.

[edit] Distributed approaches

This approach is very new to addressing link spam. One of the shortcomings of link spam filters is that most sites receive only one link from each domain which is running a spam campaign. If the spammer varies IP addresses, there is little to no distinguishable pattern left on the vandalized site. The pattern, however, is left across the thousands of sites that were hit quickly with the same links.

A distributed approach, like the free LinkSleeve[7] uses XML-RPC to communicate between the various server applications (such as blogs, guestbooks, forums, and wikis) and the filter server, in this case LinkSleeve. The posted data is stripped of urls and each url is checked against recently submitted urls across the web. If a threshold is exceeded, a "reject" response is returned, thus deleting the comment, message, or posting. Otherwise, an "accept" message is sent and the blog software will act accordingly.

A more robust distributed approach is Akismet, which uses a similar approach to LinkSleeve but uses API keys to assign trust to nodes and also has wider distribution as a result of being bundled with the 2.0 release of WordPress.[8] They claim over 140,000 blogs contributing to their system. Akismet libraries have been implemented for Java, Python, Ruby, and PHP, but its adoption may be hindered by its commercial use restrictions. In 2008, Six Apart therefore released a beta version of their TypePad AntiSpam software, which is compatible with Akismet but free of the latter's commercial use restrictions.

Project Honey Pot has also begun tracking comment spammers. The Project uses its vast network of thousands of traps installed in over one hundred countries around the world in order to watch what comment spamming web robots are posting to blogs and forums. Data is then published on the top countries for comment spamming, as well as the top keywords and URLs being promoted. The Project's data is then made available to block known comment spammers through http:BL. Various plugins have been developed to take advantage of the Http:BL API.

[edit] RSS feed monitoring

Some wiki and blog software provides functionality to access an RSS feed of recent changes or comments. If a website owner or moderator adds that to their news reader they can usually quickly (by routinely monitoring and refreshing that feed) identify and remove the offending spam before other viewers see it.

[edit] Response tokens

Another filter available to webmasters is to add a hidden variable to their comment form containing a session token which uniquely identifies the instance of the form for that page. The primary protection afforded by this mechanism is through enforcing a one-to-one correspondence between each request to get the form and each request to submit it. This is impossible to do with IP addresses, since they are shared by users behind a proxy, firewall, or nat (e.g., multiple users sitting in the same internet cafe, library, senior citizens' center, managed care home, club, etc.) and they may change frequently, even between related requests (e.g. AOL and other enterprise-scale proxies, anonymizing services such as Tor). When the form is eventually submitted, the server can use the token to validate the post (and at least infer that the page was loaded by the viewer). If the token is unrecognised the server can send back the form, along with a new token, requiring user resubmission. A duplicate token with duplicate content can safely be silently discarded. Additionally, spammers may not actually load the comments form for an entry; having a unique code for each request inserted into the comment form and verifying it on receipt of the HTTP POST will significantly increase the number of steps required to spam multiple entries.[2]

Given a valid token, the server can then flag as suspicious, for example, postings that use different IP addresses for loading the comment form and the actual submission of the comment form, many postings all using the same IP address, or postings that took unusually short or long periods of time to compose. These can then be subjected to additional scrutiny, such as challenging the poster with a captcha, queuing for human review, or outright rejected.

This method is effective against spammers who spoof their IP Address in an attempt to conceal their identities or to appear to be many more distinct users than the number of IP addresses simultaneously under their control, since the token can only be returned if it was received by the spammer in the first place. It has been suggested that flagging posts based on changing IP addresses is effective against spammers abusing the distributed anonymous proxy Tor.[2]

[edit] AJAX

Some blog software such as Typo allow the blog administrator to allow only comments submitted via AJAX XMLHttpRequests, and discard regular form POST requests. This causes accessibility problems typical to AJAX-only applications like its behaviour on browsers with no Javascript support or where it has been explicitly disabled.

Although this technique largely prevents spam, it is a form of security by obscurity and can easily be defeated if it becomes popular enough, since it essentially is just a different encoding of the same form data.

[edit] See also

[edit] References

[edit] External links