Spam in Chinese is problematic for traditional content-filtering anti-spam engines for several reasons:
- Chinese characters are “double-byte”, as opposed to “single-byte” like non-Asian languages. The second byte is due to the fact that one byte isn’t enough to transmit all the necessary information since the alphabet is so much larger than western languages like, for example, English. Most content-filters were designed to work on single-byte languages, and choke when it comes to double-byte.
- There are no spaces between words in Chinese. A word may be made up of several Chinese characters, however the characters around it may also have a meaning in conjunction with those other characters. A spam filter may “read into” certain phrases that were not intended. A reader of Chinese will figure out the meaning based on the context; content-based spam filters with dictionaries of good & bad words are not that smart.
- Chinese can be written vertically, as opposed to other languages which are written horizontally. Content-filters are typically designed to scan words from left to right, and then down. Vertical writing will simply appear as gibberish to a content-filter that is scanning it left to right. (I won’t even get into the right/left; left/right issue, since Hebrew and Arabic are written right to left…another thing altogether.)
A while back, Commtouch’s CTO Amir Lev wrote a paper for Virus Bulletin that delves into the issue of international spam, and how different languages and even cultures affect spam filtering around the globe. That was nearly two years ago, and at that time, he wrote that:
It is worth mentioning that Japanese and Chinese can also have a vertical orientation; however this layout is typically not used for computers, since it is not practical.
At that time, in 2006, we had not seen spam written vertically. Now I may be getting paranoid, but are spammers reading our old, esoteric journal articles, for ideas? Because… this week Commtouch has identified an outbreak of vertical Chinese spam!
Check out this example – the entire message is written vertically, and that number running down the right-hand edge is the business’ phone number. BTW that’s another example of how spam varies around the world – most western spammers wouldn’t dream of including a phone number (then they’ll start getting those boiler room telemarketers calling them all the time, oh yeah, and perhaps the FBI…)
By the way, what are they selling? I don’t know Chinese, so I checked with one of our BizDev Asia representatives, and this is the response I got: “oh you know, the usual stuff, nothing new here, receipts, customs, import, export…” What, no Viagra?
Incidentally, Commtouch typically has great results filtering spam in Asian languages, since the patented RPD technology is language- and content-independent. Where some anti-spam technologies have rooms full of language-experts sifting through piles and piles of spam, or generating dictionaries of “spammy” words for every language in the world, RPD is based on identifying recurring patterns in bulk-sent email, regardless of the language of the message. This ability to excel at filtering in multiple languages has served Commtouch well, netting the company many partners throughout Asia.