Thursday, December 26, 2013

Baidu IME, Simeji (not really) sending keystrokes to outside servers

UPDATE: Much ado about nothing

The NHK piece I watched this morning turns out to have been total crap and essentially a staged sending of a password. My apologies for being duped. I should have seen through the bullshit, and I'll explain why below.

But first, the security company that was featured has posted a clarification on their blog. Both the Baidu IME and Simeji are doing cloud conversion of Japanese text. That is, conversion of 2-byte hiragana(全角文字)to kanji. So to do this, it sends all 2-byte text to the cloud, and they claim the text is sent even when the option is turned off. So, yes, this would seem to be a bug. If the cloud option is off, nothing should be done in the cloud.

However, it does not send standard (single byte) text at all.

Credit card numbers and passwords are always in single byte text, which means that neither the Baidu IME or simeji would have sent them, and the clarification explains just that:
Baidu IME , Simejiでは、全角入力の場合のみ情報が送信されています。クラウド入力Offの場合でも入力文字列を送信していました。パスワードなど半角入力のみの場合は送信されていません。クレジット番号や電話番号も変換しなければ送られません。
The Baidu IME and Simeji only send information when text is entered as two byte text. This happens even when cloud input is turned off. Passwords that are in single byte text are not sent. Credit card and phone numbers are also not sent if they do not require conversion. [Emphasis is original]
What this last sentence says is that if you enter the numbers in two byte text, e.g., 1234, then it will be sent since it is a conversion candidate.

Heres the thing: no one ever enters passwords or credit card number as two byte text, so the cases that they would have been sent are essentially zero. You cannot enter a credit card number (partially for this reason) as two-byte text on any e-commerce site.

The staged password theft

Getting to how all this go started, they used a phrase in Japanese that was essentially "1234 is a password," and it was done as 1234はパスワードです。(Or something like that). The camera then zooms in on a computer monitor that is capturing and displaying the Baidu IME's communication with the cloud server and they show 1234 being sent. At the time, I was thinking, "who uses 2-byte text for passwords?"

And the answer is:

A security company being broadcast on national TV uses 2-byte text for a password when that is the only way to trigger the reaction they want, even when it's a totally impossible situtation. The whole thing was staged. NHK is usually much better than this.

Original post

According to NHK  [J], the Baidu IME for PCs is sending all keystrokes, plus application and computer information, to outside servers even when the settings are explicitly set to not send information.

It's less clear what is happening with Simeji android IME. I haven't used it in years since adamrocker sold it to Baidu. With Simeji, it could just be that it is set to send and receive data by default, as opposed to sending data always, regardless of preference settings. Either way, I recommend avoiding it for the Google Japanese IME (insert NSA joke here).

This is very different from other IMEs

Of course all IMEs have the ability to send data back home. This allows for new words to be added as they become commonly used and for general improvements to input*. The difference here are major. According to the NHK article, both Google and Just Systems (maker of ATOK) send anonymized usage statistics with explicit permission from the user. That is, sending information is opt-in.

Baidu on the other hand does the exact opposite. Data are sent by default. Data are not anonymized. Raw text input is sent. You cannot opt out. If you do opt out, data are sent anyway.

* I'd argue that Swype, while I initially praised it's Japanese input, does actively not collect any information about how Japanese is input. None of these suggestions or bug has been fixed or implemented. I feel like I basically had to teach Japanese grammar to the swype keyboard, but with any complex sentence structure, forget about swyping in Japanese.


  1. Ken Yasumoto-NicolsonDecember 26, 2013 at 11:20 AM

    Does it include password information, I wonder?

    BTW: Is Japanese Swype quicker (or could it be quicker, if it worked) than 10-key with flicking? Flicking also has the benefit of being able to work with one thumb while hanging onto a train strap.

  2. It's still possible that simeji was unfairly dragged into this. This morning on NHK, they showed the raw capture of what was being sent from the desktop IME, and it was sending EVERYTHING.

    For me, swype in Japanese would be quicker than flick because I don't flick. But I bet for a proficient flicker, flicking will be fastest for Japanese.

  3. I knew it! Never trusted any Chinese software.
    Right now I am using Open WNN on my Android and am very happy with it!
    As for desktop, well, I am guilty of using Windows and being spied on...
    But soon, I will probably switch to Linux too. Anyone knows a better IME than mozc? Anthy sucked last time I used it..

  4. Winter holidays coming up and I finally got some time to do some long waited maintenance on my internet-setup. When reading this article I wondered if you're running SIP through a Softbank ADSL line?

    I used to run my VoIP line through a OCN Hikari line, switched to a softbank line after the missus got an iPhone and I considered the OCN latency abroad bad enough that we probably wouldn't notice the latency difference anyways. Anyways, after the witch al seems to work okay-ish, up until a month or two ago when suddenly my SIP phone stopped registering itself. the account works on my LTE docomo connection, and also at my work's dsl connection. Softbanks Routing and Docomo/OCN's routing to seems to be different and surely the 33 hops on the softbank line compared to 25 hops on OCN doesn't make it better, but it used to work... Do you know if softbank is blocking SIP?

    Except from SIP i was trying to setup a PPTP connection from work to home, although i can't confirm, it seems that the GRE protocol is also being blocked on Softbank's DSL line. As a workaround I created a SSTP connection that runs on 443, making it difficult to block. Well Internet here is definitely something I won't be adjusting too...

  5. SIP is with OCN as ISP on NTT East flet's hikari (IPv4 fiber to the building) with a cisco phone adapter and then my xperia A with cSIPsimple either on wifi or LTE. I have a 03 number "hikari denwa" and a US number with callcentric.

    Actually, running through an android SIP client adds noticable latency, and I see it on both wifi and LTE, so it's not just the mobile connection. The problem is that the quality is highly variable on mobile, as you can imagine. I don't know if you would want to rely on SIP over mobile.

  6. Joost van SteenderenDecember 27, 2013 at 1:18 PM

    Thanks, for explaining your setup. I definitely not prefer running SIP over LTE but at the moment I haven't any other options as the setup doesn't work on softbank's adsl. Guess I'll have to switch back to fiber after this contract is finished.

  7. Once you learn flicking, it works really well. But the first month of switching was painful

  8. Bluehost is definitely the best website hosting company for any hosting services you might require.