Link: GPT-4o’s Chinese token-training data is polluted by spam and porn websites

Of the 100 results, only three of them are common enough to be used in everyday conversations; everything else consisted of words and expressions used specifically in the contexts of either gambling or pornography. The longest token, lasting 10.5 Chinese characters, literally means “_free Japanese porn video to watch.” Oops. #

--

Yoooo, this is a quick note on a link that made me go, WTF? Find all past links here.