Characters, data, stable power may give China the edge over US in AI stakes
By Xin Zhou, Ben Armour  |  May 23, 2024
China and the US are locked head-to-head in a struggle for primacy in global AI. Many factors will come to bear in deciding the outcome, but China’s intrinsic advantages in data, energy, and the nature of the Chinese language itself may prove decisive, two The Yuan editors argue.

HONG KONG/LONDON - China and the United States are unquestioned top dogs in the global artificial intelligence (AI) arena, with the scrap between them growing ever-fiercer by the day. Seemingly innocuous forces are, however, quietly tipping the scales. The idiosyncrasies of the Chinese language and China’s edge in data and unlimited, stable power resources are slowly but steadily forging the sword for the country’s eventual triumph in the AI melee. 

These two elements - the one an heirloom of cultural wisdom, with the other two forming a fertile bed to spur the growth of AI - are undoubtedly mighty weapons for China to wield to win the future AI contest.

China’s data hoard: unearthing cultural gems

China has the world’s largest cohort of internet users, and the data traffic they generate each day is a river in full spate, much like the vast streams whose hydropower makes up almost one-fifth (and counting) of total national energy generation. Together these rich sources - endless data and boundless power - present a veritable cornucopia to nourish AI development. The richness and complexity of Chinese information provide a huge space for AI learning unrivaled in its length and breadth. 

Chinese, an antediluvian language with profound cultural connotations and a unique evolutionary trajectory, also holds signal advantages in natural language processing:

“Text normalization is a method for standardizing text to prepare it for the tokenization, vectorization and classification steps. With [English], the first step would be to convert all text to lowercase. Because Chinese characters are not capitalized to begin with, there’s no need for that data cleaning step. Next comes stemming or lemmatization.Compared to English, there is also no concept of a stem in Chinese. Therefor

