My thoughts on the Issue of English-centric AI

Viktoriya Tigipko is one of the most recognized names in the Eastern European VC community and is a native of Ukraine.
She runs TA Ventures, a pre-seed and seed stage VC, since 2010. Additionally she founded iClub (an angel network), WTech (a community for women in tech), and is Chair of the Board at the Ukrainian Startup Fund.
Guest Author: Viktoriya Tigipko

Contents

Most of the information that AI is being trained on is in English Products based on these AI models will also be more English-focused The problem is that the business case is weak So what can be done?Wrapping up

I was reading a terrific article by fellow Ukrainian, Artur Kulian, recently and wanted to add my two cents. The article is “Why is AI English-centric, and why is it a very big problem?”

The gist of the article is that the fact that AI engines are English-centric is potentially going to have very harmful impacts on culture.

By ‘AI engine’ we are referring to things like ChatGPT (OpenAI), Llama (Meta), Gemini (Google), and Claude (Anthropic).

And after I read it I was thinking to myself.. “Wow! He is right! As time goes on this is going to be more and more problematic!”

Most of the information that AI is being trained on is in English

You see AI is trained with loads and loads of information. And as Artur mentions, most of that information is written in English.

As of August 2024 ~50% of all websites are written in English with Spanish making a distant 2nd place with just 5.9%.

This means that AI will be a lot dumber in any topics that are not written in English. And reality is that in a lot of cultures around the world there is a lot of accumulated information and knowledge that is not written in English.

Beyond being ‘dumber’, in many cases the AI will simply spit out wrong or incorrect information. So people using these models in these other languages will be at a pretty big disadvantage.

Products based on these AI models will also be more English-focused

There are startups all over the place that are building products based on these models. Tools for both consumers and businesses like SAAS.

Just have a look at how many startups in the latest batches of Y-Combinator are AI-driven companies.

If the model performs much better in English than of course the companies that are built on top of them will have a major advantage if they are focused on the English-speaking market.

And since most of the top startups these days are leveraging AI this brings about some important questions to ponder.

For example, does this mean that tech startups focusing on the English-speaking market will have a major advantage for the forseeable future?

And if so, does that mean we can expect a higher and higher percentage of the most innovative companies being produced in markets that focus on English speakers?

The problem is that the business case is weak

One thing you might be thinking to yourself is… “well if the models are fed lots of information in English, than can’t you just translate all of that to every other language? And therefore the information available in those other languages would be equivalent.”

And you’d be right. You could hypothetically do this.

But there are a few problems with that.

First you’d need to worry about the quality of the translation. Machine translation can often lose nuances, idioms and cultural references.

Second, computational cost. Translating a massive dataset is very resource-intensive and often the business case just isn’t there.

Plus there are a number of other barriers. So the reality is that lots of the large datasets that are fed in English are simply not translated into many of the other languages.

So what can be done?

Well there are probably some folks out there with a lot more expertise in this area than me, but let me have a stab at some of the things that I think would make sense from my perspective.

First, collecting more large, high-quality datasets in other languages. For example I am from Ukraine.

What can be done to ensure that some of the most valuable datasets that are only written in Ukrainian are included into the popular models?

And how do we then ensure that there is a business case around that? Because when things are ‘for profit’ they tend to happen faster and at a larger scale.

Second, I think open source is key. Llama is open source whereas engines like ChatGPT and Gemini are not.

These open source models should cooperate with contributors around the world that have access to these localized datasets so that they incorporate them.

I think of this a bit like what Wikipedia has achieved. Wikipedia is in 300+ languages and in many of those languages it is very comprehensive.

How did it achieve this?

Simple. By incorporating lots of contributors. And by ‘lots’ I mean that there are currently over 47 million Wikipedia accounts, of which ~113k of them have made a contribution in the past month.

Wrapping up

What I do not want to see is certain countries begin to lag behind because of this disadvantage due to language. Because I can totally see how this problem will compound over time.

Rather I’d love to see more action be taken now to ensure there is a level playing field.

At TAV we invest in startups accelerated by AI. I hesitate to say ‘AI startups’ because at this point I think pretty much all startups can and should be accelerated by AI. And if they aren’t, then they are at high risk of being out of business in the future.

We are particularly interested in verticals that can be accelerated by AI. Things like heathcare & biotechnology, autonomous systems and robotics, natural language processing and conversational AI, and fintech (fraud detection, trading, etc.).

The key elements within these startups that we look for are things like solid adoption, data availability, and potential to disrupt the market.

Also it is important in my view that these startups are also keeping a focused eye on some of the challenges that AI faces. Things like data privacy, security and ethical issues.

For example we like to see that companies, even in the early stages, have and adhere to an ethical code of doing business.

If you’re interested in TA Ventures then please don’t hesitate to visit our website and reach out to the person that you feel is most appropriate to your inquiry: https://taventures.vc/team/.

My thoughts on the Issue of English-centric AI

Most of the information that AI is being trained on is in English

Products based on these AI models will also be more English-focused

The problem is that the business case is weak

So what can be done?

Wrapping up

Subscribe to our newsletter to get our newest articles instantly

Stay Connected

Latest News

Techzi is Pausing

Twitch Pioneer Emmett Shear Launches Mysterious AI Venture

OpenAI CEO Labels Musk a ‘Bully’ in Latest Tech Titan Clash

AI Revolution Could Spark Live Entertainment Boom

Techzi

Quick Links

Quick Links

Techzi Tech Newsletter

Legal

Most of the information that AI is being trained on is in English

Products based on these AI models will also be more English-focused

The problem is that the business case is weak

So what can be done?

Wrapping up

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Subscribe to our newsletter to get our newest articles instantly

Stay Connected

Latest News