Why CD Projekt used AI to localise Cyberpunk 2077

There are a lot of big names associated with the upcoming Cyberpunk 2077.

The eagerly-anticipated sci-fi RPG is being made by Polish studio CD Projekt of The Witcher 3: Wild Hunt fame. The Matrix and John Wick star Keanu Reeves is performing the role of Johnny Silverhand in the game, while punk legends Refused and pop sensation Grimes are among the musical acts performing on the soundtrack.

One name that you’ll be forgiven for not being familiar with is Jali Research. This is a facial animation company based in Toronto, Canada that has helped CD Projekt with the localisation process of Cyberpunk 2077, in a manner of speaking.

The outfit emerged out of the University of Toronto, founded by PhD student Pif Edwards, along with Academy Award-winning animator and director Chris Landreth, as well as professors Eugene Fiume and Karan Singh.

Edwards was doing a PhD in Computer Science, initially wanting to focus on facial animation, but ended up looking at speech because it “turns out when people are expressing, they’re almost always talking.” Unhappy with the tools for handling speech and animation that were available at the time, he decided to build his own.

CD Projekt turned to Jali after reading a paper from the Canadian outfit that had been submitted to annual computer graphics conference, SIGGRAPH, in 2016. This was focused on procedural speech.

For 2015’s The Witcher 3, CD Projekt used algorithms to handle facial animation for eight different language voiceovers. This was successful up to a point, but for Cyberpunk 2077, the Polish firm had loftier goals; it wanted to do lip-syncing for ten languages: English, German, Spanish, French, Italian, Polish, Brazilian Portuguese, Russian, Mandarin and Japanese.

For Cyberpunk 2077, CD Projekt and Jali used a combination of machine learning and rule-based artificial intelligence. The former is used for what Jali calls the ‘alignment’ phase, a machine learning process that figures out what sounds are actually being made when someone speaks.

“Let’s say we have an audio file of someone saying ‘Hello’,” Jali co-founder and CTO Pif Edwards explains.

“Where does the ‘H’ start and stop? Then where are the ‘e’, ‘l’ and ‘o’ sounds? We mark that information up for a specific language, then train a machine-learning process using this data to recognise what sounds are being made.

“After a while, you can give it a brand new line of dialogue that it has never seen before and it will predict where the boundaries between sounds are and how long each of these phonemes is.”

You can read the full piece on GamesIndustry.biz

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s