The RetroBeat: Dragon Quest XI finally turns me into a fan

I love Japanese role-playing games like the Final Fantasy and Persona series, but I just never got into Dragon Quest. But I’ve finally corrected this oversight.

I had a few false starts with the classic franchise. When a remake of Dragon Quest IV came out for the Nintendo DS in 2008, I picked it up. This was right around the time I was marathoning my way through Final Fantasy, playing the first 10 games in the franchise in 6 months. After so much Final Fantasy, I thought I’d enjoy checking out the other major JRPG institution. But I barely got anywhere with Dragon Quest IV. I don’t have a great reason; the game just didn’t grab me. If there was one big thing, I just couldn’t get into the first-person battles. Not seeing my party members on the battlefield just felt jarring.

Not long after that, Dragon Quest IX came out for the DS in 2010. This time, Square Enix made a Dragon Quest for the DS from scratch. And I actually played this one for a few hours. This time, I could actually see my characters in battle, swinging their little swords and casting their spells. But Dragon Quest IX was a more open, less narrative-focused RPG. That was a pretty big jump for me, especially after playing all of those story and character-heavy Final Fantasy games.

Fast-forward to 10 years later, and my JRPG experience has extended beyond the confines of Final Fantasy. I have now played games like Suikoden 2, The Legend of Heroes: Trails in the Sky, and Phantasy Star IV. My appreciation for the genre extends beyond comparisons to Final Fantasy.

So when everyone started raving about Dragon Quest XI, I became curious. But it still took me awhile to jump in. After my past experience with the franchise, I had convinced myself that I just wasn’t a “Dragon Quest guy.” But when Dragon Quest XI came to Switch, and I finally had some free time after the rush holiday and early year game reviews, I decided to finally give it a go.

Gotta kill slimes in a Dragon Quest.Gotta kill slimes in a Dragon Quest.

Above: Gotta kill slimes in a Dragon Quest.

Image Credit: Square Enix

Questing for glory

I’m so glad that I did. About 75 hours later, Dragon Quest XI is now one of my favorite RPGs of all time. It was everything that people said it was: charming, beautiful, and delightfully old-school with a modern approach. But Dragon Quest XI was also many things I wasn’t expecting. It has the best voice acting I’ve ever heard in a JRPG, avoiding the obnoxious anime tropes that so many other games of this kind go for. And while Dragon Quest XI uses a simple, traditional turn-based battle system, it still has a lot of depth. Every fight feels meaningful. You can grind your way through many JRPGs by just smashing a single button, telling every party member to use a basic attack every turn. Dragon Quest XI had me using a much wider set of abilities and spells than a normal JRPG.

Now, I’m not here to review Dragon Quest XI (managing editor Jason Wilson already has, and you should read it). Suffice it say that it’s a fantastic JRPG and you need to play it if you have any affinity for the genre.

But aside from the tremendous experience of playing that one game, I finally feel ready to dive into the rest of this franchise. Already, I’m eyeing the recent Switch port of Dragon Quest III, which it looks like many people think is the best of the NES-era games in the series. I’m also excited to try some of the games that I’m hearing are among the best in the franchise, like Dragon Quest V and Dragon Quest VIII.

It feels like a whole new JRPG world is open to me. I’m not sure that I’m going to go back and play every Dragon Quest game, like I once did for Final Fantasy. But after beating Dragon Quest XI, all I want to do is get deeper into the franchise.

I am now a Dragon Quest guy.

The RetroBeat is a weekly column that looks at gaming’s past, diving into classics, new retro titles, or looking at how old favorites — and their design techniques — inspire today’s market and experiences. If you have any retro-themed projects or scoops you’d like to send my way, please contact me.

Facebook’s voice synthesis AI generates speech in 500 milliseconds

Facebook today unveiled a highly efficient, AI text-to-speech (TTS) system that can be hosted in real time using regular processors. It’s currently powering Portal, the company’s brand of smart displays, and it’s available as a service for other apps, like VR, internally at Facebook.

In tandem with a new data collection approach, which leverages a language model for curation, Facebook says the system — which produces a second of audio in 500 milliseconds — enabled it to create a British-accented voice in six months as opposed to over a year for previous voices.

Most modern AI TTS systems require graphics cards, field-programmable gate arrays (FPGAs), or custom-designed AI chips like Google’s tensor processing units (TPUs) to run, train, or both. For instance, a recently detailed Google AI system was trained across 32 TPUs in parallel. Synthesizing a single second of humanlike audio can require outputting as many as 24,000 samples — sometimes even more. And this can be expensive; Google’s latest-generation TPUs cost between $2.40 and $8 per hour in Google Cloud Platform.

TTS systems like Facebook’s promise to deliver high-quality voices without the need for specialized hardware. In fact, Facebook says its system attained a 160 times speedup compared with a baseline, making it fit for computationally constrained devices. Here’s how it sounds:

VB Transform 2020 Online – July 15-17: Join leading AI executives at the AI event of the year.
Register today and save 30% off digital access passes.

“The system … will play an important role in creating and scaling new voice applications that sound more human and expressive,” the company said in a statement. “We’re excited to provide higher-quality audio … so that we can more efficiently continue to bring voice interactions to everyone in our community.”


Facebook’s system has four parts, each of which focuses on a different aspect of speech: a linguistic front-end, a prosody model, an acoustic model, and a neural vocoder.

The front-end converts text into a sequence of linguistic features, such as sentence type and phonemes (units of sound that distinguish one word from another in a language, like pbd, and t in the English words padpatbad, and bat). As for the prosody model, it draws on the linguistic features, style, speaker, and language embeddings — i.e., numerical representations that the model can interpret — to predict sentences’ speech-level rhythms and their frame-level fundamental frequencies. (“Frame” refers to a window of time, while “frequency” refers to melody.)

Style embeddings let the system create new voices including “assistant,” “soft,” “fast,” “projected,” and “formal” using only a small amount of additional data on top of an existing training set.  Only 30 to 60 minutes of data is required for each style, claims Facebook — an order of magnitude less than the “hours” of recordings a similar Amazon TTS system takes to produce new styles.

Facebook’s acoustic model leverages a conditional architecture to make predictions based on spectral inputs, or specific frequency-based features. This enables it to focus on information packed into neighboring frames and train a lighter and smaller vocoder, which consists of two components. The first is a submodel that upsamples (i.e., expands) the input feature encodings from frame rate (187 predictions per second) to sample rate (24,000 predictions per second). A second submodel similar to DeepMind’s WaveRNN speech synthesis algorithm generates audio a sample at a time at a rate of 24,000 samples per second.

Performance boost

The vocoder’s autoregressive nature — that is, its requirement that samples be synthesized in sequential order — makes real-time voice synthesis a major challenge. Case in point: An early version of the TTS system took 80 seconds to generate just one second of audio.

The nature of the neural networks at the heart of the system allowed for optimization, fortunately. All models consist of neurons, which are layered, connected functions. Signals from input data travel from layer to layer and slowly “tune” the output by adjusting the strength (weights) of each connection. Neural networks don’t ingest raw pictures, videos, text, or audio, but rather embeddings in the form of multidimensional arrays like scalars (single numbers), vectors (ordered arrays of scalars), and matrices (scalars arranged into one or more columns and one or more rows). A fourth entity type that encapsulates scalars, vectors, and matrices — tensors — adds in descriptions of valid linear transformations (or relations).

With the help of a tool called TorchScript, Facebook engineers migrated from a training-oriented setup in PyTorch, Facebook’s machine learning framework, to a heavily inference-optimized environment. Compiled operators and tensor-level optimizations, including operator fusion and custom operators with approximations for the activation function (mathematical equations that determine the output of a model), led to additional performance gains.

Another technique called unstructured model sparsification reduced the TTS system’s training inference complexity, achieving 96% unstructured sparsity without degrading audio quality (where 4% of the model’s variables, or parameters, are nonzero). Pairing this with optimized sparse matrix operators on the inference model led to a 5 times speed increase.

Blockwise sparsification, where nonzero parameters are restricted to blocks of 16-by-1 and stored in contiguous memory blocks, significantly reduced bandwidth utilization and cache usage. Various custom operators helped attain efficient matrix storage and compute, so that compute was proportional to the number of nonzero blocks in the matrix. And knowledge distillation, a compression technique where a small network (called the student) is taught by a larger trained neural network (called the teacher), was used to train the sparse model, with a denser model as the teacher.

Finally, Facebook engineers distributed heavy operators over multiple processor cores on the same socket, chiefly by enforcing nonzero blocks to be evenly distributed over the parameter matrix during training and segmenting and distributing matrix multiplication among several cores during inference.

Data collection

Modern commercial speech synthesis systems like Facebook’s use data sets that often contain 40,000 sentences or more. To collect sufficient training data, the company’s engineers adopted an approach that relies on a corpus of hand-generated speech recordings — utterances — and selects lines from large, unstructured data sets. The data sets are filtered by a language model based on their readability criteria, maximizing the phonetic and prosodic diversity present in the corpus while ensuring the language remains natural and readable.

Facebook says this led to fewer annotations and edits for audio recorded by a professional voice actor, as well as improved overall TTS quality; by automatically identifying script lines from a more diverse corpus, the method let engineers scale to new languages rapidly without relying on hand-generated datasets.

Future work

Facebook next plans to use the TTS system and data collection method to add more accents, dialogues, and languages beyond French, German, Italian, and Spanish to its portfolio. It’s also focusing on making the system even more light and efficient than it is currently so that it can run on smaller devices, and it’s exploring features to make Portal’s voice respond with different speaking styles based on context.

Last year, Facebook machine learning engineer Parthath Shah told The Telegraph the company was developing technology capable of detecting people’s emotions through voice, preliminarily by having employees and paid volunteers re-enact conversations. Facebook later disputed this report, but the seed of the idea appears to have germinated internally. In early 2019, company researchers published a paper on the topic of producing different contextual voice styles, as well as a paper that explores the idea of building expressive text-to-speech via a technique called join style analysis.

Here’s a sample:

“For example, when you’re rushing out the door in the morning and need to know the time, your assistant would match your hurried pace,” Facebook proposed. “When you’re in a quiet place and you’re speaking softly, your AI assistant would reply to you in a quiet voice. And later, when it gets noisy in the kitchen, your assistant would switch to a projected voice so you can hear the call from your mom.”

It’s a step in the direction toward what Amazon accomplished with Whisper Mode, an Alexa feature that responds to whispered speech by whispering back. Amazon’s assistant also recently gained the ability to detect frustration in a customer’s voice as a result of a mistake it made, and apologetically offer an alternative action (i.e., offer to play a different song) — the fruit of emotion recognition and voice synthesis research begun as far back as 2017.

Beyond Amazon, which offers a range of speaking styles (including a “newscaster” style) in Alexa and its Amazon Polly cloud TTS service, Microsoft recently rolled out new voices in several languages within Azure Cognitive Services. Among them are emotion styles like cheerfulness, empathy, and lyrical, which can be adjusted to express different emotions to fit a given context.

“All these advancements are part of our broader efforts in making systems capable of nuanced, natural speech that fits the content and the situation,” said Facebook. “When combined with our cutting-edge research in empathy and conversational AI, this work will play an important role in building truly intelligent, human-level AI assistants for everyone.”

Turtle Beach sees boost in headset demand as gamers play during the pandemic

If you’ve been playing online games or doing a lot of Zoom calls during the pandemic, chances are you need a good headset. And those aren’t easy to find.

And that’s the opportunity at hand for Turtle Beach, the San Diego, California-based maker of gaming accessories such as headsets, mice, keyboards, and other peripherals.

Turtle Beach CEO Juergen Stark said in an interview with GamesBeat that gamers hunkering down are making sure they have good equipment. Add to that the popularity of battle royale games like Call of Duty: Warzone, which require quality headsets with good microphones, as well as the need to talk to people on business video calls, and you have a kind of perfect storm, Stark said.

Market researcher NPD said March sales grew 12% to a historic level from a year earlier for gaming accessories, and they found that Turtle Beach had the No. 1 headset again in March, with the Xbox One Ear Force Stealth 600 Wireless Headset.

VB Transform 2020 Online – July 15-17: Join leading AI executives at the AI event of the year.
Register today and save 30% off digital access passes.

I talked to Stark about this trend after the company reported its results for the first quarter ended March 31. Revenues in the quarter were $35 million, with a net loss of $3.6 million. Stark said that Turtle Beach recently acquired Roccat to boost its accessories business, and the company has been investing heavily in R&D to develop new lines of business.

Still, Stark said that the coronavirus-related growth prompted the company to upgrade its estimates for second-quarter results to a revenue range of $42 million to $47 million in sales. The company also upgraded full-year revenues estimates to be $224 million to $234 million. That means that Stark expects a real boost because of the trends we mentioned, and he expects to continue to be the No. 1 player in market share for gaming headsets for the U.S. market.

Here’s an edited transcript of our interview.

Above: Juergen Stark is CEO of Turtle Beach.

Image Credit: Turtle Beach

Juergen Stark: January and February were down year over year for the whole market. The market in March shot up 39% year over year. This is using U.S. console headset sales data. We shot up more than 50%, because we also gained more than 500 basis points of share in March. Demand, in a way we’ve never seen in the past, even during the Fortnite year, just went through the roof.

GamesBeat: For this quarter, was it just the last two weeks of March that were affected by higher demand, or was it more like the whole quarter?

Stark: No, it started in roughly mid-March. It might have been when stay-at-home orders went into effect, basically. It might have been some of the second week of March versus exactly on March 15, but it wasn’t even all of March.

GamesBeat: For the full year, you’re now forecasting better sales as well, right?

Stark: Yeah, largely driven by a significantly increasing Q2. We’ve increased Q2 because the demand we started seeing in March has continued into Q2 here, through today. That doesn’t mean it’s ending today. It just keeps going.

GamesBeat: Is this shifting almost entirely to digital sales for you, versus brick-and-mortar?

Stark: No, it’s still a lot of retail sales. Wal-Mart and Target have largely remained open, and they’re big customers of ours. Best Buy and GameStop, although it took GameStop a little longer, they both shifted to touchless pickup. And then obviously everyone has increased, including our own website, e-commerce sales significantly.

We’re surprised at how quickly and how well retail has adjusted to, in some cases, closed doors and all these social distancing measures. March this year was the highest March sales of console gaming headsets in the U.S. in history, even higher than the March during the Fortnite year, despite all of the constraints at retail.

GamesBeat: It looks like there’s a challenge here because you still have a net loss on the quarter. You’re projecting a net loss for the whole year as well. It feels like that’s a little inconsistent. You have this great demand, but still a forecast for losing money?

Stark: We’re investing about $9 million this year to drive our entry into PC gaming accessories. Expanding the portfolio and driving the Roccat brand. We’re investing for future growth. That’s going very well. If we didn’t have that, then net income would be positive this year. That’s the reason for it. Even now, with the increase, the higher end of our guidance range has us in positive profitability. It’s not inconsistent. It’s like other companies that are investing to grow in the future.

GamesBeat: Is the business still a pretty competitive one?

Stark: It’s always been competitive. We’ve led our category for more than 10 years. We have higher market share than the next three players combined in console headsets. Despite that, we’ve even grown share in March, as I mentioned.

Above: Gaming headsets are in high demand.

Image Credit: Turtle Beach

GamesBeat: Was that referencing U.S. retail?

Stark: Yes, it’s U.S. retail.

GamesBeat: Are you expecting more changes in market share happening in Q2, or for the year?

Stark: Typically we’ve been above 40% in market share for many years. Last year our market share was well into the 40s. We gained some share in March here. I expect that our share performance will be quite good in April. That will probably flow through to what will, again, be good share performance for the year. Whether our share is up or down a few percent for the year, I think the most important thing is that we’re in the 40s, and the next closest competitor is in the high teens. We’re so far ahead of everyone else, whether we go up and down a few percentage points, it affects our revenues, but it doesn’t affect the fact that we’re by far the leading player and have been for more than 10 years.

GamesBeat: Do you feel like people are replacing headsets that didn’t work, or are they just getting them for the first time? Are they using them for work? What’s some of the user feedback you’re hearing?

Stark: It’s a bit hard to judge, but I would say that given the strong increase in demand, it’s coming from all of the above. It’s typically less headsets that don’t work or break. What happens is people upgrading their headsets, getting the next better one. Moving up from earbuds or a passive headset to an amplified headset, or from an amplified headset to a wireless headset. One thing we’ve always done well is–almost in $20 increments, you can get a lot more functionality. The upgrading of headsets among the active installed base of users has always been the primary driver of market sales every year. It’s not new gamers.

Now, in this environment there might actually be some new gamers who are coming in, or new headset users. Nintendo’s Animal Crossing, for example, has voice chat. That could be attracting headset users. The other thing is that we’ve seen guidance for not sharing headsets. It’s also possible that kids in a home where they were sharing a headset are now wanting to get their own headsets. That could be one of the drivers. And then the last thing is, we’ve definitely heard anecdotally that people are actively using our gaming headsets for their at-home learning, videoconferencing with teachers, and working from home.

GamesBeat: When it comes to the kind of gear they’re getting, are microphones a priority as well, versus just headphones?

Stark: We sell headsets, as opposed to headphones. All of our headsets come with mics. They’re being used for two-way communication, which is why they work well with Zoom and that stuff, which has also taken off in the last month and a half here.

The DeanBeat: Supercell CEO’s 10 takeaways from 10 years of mobile games

Supercell has been one of the most successful game companies in history, with games like Clash of Clans and Clash Royale generating billions of dollars in revenue.

Helsinki-based Supercell has had five games played by more than a billion unique players. The company’s founder reflected on its 10-year history in a blog post yesterday. Paananen disclosed that the company has 4.3 billion player accounts. If each player had a Supercell game on three devices, that adds up to about 1.3 billion unique players.

With just 328 employees, Supercell makes a ridiculous $4.76 million per employee in revenue, or 3.96 million users per employee. I would guess that makes it the most capital-efficient company in the game industry. That’s clearly why Tencent invested in the company at a very high valuation. But Supercell doesn’t brag about this side of its business.

“When we started the company, we were inspired by companies like Blizzard, Nintendo, and Pixar,” said Supercell CEO Ilkka Paananen in the post. “All of these companies have been able to create successful entertainment products that are loved by millions all over the world. And most importantly, they have been able to do so consistently over decades, in Nintendo’s case, for more than a hundred years.”

VB Transform 2020 Online – July 15-17: Join leading AI executives at the AI event of the year.
Register today and save 30% off digital access passes.

He wrote the post as a way of giving back and sharing learnings with other game companies, and Paananen tapped the rest of the staff for ideas. Paananen said that inspiration has remained constant, as is the company’s dream: To create games that are played for years and remembered forever. He said this worked for Supercell, but each company must develop its own culture.

My take: I’ve added my own reaction to what Paananen has said in each of the points below. It’s a great document for companies that are aspiring to be as impactful as Supercell, and yet I have this thought in the back of my head. It’s not jealousy, as I don’t run a game company. I just wonder if other game companies would react by saying these rules are for companies that have billions of cash and wonderfully successful employees and teams. That is, it’s a luxury to be able to run a company this way. That’s unfair, as I don’t see any negative intentions here, like bragging — just a sincere intention to help. So I don’t think anyone should dismiss these ideas. But it is a kind of backdrop, as Supercell is a rare company and what Paananen says is true. Supercell’s rules work for Supercell, not everyone.

Supercell has a lean team that is less than 300 people.Supercell has a lean team that is less than 300 people.

Above: Supercell has a lean team of 328 people.

Image Credit: Supercell

1. Always play the infinite game

Supercell built itself around the idea of creating games that as many people as possible play for years and that are remembered forever. Teams don’t launch games that they do not believe have a realistic shot at reaching the dream. The most important factor that our teams look at when they test our games in beta phase is how long players keep playing the game. Retention figures matter most here.

“This means that we don’t only kill games in beta, we kill good games,” Paananen wrote. “Great examples would be Smash Land and more recently Rush Wars – both very polished fun games that received positive feedback and excitement early on. But they weren’t games that people would play for years, so their development teams decided to kill them. They decided that instead of developing their games further, their time is best spent developing a new and better game.”

Beyond that, the teams think about what the right long term call would be for the players and the community. This shows in how they treat game updates, and content creation and community events. The teams try their best not to think about the next quarter or the next year, but about the next decade. The Clash of Clans team worked on some big design and technical issues for over two years before the efforts bore fruit. This would never have worked in a company that only cared about quarterly results.

My take: This rule puts Supercell in an exceedingly rare position in the game industry. Most game companies don’t have the luxury of killing games that don’t fit a long-term goal. Blizzard had a similar culture from its founding nearly three decades ago, and it killed a lot of games, just as Supercell has. But quite often game companies have to think about the next quarter, either because they are low on cash or they have to worry about public company investors selling off stock if games are delayed. But if you didn’t have to worry about the short term, Paananen is right. This is what you should think about in the long term.

Above: Supercell’s cute characters.

Image Credit: Supercell

2. Great teams make great games. Not necessarily great individuals

In 2010, Supercell started with the idea that the “best people make the best games.” However, years later, we realized that it is not about having the best individuals/people, but having the best teams. So the company changed its guiding sentence to reflect that.

“For me personally, the biggest surprise over the last ten years has been how incredibly difficult it is to put together a great team that can release a new hit game,” Paananen said.

The things that have to fall in place include making the game match player interest at the exact time it is released, have talented individuals with very different capabilities and ways of thinking, getting them to work together well in a psychologically safe environment with complete trust. Paananen said there is no silver bullet for solving this challenge, and it is subject to a lot of trial and error. It starts with a core of two to four people who are a “tight and well-functioning core team.” Even two such people is a “magical pair.”

My take: I think it’s true that games have had some Hollywood envy in the past and have put some leaders into the spotlight as the brilliant makers of games. Hideo Kojima comes to mind. But it really is teams that make the best games, and teams that have worked together for a long time are the ones that are really valuable.