Saturday, 24 January 2026

SpeakSmart AI by Google's AI Studio looks good, but is it really?

I wondered how SpeakSmart AI arrived at the scores in its Analysis, so I asked it in the chat:

after 24 seconds, I got this reply:

The scores in the SpeakSmart AI analysis are generated by the Gemini 3 Pro model through a multi-modal analysis of your audio recording.

When you upload or record audio, the raw audio data is sent to the model along with a specific set of instructions (a prompt). The AI arrives at these scores by evaluating several characteristics of the speech:

1.      Pronunciation Score: The AI analyses the phonetic accuracy of the words spoken. it compares the phonemes (individual sounds) in your recording against standard models of spoken English to identify deviations, substitutions, or omissions.

2.      Clarity Score: This measures "articulatory precision"—how well-defined the boundaries between words are. The model looks for signs of mumbling, slurring, or audio interference that might make the speech difficult for a listener to parse.

3.      Pacing Score: The model evaluates the tempo and rhythm of your speech. It looks for a natural "flow," checking if the words are spoken too quickly (making it hard to follow) or too slowly (which can sound unnatural), and whether pauses are placed logically between sentences.

4.      Overall Score: This is a holistic assessment derived from the combination of the metrics above, reflecting the general effectiveness and professional quality of the communication.

Because Gemini is a "native multi-modal" model, it doesn't just transcribe the text and then grade it; it "listens" to the nuances of the audio signal itself—including tone, stress, and intonation—to provide feedback that goes beyond just the words used.

But I wasn't 100% convinced, so I used it to evaluate my own recordings in Catalan over a period of time, hoping that it would show a clear improvement, But It didn't.

I refuse to believe that that my speaking hasn't improved over these years.

Then I made a chance mistake - I asked for the same recording to be evaluated again and was disconcerted to see that the analysis was different, and what's more The Strengths, General Advice, Specific Improvements were also all different.
 
So, it seems that the scores are quite arbitrary. The Strengths always covered three from a total of four different points, but they were all useful comments. the General Advice always started with a positive comment and then showed two of the three variations of points to concentrate on.

Similarly, under Specific Improvements there was only one point covered by all three and then two covered the same problem with past tenses, but the third didn't. Two of them corrected a misused word from Spanish, which the third ignored, There were five other points that were covered in one or other version of the feedback, and the most remarkable correction was pointing out that the famous Catalan actor is in fact called Juanjo Puigcorbé and not Juan Puigcorbé!

In the Transcript, there were also a number of differences. The first version included five words that had been corrected including the misused word from Spanish, which had been rendered as the Catalan equivalent and so didn't appear in the list of Specific Improvements.

This was all done in the Catalan adaptation of the SpeakSmart AI app, so decided to check whether the same things happens with a recording in English with the original app.

I used the same recording made in class by a pre-intermediate student three times.

The first and most important problem is the arbitrary nature of the Analysis Results. They are not helpful at all and may give a totally false assessment of the student’s pronunciation.

First time

Second Time

Third Time

Analysis Results

45 Overall

50 Pronunciation

50 Clarity

35 Pacing

Analysis Results

52 Overall

55 Pronunciation

50 Clarity

45 Pacing

Analysis Results

40 Overall

45 Pronunciation

40 Clarity

35 Pacing

The third Transcript was very different from the first two as it included all the ums and uhs and so was like a verbatim transcript. Here is an extract:

"Hana was driving her car and suddenly one men across the... across round. Uh, after that, eh, she eh stop... the car just... [Spanish interaction: She stopped the car?]... Hana stopped eh the car and eh... she eh... see a man... uh, and the man eh was Jamie.

In the Strengths, there was only one similar phrase: “You successfully conveyed the main plot points of the story” together with some reference to self-correction.

In General Advice all three versions insisted on not translating (word-for-word) from Spanish, and on using the correct past forms of irregular (or regular) verbs. They each included one or two other useful suggestions.

The Specific Improvements were very similar and covered, albeit in different orders: 

  • the pronunciation of the -ed ending in ‘stopped’
  • the use of the false friend ‘history’ instead of ‘story’
  • the fact that ‘explained’ is followed by ‘to’ if the person is included
  • the correct past tense of ‘see’ is ‘saw’
  • the use of the preposition ‘across’ instead of the verb ‘cross’

Despite the variations in the feedback, they were all valid and potentially useful, with the exception of the Analysis Results, which it would be better to remove.

Conclusions

I will challenge Google's AI Studio about the arbitrariness of the scores in the two versions of the app and prompt that the transcript should be verbatim. After these changes, I will repeat the test for consistency with another recording

No comments:

Post a Comment