I wondered how SpeakSmart AI arrived at the scores in its Analysis, so I asked it in the chat:
after 24 seconds, I got this reply:
The scores in the SpeakSmart
AI analysis are generated by the Gemini 3 Pro model
through a multi-modal analysis of your audio recording.
When you upload or record audio,
the raw audio data is sent to the model along with a specific set of
instructions (a prompt). The AI arrives at these scores by evaluating several
characteristics of the speech:
1. Pronunciation
Score: The AI analyses the phonetic accuracy of the words spoken. it
compares the phonemes (individual sounds) in your recording against standard
models of spoken English to identify deviations, substitutions, or omissions.
2. Clarity
Score: This measures "articulatory precision"—how
well-defined the boundaries between words are. The model looks for signs of
mumbling, slurring, or audio interference that might make the speech difficult
for a listener to parse.
3. Pacing
Score: The model evaluates the tempo and rhythm of your speech. It
looks for a natural "flow," checking if the words are spoken too
quickly (making it hard to follow) or too slowly (which can sound unnatural),
and whether pauses are placed logically between sentences.
4. Overall
Score: This is a holistic assessment derived from the combination of
the metrics above, reflecting the general effectiveness and professional
quality of the communication.
Because Gemini is a "native
multi-modal" model, it doesn't just transcribe the text and then grade it;
it "listens" to the nuances of the audio signal itself—including
tone, stress, and intonation—to provide feedback that goes beyond just the
words used.
But I wasn't 100% convinced, so I used it to evaluate my own recordings in Catalan over a period of time, hoping that it would show a clear improvement, But It didn't.I refuse to believe that that my speaking hasn't improved over these years.
Then I made a chance mistake - I asked for the same recording to be evaluated again and was disconcerted to see that the analysis was different, and what's more The Strengths, General Advice, Specific Improvements were also all different.
So, it seems that the scores are quite arbitrary. The Strengths always covered three from a total of four different points, but they were all useful comments. the General Advice always started with a positive comment and then showed two of the three variations of points to concentrate on.
Similarly, under Specific Improvements there was only one point covered by all three and then two covered the same problem with past tenses, but the third didn't. Two of them corrected a misused word from Spanish, which the third ignored, There were five other points that were covered in one or other version of the feedback, and the most remarkable correction was pointing out that the famous Catalan actor is in fact called Juanjo
Puigcorbé and not Juan Puigcorbé!
In the Transcript, there were also a number of differences. The first version included five words that had been corrected including the misused word from Spanish, which had been rendered as the Catalan equivalent and so didn't appear in the list of Specific Improvements.
This was all done in the Catalan adaptation of the SpeakSmart AI app, so decided to check whether the same things happens with a recording in English with the original app.
I used the same recording made in class by a pre-intermediate
student three times.
The
first and most important problem is the arbitrary nature of the Analysis
Results. They are not helpful at all and may give a totally false
assessment of the student’s pronunciation.
|
First time
|
Second
Time
|
Third Time
|
|
Analysis
Results
45 Overall
50 Pronunciation
50 Clarity
35 Pacing
|
Analysis
Results
52 Overall
55 Pronunciation
50 Clarity
45 Pacing
|
Analysis
Results
40 Overall
45 Pronunciation
40 Clarity
35 Pacing
|
The third Transcript was very different from the first
two as it included all the ums and uhs and so was like a verbatim transcript. Here
is an extract:
"Hana was driving her car
and suddenly one men across the... across round. Uh, after that, eh,
she eh stop... the car just... [Spanish interaction: She stopped the
car?]... Hana stopped eh the car and eh... she eh...
see a man... uh, and the man eh was Jamie.
In the Strengths, there was only one similar phrase: “You
successfully conveyed the main plot points of the story” together with some
reference to self-correction.
In General Advice all three versions insisted on not
translating (word-for-word) from Spanish, and on using the correct past forms
of irregular (or regular) verbs. They each included one or two other useful
suggestions.
The Specific Improvements were very similar and
covered, albeit in different orders:
- the pronunciation of the -ed ending in ‘stopped’
- the use of the false friend ‘history’ instead of ‘story’
- the fact that ‘explained’
is followed by ‘to’ if the person is included
- the correct past tense of ‘see’
is ‘saw’
- the use of the preposition ‘across’ instead of the verb ‘cross’
Despite the variations in the feedback, they
were all valid and potentially useful, with the exception of the Analysis
Results, which it would be better to remove.
Conclusions
I will challenge Google's AI Studio about the arbitrariness of the scores in the two versions of the app and prompt that the transcript should be verbatim. After these changes, I will repeat the test for consistency with another recording