Saturday, 24 January 2026

SpeakSmart AI by Google's AI Studio looks good, but is it really?

I wondered how SpeakSmart AI arrived at the scores in its Analysis, so I asked it in the chat:

after 24 seconds, I got this reply:

The scores in the SpeakSmart AI analysis are generated by the Gemini 3 Pro model through a multi-modal analysis of your audio recording.

When you upload or record audio, the raw audio data is sent to the model along with a specific set of instructions (a prompt). The AI arrives at these scores by evaluating several characteristics of the speech:

1.      Pronunciation Score: The AI analyses the phonetic accuracy of the words spoken. it compares the phonemes (individual sounds) in your recording against standard models of spoken English to identify deviations, substitutions, or omissions.

2.      Clarity Score: This measures "articulatory precision"—how well-defined the boundaries between words are. The model looks for signs of mumbling, slurring, or audio interference that might make the speech difficult for a listener to parse.

3.      Pacing Score: The model evaluates the tempo and rhythm of your speech. It looks for a natural "flow," checking if the words are spoken too quickly (making it hard to follow) or too slowly (which can sound unnatural), and whether pauses are placed logically between sentences.

4.      Overall Score: This is a holistic assessment derived from the combination of the metrics above, reflecting the general effectiveness and professional quality of the communication.

Because Gemini is a "native multi-modal" model, it doesn't just transcribe the text and then grade it; it "listens" to the nuances of the audio signal itself—including tone, stress, and intonation—to provide feedback that goes beyond just the words used.

But I wasn't 100% convinced, so I used it to evaluate my own recordings in Catalan over a period of time, hoping that it would show a clear improvement, But It didn't.

I refuse to believe that that my speaking hasn't improved over these years.

Then I made a chance mistake - I asked for the same recording to be evaluated again and was disconcerted to see that the analysis was different, and what's more The Strengths, General Advice, Specific Improvements were also all different.
 
So, it seems that the scores are quite arbitrary. The Strengths always covered three from a total of four different points, but they were all useful comments. the General Advice always started with a positive comment and then showed two of the three variations of points to concentrate on.

Similarly, under Specific Improvements there was only one point covered by all three and then two covered the same problem with past tenses, but the third didn't. Two of them corrected a misused word from Spanish, which the third ignored, There were five other points that were covered in one or other version of the feedback, and the most remarkable correction was pointing out that the famous Catalan actor is in fact called Juanjo Puigcorbé and not Juan Puigcorbé!

In the Transcript, there were also a number of differences. The first version included five words that had been corrected including the misused word from Spanish, which had been rendered as the Catalan equivalent and so didn't appear in the list of Specific Improvements.

This was all done in the Catalan adaptation of the SpeakSmart AI app, so decided to check whether the same things happens with a recording in English with the original app.

I used the same recording made in class by a pre-intermediate student three times.

The first and most important problem is the arbitrary nature of the Analysis Results. They are not helpful at all and may give a totally false assessment of the student’s pronunciation.

First time

Second Time

Third Time

Analysis Results

45 Overall

50 Pronunciation

50 Clarity

35 Pacing

Analysis Results

52 Overall

55 Pronunciation

50 Clarity

45 Pacing

Analysis Results

40 Overall

45 Pronunciation

40 Clarity

35 Pacing

The third Transcript was very different from the first two as it included all the ums and uhs and so was like a verbatim transcript. Here is an extract:

"Hana was driving her car and suddenly one men across the... across round. Uh, after that, eh, she eh stop... the car just... [Spanish interaction: She stopped the car?]... Hana stopped eh the car and eh... she eh... see a man... uh, and the man eh was Jamie.

In the Strengths, there was only one similar phrase: “You successfully conveyed the main plot points of the story” together with some reference to self-correction.

In General Advice all three versions insisted on not translating (word-for-word) from Spanish, and on using the correct past forms of irregular (or regular) verbs. They each included one or two other useful suggestions.

The Specific Improvements were very similar and covered, albeit in different orders: 

  • the pronunciation of the -ed ending in ‘stopped’
  • the use of the false friend ‘history’ instead of ‘story’
  • the fact that ‘explained’ is followed by ‘to’ if the person is included
  • the correct past tense of ‘see’ is ‘saw’
  • the use of the preposition ‘across’ instead of the verb ‘cross’

Despite the variations in the feedback, they were all valid and potentially useful, with the exception of the Analysis Results, which it would be better to remove.

Conclusions

I will challenge Google's AI Studio about the arbitrariness of the scores in the two versions of the app and prompt that the transcript should be verbatim. After these changes, I will repeat the test for consistency with another recording

Friday, 23 January 2026

Can you adapt this to be in Catalan and for the CATALAN language?

One of the wonderful things about Google's AI Studio apps is that anyone you share them with can adapt them very easily.

I asked, "Can you adapt this to be in Catalan and for the CATALAN language?"

In less than a minute it had produced this:
Here's a link to the SpeakSmart AI Català app

I allowed these two
I then uploaded an mp3 file, but SpeakSmart AI also accepts:

opus, oga, flac, ra, webm, weba, wav, ogg, m4a, mp3, mid, amr, aiff, wma, au, aac

SpeakSmart took a minute or two to produce this feedback:

3 anècdotes 28/01/2016

72

General

75

Pronunciació

78

Claredat

65

Ritme

Transcripció

"Una vegada, fa molts anys, vaig anar al teatre... crec que era al Mercat de les Flors. No? Hi havia un teatre... una miqueta... alternatiu. No clàssic, però clàssic com els teatres del centre de Barcelona. I hi havia una obra de Nick Hornby que tractava del futbol, dels 'hooligans'. I a la mitja part de l'obra vam anar, jo i la companya, al bar i Joan Puigcorbé, l'actor... estava allà al bar. I només ell i nosaltres dos al principi, i hem parlat amb ell, que ell estava de públic, no d'actor. Molt agradable. Una altra vegada, la meva dona... anàvem passejant per la costa de Tarragona, a prop de la Punta de la Móra. Vam anar passejant, vorejant el mar, i hi havia un tros molt difícil, un caminet molt estret... unes roques... i ens hem trobat una tanca. I per salvar-nos hem saltat la tanca i ha arribat un... de seguretat... de la Punta de la Móra, que és un càmping privat, no? I ens ha dit: 'Aquí no pots creuar la tanca, home! És que és perillós aquí'. I al final ens han deixat passar. Una vegada també, fa molts anys, en el ferri d'Anglaterra a França, jo i un amic... més de 50 anys... era un ferri de nit. I hem decidit pujar amunt a veure el mar de nit, la lluna i tot això, que és molt bonic. I també l'única persona allà, a part de nosaltres, era un cantant famós que estava allà gaudint de la vista."

Punts forts

  • ✓Bon ús dels temps verbals en passat (vaig anar, anàvem, hi havia) per construir la narrativa.
  • ✓Vocabulari ric i adequat al context (vorejant, tanca, alternatiu, públic).
  • ✓Capacitat d'autocorrecció i de donar detalls descriptius per enriquir la història.

💡Consells generals

Tens una bona capacitat narrativa i un vocabulari extens. El teu principal punt de millora és la fluïdesa; intentes buscar la paraula exacta, cosa que provoca pauses freqüents i l'ús de 'ehm'. Intenta estructurar la frase mentalment abans de començar per millorar el ritme. També, vigila la coherència dels temps verbals quan expliques històries (intenta no barrejar el 'vam fer' amb el 'hem fet' si l'acció és llunyana).

Millores específiques

Paraula/Frase

Problema

Suggeriment

la meva dona... anàvem

Concordança de subjecte

La meva dona i jo anàvem / Amb la meva dona anàvem

ens hem trobat / hem saltat

Barreja de temps verbals (Perfet vs Perifràstic en narració)

ens vam trobar / vam saltar (per mantenir la coherència amb 'vam anar')

un de seguretat

Precisió lèxica

un vigilant de seguretat

Joan Puig Corbé

Nom propi (probable confusió)

Juanjo Puigcorbé

© 2024 SpeakSmart AI - Català. Creat amb Google Gemini 3.


   








 

Wednesday, 21 January 2026

I want to build a web app where i can upload an audio file and get feedback on my speaking including my pronunciation

SpeakSmart AI

Try it here your self

Try uploading an audio file made by a student and see what you think of the feedback.


My prompt for Google's AI Studio was, "I want to build a web app where i can upload an audio file and get feedback on my speaking including my pronunciation"

Sunday, 18 January 2026

You can upload all these file types to Gemini to get feedback on your speaking, including your pronunciation MP3, WAV, AAC, M4A, OGG, FLAC, MP4, MOV, AVI, WMV, WebM


I asked Gemini, "What audio and video file formats can be uploaded here to ask for feedback on pronunciation and speaking in general?"

To help you and your students use this workflow effectively, here is a breakdown of the supported formats.

Since your focus is on analyzing speech patterns and pronunciation, audio files are usually the most efficient, but video files also work well if the student wants to show their facial expressions or mouth movements.

Supported Audio Formats

These are the most common formats for voice recordings:

  • MP3 (The standard format for most phones and voice recorders)
  • WAV (High quality, but larger file sizes)1
  • AAC / M4A (Common for iPhone "Voice Memos")2
  • OGG 
  • FLAC

Supported Video Formats

If a student records themselves on camera, these formats are supported:

  • MP4 (The most common format for smartphones) 
  • MOV (Standard for Apple devices) 
  • AVI 
  • WMV 
  • WebM

 Key Technical Limits

  • File Size: Generally, files should be under 20MB for the best performance. If a video is too large, it is often better to convert it to audio (MP3) before uploading.
  • Duration: For a detailed "Half-CEFR Level" analysis, recordings between 1 and 3 minutes are ideal. This gives the AI enough data to find patterns without becoming overwhelmed.

Pro-Tip for your Students:

If a student is using a smartphone, I recommend the following:

  1. iOS: Use the Voice Memos app (saved as M4A).3
  2. Android: Use the built-in Voice Recorder (usually saved as MP3 or M4A) 
  3. Environment: Remind them to record in a quiet room. Background noise (like a TV or wind) can cause the AI to create an incorrect transcript, which will ruin the bracket [ ] error-finding step.

More details on File sizes 

Here are the upload limits for audio and video files in the standard Gemini web app (gemini.google.com) and mobile app:

Video

  • Maximum File Size: 2 GB per file.

  • Length Limit:

    • Free Version: Up to 5 minutes total duration per prompt.

    • Gemini Advanced: Up to 1 hour total duration per prompt.

Audio

  • Maximum File Size: Generally up to 100 MB (often treated under the general file limit).

  • Length Limit:

    • Free Version: Up to 10 minutes total duration per prompt.

    • Gemini Advanced: Up to 3 hours total duration per prompt.

Key Notes:

  • You can upload up to 10 files in a single prompt.

  • The limits apply to the total duration in a single conversation turn (e.g., if you are on the free plan, you could upload two 2.5-minute videos, but not two 3-minute videos).