Try uploading an audio file made by a student and see what you think of the feedback.
My prompt for Google's AI Studio was, "I want to build a web app where i can upload an audio file and get feedback on my speaking including my pronunciation"
I asked Gemini, "What audio and video file formats can be uploaded here to ask for feedback on pronunciation and speaking in general?"
To help you and your students use this workflow effectively, here is a breakdown of the supported formats.
Since your focus is on analyzing speech patterns and pronunciation, audio files are usually the most efficient, but video files also work well if the student wants to show their facial expressions or mouth movements.
Supported Audio Formats
These are the most common formats for voice recordings:
MP3 (The standard format for most phones and voice recorders)
WAV (High quality, but larger file sizes)1
AAC / M4A (Common for iPhone "Voice Memos")2
OGG
FLAC
Supported Video Formats
If a student records themselves on camera, these formats are supported:
MP4 (The most common format for smartphones)
MOV (Standard for Apple devices)
AVI
WMV
WebM
Key Technical Limits
File Size: Generally, files should be under 20MB for the best performance. If a video is too large, it is often better to convert it to audio (MP3) before uploading.
Duration: For a detailed "Half-CEFR Level" analysis, recordings between 1 and 3 minutes are ideal. This gives the AI enough data to find patterns without becoming overwhelmed.
Pro-Tip for your Students:
If a student is using a smartphone, I recommend the following:
iOS: Use the Voice Memos app (saved as M4A).3
Android: Use the built-in Voice Recorder (usually saved as MP3 or M4A)
Environment: Remind them to record in a quiet room. Background noise (like a TV or wind) can cause the AI to create an incorrect transcript, which will ruin the bracket [ ] error-finding step.
I've been reading Jeff Jordan's article on input, noticing, intake and other dubious constructs. And I hadn't realised before that the difference between input and intake isn't really significant, isn't really enough. I think the word exposure is worth using. It's exposure to language. Then there is input and then there is intake—if there is noticing in between input and intake so be it—but I was also reminded of something that Kunihide Sakai, a Japanese specialist in extensive reading, talked about regarding 'swells'.
He said that 'swells' are just the sounds you hear around you, and it's strange because he was talking about extensive reading. But what he was saying was that as babies learn a language, all they're aware of is this sort of washing of sounds around them. That cannot be input; it is exposure. It is a potential stimulus—it isn't a stimulus unless something happens as a result of it. It's sort of background noise, shall we say, with no attempt at interpreting that there is meaning involved in it.
This is a rather complicated way of expressing the ideas, but what I mean is that there is exposure to a raft of noises and sounds. Obviously, babies will focus their attention on particularly strident sounds because these will stand out. They will begin to detect repeated sounds that make them apply their attention to these repeated sounds just because they're interesting—not because they're necessarily attaching meaning to them—but simply because they're repeated sounds in the environment. They gain prominence because they've been identified as the same sound, and these could be words like "baby" or sounds like a clicking of the tongue.
• I think that paying attention to any sounds will be the beginning of some form of learning, which the baby is undertaking from the environment—whether it be spoken language or any other sounds in the environment.
• Should it only be sounds, or is it perhaps also things they see? I think it is appropriate to say that noticing is one of the parts of this learning process. It could be noticing visual stimuli as well as auditory stimuli.
A baby begins to learn about the world around them by paying attention to things that stand out, things that draw their attention for whatever reason. I think this is prior to considering it as language—identifying that some of the noises made by human beings in the environment are attempts to attract the child's attention is perhaps the first realization of intentionality. In this way, they begin to understand that people are trying to achieve something with these sounds they're making.
The first step for a baby must be to experiment with making sounds themselves, with the intention of getting the attention of the people around them. None of this is clearly language—it is just responding to the environment. Language learning and eventually language output is a much slower process. It must depend on beginning to recognize associations between the sounds they hear and physical realities, objects, and things in the world around them.
For example:
• People saying "doggy, doggy, doggy"
• The child, being interested in this hairy animal that's looking at them
• It all has to do with attention and its relation to objects, sounds, and then the intentionality of the person who is saying "doggy, doggy, doggy"
• The child is realizing that "doggy, doggy, doggy" is being said with the intention of drawing their attention to what they later discover is called a dog
That's a rather complex exposition of my thoughts on the matter.
It has nothing to do directly with second language acquisition, but it helped me clarify my own thoughts on exposure, attention, intentionality, input, intake as precursors to L1 language acquisition.
I didn’t know this worked and when Frank Andrade posted about it, I immediately uploaded
an mp3 file on my Android mobile phone. Gemini gave me feedback, but I then
asked, ““Is your feedback on my pronunciation based on the sounds I produced or
on typical errors learners make with these sounds?”
The reply began, “My feedback is a combination of both: I analyse the specific
acoustic signals in your recording while also referencing common error patterns
to provide meaningful context.
How the Analysis Works
When you upload an audio file, I use advanced speech processing to break down
your speech into individual sounds (phonemes) and compare them against standard
models.
1. Direct Sound Analysis: I analyse the physical properties of your voice—such
as pitch, duration, and intensity—to identify exactly where your pronunciation
deviates from the reference. For example, if you shorten a vowel that should be
long, I detect that specific timing difference.
I then decided to test it again on my iPad, so I uploaded a
different file and asked, “Can you give me feedback on my speaking including my
pronunciation?”
Next, I asked, “ Can you make a corrected version of what I
wanted to say?”
The third prompt was, “Can you make a more colloquial
version of this?”
The fourth and final prompt was, “Can you make a new version
based on the corrected one but half a CEFR level up?”
My reflections on the replies from Gemini:
I was very impressed by the feedback on the two students’
pronunciation, although it found three very common errors made by Spanish and
Catalan learners of English-
The feedback was very detailed, maybe too detailed. Maybe
the prompt should be, “Can you give me some brief feedback on my speaking including my
pronunciation?”
Under Grammatical Accuracy, Gemini said, “Since you are
telling a story in the past, try to keep all verbs in the past tense.” However,
when asked to produce a corrected version it was all in the present simple and included
the comment,“In English, when we
summarize the plot of a book or a movie, we almost always use the present
tense.”
The more colloquial version was great but at B2/B2+ it is a
lot higher than the student’s B1+ level as assigned by Gemini.
The version half a CEFR level up was definitely overcomplicated
for a student at her level and would certainly end up with her asking things
like, “What does frosty mean?” It was supposed to be at B2 level as Gemini
claimed it was, but it was actually at C2, which is about two levels higher,
rather than half a level higher!
What to do next:
I wanted to try the same prompts with other GenAI tools like
ChatGPT, Claude, Copilot, Perplexity and DeepSeek.
Claude, Copilot and Deepseek simply won’t accept uploading
mp3 files.
ChatGPT allowed me to upload the same mp3 file, but
then left me waiting for a few hours after saying, “Please wait a moment while
I listen and prepare everything.” And when I complained it asked me to upload
the file again as it was not ‘available’ any more. I was told to upgrade or
wait 21 hours to be able to upload the file again without paying. It also
suggested that I should upload 1-2-minute-long files next time. Mine was just
over 3 minutes long. So I think we can discount ChatGPT on a free account.
Perplexity was the only one that allowed me to upload
an mp3 file and immediately give me feedback on it, but it admitted that its
feedback on pronunciation was not based on the audio, but on the transcript and
typical Spanish errors. A further problem with Perplexity using the webpage or
the app is that there is no ‘Read Aloud’ so students can’t listen to anything.
As I was using Microsoft’s Edge browser, I tried the ‘Read Aloud’ incorporated
in it, but it was not very satisfactory, sounding very robotic in comparison to
the ‘Read Aloud’ available with Gemini.
This is all a great pity as the different versions of the
transcriptions made by Perplexity were much closer to what I had been hoping
for. For example, the feedback was B2/B2+, which for a B1+ student is probably
alright, compared with Gemini, which was B2+/C1. This happened with the
different versions, apart from the corrected versions, which were very similar
B1/B1+.
I want to see what subtle changes can be made to the four
prompts to get feedback that the student could engage with more easily.
1.Can you show me
where my errors are without telling me how to correct them?
2.Can you give me some
brief feedback at my level on my
speaking including my pronunciation?
3.Can you make a corrected version of what I
wanted to say?
4.Can you make a more colloquial version of this without increasing the sophistication more than half a
CEFR level?
5.Can you make a new more
sophisticated version based on the corrected one but less than half a CEFR level up?
Ideally, students can be given a choice of which prompts to
use and help with how to engage with the feedback in different ways.
As you will see Eric Curts has produced and shared a wide
range of other resources for language learning, but for the moment let me just show
you a short extract from the instructions for NotebookLM:
The Indestructible Moss Piglet
[COVER PAGE]
Title:
The Indestructible Moss Piglet
Visual
Description: A stunning 16:9 wide composition. A split background: The
bottom left is a lush, green, textured moss forest painted in soft gouache
strokes. The top right fades into a deep, velvety dark blue space with
soft white splatter stars. In the center, bridging both worlds, is Terry
the Tardigrade (a chubby, squishy, microscopic water bear, painted in soft
pastel teal and mint green with visible brush strokes, eight stubby legs,
and a round snout). He looks happy and is floating as if swimming. The
title is written in playful, hand-cut paper letters with soft edges.
Style
Note: Soft Gouache & Pastel, textured collage, visible brush
strokes, no hard black outlines, dreamy atmosphere.
Teachers of English are lucky that, as Eric says, there are
hundreds of professionally written, illustrated and published readers, but in
many less studied languages there are far fewer.
For example, in Catalan, although there are some excellent
publishers of graded readers, there are nothing like as many as there are for
English. So, I wondered if Eric Curts’readers could be repurposed for learners of Catalan.
Here is the result of asking NotebookLM (Studio - Slide
Deck), “Can you make a version in Catalan? Using the same graphics” It produced
a pdf file like the original, but I am sharing it as a 30-second video
crediting Eric Curts
One of the problems with the app at this stage is that it doesn't have a consistent character on every page.
I think I've seen this scene before, but possibly not with a scientist in it.
And now you see the problem? Umm, the scientist is now a woman.It was a man in both the previous scenes.
It always seems to be pouring with rain outside this cafe.
Trying to post the downloaded web page from a storyline I pasted the html code for the downloaded webpage into the Html view in Blogger I had to remove one parameter: overflow: hidden; This was preventing the page from scrolling If you want to see the four pages, click on the picture. If you want to listen to my recordings, click on the 'play' icon I also changed the colour in the caption to - caption { color: #000000;