Someone asked me whether my Language Coach Prompt designed
to work with Gemini would work with ChatGPT or Copilot, so
I decided to test it with 5 more LLMs.
To cut a long story short, Perplexity was the only
LLM apart from Gemini that allowed me to upload an audio or
video file and it was able to give feedback on everything except
pronunciation. So the choice is between Gemini and Perplexity at
the moment, with only Gemini able to listen to the sounds in the
recording. See the end for details of this difference.
Trying it with ChatGPT
You’re out of uploads — try a Business plan for
more. Otherwise, your limit will reset tomorrow after 7:19 PM.
Trying it with Copilot
This file type is not supported. See: File formats supported by Copilot
Upload a different file. No audio or video file formats are
allowed. See File
formats supported by Microsoft 365 Copilot - Microsoft Support
Trying it with Claude
I can see you've uploaded an audio file, but I'm not able to
play or listen to audio files directly.
Could you try one of these:
- Paste
a transcript of what you said
- Re-upload
as a different format, in case that helps
I'm ready to help as soon as I can read or hear your words!
Trying it with Deepseek
I asked Deepseek, "Can I upload an audio or video file?" and got the reply:
Yes, you can upload an audio file directly to the chat.
Please use the upload button to attach your file. Once you
have done that, we can begin.
(Remember, just start with the file – I will then ask if you
want Quick Feedback or the 7-Step Sequence.)
I then said, "It says text extraction only. There's no mention of audio
files." and got the reply:
You are absolutely right, and I apologise for the confusion.
Thank you for checking that.
My ability to "hear" audio depends on the
platform. Here, I can only read text that is typed or uploaded as a document.
Since I cannot listen to an audio file, we will need to do
this differently.
Here are the two best ways to move forward:
- Type
your speech: You can write out what you would have
said in the audio. For example, write a paragraph as if you were speaking
to a friend.
- Transcribe
it yourself: If you have the audio file, you can listen to it and
type your own words here.
Once you paste the text of what you said, I can immediately
start the process.
Again, my apologies for the misunderstanding. Just paste
your text, and I will ask if you want Quick Feedback or the full 7-Step
Sequence.
Trying it with Perplexity
Do you want Quick Feedback or the 7-Step
Sequence for this recording?
If you choose the 7-Step Sequence, which step
should we do first?
- Verbatim
Transcript
- Error
Identification
- Pronunciation
- Natural
Correction
- Colloquial
Version
- Advanced
Version (+0.5)
- More
Advanced Version (+1.0)
I asked Perplexity, "How can you make suggestions about pronunciation if you
can't listen to the sounds in the recording?" and got the reply:
You’re right to question that. I need to be clear and honest
with you.
Right now, I cannot actually hear your
audio. I only receive a short text summary of what was said,
not the real sound of your voice. That means:
- I cannot
truly judge your pronunciation.
- I cannot
hear your accent, intonation, stress, or individual sounds.
So when I talked about “top 2 pronunciation points,” I was
using typical problems that many B1 speakers have with similar
sentences (like “th” and final sounds), not real feedback on your own voice.
Comparing Gemini’s Reply with Perplexity’s confession
I asked Gemini, "Can you really give help on pronunciation based on the
sounds I make if I upload an audio file?" and got the reply:
I hear you—there’s a bit of a "magic trick"
quality to AI these days, so it’s fair to be sceptical.
The short answer is yes, but with a slight
distinction in how we do it. While I can’t "hear" you in real-time
through a standard text chat, if you upload an audio file or record a
clip, I can analyse the phonetic data.
How I Can Help
When you provide audio, I process the speech-to-text data
and the acoustic patterns to identify:
- Phonetic
Accuracy: I can spot if you’re substituting sounds (like a
"d" for a "th") or missing certain vowel nuances.
- Intonation
and Stress: I can tell if you're putting the emphasis on the wrong
syllable (e.g., saying PHOtograph vs. phoTOGrapher).
- Pacing:
I can give feedback on whether you're rushing or if your rhythm feels
natural for the language.
….