I offered a week ago to adapt my 9-step prompt for studentsto use to get feedback on their speaking from, which uses the CEFR scale and the Global Scale of English to work with the IELTS scale. In many countries the IELTS scale is used more often than the CEFR scale.
Here is a table showing how grades on the IELTS scale
correspond to the other two scales:
It also shows the levels of the different versions ChatGPT
produced from the original transcript of “Bargain Sweater”.
Unfortunately, after experimenting with seven different
variations of my 9-step prompt with various differences between steps, I began
to suspect that ChatGPT had no idea about how to rate texts or create them at
different points on the IELTS scale. I wanted to test this.
Daniel
Tena Calderon had experimented with using Gemini to do something verysimilar to what I’m trying to do with my 9-step prompt, so I decided to see how
the two, ChatGPT and Gemini, compared using the same transcript and the same prompt.
I asked each of the LLMs to grade the five versions on the
IELTS, CEFR and GSE scales and they both asserted that they had done what they
had been asked to do: produce texts that go from the original to half a level
up and then one level further and one level further again.
|
ChatGPT |
Check |
Gemini |
Check |
Original |
4.5-5.0 |
3.5-4.0 |
4.0-4.5 |
3.5-4.0 |
Corrected |
5.0-5.5 |
3.5-4.0 |
5.0 |
4.0 |
½ IELTS level
up |
5.5-6.0 |
4.0-4.5 |
5.5-6.0 |
4.5 |
1 more IELTS
level up |
6.5 |
3.5-4.0 |
6.5-7.0 |
6.5-7.0 |
1 further
IELTS level up |
7.0 |
4.5 |
7.5-8.0 |
6.5-7.0 |
Differences |
|
9.5
too high |
|
3.5 too high |
Gemini appeared to grade the original more severely, but
appeared to reach higher levels later in the sequence.
I haven’t got a tool to independently check if the LLMs were
right, but I have used an average of the CEFR grades given by Text Inspector
and Pearson’s Text Analyzer as a reliable way to grade texts. I then used the
table above to predict the level on the IELTS scale and I’ve added a Check
column with the results for each step for each GenAI tool.
Gemini was much closer to the checked levels and ChatGPT was
so far out that I think it is fair to say that ChatGPT simply doesn’t
understand how to grade or create texts on the IELTS scale.
Before writing this, I made a screen recording making the
same point and illustrating it with the texts created by the two LLMs. As
usual, I used Clipchamp to make a couple of edits and to generate the subtitles
file, which I then used to create subtitles on YouTube. I experimented with
various formats of the video and subtitles and uploaded the four formats to YouTube:
·
Wide
screen with subtitles to the right
·
Portrait
with subtitles at the bottom