Friday, 12 September 2025

ChatGPT simply doesn’t understand how to grade or create texts on the IELTS scale

I offered a week ago to adapt my 9-step prompt for studentsto use to get feedback on their speaking from, which uses the CEFR scale and  the Global Scale of English to work with the IELTS scale. In many countries the IELTS scale is used more often than the CEFR scale.

Here is a table showing how grades on the IELTS scale correspond to the other two scales:

It also shows the levels of the different versions ChatGPT produced from the original transcript of “Bargain Sweater”.

Unfortunately, after experimenting with seven different variations of my 9-step prompt with various differences between steps, I began to suspect that ChatGPT had no idea about how to rate texts or create them at different points on the IELTS scale. I wanted to test this.

Daniel Tena Calderon had experimented with using Gemini to do something verysimilar to what I’m trying to do with my 9-step prompt, so I decided to see how the two, ChatGPT and Gemini, compared using the same transcript and the same prompt.

I asked each of the LLMs to grade the five versions on the IELTS, CEFR and GSE scales and they both asserted that they had done what they had been asked to do: produce texts that go from the original to half a level up and then one level further and one level further again.

 

ChatGPT

Check

Gemini

Check

Original

4.5-5.0

3.5-4.0

4.0-4.5

3.5-4.0

Corrected

5.0-5.5

3.5-4.0

5.0

4.0

½ IELTS level up

5.5-6.0

4.0-4.5

5.5-6.0

4.5

1 more IELTS level up

6.5

3.5-4.0

6.5-7.0

6.5-7.0

1 further IELTS level up

7.0

4.5

7.5-8.0

6.5-7.0

Differences

 

9.5 too high

 

3.5 too high

 

Gemini appeared to grade the original more severely, but appeared to reach higher levels later in the sequence.

I haven’t got a tool to independently check if the LLMs were right, but I have used an average of the CEFR grades given by Text Inspector and Pearson’s Text Analyzer as a reliable way to grade texts. I then used the table above to predict the level on the IELTS scale and I’ve added a Check column with the results for each step for each GenAI tool.

Gemini was much closer to the checked levels and ChatGPT was so far out that I think it is fair to say that ChatGPT simply doesn’t understand how to grade or create texts on the IELTS scale.

Before writing this, I made a screen recording making the same point and illustrating it with the texts created by the two LLMs. As usual, I used Clipchamp to make a couple of edits and to generate the subtitles file, which I then used to create subtitles on YouTube. I experimented with various formats of the video and subtitles and uploaded the four formats to YouTube:

·         Wide screen with subtitles to the right

·         Portrait with subtitles at the bottom

·         With red subtitles on top of the screen recording

·         With subtitles you can turn on or off