When Open AI rolled out the latest version of ChatGPT in March, one particular statistic put the legal profession on its heels: The artificial intelligence technology outperformed nine out of 10 human test-takers.
But a Ph.D. candidate at the Massachusetts Institute of Technology now says that GPT-4’s bar exam performance that put it in the 90th percentile of test-takers has likely been overstated, and that the chatbot actually lands in the neighborhood of the 68th percentile of real test-takers—a conclusion the original researchers reject.
The percentile debate is centered on how the researchers who first looked at the GPT-4’s bar exam performance calculated its score percentile—wrote Eric Martinez in a new paper titled, “Re-Evaluating GPT-4’s Bar Exam Performance.”
GPT-4’s Uniform Bar Exam score of 297 would have landed the program in the 90th percentile among those who took Illinois’ February 2019 bar exam—the benchmark cited by the original researchers. But Illinois’ July bar exam would have yielded a more accurate comparison because the February exam typically draws a larger percentage of those retaking the exam after failing and who have lower scores, Martinez wrote.
Measured against a recent July exam in Illinois, GPT-4 would have scored in the 68th percentile, he concluded.
“The fact that GPT-4’s reported ‘90th percentile’ capabilities were so widely publicized might pose some concerns that lawyers and non-lawyers may use GPT-4 for complex legal tasks for which it is incapable of adequately performing,” Martinez wrote.
Chicago-Kent law professor Daniel Martin Katz and Michigan State law professor Michael James Bommarito, who conducted the original research alongside two others from legal AI company Casetext, said this week that they stand by their conclusions and the 90th percentile finding.
However, Katz and Bommarito said they plan to “correct points of confusion and misunderstanding that have arisen in public discourse” in the upcoming final version of their research paper. The draft version published in March focuses on GPT-4’s overall score, with the percentile conversion only appearing in a footnote.
Open AI did not immediately respond to requests for comment Wednesday.
The differing pass rates on the February and July bar exams can be dramatic. For example, the pass rate on Illinois’ most recent July exam was 68%, compared with 43% for February.
Martinez, Bommarito and Katz all agree that converting GPT-4’s Uniform Bar Exam score into a percentile is complicated by the fact that the National Conference of Bar Examiners, which designs the exam, does not publicly release score distributions, nor do states on a regular or consistent basis.
Katz and Bommarito said that their 90th percentile conclusion is conservative, because they threw out GPT-4’s high essay scores and because they used pre Covid-19 pandemic results for comparison. Anecdotal evidence suggests that law student learning suffered amid the pandemic, they said.