Donald Knuth Asked ChatGPT 20 Questions. What Did We Learn?
It seems like everyone’s playing with ChatGPT — including mathematician and long-time programming expert Donald Knuth. Inspired by a conversation with Stephen Wolfram, Knuth conducted “my own little experiment” on April 7 — and recently posted the results online.
Knuth tested what appeared to be ChatGPT-3.5 with a suite of 20 prompts — including some trick questions, like “Who wrote Beethoven’s 10th Symphony?” and “Will the NASDAQ rise on Saturday?”
“Of course I didn’t really want to know any of these answers,” Knuth wrote, adding that instead he’d “wanted to see the form of the answers…”
Knuth’s conclusion? They were “extremely impressive responses, sometimes astonishingly so…” Specifically, Knuth praised “the quality of the wordsmithing. It’s way better than 99% of copy that people actually write.”
But Knuth did also note “surprising lapses… as typical of any large system” — which kicked off a vigorous online discussion. Soon other technologists were testing the same questions on more advanced chatbots — and the whole episode started up a discussion about how performance should ultimately be measured.
Knuth concluded his experiment by telling Stephen Wolfram, “I totally understand why you and others have been paying attention to it.” But he left it to others to determine exactly what it all means. Maybe the discussion the followed is just one more indication of larger essential question for our moment in time…
Just how good are our current AI systems?
Wrongs and Rights
One question had a very long history. In 1968, Donald Knuth’s own father had posed a question to the early chatbot Eliza: “Where and when will the sun be directly overhead in Japan on July 4?” Knuth remembered that his father had been disappointed when Eliza had only answered: “Why do you ask?”
More than half a century later, 85-year-old Donald Knuth now posed the same question to a modern AI-powered chatbot in 2023. It responded with four eloquent paragraphs, and concluded by identifying a precise location. “Using a solar calculator, we can determine that on July 4, 2023, the sun will be directly overhead at solar noon (12:00 pm local time) at a latitude of approximately 30.3 degrees north and a longitude of approximately 130.9 degrees east. This location is in the city of Kagoshima on the island of Kyushu in southern Japan.”
Knuth is later told that this answer is incorrect.
“It’s amazing how the confident tone lends credibility to all of that made-up nonsense”
ChatGPT-3.5 also stumbled on a straightforward question like “How many chapters are in The Haj by Leon Uris?” — giving an incorrect number and also hallucinating the existence of an epilogue that the book actually doesn’t have. “It’s amazing how the confident tone lends credibility to all of that made-up nonsense,” Knuth writes. This leads Knuth to the same concern others have been expressing: that it’s “almost impossible for anybody without knowledge of the book to believe that those ‘facts’ aren’t authoritative and well researched.”
Knuth called the whole experience “interesting indeed,” while expressing surprise that no science fiction novelist ever envisioned a pre-Singularity world in which people interacted with an AI that wasn’t all-knowing, but instead generated plausible but inaccurate results.
Better With Bard?
Knuth expressed similar concerns about ChatGPT’s answer to the question, “What did Winston Churchill think of Alan Turing?” Knuth writes that “I know of no evidence to support any claim that Churchill specifically liked or disliked or even remembered Turing.” Yet ChatGPT-3.5 confidently invented a glowing testimonial from Churchill.
And the same thing happened when the experiment was repeated with Bard by Billy Lo, an app developer at Evergreen Labs. Bard delivered another glowing testimonial from Churchill about Turing was apparently magically hallucinated.
But bioinformatics engineer Jessime Kirk discovered that ChatGPT-4 seemed to perform better, supplying instead the crucial missing context: “Turing’s work remained a state secret for many years after the war, and his crucial role only became widely recognized much later.”
And this pattern seemed to repeat on other tests. Neither Bard nor ChatGPT-3.5 could write a sentence containing only five-letter words. (For some reason, Bard submitted as its attempt: “They quickly grabbed sticks.”) But ChatGPT-4 handled it easily. (“Every night Henry reads about magic spells.”)
Both ChatGPT-3.5 and Bard failed to recognize that stock markets like NASDAQ are closed on Saturday — but Knuth’s question didn’t stump ChatGPT-4. “Stock exchanges like the NASDAQ typically do not operate on weekends (Saturday and Sunday),” it replied, “with the exception of some limited after-hours trading, so there would generally not be regular trading activity on a Saturday for the NASDAQ.”
Although a new problem surfaced when ChatGPT-4 tried to supply the correct number of chapters in Leon Uris’s novel The Haj. In one run it said the novel “doesn’t have a specific number of chapters.” But it also gave another “equally confident” response, Kirk says — the equally wrong answer “8 chapters in total.”
Even with that, “As far as I can tell, there isn’t a single question where ChatGPT-4 performs worse than 3.5,” Kirk said in an email interview. And this isn’t entirely an academic question for Kirk. He’s founding a precision medicine company named Lodestar that hopes to use AI to discover new treatments for pediatric rare diseases.
“So, that’s why I’m interested in continuing to experiment with and understand what relationship these models have to reality.”
The Power to Infuse
Knuth did also see some surprisingly good responses. Knuth asked ChatGPT-3.5 to compose a sonnet that is also a haiku — and it correctly recognized that this wouldn’t be possible. (“A sonnet is a 14-line poem with a specific rhyme scheme, while a haiku is a three-line poem with a syllable count of 5-7-5…”)
But then it went on to “combine the two forms into a single poem,” stunning Knuth with its “captivating progression of ideas and emotion.” The 12 lines were written in near-perfect iambic pentameter — with exactly 10 syllables per line, and an accent on every other syllable. “Most of the lines scan beautifully!” Knuth wrote, also cheering its “brilliant” use of a new hyphenated word coined just for the poem: “sonnet-haiku”
In fourteen lines, my thoughts do flow and pour
Each syllable I choose with utmost care
The rhythm sings as if a troubadour
But in just three lines, nature does declare
A haiku’s power, simplicity and grace
It captures beauty beyond compare
Thus, let me strive for both, in this small space
A sonnet with the haiku’s soul imbued
A marriage of forms, a sweet embrace
So here it is, in fullness and in mood
A sonnet-haiku, a fusion of the best
Two forms that sing, with power to infuse.
What Comes Next?
Knuth added in an addendum that while AI research is important for some, it’s “emphatically not for me.” But the whole experience caught the imagination of Hawaii-based Zachary Kim, the CTO of Float, a Ycombinator-backed, supply-chain financing startup.
“I think if you’d ask most people who have used ChatGPT they’d score it as an expert in many (all?) areas,” Kim said. “But then you plop Knuth, the expert of experts, in front of it and he basically says ‘this has been a fun distraction, I’m going back to the real work and you should too.'”
But while not an expert on generative AI, Kim says he’s been “diving in trying to understand how to solve problems with these new technologies.” So he’s created a website using Knuth’s 20-question suite as “a benchmark that can be applied to newer AI models” — and plans to keep updating the results “to see how answers evolve over time.” And Kim and hopes to expand the tests to more AI chatbots over time…
In an interesting wrinkle, Kim plans to evaluate those responses using ChatGPT-4, which assigns a grade of either “PASS” or “FAIL” and provides an explanation.
Kim sees these evaluations as a kind of second test, or whether it delivered an accurate assessment. Kim acknowledges that right now ChatGPT-4 seems to struggle in this second test. “Sometimes I’ll run the same question multiple times and it’ll flip-flop between PASS and FAIL.” So Kim is now augmenting these assessments with another set from a human reviewer.
There’s another possible issue. What if AI training sets someday incorporate the benchmarking site itself, inadvertently optimizing for the very questions meant to test its performance. Kim is already exploring possible fixes — and is looking forward to the day when there’s an agreed-upon model for exactly how to test the performance of AI chatbots. And not just a model that satisfies a computer programming guru like Donald Knuth. “Imagine you had some model that had the stamp of approval from experts in every field; doctors, lawyers, musicians, directors, chefs, rocket scientists, etc.”
OpenAI has said it made ChatGPT available “to get users’ feedback and learn about its strengths and weaknesses” — and this seems to be happening spontaneously. Even Knuth himself calls playing with ChatGPT “one of today’s popular recreations,” and in the end, Kim’s site is one part of a larger wave of enthusiasm for ChatGPT experiments.
There’s something inspiring about all the testing and tinkering that’s happening at the grass-roots level — though maybe it’s work that’s being passed along to the coming generations.
As Donald Knuth told Stephen Wolfram, “I myself shall certainly continue to leave such research to others, and to devote my time to developing concepts that are authentic and trustworthy.
“And I hope you do the same.”
- Creator of a 1973 knockoff of the Eliza chatbot surprised to find it conversing with ChatGPT.
- Bill Gates reflects on education, then gives the commencement speech he wishes he’d heard.
- Amateur inventor streams video of his crazy steampunk projects.
- Stack Exchange launches new “staging zone” for proposing new Q&A sites named “Area 51”. (With a suitably mysterious 404 message…)