Do the math: ChatGPT sometimes can't, expert says

Paulo Shakarian’s son wanted to have some fun with the natural-language processing tool ChatGPT recently, so he generated a fictitious movie script where Arnold Schwarzenegger fights Jean-Claude Van Damme.

Welcome to the world of artificial intelligence.

ChatGPT, which was designed by OpenAI, a small San Francisco company, is different from other large language models in that it allows the general public to experiment with it directly.

Want to know what to do for your child’s birthday? Ask ChatGPT.

Want poetry written in the style of William Shakespeare? ChatGPT will do that for you.

RELATED: The pros and cons of GhatGPT

But Shakarian, an associate professor at Arizona State University who runs Lab V-2 in the Ira A. Fulton Schools of Engineering — the lab examines challenges in the field of artificial intelligence — is not as sold on ChatGPT’s capability of higher-level reasoning. In a paper that was accepted to the Association for the Advancement of Artificial Intelligence for its spring symposium, Shakarian detailed results of a study in which he tested ChatGPT on 1,000 mathematical word problems.

“Our initial tests on ChatGPT, done in early January, indicate that performance is significantly below the 60% accuracy for state-of-the-art algorithm for math word problem-solvers,” Shakarian said. “We are conducting a new experiment as OpenAI has stated that they have released a new version of ChatGPT with improvements in solving math problems."

ASU News talked to Shakarian about the paper and ChatGPT’s uses as a product.

Editor's note: The following interview has been edited for length and clarity.

Question: For those not familiar with ChatGPT, what would you say it is and does?

Answer: It’s designed around a concept called next word prediction, where for when you ask it something, it’s going to predict what the related words are based on a corpus (text and speech) data. It uses an underlying technology called the Transformer. This piece is critical because earlier technology … could only give good answers for, say, very short questions as opposed to something longer and more conversational.

Q: So, what can someone do with ChatGPT?

A: I think the practical applications in my view are probably going to be more in the creative and artistic space, as well as entertainment, where accuracy is not something that is going to be the most important thing. For things like a creative writing project, it could be really interesting. There was a recent story by a New York Times reporter that had a very long and strange conversation with the chat feature where the thing went a little bit off the rails. But on the flip side of that, having something that appears sentient … does give an impression there’s someone on the other end, and some people might find entertainment value in that. That said, there could be ethical implications with such uses as well, as these models can appear almost human and gain the trust of a user. However, designers have very little control over what they communicate to such a trusting and possibly vulnerable individual. These problems are related to ones of social engineering.

Q: What are the limitations of ChatGPT?

A: One really well-known limitation is the information in it only goes until the end of 2021. The reason for this is that ChatGPT uses what’s called a trained model, which means there’s a corpus of data used to train it. At some point that data has to stop, and it stopped at the end of 2021. So if you add new data, you usually have to start from scratch in this process. That’s significant because estimates for computational cost, just the cost of computers and electricity … is somewhere in the neighborhood of four to five million dollars. So to do that is very expensive, which is why the limitation of the data that’s used to go into it is significant.

Q: So I couldn’t ask it about anything that happened in 2022, right?

A: Right. Now, what has happened recently is Microsoft has announced using similar models created by OpenAI to power Bing (Microsoft’s web search engine). Instead of giving you a response, you type in your prompt and behind the scenes it’s generating search queries, and then taking those search results and putting them back into the language model and using that to give you an answer.

Q: Sounds like Google.

A: It is, except it’s using the language models as layers to communicate between the human and the search engine. Let’s say you have a query around buying a car, and you have specifications about the size of the vehicle because maybe you have a small garage or something. Where before you might have to do some research to kind of identify the sizes of various vehicles, and then you do another set of searches around identifying which ones meet the criteria, what happens with the new Bing is you just have one prompt that goes in and it’s using the language model to do a bunch of different searches all at once. Then it combines it together to give you an end result.

Q: So it’s a quicker process, essentially.

A: Yeah. From the search engine perspective, that’s where there could be some advantages. But there’s also some serious drawbacks because the language model, both in the creation of the queries and in compiling the results together, makes no difference between, say, adding in an extra sentence to kind of make something more readable versus adding in an extra sentence with some false information that just kind of sounds related to the topic. Because of that, people who have been experimenting on this have noticed that it has factual errors in the results; and by factual, I mean discrepancies between the final results and what the search engine actually found. So these are some of the problems that these companies will need to overcome.

Q: What were you trying to find out with your paper and what did the results tell you?

A: When ChatGPT first came out, there were all kinds of comments about how it was bad at math. There is a line of research in the field of natural language processing where people have studied how to create algorithms to solve mathematical word problems. Take a word problem that a junior high student would see that would maybe lead to a system of equations, nothing too bad, like two trains going at different speeds (to the same place). You can use algebra to solve those simultaneous questions. One key aspect about these math word problems is that they require multiple steps of inference. This simply means that once you take a look at the problem, there’s kind of a translation step, which is taking the words and turning it into the equations. These are all multiple steps we’ve done in high school, and we wanted to see if ChatGPT could correctly do these steps. What we can conclude is one of the limitations with ChatGPT is it’s just not capable of doing good multistep logical inference. And this makes sense because the underlying technology really wasn’t designed for that.

Top photo courtesy of Shutterstock

Scott Bordow
sbordow@asu.edu