Companies often state that when refine the modelsOne effective solution to make a big language model (LLM) fit for purpose and data-driven is for the model to lose some of its capabilities. Once tuned, some models “forget” the best way to perform certain tasks or other tasks they have already learned.
Research from the University of Illinois Urbana-Champaign proposes a latest method for retraining models that avoids “catastrophic forgetting,” where a model loses some of its prior knowledge. This article focuses on two specific LLMs that generate responses from images: LLaVA and Qwen 2.5-VL.
This approach encourages firms to coach only narrow parts of the LLM to avoid retraining the entire model and incurring significant increases in computational costs. The team says catastrophic forgetting is not true memory loss, but moderately a side effect of bias drift.
“Training a new LMM can cost millions of dollars and weeks of time and emit hundreds of tons of CO2, so it is an urgent problem to find ways to update existing models more efficiently and effectively,” the team wrote in paper. “Guided by this result, we are investigating tuning formulations that preserve learning while limiting changes in power output.”
The researchers focused on the multi-layer perceptron (MLP), the internal decision-making component of the model.
Catastrophic oblivion
The researchers first desired to confirm the existence and cause of catastrophic forgetting in the models.
For this purpose, a set of goal tasks to be performed by the models was created. The models were then refined and evaluated to find out whether or not they led to significant forgetting. However, as the process progressed, the researchers found that the models regained some of their abilities.
“We also noticed a surprising result: model performance decreased significantly on maintained benchmarks after training on the counting task, and mostly returned to normal on PathVQA, another specialized task that is not well represented in the benchmarks,” they said. “Meanwhile, when performing forgetting mitigation experiments, we also tried to separately tune only the self-attention projection (SA Proj) or MLP layers, motivated by the finding that tuning only the LLM was generally better than tuning the full model. This led to another very surprising result – that tuning only the self-attention projection layers led to very good target learning tasks without any drop in task performance, even after training all five task goals sequentially.”
The researchers said they believe that “what appears to be forgetting or interference after fine-tuning a narrow goal task is actually a bias in the performance distribution resulting from a change in task distribution.”
Narrow retraining
This discovery turned out to be the key to the experiment. The researchers noted that tuning the MLP increases the likelihood of “generating numerical tokens and a highly correlated decline in task accuracy.” It showed that a model forgetting some of its knowledge is only a temporary issue, not a long-term one.
“To avoid biasing the output signal distribution, we tune the MLP gating/up projections while keeping the down projection frozen, and find that this allows for learning similar to full MLP tuning, with little forgetting,” the researchers said.
This allows for a simpler and more repeatable method of tuning the model.
By focusing on a narrow segment of the model moderately than wholesale retraining, enterprises can reduce computational costs. It also allows for higher control of output drift.
However, research has focused on only two models, particularly those related to vision and language. The researchers noted that as a result of limited resources, they were unable to conduct the experiment with other models.
However, their findings can be prolonged to other LLMs, especially across different modalities.
