Transformer

Scaling LLMs for Long Texts with FocusLLM

In traditional Transformer architectures, the computational complexity grows quadratically (O(L²)) with the length of the sequence, making it resource-intensive to process long sequences. This high demand for resources makes it impractical to extend context length directly. Even when fine-tuned on longer sequences, LLMs often struggle with extrapolation, failing to perform well on sequences longer than

Scaling LLMs for Long Texts with FocusLLM Read More »

Paper Skimming, , ,

Test-Time Training — Is It an Alternative to Transformer?

This research paper shows that TTT-Linear outperforms Mamba and Transformer in handling contexts as long as 32k. (See the card below) A self-supervised loss function on each test sequence reduces the likelihood of information forgetting in long sequences. Will the Test-Time Training(TTT) solve the problem of forgetting information in long sequences? As for the algorithm:

Test-Time Training — Is It an Alternative to Transformer? Read More »

Research Highlights, ,
Scroll to Top