Talk:Neural scaling law

todo list[edit]

More data[edit]

https://epochai.org/blog/extrapolating-performance-in-language-modelling-benchmarks

PaLM2 paper. Almost no details, but there's something.

2 Scaling law experiments 2.1 Scaling laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Downstream metric evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 A Detailed results A.1 Scaling laws pony in a strange land (talk) 15:25, 20 May 2023 (UTC)[reply]

[[2305.18565] PaLI-X: On Scaling up a Multilingual Vision and Language Model](https://arxiv.org/abs/2305.18565)

[[2306.13575] Scaling MLPs: A Tale of Inductive Bias](https://arxiv.org/abs//2306.13575) > performance of MLPs drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on TinyImageNet), > lack of inductive bias compensated. > MLP mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however surprisingly exhibiting stronger or unexpected behaviours.

Trading training and inference costs[edit]

[[2104.03113] Scaling Scaling Laws with Board Games](https://arxiv.org/abs/2104.03113)

> Training compute and inference compute (MCTS) can be traded off against each other. 10x more MCTS steps is almost the same as training 10x more.

Figure 6, 7 of Alphacode https://arxiv.org/pdf/2203.07814.pdf

Scaling by data quality[edit]

[[2206.14486] Beyond neural scaling laws: beating power law scaling via data pruning](https://arxiv.org/abs/2206.14486)

> the scaling of error-(dataset size)

> faster than power law scaling, even possibly exponential scaling, if we have high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size

If phi-1 replicates, incorporate it too. https://arxiv.org/abs/2306.11644

started with The Stack (a 3 TB collection of code) and text from StackOverflow
used a LLM to select 6B "high-quality" tokens from (1)
used GPT-3.5 to generate 1B tokens of text similar to textbooks
trained a small (1.3B parameter) model ("phi-1") on (2) and (3)
used GPT-3.5 to generate text similar to textbook exercises
fine-tuned phi-1 on (5)
tested phi-1 on HumanEval to evaluate its programming ability

RL scaling[edit]

Scaling laws for reward model overoptimization

GATO? RoboCat?

Theoretical explanations[edit]

^[1]

^[2]

13 the tradeoffs of large-scale learning L Bottou, O Bousquet - Optimization for machine learning, 2011

^[3]

References

^ Hutter, Marcus (2021-02-01). "Learning Curve Theory". {{cite journal}}: Cite journal requires |journal= (help)
^ Sharma, Utkarsh; Kaplan, Jared (2022). "Scaling Laws from the Data Manifold Dimension". Journal of Machine Learning Research. 23 (9): 1–34. ISSN 1533-7928.
^ Allen-Zhu, Zeyuan; Li, Yuanzhi (2024-04-08), Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws, doi:10.48550/arXiv.2404.05405, retrieved 2024-04-25

[1] Hutter, Marcus (2021-02-01). "Learning Curve Theory". {{cite journal}}: Cite journal requires |journal= (help)

[2] Sharma, Utkarsh; Kaplan, Jared (2022). "Scaling Laws from the Data Manifold Dimension". Journal of Machine Learning Research. 23 (9): 1–34. ISSN 1533-7928.

[3] Allen-Zhu, Zeyuan; Li, Yuanzhi (2024-04-08), Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws, doi:10.48550/arXiv.2404.05405, retrieved 2024-04-25

[1]

[2]

[3]