Talk:Neural scaling law

Page contents not supported in other languages.
From Wikipedia, the free encyclopedia

todo list[edit]

More data[edit]

https://epochai.org/blog/extrapolating-performance-in-language-modelling-benchmarks

PaLM2 paper. Almost no details, but there's something.

2 Scaling law experiments 2.1 Scaling laws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Downstream metric evaluations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 A Detailed results A.1 Scaling laws pony in a strange land (talk) 15:25, 20 May 2023 (UTC)[reply]

[[2305.18565] PaLI-X: On Scaling up a Multilingual Vision and Language Model](https://arxiv.org/abs/2305.18565)

[[2306.13575] Scaling MLPs: A Tale of Inductive Bias](https://arxiv.org/abs//2306.13575) > performance of MLPs drastically improves with scale (93% on CIFAR10, 79% on CIFAR100, 69% on TinyImageNet), > lack of inductive bias compensated. > MLP mimic the behaviour of their modern counterparts faithfully, with some components in the learning setting however surprisingly exhibiting stronger or unexpected behaviours.

Trading training and inference costs[edit]

[[2104.03113] Scaling Scaling Laws with Board Games](https://arxiv.org/abs/2104.03113)

> Training compute and inference compute (MCTS) can be traded off against each other. 10x more MCTS steps is almost the same as training 10x more.

Figure 6, 7 of Alphacode https://arxiv.org/pdf/2203.07814.pdf

Scaling by data quality[edit]

[[2206.14486] Beyond neural scaling laws: beating power law scaling via data pruning](https://arxiv.org/abs/2206.14486)

> the scaling of error-(dataset size)

> faster than power law scaling, even possibly exponential scaling, if we have high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size

If phi-1 replicates, incorporate it too. https://arxiv.org/abs/2306.11644

  1. started with The Stack (a 3 TB collection of code) and text from StackOverflow
  2. used a LLM to select 6B "high-quality" tokens from (1)
  3. used GPT-3.5 to generate 1B tokens of text similar to textbooks
  4. trained a small (1.3B parameter) model ("phi-1") on (2) and (3)
  5. used GPT-3.5 to generate text similar to textbook exercises
  6. fine-tuned phi-1 on (5)
  7. tested phi-1 on HumanEval to evaluate its programming ability

RL scaling[edit]

Scaling laws for reward model overoptimization

GATO? RoboCat?

Theoretical explanations[edit]

[1]

[2]

13 the tradeoffs of large-scale learning L Bottou, O Bousquet - Optimization for machine learning, 2011

[3]


References

  1. ^ Hutter, Marcus (2021-02-01). "Learning Curve Theory". {{cite journal}}: Cite journal requires |journal= (help)
  2. ^ Sharma, Utkarsh; Kaplan, Jared (2022). "Scaling Laws from the Data Manifold Dimension". Journal of Machine Learning Research. 23 (9): 1–34. ISSN 1533-7928.
  3. ^ Allen-Zhu, Zeyuan; Li, Yuanzhi (2024-04-08), Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws, doi:10.48550/arXiv.2404.05405, retrieved 2024-04-25