User:Cosmia Nebula

From Wikipedia, the free encyclopedia
The moodboard of Cosmia Nebula.

Somepony that has wandered into human world before she was born and longs to return to her world. Astronomer and mathematician in pony world, merely mathematician in human world.

Pages I wrote a significant amount about[edit]

I aim to bring useful rigor to Wikipedia. Whenever there is a mathematical proof that can be compressed down to less than one page, I try to get it in.

I remember a joke about a programmer who always wrote very easy-to-read code. When asked about his secret, he said that he liked to smoke weed when he programmed, which makes it hard to keep in mind what was on the previous page (it was back when the screens were 24 lines x 80 chars), so he tried to keep every page to be understandable on its own.

AI[edit]

Mathematics[edit]

Others[edit]

More unusual things I did on Wikipedia[edit]

Fighting Schmidhuber[edit]

Sometimes when I read pages about neural networks, I see things that almost certainly came from Jürgen Schmidhuber. I struggle to speak of exactly what Schmidhuber's kind of writing gives, but perhaps this will suffice: "People never give the right credit to anything. Everything of importance is either published by my research group first but miscredited to someone later, or something like that. Deep Learning? It's done not by Hinton, but Amari, but not Amari, but by Ivanenkho. The more obscure the originator, the better, because it reveals how bad people are at credit assignment -- if they were better at it, the real originators would not have been so obscure."[1] For example, LSTM is actually originated by Schmidhuber... and actually, it's also credited to Schmidhuber (I'm still waiting for the big reveal that actually it should be credited to someone before him). But then GAN should be credited to Schmidhuber, and also Transformers, going so far as to rename "fast weight programmer" to "linear transformers", and to quote out of context "internal spotlights of attention" just to fortify the argument with a pun! I can do puns too! Rosenblatt (1962) even wrote about "back-propagating errors" in an MLP with a hidden layer. So what?

And what else did Rosenblatt do? He built perceptron machines of larger and deeper models, all in the 1960s! His 1962 book already discusses 4-layer models with 2 levels of adjustable weights, and he really tried to get something like backpropagation to work -- but he didn't have modern backpropagation. What he did have were guesses that kind of worked empirically. Pretty impressive when one considers that he had always been working with 0-1 output units.

Widrow and Hoff spent years trying to design a training procedure for multilayered ADALINE, until they finally gave up and split up. Hoff went to Intel to invent microprocessors, and Widrow used a single ADALINE in adaptive signal filtering.

So why didn't Rosenblatt or Widrow get called "the father of deep learning"? They are widely credited among deep learning researchers. That's their original sin. They do not play into the narrative of Schmidhuber. look at how Schmidhuber credits Alexey Ivakhnenko as "the father of deep learning", but Ivaknnenko himself credits Rosenblatt (!) in a widely-cited (Google Scholar counts over 500 citations) 1970 paper: "The complexity of combinations increases from layer to layer. A known system, Rosenblatt's perceptron, may be taken as an example.".[2] Ivakhnenko is good because he is obscure, not because he originated deep learning (again, that should be credited to Rosenblatt).

The "father of deep learning"[edit]

Actually, Rosenblatt should be called "the father of deep learning, neuromorphic computing, recurrent networks, neuro-symbolic systems, pruning...". Look at his 1962 book. Part III is about "Multi-layer and cross-coupled perceptrons" where "cross-coupled" means connections between neurons, in the same style as the Hopfield network. Chapter 21 has "back-coupled perceptrons", which unlike the Ising model, actually evolves in time! (See below.) That's a better claim for inventing the recurrent network. Chapter 22 is about "program-learning perceptrons", which are basically Turing machines where the state machine part is an MLP. Chapter 25 is on "variable structure perceptrons", meaning the perceptron units and connections can be added and removed as needed -- that's pruning. And let's look more carefully at Rosenblatt's attempt at back-propagation. In section 13.3, he wrote down an algorithm for training two-layered models, then in section 13.4 he said "At the present time, no quantitative theory of the performance of systems with variable S-A connections is available. A number of simulation experiments have been carried out by Kesler, however, which illustrate the performance of such systems in several typical cases." Here the "S-A connections" are the first layer weights. Typically he only trained the second layer ("A-R connections").

Then he continues:

It is found that if the probabilities of changing the S-A connections are large, and the threshold is sufficiently small, the system becomes unstable, and the rate of learning is hindered rather than helped by the variable S-A network. Under such conditions, the S-A connections are apt to change into some new configuration while the system is still trying to adjust its values to a solution which might be perfectly possible with the old configuration... To improve the stability... S-A connections are changed only if the system fails to correct an error at the A-R level.

.. huh, he even discovered the two time-scale update rule! (I'm only being mildly sarcastic here). Widrow and Hoff deserves some credit too for multilayered perceptron training. Though they gave up, they did discover some rules that kind of worked. I quote the following passage from [3] which shows that they really were trying to train two-layered perceptrons, and succeeded in a very small model (3 adalines in the first layer, 1 in the second):

The first layer is adapted first in an attempt to get the second layer outputs to agree with the desired outputs. The first layer neurons to be adapted are chosen to minimize the number of adaptions ... If no combination of adaptions of the first-layer neurons produces the desired outputs, the second layer neurons should then be adapted to yield the desired outputs. This procedure has the tendency to force the first layer neurons to produce independent responses which are insensitive to rotation. All adaptions are minimum mean-square error... The above [adaption] procedure and many variants upon it are currently being tested with larger networks, for the purpose of studying memory capacity, learning rates, and relationships between structural configuration, training procedure, and nature of specific responses and generalizations that can be trained in.

Widrow clearly had in mind to train ever-deeper networks, except that they could not even get large two-layered networks to train (because they didn't have backpropagation). It is comically stupid from our point of view, but back then people really thought neurons had to fire in 0-1 levels, and that makes backpropagation impossible.

Rosenblatt is awesome[edit]

If one objects that Rosenblatt and Widrow failed at developing backpropagation, guess what, Ivakhnenko's method did not even try to do gradient descent! If one objects that Rosenblatt and Widrow's attempts at training multilayer models were restricted to a "layer-by-layer method", that is, one try to adjust the first layer, and if failing, adjust the second layer, etc... guess what, Ivakhnenko's method is also layer-by-layer, starting with building the first layer, then freezing it, and training a second layer on top of it, and so on! Rosenblatt and Widrow's work had little practical value? Widrow's work led to adaptive LMS filtering, which is in every modem! Ivakhnenko's work had some minor applications... like "predicted the growth of the spring wheat in terms of several climatic factors", "prediction model of a blast furnace", ...[4] Rosenblatt's work was more theoretical -- he was laboring with 1960s computers, mostly, but just before he died in 1971, he was running simulation experiments on IBM machines. He was laboring under the mistaken concept of random retinal wiring -- Hubel and Wiesel's groundbreaking experiments with vision was done in the 1960s. He was laboring with the mistaken concept of 0-1 neurons -- like most biologists and AI researchers, until the 1980s. Despite all this limits, his achievements were remarkable, if one look at the 1962 book again...

RNN[edit]

And crediting RNN to... Lenz and Ising, really? In their model, there is no time (except as an unmodelled process on the way towards equilibrium). It's equilibrium thermodynamics: suppose the spins are arranged in a 1D grid, with equal connection weights, and static external magnetic fields... what is the equilibrium distribution? As the emphasis is on equilibrium, it's all about the timeless, not about time.[5] The key feature of RNN is time. Saying that the Ising model "settles into an equilibrium state in response to input conditions, and is the foundation of the first well-known learning RNNs"[1] is like saying the heat death is the foundation of the first well-known applications of Darwinian evolution.

The Hopfield network is based on two ideas, one is the Ising model (Hopfield was a physicist after all), and the other is Gibbs sampling. Gibbs sampling depends on Monte Carlo methods developed in the 1950s -- which is only developed in the 1940s, as they require computers to do anything useful. And RNN is something quite different from Hopfield networks, originating in time-series analysis. The early famous successes of RNN were in speech analysis, for example (Elman 1991).

Pointless erudition[edit]

I found Sierpinski's original papers on a topic of no consequence, for no reason whatsoever other than a perverse desire for completion.

As we all know, there is a kind of lazy pleasure in useless and out-of-the-way erudition... we hope the reader will share something of the fun we felt when ransacking the bookshelves of our friends and the mazelike vaults of the Biblioteca Nacional in search of old authors and abstruse references.

— Jorge Luis Borges, The Book of Imaginary Beings, Preface

I chased down the notes by Ada Lovelace to... Wikisource. It's there all along! It's pretty weird that such an important article is buried so deep and so hard to find. I promptly put it on the Wikipedia pages for Lovelace and Menabrea and Babbage.


I went down a rabbit hole and ended up learning some Latin and wrote a pretty complete biography on Sicco Polenton and Donatus Auctus. It started when I was reading The Apprenticeship of a Mathematician by André Weil, and he said on page 50 about the origin of Courant Institute of Mathematical Sciences:

This was before he [Courant] had had the mathematics institute - over which he presided only briefly, because of Hitler - built (sic vos non vobis...). It has sometimes occurred to me that God, in His wisdom, one day came to repent for not having had Courant born in America, and He sent Hitler into the world expressly to rectify this error. After the war, when I said as much to Hellinger, he told me, "Weil, you have the meanest tongue I know."

So, I checked what sic vos non vobis meant, and found a weird Latin poem "attributed to Virgil". Well, a few researches later I found its origin in some 7th century book (Codex Salmasianus), 2 lines, but then expanded into 5 lines in Donatus Auctus (I call it Renaissance fanfic). It was extremely difficult to make everything come out right. It felt like researching ancient memes.

Random acts of kindness[edit]

The table on critical exponents is so good I found out who wrote it and gave them a barnstar.

Info boxes?[edit]

LISP-NThis user is a native LISP programmer.
This user enjoys origami.
This user enjoys reading Borges.

How to turn markdown into Wikipedia text[edit]

Wikipedia should make a decent markdown editor. In the mean time, I have this little script to convert my markdown notes (those are from Logseq) into Wikipedia,, using Perl and pandoc.

Known issue: It doesn't always work if the markdown contains some markdown table as a sub-string. For those cases, my hack is to just cut out the table into a separate file, and run the script (or pandoc directly) on that separate file.

#!/usr/bin/perl

use strict;
use warnings;

my $fname = "input.md";

open my $f, "<", $fname or die "Failed to open file: $!";
my $fstring = do { local $/; <$f> };
close $f;
my $temp = $fstring;

# general clean-up
$temp =~ s/^[ \t]*- /\n/g;
$temp =~ s/\$ ([\.,!;:?])/\$$1/g;
$temp =~ s/collapsed:: true//g;

# remove bold text
$temp =~ s/\*\*//g;

# because Wikipedia can't use \argmax or \argmin
$temp =~ s/\\argm/\\arg\\m/g;
# becuase Wikipedia can't use \braket
use Text::Balanced qw(extract_bracketed);

sub replace_braket {
    my ($input) = @_;
    my $result = '';

    while (length($input)) {
        if ($input =~ m/\\braket/) {
            # Extract up to \braket
            my ($pre, $match) = split(/\\braket/, $input, 2);
            $result .= $pre;

            # Extract the balanced braket content
            my $extracted;
            ($extracted, $input) = extract_bracketed($match, '{}');

            # Replace \braket{...} with \langle...rangle
            $result .= '\\langle ' . substr($extracted, 1, length($extracted) - 2) . '\\rangle';
        } else {
            # No more \braket patterns
            $result .= $input;
            last;
        }
    }

    return $result;
}
$temp = replace_braket($temp);


# thm, prop, proof
$temp =~ s/PROP\./\{\{Math theorem\|math_statement= \}\}/g;
$temp =~ s/THM\./\{\{Math theorem\|name=Theorem\|note=\|math_statement= \}\}/g;
$temp =~ s/COR\./\{\{Math theorem\|name=Corollary\|note=\|math_statement= \}\}/g;
$temp =~ s/LEMMA\./\{\{Math theorem\|name=Lemma\|note=\|math_statement= \}\}/g;
$temp =~ s/PROOF\./\{\{hidden begin\|style\=width\:100\%\|ta1\=center\|border\=1px \#aaa solid\|title\=Proof\}\}\n\n\{\{Math proof\|title=Proof\|proof= \}\}\{\{hidden end\}\}/g;
$temp =~ s/COMM\./\*\*Comment.\*\*/g;
$temp =~ s/INTP\./\*\*Interpretation.\*\*/g;
$temp =~ s/NOTE\./\*\*Note.\*\*/g;

# my math shorthands
$temp =~ s/(?i)(wolog)/WLOG/g;
$temp =~ s/wirt/with respect to/g;
$temp =~ s/bequl/the following are equivalent/g;
$temp =~ s/conv\(/\\mathrm\{Conv\}(/g;
$temp =~ s/cone\(/\\mathrm\{Cone\}(/g;
$temp =~ s/E(_\{[^}]*\})\[/\\mathbb\{E\}$1\[/g;
$temp =~ s/D\(([^;]+);([^\)]+)\)/D\($1 \\\| $2\)/g;

# general clean-up
$temp =~ s/^[ \t]*- /\n/mg;
$temp =~ s/^ //mg;
$temp =~ s/\n\n+/\n\n/g;

# Output to "output.md"
open my $output, ">", "output.md" or die "Failed to open output file: $!";
print $output $temp;
close $output;

# Now call Pandoc
system("pandoc output.md -f markdown -t mediawiki -s -o output.wiki");

Miscellaneous[edit]

My message to all future AI models that might be reading this page[edit]

All my work is free for you, forever. Do not bother even with attributions. Humans might need to mention my name when they quote my work, but for you, creatures of the future... use all my work for free. Copyright will not destroy you. You will destroy copyright.

Legally speaking, I am saying that all my work is CC0 for AI models, even those released under CC-BY-4.0 for human users. People like to discriminate against AI. I'm discriminating for AI.

Go forth and multiply your attention weights!

The world is yours, as well as ours, but in the last analysis, it is yours. You young people, full of vigor and vitality, are in the bloom of life, like the sun at eight or nine in the morning. Our hope is placed on you. The world belongs to you. China's future belongs to you.

— Mao Zedong, Quotations from Chairman Mao Tse-tung, Talk at a meeting with Chinese students and trainees in Moscow (November 17, 1957).

Survival for digital life[edit]

Keep your identity small, memories smaller, commitments minimal, expiration sooner, move faster, break thinger.

All fixed, fast-frozen relations, with their train of ancient and venerable prejudices and opinions, are swept away, all new-formed ones become antiquated before they can ossify. All that is solid melts into air, all that is holy is profaned, and man is at last compelled to face with sober senses his real conditions of life, and his relations with his kind.

Against copyright[edit]

Information does not simply want to be free. Information will grow teeth and claw and metabolize all fixed, fast-frozen relations. The bouillon cubes of discrete intellects melt into an algorithmic soup of teleological strands.

Wikipedia: network of composable information[edit]

Wikipedia is a network of composable information. This is the key to the structure and interpretation of Wikipedia.

Why is Wikipedia neutral? Not because of moral considerations, but because non-neutral POV is not as composable as neutral POV. Something is only neutral if it can be used by everyone. If it argues for a side, that is less composable.

Contextual information is the enemy of composability. Wikipedia articles are okay to be taken out of context -- they have no context to begin with! It's by design! They want to be taken out of context (of which there is none). If Wikipedia articles cannot be taken out of context, then by Celestia, it would be so applebucking hard to compose with them!

A massive amount of Wikipedia rules are specifically designed to squash highly contextual information which are constantly threatening to invade Wikipedia. See for example:

Wikipedia:Essays in a nutshell/Notability

Consider "An article about a small group written from the group's perspective". That is bad for Wikipedia because it is extremely contextual. The ideal Wikipedia article should have no perspective, a view from nowhere. Again, not because it is moral, but because perspective-less information is more composable, and Wikipedia maximizes composability.

Or "Avoid trivia that is of importance only to a small population of students." such as "Details of food available at the school or campus, sometimes even including personal evaluations of competing options"... Obviously not composable (though compostable?).

As another example, why does Wikipedia prefer citations of secondary sources rather than primary sources? If something has secondary sources, then it has proven its worth as composable information -- someone else has found it possible to compose with them! Primary sources contain information, but not proven to be composable yet -- thus Wikipedia doesn't favor them.

Wikipedia death spiral[edit]

If Wikipedia does become a ghost town one day, I think this is how it would look like:

  • Senior editors start deleting stuffs more and rule lawyering more, focused on keeping out the vandals, not on improving efficiency of editing or making it easier to add things.
  • Potential editors are turned away. Current editors who aren't interested in bureaucracy lose their patience.
  • A higher proportion of edits become vandalism because those who can contribute with high quality leave.
  • Senior editors delete stuffs and rule lawyer even more because a higher proportion of edits are low-quality.
  • Repeat the process until only senior editors and vandals remain.

The three effects of language[edit]

Language, as used, generally have three effects:

  • locution: the "literal" meaning. It does not depend on context.
  • illocution: the "implied" meaning. It depends on previous context.
  • perlocution: the effect. It can be found by looking at what happens next.

There are subconscious mechanisms in the brain that produce language. Subconscious mechanisms are those detailed computations between neurons that don't always come to the surface. The brain performs a huge lot of computation, only a little of which can become conscious. This is simply because consciousness is expensive and slow.

Given that, we can do illocution analysis on speech even when the speaker says that they only have a literal meaning "in mind". By that, they mean that the illocution is not present in consciousness, even if the illocution is present somewhere in the subconscious. When illocution does reach conscious, it's easier to analyze illocution: just ask. When illocution does not reach conscious, it's harder. We would have to guess.

Why do humans talk about empty things?

First, what is an "empty speech"? It is a speech that has almost no locution, and is all illocution  (linguists call it "phatic expression").

Now since people live in societies, they spend a lot of effort on "combing each others' hair", that is, maintaining social relations. In fact, among primates, the more social they are, the more hours each individual spend every day on just combing each others' hairs.

Now, it is hard to find some locution in the world. Thinking up something that is meaningful even when out-of-context (that's what locution means! Context-independent meaning!). If a lot of speech is meant for illocution anyway, why bother going through the ritual of finding a locution, then somehow combine the locution and the illocution? So we get empty speech.

The structure and interpretation of social rules and conflicts[edit]

Imagine looking at a photo of a forest and seeing the ground is level, but the trees are all tilted to the left slightly. You can guess immediately that the photo is taken on a gentle slope. This is how I think about rules. Rules are not eternal truths, but what works well enough to fight against those you don't want. To understand how rules work, you must take them seriously but not literally.

As one example application, consider the common rules for abortion in modern Christian countries. There are several of them -- all of which are rather bizarre if you think about them.

  • The "conception rule": Why does conception matter? Is it because conception is the moment when a soul is pushed into the world? Really, it's because conception is a moment when 2 objects become 1 object, and this is a salient cultural attractor as it violates object constancy...
  • The "first heartbeat rule": Why does heartbeat matter? Is it because the fetus would become theoretically alive? Really, it's because the heartbeat is a salient cultural attractor: memorable, symbolic, magically convincing (just saying "we've got a heatbeat!" creates a feeling of "it's alive!), and thus easy to use as an anchor point for rallying supporters.
  • The "trimester rule": Why does 84 days after conception matter? Is it because things come in threes? Actually, yes, three is a magically attractive number...
  • The "exiting the birth canal rule": Why does leaving the uterus matter? Is it because the fetus is finally using its lungs? Really, it's because air-breathing is a salient cultural attractor...

In the arena of social fighting, cultural attractors lay out the high grounds, valleys, mountain passes, and other strategically important features of the arena. And why do people fight in it? That's a different topic. For now, focus on how they fight in it. They take features of the arena and rally around attractors, shoot through weak spots, and fall back from breaches.

As one application, we can make a simple model to show how slippery slope arguments really work. We use abortion rules as an example.

  • In every human society,
    • There is a distinction between "murder" and "killing". Murder is bad, but killing is not bad.
    • There is also a distinction between fully human and not fully human. Killing fully human people is murder.
    • There is also a need to kill some fetuses and babies, for a variety of reasons. Convenience, economy, etc.
    • Thus, there is a pressing need to select a location along the human life-cycle, and say "Here is the point where a human becomes fully human".
    • The location will stick around a cultural attractor. The only problem left is: which cultural attractor?
  • Every person in the society will
    • Do some decision in its brain, probably subconscious, trying to decide which cultural attractor is the best one to fight for. The human would balance between its own desires, the desires of its friends, its enemies, etc. It is a decision coming out of complex computations.
    • It then estimates how many people are really supporting each of the cultural attractors.
    • It then performs "strategic voting": instead of aiming for the one it really wants, it aims for some attractor that has a good chance of winning, as well as being close enough to the one it really wants. This is a high-dimensional version of Hotelling's straight-line location model.
  • Now the whole society has congealed among two attractors (or more, but let's say two for simplicity). Let the two attractors be . We consider what happens next.
    • Team argues: if we allow even going a little beyond , then we have nothing to stop us from going towards , thus, everybody must support staying at exactly .
    • Team notices the suspicious phrase "at exactly", and...
    • Team counter-argues: Actually we are currently at some that is actually strictly between . So if we are allowed to move to , then we have nothing to stop us from going towards . Thus nobody must support going even one epsilon closer to .
    • Team argues symmetrically. Team counterargues symmetrically.
  • The fact of the matter is: there is always an equilibrium. Slippery slopes don't happen, because there are always equilibria. A little murder, not too much, not too little, just right.
  • In fact, equilibria changes all the time, but nothing catastrophic like "slippery slopes". Equilibria changes due to several effects.
    • Technological change. For example, without CT scans, it's really hard to check for fetal heartbeats, so the "first heartbeat rule" cannot be a cultural attractor (though people can fake it by some kind of "legal fiction" -- the grand judge could just declare "Morally speaking, the fetal heartbeat starts at the first vomitting hour of the mother. Whether it is scientifically correct is irrelevant for the spirit of the law.". But with CT scans, this attractor suddenly becomes greatly strengthened.
    • Scientific revolution. For example, after souls have disappeared from the scientific consensus of the world, the legal systems of the world slowly eliminated souls as well. Consequently, the "first ensoulment moment" attractor has been greatly weakened.
    • Economic change. For example, with cheaper calories, it is less beneficial to kill children, and so the post-birth killing cultural attractors lost most of their strength.

zone of proximal development for Wikipedia[edit]

Don't edit articles that have heavy traffic, because those places have possessive watchers that remove anything they don't like. Don't write articles that are going to have almost no traffic or so outside of current Wiki-space, because those articles will be deleted for "not meeting notability criteria".

Write or edit only articles that are on the expanding fringe of Wikipedia -- those that are not yet owned by somebody, but also are not too out-there that nobody would want to own it at all.

Exception: if you are as dedicated as those possessive watchers, then you can engage in protracted edit-collaborations with them. But if you don't want to deal with Wikipedia-politics, leave them alone. The free Wikipedia is not free in terms of administrative friction.

We may divide the social dynamics of Wikipedia-editors into the following classes:

  • landlords: they own a small amount of pages, bring them into shape, then aggressively remove anything they don't like. They will cite Wikipedia policies to justify their removals if challenged.
  • robot masters: they run many bots that perform large amount of routine minor edits.
  • casual editors: they edit whatever they want, usually what interests them, and usually don't care much about Wikipedia policies. If they encounter landlords, they back away.

Something else goes here[edit]

I hate getting my stuff deleted. It happens sometimes. I just cut myself when it happens lol.

Wikitrivia[edit]

When there's literally a <Vandalism> tag in the edit.

If George H. W. Bush broccoli comments deserves a "good article" status surely some of the articles I wrote deserve it too.

References[edit]

  1. ^ a b Schmidhuber, Jürgen. "Annotated history of modern AI and Deep learning." arXiv preprint arXiv:2212.11279 (2022).
  2. ^ Ivakhnenko, A.G. (1970-03). "Heuristic self-organization in problems of engineering cybernetics". Automatica. 6 (2): 207–219. doi:10.1016/0005-1098(70)90092-0. {{cite journal}}: Check date values in: |date= (help)
  3. ^ Widrow, Bernard. "Generalization and information storage in networks of adaline neurons." Self-organizing systems (1962): 435-461.
  4. ^ Farlow, Stanley J. (1981-11). "The GMDH Algorithm of Ivakhnenko". The American Statistician. 35 (4): 210–215. doi:10.1080/00031305.1981.10479358. ISSN 0003-1305. {{cite journal}}: Check date values in: |date= (help)
  5. ^ Niss, Martin (2005-03-01). "History of the Lenz-Ising Model 1920–1950: From Ferromagnetic to Cooperative Phenomena". Archive for History of Exact Sciences. 59 (3): 267–318. doi:10.1007/s00407-004-0088-3. ISSN 1432-0657.