User:CTho/Electromigration

From my post here

Electromigration is an effect dominated by temperature and the current density of a current flowing in a wire. It's caused by electrons bumping into metal atoms and moving them around. Higher temperature makes it easier to move the metal atoms. Higher current density affects electromigration because it means there are more electrons flowing through the wires.

If you increase the voltage, you'll be increasing the current flowing through the wires. Since there's more current, and there's not more metal in the chip, the current density goes up, which worsens electromigration.

Now, a little background: a given logic gate or transistor in a circuit generally only flips once per clock cycle. As it flips, a current flows, but once it's done flipping, there's very very little current (the transistors leak - they don't shut off completely - but for the purposes of this explanation, leakage is negligible). Each flip involves moving a little bit of charge, which involves a small current flowing for a brief time (say, 1 milliamp for 100 picoseconds).

A wire in a chip running at 1 GHz might see current flowing through it for, on average, 1 second out of every 10 seconds (100 picoseconds every 1000 picoseconds, over 10 seconds). If you overclocked that chip to 2GHz, now the wire will see current flowing for it for 2 seconds out of every 10 seconds (100 picoseconds every 500 picoseconds, over 10 seconds). This will roughly double the rate of electromigration (assuming the temperature doesn't go up).

In practice, most of the wires on the chip flip one way, then the other, then back, and so on--so if an atom gets bumped in one direction, another atom might get bumped back in the other direction when the signal flips the other way. This largely cancels out the electromigration, because the average current cancels out. Some of the metal wires, however, conduct current predominantly in one direction (for example, wires supplying power to a circuit, or short stubs of wires within individual gates) (a picture might help here, but I can't find a decent drawing program for Linux ). A wire supplying a ground connection (low voltage) to a transistor will only see current flowing when the signal switches to 0, and the current always flows the same way. These wires are the most vulnerable to electromigration, and they'll experience a linear increase with clock frequency (as explained above).

When a processor is pushed too far and it starts to produce errors in programs such as Prime95, would this be considered a precursor to electromigration? As in a slight decay and if it were allowed to continue the margins would decrease?

Processors produce errors when OCing for a few reasons, but by and large, the errors you'll see are caused by logic paths not evaluating completely in the time allotted. No damage occurs from a path not evaluating completely* - if you want to imagine what's occurring, just consider a line of people passing numbers along. It takes a certain amount of time from when the first person hands off a number to when the last person receives it. If you're trying to synchronize a lot of things, you might require that they pass numbers down the line within 60 seconds (let's say they actually take 45 seconds to do it). Every 60 seconds, the first person passes on a number, and every 60 seconds, the last person shouts out the number he's currently holding. You could overclock the system by increasing how often the first person launches a new number and how quickly the last person shouts the number he's holding. At some point (say, 40 second intervals), the number won't make it all the way down the line before the last guy shouts his number...and he'll shout out an old number rather than the most recent one. The people don't try to pass numbers faster when you overclock them - they're passing just as fast whether you give them 1 second or an hour. The only difference is that you'll get a wrong answer if you try to go too fast. Logic gates work the same way. In a given circuit, most of the gates don't "see" the clock frequency - they just switch at their own speed. If you give the circuit enough time to finish, you get the right answer. If you run it too fast, you get the wrong answer. (I'm sure a picture would have been better than this contrived example... hopefully it conveys the point ). The point is, there isn't a sudden change in damage as you go from the speed where things are working great to the speed where they're not working (it's not like an engine, where as you increase the RPM you can hit a point where you're doing large amounts of damage pretty suddenly).

* Well, I can think of contrived situations where you could increase wear & tear, but let's ignore them.

When a processor is pushed too far and it starts to produce errors in programs such as Prime95, would this be considered a precursor to electromigration? As in a slight decay and if it were allowed to continue the margins would decrease? Meaning the processor would no longer work reliably at the same speed, etc.?

1) It can cause a short circuit. When an atom gets knocked out of place, it has to end up somewhere. For various reasons, in electromigration-prone spots, a lot of atoms will end up at the same spot. They eventually bloat the wire at that spot to the point where it creates a short circuit (it touches another wire). Until the moment of failure, there won't really be any warning. This image shows atoms piling up in specific locations, and you can see that the circuit is getting pretty close to having a short.

2) It can cause a wire to break. If many atoms get knocked out of the same area, eventually there won't be any left, and the wire won't conduct any more. This picture is a fantastic example. Theoretically, this process could slow a chip down before it fails...but there's a catch. Remember, the rate of electromigration depends on the current density in a wire. As atoms move away from a spot, the wire there gets narrower. About the same amount of current flows, though, so the current density goes up. This speeds up electromigration, and the wire thins more...which raises the current density...which thins the wire even faster... and the wire will break pretty quickly from that point. So, I'm going to claim, "Electromigration is not going to cause a chip to slow down" because I would expect a chip in that state to end up dying completely soon after the slowdown would start. I don't feel like working out the math to tell for sure, but some members here play with enough silicon that the might know off the top of their heads.

There are, however, other effects that would cause a stressed chip to slow down. A particularly nasty one right now is called "NBTI" for Negative Bias Temperature Instability. It's an effect whose exact physical mechanism isn't entirely understood, but the results (and the kinds of things that cause it to happen) are understood pretty well: as transistors age, they get harder to turn on, and when they're on, the don't conduct current as well. NBTI is strongly affected by both voltage and temperature, and causes a pretty significant slowdown nowadays. In bad operating conditions, you can cause a chip to slow down significantly in a matter of hours. I have numbers, but unfortunately can't share them. Manufacturers have to add margin nowadays to account for this - they have to make sure that the 3 GHz chip you buy today will still work 5 years from now, so the chip they sell at 3 GHz would have passed their tests at a higher speed and would overclock well when new. I would not be surprised if a C2D operated in normal conditions for 5 years overclocks poorly on the 5th year, or if an aggressively-overclocked C2D had to be run slower after a few years.

I don't know if electromigration is a big killer nowadays. Modern transistors are operated pretty close to what I would call their "breaking points", whereas it's relatively easy to design with electromigration in mind and minimize it. I would expect other effects to be more dominant (TDDB, for example). Maybe someone who does reliability analysis can share with us what effects are mainly responsible for chip death in the short term.