World's most popular travel blog for travel bloggers.

What does the processor do while waiting for a main memory fetch

, , No Comments
Problem Detail: 

Assuming l1 and l2 cache requests result in a miss, does the processor stall until main memory has been accessed?

I heard about the idea of switching to another thread, if so what is used to wake up the stalled thread?

Asked By : 102948239408

Answered By : Wandering Logic

Memory latency is one of the fundamental problems studied in computer architecture research.

Speculative Execution

Speculative execution with out-of-order instruction issue is often able to find useful work to do to fill the latency during an L1 cache hit, but usually runs out of useful work after 10 or 20 cycles or so. There have been several attempts to increase the amount of work that can be done during a long-latency miss. One idea was to try to do value prediction (Lipasti, Wilkerson and Shen, (ASPLOS-VII):138-147, 1996). This idea was very fashionable in academic architecture research circles for a while but seems not to work in practice. A last-gasp attempt to save value prediction from the dustbin of history was runahead execution (Mutlu, Stark, Wilkerson, and Patt (HPCA-9):129, 2003). In runahead execution you recognize that your value predictions are going to be wrong, but speculatively execute anyway and then throw out all the work based on the prediction, on the theory that you'll at least start some prefetches for what would otherwise be L2 cache misses. It turns out that runahead wastes so much energy that it just isn't worth it.

A final approach in this vein which may be getting some traction in industry involves creating enormously long reorder buffers. Instructions are executed speculatively based on branch prediction, but no value prediction is done. Instead all the instructions that are dependent on a long-latency load miss sit and wait in the reorder buffer. But since the reorder buffer is so large you can keep fetching instructions if the branch predictor is doing a decent job you will sometimes be able to find useful work much later in the instruction stream. An influential research paper in this area was Continual Flow Pipelines (Srinivasan, Rajwar, Akkary, Gandhi, and Upton (ASPLOS-XI):107-119, 2004). (Despite the fact that the authors are all from Intel, I believe the idea got more traction at AMD.)

Multi-threading

Using multiple threads for latency tolerance has a much longer history, with much greater success in industry. All the successful versions use hardware support for multithreading. The simplest (and most successful) version of this is what is often called FGMT (fine grained multi-threading) or interleaved multi-threading. Each hardware core supports multiple thread contexts (a context is essentially the register state, including registers like the instruction pointer and any implicit flags registers). In a fine-grained multi-threading processor each thread is processed in-order. The processor keeps track of which threads are stalled on a long-latency load miss and which are ready for their next instruction and it uses a simple FIFO scheduling strategy on each cycle to choose which ready thread to execute that cycle. An early example of this on a large scale was Burton Smith's HEP processors (Burton Smith went on to architect the Tera supercomputer, which was also a fine-grained multi-threading processor). But the idea goes much further back, into the 1960s, I think.

FGMT is particularly effective on streaming workloads. All modern GPUs (graphics processing units) are multicore where each core is FGMT, and the concept is also widely used in other computing domains. Sun's T1 was also multicore FMGT, and so is Intel's Xeon Phi (the processor that is often still called "MIC" and used to be called "Larabee").

The idea of Simultaneous Multithreading (Tullsen, Eggers, and Levy, (ISCA-22):392-403, 1995) combines hardware multi-threading with speculative execution. The processor has multiple thread contexts, but each thread is executed speculatively and out-of-order. A more sophisticated scheduler can then use various heuristics to fetch from the thread that is most likely to have useful work (Malik, Agarwal, Dhar, and Frank, (HPCA-14:50-61), 2008). A certain large semiconductor company started using the term hyperthreading for simultaneous multithreading, and that name seems to be the one most widely used these days.

Low-level microarchitectural concerns

I realized after rereading your comments that you are also interested in the signalling that goes on between processor and memory. Modern caches usually allow multiple misses to be simultaneously outstanding. This is called a Lockup-free cache (Kroft, (ISCA-8):81-87, 1981). (But the paper is hard to find online, and somewhat hard to read. Short answer: there's a lot of book-keeping but you just deal with it. The hardware book-keeping structure is called a MSHR (miss information/status holding register), which is the name Kroft gave it in his 1981 paper.)

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/29487

0 comments:

Post a Comment

Let us know your responses and feedback