World's most popular travel blog for travel bloggers.

[Solved]: Back invalidation to maintain inclusion in inclusive cache

, , No Comments
Problem Detail: 

For an L2 cache that is strictly inclusive of the L1 cache, if a block to be evicted is also present in L1, then back invalidation is required to maintain the inclusion property.

I am interested in some details of how back invalidation is handled.

  1. When an L2 block is evicted, is back invalidation command always sent to the L1 or only sent if the block is in the L1 cache?

  2. If the back invalidation is only sent when the block is in L1, how does the L2 cache maintain the information that a given L2 block is also in L1?

  3. When invalidating L1 block by back invalidation, what happens if the L1 block is dirty? Is it written back to main memory bypassing L2 (assume write-back caches and L2 is the last level cache)?

Asked By : enc

Answered By : Paul A. Clayton

There are a broad range of design choices about how back invalidation is handled. Whether the L2 knows if a cache block is in L1 (and which L1, instruction or data) is one implementation choice with three basic options: no knowledge, approximate but conservative knowledge, and perfect knowledge.

With the no knowledge, the L2 cache will always probe both L1 caches (assuming a coherent L1 instruction cache as in x86 generally and some other processors, e.g., recent POWER processors). With approximate knowledge back invalidation commands would be filtered so they are sent to the appropriate L1 only if the block is likely to be present. With perfect knowledge only necessary back invalidations are sent to L1.

Obviously the no knowledge option provides the most straightforward implementation since there is no choice but to send the invalidation request. This can waste request bandwidth between L2 and L1 and may interfere with the availability of L1 tags for memory accesses from the core. Replication of tags or adding a read port would solve this issue, banking of the tags would reduce the frequency of conflict, adding a filter on the L1 side would also reduce access to the complete tags.

Approximate knowledge can substantially reduce the number of probes needed and reduces the amount of information that must be stored with the L2 cache and makes communication about L1 evictions less critical. The latter factor can be exploited to schedule communication of the information to minimize overhead, including interference with required communication. Since back invalidation probes are filtered, their interference with normal operation of the core is reduced giving less reason to increase the number of tag reads supported.

An approximate filter could use per L2 block information (where techniques similar to those used to compression coherence directory entries could be used) or could use a separate filter (like a Bloom filter).

Perfect knowledge (or a very close approximation) avoids unnecessary invalidation requests to L1, but effectively requires the equivalent of a directory coherence entry for each L2 cache block and every eviction from L1 must be quickly communicated to L2. A simple implementation might use one bit per L1 cache to track the presence of the block in that cache (e.g., tracking the presence in a single core's L1 instruction and data caches would use two bits per L2 block).

Another, not entirely implausible option for an L2 much larger than the L1 caches would be replicating the entire L1 cache tags, possibly in a compressed format that used the L2 set and way (excluding the bits used for indexing the L1 cache) for the replicated L1 tags.

The information can be kept current by communicating to L2 the address (or in the case of the replicated L1 tags located with L2 just the set and way could be communicated) of every L1 eviction. To include this information in the message requesting the data replacing the evicted block, the victim would have to be selected quickly. While this is not a substantial problem, it treats an L1 victim as if it was immediately replaced (i.e., an access to the victim block before the L2 cache has supplied the new block would be treated as a miss even though the data is still present).

This miss could be made merely conceptual by changing the victim choice in the rare case that the victim is accessed during this small window between victim selection and the response from L2 filling the L1 entry, but in that case an additional message must be sent to the inform L2 of the change in victim.

When the back invalidated block is dirty in L1, this block typically would be written back to memory (or sent to the requesting cache for a coherence invalidation caused by another cache attempting a write to that address; in that case whether it is also sent to memory is another implementation choice). However, when the eviction is not coherence-related, it would be quite possible to change the victim choice. In this case, the L2 might retain the dirty block, possibly requiring another back invalidation if the new victim is (or might be) in L1. Obviously, this same strategy could be used if the L1 informed the L2 cache that the block is a poor replacement choice (e.g., being more recently used than most other blocks in its L1 set).

Note that it is also possible to communicate information about L1 accesses back to the L2 cache so that an LRU-oriented L2 replacement policy would less often choose a block that is still in L1. While doing this for every L1 access or even every L1 access that influences replacement (with LRU replacement an access of the most recently used block will not change the state), would be excessively expensive, partial information could be sent when convenient. E.g., extra information could be included with a miss handling request or extra messages could be sent to L2 when the interface is otherwise idle (using a separate queue of replacement policy update information from which entries can be dropped if the interface is busy when more information is added to a full queue).

Since back-invalidations are relatively rare, the cost of unnecessary invalidation requests (i.e., when the block is not actually in a particular L1 cache) might be lower than the cost of reducing or completely avoiding them. On the other hand, with L2 cache potentially shared among several L1 caches (e.g., an L2 cache shared by two cores might include four L1 caches), avoiding the need to broadcast the invalidation to all the included L1 caches would have some attraction.

These implementation details do not appear to be documented in general, so the above is derived from reasoning about the tradeoffs.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/29136

0 comments:

Post a Comment

Let us know your responses and feedback