Age | Commit message (Collapse) | Author |
|
Without this tweak, a prefetcher will happily prefetch data that will
promptly be invalidated and overwritten by a WriteInvalidate.
|
|
The cache's MemSidePacketQueue schedules a sendEvent based upon
nextMSHRReadyTime() which is the time when the next MSHR is ready or whenever
a future prefetch is ready. However, a prefetch being ready does not guarentee
that it can obtain an MSHR. So, when all MSHRs are full,
the simulation ends up unnecessiciarly scheduling a sendEvent every picosecond
until an MSHR is finally freed and the prefetch can happen.
This patch fixes this by not signaling the prefetch ready time if the prefetch
could not be generated. The event is rescheduled as soon as a MSHR becomes
available.
|
|
Previously the code commented about an unhandled case where it might be
possible for a writeback to arrive after a prefetch was generated but
before it was sent to the memory system. I hit that case. Luckily
the prefetchSquash() logic already in the code handles dropping prefetch
request in certian circumstances.
|
|
Re-organizes the prefetcher class structure. Previously the
BasePrefetcher forced multiple assumptions on the prefetchers that
inherited from it. This patch makes the BasePrefetcher class truly
representative of base functionality. For example, the base class no
longer enforces FIFO order. Instead, prefetchers with FIFO requests
(like the existing stride and tagged prefetchers) now inherit from a
new QueuedPrefetcher base class.
Finally, the stride-based prefetcher now assumes a custimizable lookup table
(sets/ways) rather than the previous fully associative structure.
|
|
Adds a new parameter that reserves some number of MSHR entries for demand
accesses. This helps prevent prefetchers from taking all MSHRs, forcing demand
requests from the CPU to stall.
|
|
This patch takes a clean-slate approach to providing WriteInvalidate
(write streaming, full cache line writes without first reading)
support.
Unlike the prior attempt, which took an aggressive approach of directly
writing into the cache before handling the coherence actions, this
approach follows the existing cache flows as closely as possible.
|
|
Prepare for a different implementation following in the next patch
|
|
This patch attempts to make the rules for data allocation in the
packet explicit, understandable, and easy to verify. The constructor
that copies a packet is extended with an additional flag "alloc_data"
to enable the call site to explicitly say whether the newly created
packet is short-lived (a zero-time snoop), or has an unknown life-time
and therefore should allocate its own data (or copy a static pointer
in the case of static data).
The tricky case is the static data. In essence this is a
copy-avoidance scheme where the original source of the request (DMA,
CPU etc) does not ask the memory system to return data as part of the
packet, but instead provides a pointer, and then the memory system
carries this pointer around, and copies the appropriate data to the
location itself. Thus any derived packet actually never copies any
data. As the original source does not copy any data from the response
packet when arriving back at the source, we must maintain the copy of
the original pointer to not break the system. We might want to revisit
this one day and pay the price for a few extra memcpy invocations.
All in all this patch should make it easier to grok what is going on
in the memory system and how data is actually copied (or not).
|
|
This patch cleans up the use of hasData and checkFunctional in the
packet. The hasData function is unfortunately suggesting that it
checks if the packet has a valid data pointer, when it does in fact
only check if the specific packet type is specified to have a data
payload. The confusion led to a bug in checkFunctional. The latter
function is also tidied up to avoid name overloading.
|
|
This adds a basic level of sanity checking to the packet by ensuring
that a request is not modified once the packet is created. The only
issue that had to be worked around is the relaying of
software-prefetches in the cache. The specific situation is now solved
by first copying the request, and then creating a new packet
accordingly.
|
|
|
|
This patch cleans up the packet memory allocation confusion. The data
is always allocated at the requesting side, when a packet is created
(or copied), and there is never a need for any device to allocate any
space if it is merely responding to a paket. This behaviour is in line
with how SystemC and TLM works as well, thus increasing
interoperability, and matching established conventions.
The redundant calls to Packet::allocate are removed, and the checks in
the function are tightened up to make sure data is only ever allocated
once. There are still some oddities in the packet copy constructor
where we copy the data pointer if it is static (without ownership),
and allocate new space if the data is dynamic (with ownership). The
latter is being worked on further in a follow-on patch.
|
|
This patch takes a first step in tightening up how we use the data
pointer in write packets. A const getter is added for the pointer
itself (getConstPtr), and a number of member functions are also made
const accordingly. In a range of places throughout the memory system
the new member is used.
The patch also removes the unused isReadWrite function.
|
|
|
|
WriteInvalidate semantics depend on the unconditional writeback
or they won't complete. Also, there's no point in deferring snoops
on their MSHRs, as they don't get new data at the end of their life
cycle the way other transactions do.
Add comment in the cache about a minor inefficiency re: WriteInvalidate.
|
|
Since WriteInvalidate directly writes into the cache, it can
create tricky timing interleavings with reads and writes to the
same cache line that haven't yet completed. This patch ensures
that these requests, when completed, don't overwrite the newer
data from the WriteInvalidate.
|
|
This patch takes a step towards an ISA-agnostic memory
system by enabling the components to establish the page size after
instantiation. The swap operation in the memory is now also allowing
any granularity to avoid depending on the IntReg of the ISA.
|
|
This patch adds a number of asserts to the cache, checking basic
assumptions about packets being requests or responses.
|
|
Add some missing initialisation, and fix a handful benign resource
leaks (including some false positives).
|
|
This patch changes the name of the Bus classes to XBar to better
reflect the actual timing behaviour. The actual instances in the
config scripts are not renamed, and remain as e.g. iobus or membus.
As part of this renaming, the code has also been clean up slightly,
making use of range-based for loops and tidying up some comments. The
only changes outside the bus/crossbar code is due to the delay
variables in the packet.
--HG--
rename : src/mem/Bus.py => src/mem/XBar.py
rename : src/mem/coherent_bus.cc => src/mem/coherent_xbar.cc
rename : src/mem/coherent_bus.hh => src/mem/coherent_xbar.hh
rename : src/mem/noncoherent_bus.cc => src/mem/noncoherent_xbar.cc
rename : src/mem/noncoherent_bus.hh => src/mem/noncoherent_xbar.hh
rename : src/mem/bus.cc => src/mem/xbar.cc
rename : src/mem/bus.hh => src/mem/xbar.hh
|
|
There are two primary issues with this code which make it deserving of deletion.
1) GHB is a way to structure a prefetcher, not a definitive type of prefetcher
2) This prefetcher isn't even structured like a GHB prefetcher.
It's basically a worse version of the stride prefetcher.
It primarily serves to confuse new gem5 users and most functionality is already
present in the stride prefetcher.
|
|
|
|
A small fix to ensure the return value is not ignored.
|
|
Static analysis unearther a bunch of uninitialised variables and
members, and this patch addresses the problem. In all cases these
omissions seem benign in the end, but at least fixing them means less
false positives next time round.
|
|
Support full-block writes directly rather than requiring RMW:
* a cache line is allocated in the cache upon receipt of a
WriteInvalidateReq, not the WriteInvalidateResp.
* only top-level caches allocate the line; the others just pass
the request along and invalidate as necessary.
* to close a timing window between the *Req and the *Resp, a new
metadata bit tracks whether another cache has read a copy of
the new line before the writeback to memory.
|
|
This patch fixes a bug in the cache port where the retry flag was
reset too early, allowing new requests to arrive before the retry was
actually sent, but with the event already scheduled. This caused a
deadlock in the interactions with the O3 LSQ.
The patche fixes the underlying issue by shifting the resetting of the
flag to be done by the event that also calls sendRetry(). The patch
also tidies up the flow control in recvTimingReq and ensures that we
also check if we already have a retry outstanding.
|
|
Previously, they were treated so much like loads that they could stall
at the head of the ROB. Now they are always treated like L1 hits.
If they actually miss, a new request is created at the L1 and tracked
from the MSHRs there if necessary (i.e. if it didn't coalesce with
an existing outstanding load).
|
|
If a set of LL/SC requests contend on the same cache block we
can get into a situation where CPUs will deadlock if they expect
a failed SC to supply them data. This case happens where 3 or
more cores are contending for a cache block using LL/SC and the system
is configured where 2 cores are connected to a local bus and the
third is connected to a remote bus. If a core on the local bus
sends an SCUpgrade and the core on the remote bus sends and SCUpgrade
they will race to see who will win the SC access. In the meantime
if the other core appends a read to one of the SCUpgrades it will expect
to be supplied data by that SCUpgrade transaction. If it happens that
the SCUpgrade that was picked to supply the data is failed, it will
drop the appended request for data and never respond, leaving the requesting
core to deadlock. This patch makes all SC's behave as normal stores to
prevent this case but still makes sure to check whether it can perform
the update.
|
|
This patch prunes unused values, and also unifies how the values are
defined (not using an enum for ALPHA), aligning the use of int vs Addr
etc.
The patch also removes the duplication of PageBytes/PageShift and
VMPageSize/LogVMPageSize. For all ISAs the two pairs had identical
values and the latter has been removed.
|
|
When a cacheline is written back to a lower-level cache,
tags->insertBlock() sets various status parameters. However these
status bits were cleared immediately after calling. This patch makes
it so that these status fields are not cleared by moving them outside
of the tags->insertBlock() call.
|
|
this patch implements a new tags class that uses a random replacement policy.
these tags prefer to evict invalid blocks first, if none are available a
replacement candidate is chosen at random.
this patch factors out the common code in the LRU class and creates a new
abstract class: the BaseSetAssoc class. any set associative tag class must
implement the functionality related to the actual replacement policy in the
following methods:
accessBlock()
findVictim()
insertBlock()
invalidate()
|
|
This patch squashes prefetch requests from downstream caches,
so that they do not steal cachelines away from caches closer
to the cpu. It was originally coded by Mitch Hayenga and
modified by Aasheesh Kolli.
|
|
This never actually worked since it was printing out only a word
of the cache block and not the entire thing and doubly didn't work
csprintf overrides the %#x specifier and assumes a char* array is
actually a string.
|
|
This patch fixes an assert condition that is not true at all
times. There are valid situations that arise in dual-core
dual-workload runs where the assert condition is false. The function
call following the assert however needs to be called only when the
condition is true (a block cannot be invalidated in the tags structure
if has not been allocated in the structure, and the tempBlock is never
allocated). Hence the 'assert' has been replaced with an 'if'.
|
|
This patch adds a filter to the cache to drop snoop requests that are
not for a range covered by the cache. This fixes an issue observed
when multiple caches are placed in parallel, covering different
address ranges. Without this patch, all the caches will forward the
snoop upwards, when only one should do so.
|
|
Forces the prefetcher to mispredict twice in a row before resetting the
confidence of prefetching. This helps cases where a load PC strides by a
constant factor, however it may operate on different arrays at times.
Avoids the cost of retraining. Primarily helps with small iteration loops.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
For systems with a tightly coupled L2, a stride-based prefetcher may observe
access requests from both instruction and data L1 caches. However, the PC
address of an instruction miss gives no relevant training information to the
stride based prefetcher(there is no stride to train). In theses cases, its
better if the L2 stride prefetcher simply reverted back to a simple N-block
ahead prefetcher. This patch enables this option.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
This patch extends the classic prefetcher to work on non-block aligned
addresses. Because the existing prefetchers in gem5 mask off the lower
address bits of cache accesses, many predictable strides fail to be
detected. For example, if a load were to stride by 48 bytes, with 64 byte
cachelines, the current stride based prefetcher would see an access pattern
of 0, 64, 64, 128, 192.... Thus not detecting a constant stride pattern. This
patch fixes this, by training the prefetcher on access and not masking off the
lower address bits.
It also adds the following configuration options:
1) Training/prefetching only on cache misses,
2) Training/prefetching only on data acceses,
3) Optionally tagging prefetches with a PC address.
#3 allows prefetchers to train off of prefetch requests in systems with
multiple cache levels and PC-based prefetchers present at multiple levels.
It also effectively allows a pipelining of prefetch requests (like in POWER4)
across multiple levels of cache hierarchy.
Improves performance on my gem5 configuration by 4.3% for SPECINT and 4.7% for SPECFP (geomean).
|
|
The patch
(1) removes the redundant writeback argument from findVictim()
(2) fixes the description of access() function
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
This patch adds the basic building blocks required to support e.g. ARM
TrustZone by discerning secure and non-secure memory accesses.
|
|
Adds very basic statistics on the number of tag and data accesses within the
cache, which is important for power modelling. For the tags, simply count
the associativity of the cache each time. For the data, this depends on
whether tags and data are accessed sequentially, which is given by a new
parameter. In the parallel case, all data blocks are accessed each time, but
with sequential accesses, a single data block is accessed only on a hit.
|
|
This patch enables tracking of cache occupancy per thread along with
ages (in buckets) per cache blocks. Cache occupancy stats are
recalculated on each stat dump.
|
|
Add some values and methods to the request object to track the translation
and access latency for a request and which level of the cache hierarchy responded
to the request.
|
|
|
|
This patch makes it possible to once again build gem5 without any
ISA. The main purpose is to enable work around the interconnect and
memory system without having to build any CPU models or device models.
The regress script is updated to include the NULL ISA target. Currently
no regressions make use of it, but all the testers could (and perhaps
should) transition to it.
--HG--
rename : build_opts/NOISA => build_opts/NULL
rename : src/arch/noisa/SConsopts => src/arch/null/SConsopts
rename : src/arch/noisa/cpu_dummy.hh => src/arch/null/cpu_dummy.hh
rename : src/cpu/intr_control.cc => src/cpu/intr_control_noisa.cc
|
|
This patch removes the notion of a peer block size and instead sets
the cache line size on the system level.
Previously the size was set per cache, and communicated through the
interconnect. There were plenty checks to ensure that everyone had the
same size specified, and these checks are now removed. Another benefit
that is not yet harnessed is that the cache line size is now known at
construction time, rather than after the port binding. Hence, the
block size can be locally stored and does not have to be queried every
time it is used.
A follow-on patch updates the configuration scripts accordingly.
|
|
Make valgrind a little bit happier
|
|
This patch reorganizes the cache tags to allow more flexibility to
implement new replacement policies. The base tags class is now a
clocked object so that derived classes can use a clock if they need
one. Also having deriving from SimObject allows specialized Tag
classes to be swapped in/out in .py files.
The cache set is now templatized to allow it to contain customized
cache blocks with additional informaiton. This involved moving code to
the .hh file and removing cacheset.cc.
The statistics belonging to the cache tags are now including ".tags"
in their name. Hence, the stats need an update to reflect the change
in naming.
|
|
This patch removes the redundant cache builder class.
|
|
This patch changes the cache timing calculations such that the results
are aligned to clock edges.
Plenty stats change as a results of this patch.
|