Age | Commit message (Collapse) | Author |
|
|
|
The value that is not initialized has a bogus value that manifests when
using some debug-flags what makes the usage of tracediff a bit more
challenging.
In addition, while debugging with other techniques, it introduces the
problem of understanding if the value of a field is 'intended' or just
an effect of the lack of initialisation.
Change-Id: Ied88caa77479c6f1d5166d80d1a1a057503cb106
Signed-off-by: Giacomo Gabrielli <giacomo.gabrielli@arm.com>
Reviewed-on: https://gem5-review.googlesource.com/c/13125
Maintainer: Nikos Nikoleris <nikos.nikoleris@arm.com>
Reviewed-by: Jason Lowe-Power <jason@lowepower.com>
|
|
Summary: Usage of const DynInstPtr& when possible and introduction of
move operators to RefCountingPtr.
In many places, scoped references to dynamic instructions do a copy of
the DynInstPtr when a reference would do. This is detrimental to
performance. On top of that, in case there is a need for reference
tracking for debugging, the redundant copies make the process much more
painful than it already is.
Also, from the theoretical point of view, a function/method that
defines a convenience name to access an instruction should not be
considered an owner of the data, i.e., doing a copy and not a reference
is not justified.
On a related topic, C++11 introduces move semantics, and those are
useful when, for example, there is a class modelling a HW structure that
contains a list, and has a getHeadOfList function, to prevent doing a
copy to an internal variable -> update pointer, remove from the list ->
update pointer, return value making a copy to the assined variable ->
update pointer, destroy the returned value -> update pointer.
Change-Id: I3bb46c20ef23b6873b469fd22befb251ac44d2f6
Signed-off-by: Giacomo Gabrielli <giacomo.gabrielli@arm.com>
Reviewed-on: https://gem5-review.googlesource.com/c/13105
Reviewed-by: Andreas Sandberg <andreas.sandberg@arm.com>
Reviewed-by: Jason Lowe-Power <jason@lowepower.com>
Maintainer: Andreas Sandberg <andreas.sandberg@arm.com>
Maintainer: Jason Lowe-Power <jason@lowepower.com>
|
|
There are cases where the IEW adds a non-speculative instruction to
the IQ twice. This can happen if an instruction is flagged as
IsMemBarrier and IsNonSpeculative. Avoid adding non-speculative
instructions in the IEW to the IQ by checking if it has been added
already.
Change-Id: Ifcff676a451b57b2406ce00ed8dae19ed399515f
Signed-off-by: Andreas Sandberg <andreas.sandberg@arm.com>
Reviewed-by: Javier Setoain <javier.setoain@arm.com>
Reviewed-by: Giacomo Gabrielli <giacomo.gabrielli@arm.com>
Reviewed-on: https://gem5-review.googlesource.com/8374
Reviewed-by: Jason Lowe-Power <jason@lowepower.com>
|
|
With the hierarchical RegId there are a lot of functions that are
redundant now.
The idea behind the simplification is that instead of having the regId,
telling which kind of register read/write/rename/lookup/etc. and then
the function panic_if'ing if the regId is not of the appropriate type,
we provide an interface that decides what kind of register to read
depending on the register type of the given regId.
Change-Id: I7d52e9e21fc01205ae365d86921a4ceb67a57178
Reviewed-by: Andreas Sandberg <andreas.sandberg@arm.com>
[ Fix RISCV build issues ]
Signed-off-by: Andreas Sandberg <andreas.sandberg@arm.com>
Reviewed-on: https://gem5-review.googlesource.com/2702
|
|
Mimic the changes done on the architectural register indexes on the
physical register indexes. This is specific to the O3 model. The
structure, called PhysRegId, contains a register class, a register
index and a flat register index. The flat register index is kept
because it is useful in some cases where the type of register is not
important (dependency graph and scoreboard for example). Instead
of directly using the structure, most of the code is working with
a const PhysRegId* (typedef to PhysRegIdPtr). The actual PhysRegId
objects are stored in the regFile.
Change-Id: Ic879a3cc608aa2f34e2168280faac1846de77667
Reviewed-by: Andreas Sandberg <andreas.sandberg@arm.com>
Reviewed-on: https://gem5-review.googlesource.com/2701
Reviewed-by: Anthony Gutierrez <anthony.gutierrez@amd.com>
Maintainer: Andreas Sandberg <andreas.sandberg@arm.com>
|
|
The drain did not wait until stages were ready again. Therefore, as a
result of messages in the TimeBuffer being drain, the state after the
drain was not consistent and asserts fired in some places when the
draining happened after a stage got blocked, but before the notification
arrived to the previous stages.
Change-Id: Ib50b3b40b7f745b62c1eba2931dec76860824c71
Reviewed-by: Andreas Sandberg <andreas.sandberg@arm.com>
|
|
This patch adds probe points in Fetch, IEW, Rename and Commit stages as follows.
A probe point is added in the Fetch stage for probing when a fetch request is
sent. Notify is fired on the probe point when a request is sent succesfully in
the first attempt as well as on a retry attempt.
Probe points are added in the IEW stage when an instruction begins to execute
and when execution is complete. This points can be used for monitoring the
execution time of an instruction.
Probe points are added in the Rename stage to probe renaming of source and
destination registers and when there is squashing. These probe points can be
used to track register dependencies and remove when there is squashing.
A probe point for squashing is added in Commit to probe squashed instructions.
|
|
|
|
This patch enables instructions in LSQ to track two physical addresses for
corresponding two split requests. Later, the information is used in
checksnoop() to search for/invalidate the corresponding LD instructions.
The current implementation has kept track of only the physical address that is
referenced by the first split request. Thus, for checksnoop(), the line
accessed by the second request has not been considered, causing potential
correctness issues.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
The Request::UNCACHEABLE flag currently has two different
functions. The first, and obvious, function is to prevent the memory
system from caching data in the request. The second function is to
prevent reordering and speculation in CPU models.
This changeset gives the order/speculation requirement a separate flag
(Request::STRICT_ORDER). This flag prevents CPU models from doing the
following optimizations:
* Speculation: CPU models are not allowed to issue speculative
loads.
* Write combining: CPU models and caches are not allowed to merge
writes to the same cache line.
Note: The memory system may still reorder accesses unless the
UNCACHEABLE flag is set. It is therefore expected that the
STRICT_ORDER flag is combined with the UNCACHEABLE flag to prevent
this behavior.
|
|
|
|
IEW did not check the instQueue and memDepUnit to ensure
they were drained. This caused issues when drainSanityCheck()
did check those structures after asserting IEW was drained.
|
|
This patch fixes the load blocked/replay mechanism in the o3 cpu. Rather than
flushing the entire pipeline, this patch replays loads once the cache becomes
unblocked.
Additionally, deferred memory instructions (loads which had conflicting stores),
when replayed would not respect the number of functional units (only respected
issue width). This patch also corrects that.
Improvements over 20% have been observed on a microbenchmark designed to
exercise this behavior.
|
|
The o3 pipeline interlock/stall logic is incorrect. o3 unnecessicarily stalled
fetch and decode due to later stages in the pipeline. In general, a stage
should usually only consider if it is stalled by the adjacent, downstream stage.
Forcing stalls due to later stages creates and results in bubbles in the
pipeline. Additionally, o3 stalled the entire frontend (fetch, decode, rename)
on a branch mispredict while the ROB is being serially walked to update the
RAT (robSquashing). Only should have stalled at rename.
|
|
As highlighed on the mailing list gem5's writeback modeling can impact
performance. This patch removes the limitation on maximum outstanding issued
instructions, however the number that can writeback in a single cycle is still
respected in instToCommit().
|
|
Dispatch should not check LSQ size/LSQ stall for non load/store
instructions.
This work was done while Binh was an intern at AMD Research.
|
|
Check for free entries in Load Queue and Store Queue separately to
avoid cases when load cannot be renamed due to full Store Queue and
vice versa.
This work was done while Binh was an intern at AMD Research.
|
|
Using '== true' in a boolean expression is totally redundant,
and using '== false' is pretty verbose (and arguably less
readable in most cases) compared to '!'.
It's somewhat of a pet peeve, perhaps, but I had some time
waiting for some tests to run and decided to clean these up.
Unfortunately, SLICC appears not to have the '!' operator,
so I had to leave the '== false' tests in the SLICC code.
|
|
O3CPU has a compile-time maximum width set in o3/impl.hh, but checking
the configuration against this limit was not implemented anywhere
except for fetch. Configuring a wider pipe than the limit can silently
cause various issues during the simulation. This patch adds the proper
checking in the constructor of the various pipeline stages.
|
|
A number of calls to isEmpty() and numFreeEntries()
should be thread-specific.
In cpu.cc, the fact that tid is /*commented*/ out is a bug. Say the rob
has instructions from thread 0 (isEmpty() returns false), and none from
thread 1. If we are trying to squash all of thread 1, then
readTailInst(thread 1) will be called because rob->isEmpty() returns
false. The result is end_it is not in the list and the while
statement loops indefinitely back over the cpu's instList.
In iew_impl.hh, all threads are told they have the entire remaining IQ, when
each thread actually has a certain allocation. The result is extra stalls at
the iew dispatch stage which the rename stage usually takes care of.
In commit_impl.hh, rob->readHeadInst(thread 1) can be called if the rob only
contains instructions from thread 0. This returns a dummyInst (which may work
since we are trying to squash all instructions, but hardly seems like the right
way to do it).
In rob_impl.hh this fix skips the rest of the function more frequently and is
more efficient.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
The probe patch is motivated by the desire to move analytical and trace code
away from functional code. This is achieved by the probe interface which is
essentially a glorified observer model.
What this means to users:
* add a probe point and a "notify" call at the source of an "event"
* add an isolated module, that is being used to carry out *your* analysis (e.g. generate a trace)
* register that module as a probe listener
Note: an example is given for reference in src/cpu/o3/simple_trace.[hh|cc] and src/cpu/SimpleTrace.py
What is happening under the hood:
* every SimObject maintains has a ProbeManager.
* during initialization (src/python/m5/simulate.py) first regProbePoints and
the regProbeListeners is called on each SimObject. this hooks up the probe
point notify calls with the listeners.
FAQs:
Why did you develop probe points:
* to remove trace, stats gathering, analytical code out of the functional code.
* the belief that probes could be generically useful.
What is a probe point:
* a probe point is used to notify upon a given event (e.g. cpu commits an instruction)
What is a probe listener:
* a class that handles whatever the user wishes to do when they are notified
about an event.
What can be passed on notify:
* probe points are templates, and so the user can generate probes that pass any
type of argument (by const reference) to a listener.
What relationships can be generated (1:1, 1:N, N:M etc):
* there isn't a restriction. You can hook probe points and listeners up in a
1:1, 1:N, N:M relationship. They become useful when a number of modules
listen to the same probe points. The idea being that you can add a small
number of probes into the source code and develop a larger number of useful
analysis modules that use information passed by the probes.
Can you give examples:
* adding a probe point to the cpu's commit method allows you to build a trace
module (outputting assembler), you could re-use this to gather instruction
distribution (arithmetic, load/store, conditional, control flow) stats.
Why is the probe interface currently restricted to passing a const reference:
* the desire, initially at least, is to allow an interface to observe
functionality, but not to change functionality.
* of course this can be subverted by const-casting.
What is the performance impact of adding probes:
* when nothing is actively listening to the probes they should have a
relatively minor impact. Profiling has suggested even with a large number of
probes (60) the impact of them (when not active) is very minimal (<1%).
|
|
|
|
IEW DPRINTF uses Decode debug flag, which appears to be a copying error. This
patch changes this to the IEW Debug flag.
|
|
This patch changes the IEW drain check to include the FU pool as there
can be instructions that are "stored" in FU completion events and thus
not covered by the existing checks. With this patch, we simply include
a check to see if all the FUs are considered non-busy in the next
tick.
Without this patch, the pc-switcheroo-full regression fails after
minor changes to the cache timing (aligning to clock edge).
|
|
Fixes the tick used from rename:
- previously this gathered the tick on leaving rename which was always 1 less
than the dispatch. This conflated the decode ticks when back pressure built
in the pipeline.
- now picks up tick on entry.
Added --store_completions flag:
- will additionally display the store completion tail in the viewer.
- this highlights periods when large numbers of stores are outstanding (>16 LSQ
blocking)
Allows selection by tick range (previously this caused an infinite loop)
|
|
Previously, the O3 CPU could stop in the middle of a microcode
sequence. This patch makes sure that the pipeline stops when it has
committed a normal instruction or exited from a microcode
sequence. Additionally, it makes sure that the pipeline has no
instructions in flight when it is drained, which should make draining
more robust.
Draining is controlled in the commit stage, which checks if the next
PC after a committed instruction is in microcode. If this isn't the
case, it requests a squash of all instructions after that the
instruction that just committed and immediately signals a drain stall
to the fetch stage. The CPU then continues to execute until the
pipeline and all associated buffers are empty.
|
|
The entire O3 pipeline used to be initialized from init(), which is
called before initState() or unserialize(). This causes the pipeline
to be initialized from an incorrect thread context. This doesn't
currently lead to correctness problems as instructions fetched from
the incorrect start PC will be squashed a few cycles after
initialization.
This patch will affect the regressions since the O3 CPU now issues its
first instruction fetch to the correct PC instead of 0x0.
|
|
DynInst is extremely large the hope is that this re-organization will put the
most used members close to each other.
|
|
Eliminates dead code in the O3 and Ozone CPU models that counted
software prefetch instructions separately for the ALPHA ISA only.
|
|
Enables the CheckerCPU to be selected at runtime with the --checker option
from the configs/example/fs.py and configs/example/se.py configuration
files. Also merges with the SE/FS changes.
|
|
This patch continues the unification of how the different CPU models
create and share their instruction and data ports. Most importantly,
it forces every CPU to have an instruction and a data port, and gives
these ports explicit getters in the BaseCPU (getDataPort and
getInstPort). The patch helps in simplifying the code, make
assumptions more explicit, andfurther ease future patches related to
the CPU ports.
The biggest changes are in the in-order model (that was not modified
in the previous unification patch), which now moves the ports from the
CacheUnit to the CPU. It also distinguishes the instruction fetch and
load-store unit from the rest of the resources, and avoids the use of
indices and casting in favour of keeping track of these two units
explicitly (since they are always there anyways). The atomic, timing
and O3 model simply return references to their already existing ports.
|
|
This patch adds the necessary flags to the SConstruct and SConscript
files for compiling using clang 2.9 and later (on Ubuntu et al and OSX
XCode 4.2), and also cleans up a bunch of compiler warnings found by
clang. Most of the warnings are related to hidden virtual functions,
comparisons with unsigneds >= 0, and if-statements with empty
bodies. A number of mismatches between struct and class are also
fixed. clang 2.8 is not working as it has problems with class names
that occur in multiple namespaces (e.g. Statistics in
kernel_stats.hh).
clang has a bug (http://llvm.org/bugs/show_bug.cgi?id=7247) which
causes confusion between the container std::set and the function
Packet::set, and this is currently addressed by not including the
entire namespace std, but rather selecting e.g. "using std::vector" in
the appropriate places.
|
|
Brings the CheckerCPU back to life to allow FS and SE checking of the
O3CPU. These changes have only been tested with the ARM ISA. Other
ISAs potentially require modification.
|
|
|
|
Change the way instructions are squashed on memory ordering violations
to squash the violator and younger instructions, not all instructions
that are younger than the instruction they violated (no reason to throw
away valid work).
|
|
Implemented a pipeline activity viewer as a python script (util/o3-pipeview.py)
and modified O3 code base to support an extra trace flag (O3PipeView) for
generating traces to be used as inputs by the tool.
|
|
If a split load fails on a blocked cache wbOutstanding can be decremented
twice if the first part of the split load succeeds and the second part fails.
Condition the decrementing on not having completed the first part of the load.
|
|
Instructions that load an address and are control instructions can
execute down the wrong path if they were predicted correctly and then
instructions following them are squashed. If an instruction is a
memory and control op use the predicted address for the next PC instead
of just advancing the PC. Without this change NPC is used for the next
instruction, but predPC is used to verify that the branch was successful
so the wrong path is silently executed.
|
|
|
|
At the same time, rename the trace flags to debug flags since they
have broader usage than simply tracing. This means that
--trace-flags is now --debug-flags and --trace-help is now --debug-help
|
|
|
|
|
|
Get rid of unused members and use base types rather than derrived values
where possible to limit amount of state.
|
|
|
|
Some ISAs (like ARM) relies on hardware page table walkers. For those ISAs,
when a TLB miss occurs, initiateTranslation() can return with NoFault but with
the translation unfinished.
Instructions experiencing a delayed translation due to a hardware page table
walk are deferred until the translation completes and kept into the IQ. In
order to keep track of them, the IQ has been augmented with a queue of the
outstanding delayed memory instructions. When their translation completes,
instructions are re-executed (only their initiateAccess() was already
executed; their DTB translation is now skipped). The IEW stage has been
modified to support such a 2-pass execution.
|
|
Updated patches from Rick Strong's set that modify performance counters for
McPAT
|
|
|
|
|
|
The squash inside the fetch unit should not attempt to remove them from the
branch predictor as non-control instructions are not pushed into the predictor.
|