Age | Commit message (Collapse) | Author |
|
|
|
This patch takes quite a large step in transitioning from the ad-hoc
RefCountingPtr to the c++11 shared_ptr by adopting its use for all
Faults. There are no changes in behaviour, and the code modifications
are mostly just replacing "new" with "make_shared".
|
|
This patch transitions the o3 MemDepEntry from the ad-hoc
RefCountingPtr to the c++11 shared_ptr. There are no changes in
behaviour, and the code modifications are mainly replacing "new" with
"make_shared".
|
|
This changeset adds probe points that can be used to implement PMU
counters for CPU stats. The following probes are supported:
* BaseCPU::ppCycles / Cycles
* BaseCPU::ppRetiredInsts / RetiredInsts
* BaseCPU::ppRetiredLoads / RetiredLoads
* BaseCPU::ppRetiredStores / RetiredStores
* BaseCPU::ppRetiredBranches RetiredBranches
|
|
Commmitted by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
The Ozone CPU is now very much out of date and completely
non-functional, with no one actively working on restoring it. It is a
source of confusion for new users who attempt to use it before
realizing its current state. RIP
|
|
This patch optimises the passing of StaticInstPtr by avoiding copying
the reference-counting pointer. This avoids first incrementing and
then decrementing the reference-counting pointer.
|
|
The call paths for de-scheduling a thread are halt() and suspend(), from
the thread context. There is no call to deallocateContext() in general,
though some CPUs chose to define it. This patch removes the function
from BaseCPU and the cores which do not require it.
|
|
activate(), suspend(), and halt() used on thread contexts had an optional
delay parameter. However this parameter was often ignored. Also, when used,
the delay was seemily arbitrarily set to 0 or 1 cycle (no other delays were
ever specified). This patch removes the delay parameter and 'Events'
associated with them across all ISAs and cores. Unused activate logic
is also removed.
|
|
This patch does a bit of housekeeping on the string helper functions
and relies on the C++11 standard library where possible. It also does
away with our custom string hash as an implementation is already part
of the standard library.
|
|
This patch changes how faults are passed between methods in an attempt
to copy as few reference-counting pointer instances as possible. This
should avoid unecessary copies being created, contributing to the
increment/decrement of the reference counters.
|
|
Switch from a list to a data structure with better data layout.
|
|
Some places in O3 always iterated over "Impl::MaxThreads" even if a CPU had
fewer threads. This removes a few of those instances.
|
|
Put the packet type swizzling (that is currently done in a lot of places)
into a refineCommand() member function.
|
|
For X86, the o3 CPU would get stuck with the commit stage not being
drained if an interrupt arrived while drain was pending. isDrained()
makes sure that pcState.microPC() == 0, thus ensuring that we are at
an instruction boundary. However, when we take an interrupt we
execute:
pcState.upc(romMicroPC(entry));
pcState.nupc(romMicroPC(entry) + 1);
tc->pcState(pcState);
As a result, the MicroPC is no longer zero. This patch ensures the drain is
delayed until no interrupts are present. Once draining, non-synchronous
interrupts are deffered until after the switch.
|
|
Analogous to ee049bf (for x86). Requires a bump of the checkpoint version
and corresponding upgrader code to move the condition code register values
to the new register file.
|
|
This patch fixes the load blocked/replay mechanism in the o3 cpu. Rather than
flushing the entire pipeline, this patch replays loads once the cache becomes
unblocked.
Additionally, deferred memory instructions (loads which had conflicting stores),
when replayed would not respect the number of functional units (only respected
issue width). This patch also corrects that.
Improvements over 20% have been observed on a microbenchmark designed to
exercise this behavior.
|
|
O3 is supposed to stop fetching instructions once a quiesce is encountered.
However due to a bug, it would continue fetching instructions from the current
fetch buffer. This is because of a break statment that only broke out of the
first of 2 nested loops. It should have broken out of both.
|
|
The o3 cpu could attempt to schedule inactive threads under round-robin SMT
mode.
This is because it maintained an independent priority list of threads from the
active thread list. This priority list could be come stale once threads were
inactive, leading to the cpu trying to fetch/commit from inactive threads.
Additionally the fetch queue is now forcibly flushed of instrctuctions
from the de-scheduled thread.
Relevant output:
24557000: system.cpu: [tid:1]: Calling deactivate thread.
24557000: system.cpu: [tid:1]: Removing from active threads list
24557500: system.cpu:
FullO3CPU: Ticking main, FullO3CPU.
24557500: system.cpu.fetch: Running stage.
24557500: system.cpu.fetch: Attempting to fetch from [tid:1]
|
|
This patch adds a fetch queue that sits between fetch and decode to the
o3 cpu. This effectively decouples fetch from decode stalls allowing it
to be more aggressive, running futher ahead in the instruction stream.
|
|
The o3 pipeline interlock/stall logic is incorrect. o3 unnecessicarily stalled
fetch and decode due to later stages in the pipeline. In general, a stage
should usually only consider if it is stalled by the adjacent, downstream stage.
Forcing stalls due to later stages creates and results in bubbles in the
pipeline. Additionally, o3 stalled the entire frontend (fetch, decode, rename)
on a branch mispredict while the ROB is being serially walked to update the
RAT (robSquashing). Only should have stalled at rename.
|
|
As highlighed on the mailing list gem5's writeback modeling can impact
performance. This patch removes the limitation on maximum outstanding issued
instructions, however the number that can writeback in a single cycle is still
respected in instToCommit().
|
|
We currently generate and compile one version of the ISA code per CPU
model. This is obviously wasting a lot of resources at compile
time. This changeset factors out the interface into a separate
ExecContext class, which also serves as documentation for the
interface between CPUs and the ISA code. While doing so, this
changeset also fixes up interface inconsistencies between the
different CPU models.
The main argument for using one set of ISA code per CPU model has
always been performance as this avoid indirect branches in the
generated code. However, this argument does not hold water. Booting
Linux on a simulated ARM system running in atomic mode
(opt/10.linux-boot/realview-simple-atomic) is actually 2% faster
(compiled using clang 3.4) after applying this patch. Additionally,
compilation time is decreased by 35%.
|
|
|
|
Dispatch should not check LSQ size/LSQ stall for non load/store
instructions.
This work was done while Binh was an intern at AMD Research.
|
|
Check for free entries in Load Queue and Store Queue separately to
avoid cases when load cannot be renamed due to full Store Queue and
vice versa.
This work was done while Binh was an intern at AMD Research.
|
|
Using '== true' in a boolean expression is totally redundant,
and using '== false' is pretty verbose (and arguably less
readable in most cases) compared to '!'.
It's somewhat of a pet peeve, perhaps, but I had some time
waiting for some tests to run and decided to clean these up.
Unfortunately, SLICC appears not to have the '!' operator,
so I had to leave the '== false' tests in the SLICC code.
|
|
This patch removes the stat totalCommittedInsts. This variable was used for
recording the total number of instructions committed across all the threads
of a core. The instructions committed by each thread are recorded invidually.
The total would now be generated by summing these individual counts.
|
|
For the o3, add instruction mix (OpClass) histogram at commit (stats
also already collected at issue). For the simple CPUs we add a
histogram of executed instructions
|
|
Allow the specification of a socket ID for every core that is reflected in the
MPIDR field in ARM systems. This allows studying multi-socket / cluster
systems with ARM CPUs.
|
|
In the O3 LSQ, data read/written is printed out in DPRINTFs. However,
the data field is treated as a character string with a null terminated.
However the data field is not encoded this way. This patch removes
that possibility by removing the data part of the print.
|
|
O3CPU has a compile-time maximum width set in o3/impl.hh, but checking
the configuration against this limit was not implemented anywhere
except for fetch. Configuring a wider pipe than the limit can silently
cause various issues during the simulation. This patch adds the proper
checking in the constructor of the various pipeline stages.
|
|
A number of calls to isEmpty() and numFreeEntries()
should be thread-specific.
In cpu.cc, the fact that tid is /*commented*/ out is a bug. Say the rob
has instructions from thread 0 (isEmpty() returns false), and none from
thread 1. If we are trying to squash all of thread 1, then
readTailInst(thread 1) will be called because rob->isEmpty() returns
false. The result is end_it is not in the list and the while
statement loops indefinitely back over the cpu's instList.
In iew_impl.hh, all threads are told they have the entire remaining IQ, when
each thread actually has a certain allocation. The result is extra stalls at
the iew dispatch stage which the rename stage usually takes care of.
In commit_impl.hh, rob->readHeadInst(thread 1) can be called if the rob only
contains instructions from thread 0. This returns a dummyInst (which may work
since we are trying to squash all instructions, but hardly seems like the right
way to do it).
In rob_impl.hh this fix skips the rest of the function more frequently and is
more efficient.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
This patch fixes violation of TSO in the O3CPU, as all loads must be
ordered with all other loads. In the LQ, if a snoop is observed, all
subsequent loads need to be squashed if the system is TSO.
Prior to this patch, the following case could be violated:
P0 | P1 ;
MOV [x],mail=/usr/spool/mail/nilay | MOV EAX,[y] ;
MOV [y],mail=/usr/spool/mail/nilay | MOV EBX,[x] ;
exists (1:EAX=1 /\ 1:EBX=0) [is a violation]
The problem was found using litmus [http://diy.inria.fr].
Committed by: Nilay Vaish <nilay@cs.wisc.edu
|
|
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
This patch merely tidies up the CPU and ThreadContext getters by
making them const where appropriate.
|
|
Small fixes to appease recent clang versions.
|
|
The CheckerCPU model in pre-v8 code was not checking the
updates to miscellaneous registers due to some methods
for setting misc regs were not instrumented. The v8 patches
exposed this by calling the instrumented misc reg update
methods and then invoking the checker before the main CPU had
updated its misc regs, leading to false positives about
register mismatches. This patch fixes the non-instrumented
misc reg update methods and places calls to the checker in
the proper places in the O3 model.
|
|
With ARMv8 support the same misc register id results in accessing different
registers depending on the current mode of the processor. This patch adds
the same orthogonality to the misc register file as the others (int, float, cc).
For all the othre ISAs this is currently a null-implementation.
Additionally, a system variable is added to all the ISA objects.
|
|
|
|
|
|
snooped.
This patch add support for generating wake-up events in the CPU when an address
that is currently in the exclusive state is hit by a snoop. This mechanism is required
for ARMv8 multi-processor support.
|
|
This patch enables tracking of cache occupancy per thread along with
ages (in buckets) per cache blocks. Cache occupancy stats are
recalculated on each stat dump.
|
|
The probe patch is motivated by the desire to move analytical and trace code
away from functional code. This is achieved by the probe interface which is
essentially a glorified observer model.
What this means to users:
* add a probe point and a "notify" call at the source of an "event"
* add an isolated module, that is being used to carry out *your* analysis (e.g. generate a trace)
* register that module as a probe listener
Note: an example is given for reference in src/cpu/o3/simple_trace.[hh|cc] and src/cpu/SimpleTrace.py
What is happening under the hood:
* every SimObject maintains has a ProbeManager.
* during initialization (src/python/m5/simulate.py) first regProbePoints and
the regProbeListeners is called on each SimObject. this hooks up the probe
point notify calls with the listeners.
FAQs:
Why did you develop probe points:
* to remove trace, stats gathering, analytical code out of the functional code.
* the belief that probes could be generically useful.
What is a probe point:
* a probe point is used to notify upon a given event (e.g. cpu commits an instruction)
What is a probe listener:
* a class that handles whatever the user wishes to do when they are notified
about an event.
What can be passed on notify:
* probe points are templates, and so the user can generate probes that pass any
type of argument (by const reference) to a listener.
What relationships can be generated (1:1, 1:N, N:M etc):
* there isn't a restriction. You can hook probe points and listeners up in a
1:1, 1:N, N:M relationship. They become useful when a number of modules
listen to the same probe points. The idea being that you can add a small
number of probes into the source code and develop a larger number of useful
analysis modules that use information passed by the probes.
Can you give examples:
* adding a probe point to the cpu's commit method allows you to build a trace
module (outputting assembler), you could re-use this to gather instruction
distribution (arithmetic, load/store, conditional, control flow) stats.
Why is the probe interface currently restricted to passing a const reference:
* the desire, initially at least, is to allow an interface to observe
functionality, but not to change functionality.
* of course this can be subverted by const-casting.
What is the performance impact of adding probes:
* when nothing is actively listening to the probes they should have a
relatively minor impact. Profiling has suggested even with a large number of
probes (60) the impact of them (when not active) is very minimal (<1%).
|
|
Add some values and methods to the request object to track the translation
and access latency for a request and which level of the cache hierarchy responded
to the request.
|
|
This patch relaxes the check performed when squashing non-speculative
instructions, as it caused problems with loads that were marked ready,
and then stalled on a blocked cache. The assertion is now allowing
memory references to be non-faulting.
|
|
|
|
the current implementation of the fetch buffer in the o3 cpu
is only allowed to be the size of a cache line. some
architectures, e.g., ARM, have fetch buffers smaller than a cache
line, see slide 22 at:
http://www.arm.com/files/pdf/at-exploring_the_design_of_the_cortex-a15.pdf
this patch allows the fetch buffer to be set to values smaller
than a cache line.
|
|
Most other structures/stages get passed the cpu params struct.
|
|
Fix a problem in the O3 CPU for instructions that are both
memory loads and memory barriers (e.g. load acquire) and
to uncacheable memory. This combination can confuse the
commit stage into commitng an instruction that hasn't
executed and got it's value yet. At the same time refactor
the code slightly to remove duplication between two of
the cases.
|