Age | Commit message (Collapse) | Author |
|
|
|
In this case we need to throw away the TLB miss, not assume it was the
one we were waiting for.
|
|
We only support EABI binaries, so there is no reason to support OABI syscalls.
The loader detects OABI calls and fatal() so there is no reason to even check
here.
|
|
|
|
|
|
The ARM performance counters are not currently supported by the model.
This patch interprets a 'reset performance counters' command to mean 'reset
the simulator statistics' instead.
|
|
|
|
"executing" isnt a very descriptive debug message and in going through the
output you get multiple messages that say "executing" but nothing to help
you parse through the code/execution.
So instead, at least print out the name of the action that is taking
place in these functions.
|
|
Overall, continue to progress Ruby debug messages to more of the normal M5
debug message style
- add a name() to the Ruby Throttle & PerfectSwitch objects so that the debug output
isn't littered w/"global:" everywhere.
- clean up messages that print over multiple lines when possible
- clean up duplicate prints in the message buffer
|
|
|
|
In certain actions of the L1 cache controller, while creating an outgoing
message, the machine type was not being set. This results in a
segmentation fault when trace is collected. Joseph Pusudesris provided
his patch for fixing this issue.
|
|
The L1 cache controller file contains references to foo and goo queues, which
are not in use at all. These have been removed.
|
|
|
|
|
|
|
|
make sure instructions are able to commit before writing back to the RF
do not commit more than 1 non-speculative instruction per cycle
|
|
keep track of when an instruction needs the execution
behind it to be serialized. Without this, in SE Mode
instructions can execute behind a system call exit().
|
|
a lot of structures get allocated based off that MaxThreads parameter so this is an
effort to not abuse it
|
|
resources don't need to call getLatency because the latency is already a member
in the class. If there is some type of special case where different instructions
impose a different latency inside a resource then we can revisit this and
add getLatency() back in
|
|
each resource has a certain # of requests it can take per cycle. update the #s here
to be more realistic based off of the pipeline width and if the resource needs to
be accessed on multiple cycles
|
|
cleanup hanging pointers and other cruft in the destructors
|
|
---
need to delete the cache request's data on clearRequest() now that we are recycling
requests
---
fetch unit needs to deallocate the fetch buffer blocks when they are replaced or
squashed.
|
|
if a resource has a zero cycle latency (e.g. RegFile write), then dont allocate an event
for it to use
|
|
formerly, to free up bandwidth in a resource, we could just change the pointer in that resource
but at the same time the pipeline stages had visibility to see what happened to a resource request.
Now that we are recycling these requests (to avoid too much dynamic allocation), we can't throw
away the request too early or the pipeline stage gets bad information. Instead, mark when a request
is done with the resource all together and then let the pipeline stage call back to the resource
that it's time to free up the bandwidth for more instructions
*** inteface notes ***
- When an instruction completes and is done in a resource for that cycle, call done()
- When an instruction fails and is done with a resource for that cycle, call done(false)
- When an instruction completes, but isnt finished with a resource, call completed()
- When an instruction fails, but isnt finished with a resource, call completed(false)
* * *
inorder: tlbmiss wakeup bug fix
|
|
take away all instances of reqMap in the code and make all references use the built-in
request vectors inside of each resource. The request map was dynamically allocating
a request per instruction. The request vector just allocates N number of requests
during instantiation and then the surrounding code is fixed up to reuse those N requests
***
setRequest() and clearRequest() are the new accessors needed to define a new
request in a resource
|
|
this will allow us to reuse resource requests within a resource instead
of always dynamically allocating
|
|
we are going to be getting away from creating new resource requests for every
instruction so no more need to keep track of a reqRemoveList and clean it up
every tick
|
|
first change in an optimization that will stop InOrder from allocating new memory for every instruction's
request to a resource. This gets expensive since every instruction needs to access ~10 requests before
graduation. Instead, the plan is to allocate just enough resource request objects to satisfy each resource's
bandwidth (e.g. the execution unit would need to allocate 3 resource request objects for a 1-issue pipeline
since on any given cycle it could have 2 read requests and 1 write request) and then let the instructions
contend and reuse those allocated requests. The end result is a smaller memory footprint for the InOrder model
and increased simulation performance
|
|
This was making certain versions of gcc omit the function from the object file
which would break the build.
|
|
Get rid of RELEASE_NOTES since we no longer do releases, update some of the
information in README, and update the date in LICENSE.
|
|
Currently the wakeup function for the PerfectSwitch contains three loops -
loop on number of virtual networks
loop on number of incoming links
loop till all messages for this (link, network) have been routed
With an 8 processor mesh network and Hammer protocol, about 11-12% of the
was observed to have been spent in this function, which is the highest
amongst all the functions. It was found that the innermost loop is executed
about 45 times per invocation of the wakeup function, when each invocation
of the wakeup function processes just about one message.
The patch tries to do away with the redundant executions of the innermost
loop. Counters have been added for each virtual network that record the
number of messages that need to be routed for that virtual network. The
inner loops are only executed when the number of messages for that particular
virtual network > 0. This does away with almost 80% of the executions of the
innermost loop. The function now consumes about 5-6% of the total execution
time.
|
|
The size of the current instruction determines what the npc should be if
there's no branching.
|
|
Using the destination register directly causes the ISA parser to treat it as a
source even if none of the original bits are used.
|
|
In x86, 32 and 64 bit writes to registers in which registers appear to be 32 or
64 bits wide overwrite all bits of the destination register. This change
removes false dependencies in these cases where the previous value of a
register doesn't need to be read to write a new value. New versions of most
microops are created that have a "Big" suffix which simply overwrite their
destination, and the right version to use is selected during microop
allocation based on the selected data size.
This does not change the performance of the O3 CPU model significantly, I
assume because there are other false dependencies from the condition code bits
in the flags register.
|
|
This way a bad micropc will have to get all the way to commit before killing
the simulation. This accounts for misspeculated branches.
|
|
These faults can panic/warn/warn_once, etc., instead of instructions doing
that themselves directly. That way, instructions can be speculatively
executed, and only if they're actually going to commit will their fault be
invoked and the panic, etc., happen.
|
|
When redirecting fetch to handle branches, the npc of the current pc state
needs to be left alone. This change makes the pc state record whether or not
the npc already reflects a real value by making it keep track of the current
instruction size, or if no size has been set.
|
|
|
|
|
|
The patch changes the order in which L1 dcache and icache are looked up when
a request comes in. Earlier, if a request came in for instruction fetch, the
dcache was looked up before the icache, to correctly handle self-modifying
code. But, in the common case, dcache is going to report a miss and the
subsequent icache lookup is going to report a hit. Given the invariant -
caches under the same controller keep track of disjoint sets of cache blocks,
we can move the icache lookup before the dcache lookup. In case of a hit in
the icache, using our invariant, we know that the dcache would have reported
a miss. In case of a miss in the icache, we know that icache would have
missed even if the dcache was looked up before looking up the icache.
Effectively, we are doing the same thing as before, though in the common case,
we expect reduction in the number of lookups. This was empirically confirmed
for MOESI hammer. The ratio lookups to access requests is now about 1.1 to 1.
|
|
remove remnants of old way of instruction scheduling which dynamically allocated
a new resource schedule for every instruction
|
|
allow the pipeline and resources to use the cached instruction schedule and resource
sked iterator
|
|
resource skeds are divided into two parts: front end (all insts) and back end (inst. specific)
each of those are implemented as separate lists, so this iterator wraps around
the traditional list iterator so that an instruction can walk it's schedule but seamlessly
transfer from front end to back end when necessary
|
|
add a stage scheduler class to replace InstStage in pipeline_traits.cc
use that class to define a default front-end, resource schedule that all
instructions will follow. This will also replace the back end schedule in
pipeline_traits.cc. The reason for adding this is so that we can cache
instruction schedules in the future instead of calling the same function
over/over again as well as constantly dynamically alllocating memory on
every instruction to try to figure out it's schedule
|
|
first step in a optimization to not dynamically allocate an instruction schedule
for every instruction but rather used cached schedules
|
|
|
|
inst_buffer file isn't used , so remove it
|
|
pass/fail ops were used for testing but arent part of isa
|
|
|
|
|