gem5 - gem5

Age	Commit message (Collapse)	Author
2015-09-30	cpu: Add per-thread monitors	Mitch Hayenga
	Adds per-thread address monitors to support FullSystem SMT.
2015-09-15	cpu, o3: consider split requests for LSQ checksnoop operations	Hongil Yoon
	This patch enables instructions in LSQ to track two physical addresses for corresponding two split requests. Later, the information is used in checksnoop() to search for/invalidate the corresponding LD instructions. The current implementation has kept track of only the physical address that is referenced by the first split request. Thus, for checksnoop(), the line accessed by the second request has not been considered, causing potential correctness issues. Committed by: Nilay Vaish <nilay@cs.wisc.edu>
2015-08-07	base: Declare a type for context IDs	Andreas Sandberg
	Context IDs used to be declared as ad hoc (usually as int). This changeset introduces a typedef for ContextIDs and a constant for invalid context IDs.
2015-07-28	revert 5af8f40d8f2c	Nilay Vaish

2015-07-26	cpu: implements vector registers	Nilay Vaish
	This adds a vector register type. The type is defined as a std::array of a fixed number of uint64_ts. The isa_parser.py has been modified to parse vector register operands and generate the required code. Different cpus have vector register files now.
2015-05-15	misc: Appease gcc 5.1	Andreas Hansson
	Three minor issues are resolved: 1. Apparently gcc 5.1 does not like negation of booleans followed by bitwise AND. 2. Somehow the compiler also gets confused and warns about NoopMachInst being unused (removing it causes compilation errors though). Most likely a compiler bug. 3. There seems to be a number of instances where loop unrolling causes false positives for the array-bounds check. For now, switch to std::array. Potentially we could disable the warning for newer gcc versions, but switching to std::array is probably a good move in any case.
2015-05-05	mem, cpu: Add a separate flag for strictly ordered memory	Andreas Sandberg
	The Request::UNCACHEABLE flag currently has two different functions. The first, and obvious, function is to prevent the memory system from caching data in the request. The second function is to prevent reordering and speculation in CPU models. This changeset gives the order/speculation requirement a separate flag (Request::STRICT_ORDER). This flag prevents CPU models from doing the following optimizations: * Speculation: CPU models are not allowed to issue speculative loads. * Write combining: CPU models and caches are not allowed to merge writes to the same cache line. Note: The memory system may still reorder accesses unless the UNCACHEABLE flag is set. It is therefore expected that the STRICT_ORDER flag is combined with the UNCACHEABLE flag to prevent this behavior.
2015-03-02	cpu: o3 register renaming request handling improved	Rekai
	Now, prior to the renaming, the instruction requests the exact amount of registers it will need, and the rename_map decides whether the instruction is allowed to proceed or not.
2015-02-11	sim: Move the BaseTLB to src/arch/generic/	Andreas Sandberg
	The TLB-related code is generally architecture dependent and should live in the arch directory to signify that. --HG-- rename : src/sim/BaseTLB.py => src/arch/generic/BaseTLB.py rename : src/sim/tlb.cc => src/arch/generic/tlb.cc rename : src/sim/tlb.hh => src/arch/generic/tlb.hh
2015-01-25	sim: Clean up InstRecord	Ali Saidi
	Track memory size and flags as well as add some comments and consts.
2014-11-06	x86 isa: This patch attempts an implementation at mwait.	Marc Orr
	Mwait works as follows: 1. A cpu monitors an address of interest (monitor instruction) 2. A cpu calls mwait - this loads the cache line into that cpu's cache. 3. The cpu goes to sleep. 4. When another processor requests write permission for the line, it is evicted from the sleeping cpu's cache. This eviction is forwarded to the sleeping cpu, which then wakes up. Committed by: Nilay Vaish <nilay@cs.wisc.edu>
2014-10-16	arch: Use shared_ptr for all Faults	Andreas Hansson
	This patch takes quite a large step in transitioning from the ad-hoc RefCountingPtr to the c++11 shared_ptr by adopting its use for all Faults. There are no changes in behaviour, and the code modifications are mostly just replacing "new" with "make_shared".
2014-09-27	arch: Use const StaticInstPtr references where possible	Andreas Hansson
	This patch optimises the passing of StaticInstPtr by avoiding copying the reference-counting pointer. This avoids first incrementing and then decrementing the reference-counting pointer.
2014-09-03	arch, cpu: Factor out the ExecContext into a proper base class	Andreas Sandberg
	We currently generate and compile one version of the ISA code per CPU model. This is obviously wasting a lot of resources at compile time. This changeset factors out the interface into a separate ExecContext class, which also serves as documentation for the interface between CPUs and the ISA code. While doing so, this changeset also fixes up interface inconsistencies between the different CPU models. The main argument for using one set of ISA code per CPU model has always been performance as this avoid indirect branches in the generated code. However, this argument does not hold water. Booting Linux on a simulated ARM system running in atomic mode (opt/10.linux-boot/realview-simple-atomic) is actually 2% faster (compiled using clang 3.4) after applying this patch. Additionally, compilation time is decreased by 35%.
2014-05-09	cpu, arm: Allow the specification of a socket field	Akash Bagdia
	Allow the specification of a socket ID for every core that is reflected in the MPIDR field in ARM systems. This allows studying multi-socket / cluster systems with ARM CPUs.
2014-03-07	cpu: Make CPU and ThreadContext getters const	Andreas Hansson
	This patch merely tidies up the CPU and ThreadContext getters by making them const where appropriate.
2014-01-24	cpu: Add CPU support for generatig wake up events when LLSC adresses are ↵	Ali Saidi
	snooped. This patch add support for generating wake-up events in the CPU when an address that is currently in the exclusive state is hit by a snoop. This mechanism is required for ARMv8 multi-processor support.
2014-01-24	mem: per-thread cache occupancy and per-block ages	Dam Sunwoo
	This patch enables tracking of cache occupancy per thread along with ages (in buckets) per cache blocks. Cache occupancy stats are recalculated on each stat dump.
2013-10-17	cpu: Fix O3 uncacheable load that is replayed but misses the TLB	Ali Saidi
	This change fixes an issue in the O3 CPU where an uncachable instruction is attempted to be executed before it reaches the head of the ROB. It is determined to be uncacheable, and is replayed, but a PanicFault is attached to the instruction to make sure that it is properly executed before committing. If the TLB entry it was using is replaced in the interveaning time, the TLB returns a delayed translation when the load is replayed at the head of the ROB, however the LSQ code can't differntiate between the old fault and the new one. If the translation isn't complete it can't be faulting, so clear the fault.
2013-10-15	cpu: add a condition-code register class	Yasuko Eckert
	Add a third register class for condition codes, in parallel with the integer and FP classes. No ISAs use the CC class at this point though.
2013-07-18	mem: Set the cache line size on a system level	Andreas Hansson
	This patch removes the notion of a peer block size and instead sets the cache line size on the system level. Previously the size was set per cache, and communicated through the interconnect. There were plenty checks to ensure that everyone had the same size specified, and these checks are now removed. Another benefit that is not yet harnessed is that the cache line size is now known at construction time, rather than after the port binding. Hence, the block size can be locally stored and does not have to be queried every time it is used. A follow-on patch updates the configuration scripts accordingly.
2012-06-05	O3: Clean up the O3 structures and try to pack them a bit better.	Ali Saidi
	DynInst is extremely large the hope is that this re-organization will put the most used members close to each other.
2012-06-05	sim: Remove FastAlloc	Ali Saidi
	While FastAlloc provides a small performance increase (~1.5%) over regular malloc it isn't thread safe. After removing FastAlloc and using tcmalloc I've seen a performance increase of 12% over libc malloc when running twolf for ARM.
2012-03-19	gcc: Clean-up of non-C++0x compliant code, first steps	Andreas Hansson
	This patch cleans up a number of minor issues aiming to get closer to compliance with the C++0x standard as interpreted by gcc and clang (compile with std=c++0x and -pedantic-errors). In particular, the patch cleans up enums where the last item was succeded by a comma, namespaces closed by a curcly brace followed by a semi-colon, and the use of the GNU-extension typeof (replaced by templated functions). It does not address variable-length arrays, zero-size arrays, anonymous structs, range expressions in switch statements, and the use of long long. The generated CPU code also has a large number of issues that remain to be fixed, mainly related to overflows in implicit constant conversion (due to shifts).
2012-03-09	CheckerCPU: Make CheckerCPU runtime selectable instead of compile selectable	Geoffrey Blake
	Enables the CheckerCPU to be selected at runtime with the --checker option from the configs/example/fs.py and configs/example/se.py configuration files. Also merges with the SE/FS changes.
2012-02-24	CPU: Round-two unifying instr/data CPU ports across models	Andreas Hansson
	This patch continues the unification of how the different CPU models create and share their instruction and data ports. Most importantly, it forces every CPU to have an instruction and a data port, and gives these ports explicit getters in the BaseCPU (getDataPort and getInstPort). The patch helps in simplifying the code, make assumptions more explicit, andfurther ease future patches related to the CPU ports. The biggest changes are in the in-order model (that was not modified in the previous unification patch), which now moves the ports from the CacheUnit to the CPU. It also distinguishes the instruction fetch and load-store unit from the rest of the resources, and avoids the use of indices and casting in favour of keeping track of these two units explicitly (since they are always there anyways). The atomic, timing and O3 model simply return references to their already existing ports.
2012-02-12	mem: Add a master ID to each request object.	Ali Saidi
	This change adds a master id to each request object which can be used identify every device in the system that is capable of issuing a request. This is part of the way to removing the numCpus+1 stats in the cache and replacing them with the master ids. This is one of a series of changes that make way for the stats output to be changed to python.
2012-02-07	Faults: Turn off arch/faults.hh	Gabe Black
	Because there are no longer architecture independent but specialized functions in arch/XXX/faults.hh, code that isn't using the faults from a particular ISA no longer needs to be able to include them through the switching header file arch/faults.hh. By removing that header file (arch/faults.hh), the potential interface between ISA code and non ISA code is narrowed.
2012-01-31	Merge with head, hopefully the last time for this batch.	Gabe Black

2012-01-31	CheckerCPU: Re-factor CheckerCPU to be compatible with current gem5	Geoffrey Blake
	Brings the CheckerCPU back to life to allow FS and SE checking of the O3CPU. These changes have only been tested with the ARM ISA. Other ISAs potentially require modification.
2011-11-18	SE/FS: Get rid of includes of config/full_system.hh.	Gabe Black

2011-09-13	LSQ: Only trigger a memory violation with a load/load if the value changes.	Ali Saidi
	Only create a memory ordering violation when the value could have changed between two subsequent loads, instead of just when loads go out-of-order to the same address. While not very common in the case of Alpha, with an architecture with a hardware table walker this can happen reasonably frequently beacuse a translation will miss and start a table walk and before the CPU re-schedules the faulting instruction another one will pass it to the same address (or cache block depending on the dendency checking). This patch has been tested with a couple of self-checking hand crafted programs to stress ordering between two cores. The performance improvement on SPEC benchmarks can be substantial (2-10%).
2011-09-09	StaticInst: Merge StaticInst and StaticInstBase.	Gabe Black
	Having two StaticInst classes, one nominally ISA dependent and the other ISA dependent, has not been historically useful and makes the StaticInst class more complicated that it needs to be. This change merges StaticInstBase into StaticInst.
2011-08-14	O3: Add a pointer to the macroop for a microop in the dyninst.	Gabe Black

2011-08-07	Translation: Use a pointer type as the template argument.	Gabe Black
	This allows regular pointers and reference counted pointers without having to use any shim structures or other tricks.
2011-08-02	O3: Get rid of the raw ExtMachInst constructor on DynInsts.	Gabe Black
	This constructor assumes that the ExtMachInst can be decoded directly into a StaticInst that's useful to execute. With the advent of microcoded instructions that's no longer true.
2011-07-02	ExecContext: Rename the readBytes/writeBytes functions to readMem and writeMem.	Gabe Black
	readBytes and writeBytes had the word "bytes" in their names because they accessed blobs of bytes. This distinguished them from the read and write functions which handled higher level data types. Because those functions don't exist any more, this change renames readBytes and writeBytes to more general names, readMem and writeMem, which reflect the fact that they are how you read and write memory. This also makes their names more consistent with the register reading/writing functions, although those are still read and set for some reason.
2011-07-02	ExecContext: Get rid of the now unused read/write templated functions.	Gabe Black

2011-04-04	CPU: Remove references to memory copy operations	Ali Saidi

2011-04-04	O3: Tighten memory order violation checking to 16 bytes.	Ali Saidi
	The comment in the code suggests that the checking granularity should be 16 bytes, however in reality the shift by 8 is 256 bytes which seems much larger than required.
2011-02-11	O3: Enhance data address translation by supporting hardware page table walkers.	Giacomo Gabrielli
	Some ISAs (like ARM) relies on hardware page table walkers. For those ISAs, when a TLB miss occurs, initiateTranslation() can return with NoFault but with the translation unfinished. Instructions experiencing a delayed translation due to a hardware page table walk are deferred until the translation completes and kept into the IQ. In order to keep track of them, the IQ has been augmented with a queue of the outstanding delayed memory instructions. When their translation completes, instructions are re-executed (only their initiateAccess() was already executed; their DTB translation is now skipped). The IEW stage has been modified to support such a 2-pass execution.
2010-12-07	O3: Support squashing all state after special instruction	Ali Saidi
	For SPARC ASIs are added to the ExtMachInst. If the ASI is changed simply marking the instruction as Serializing isn't enough beacuse that only stops rename. This provides a mechanism to squash all the instructions and refetch them
2010-12-07	O3: Make all instructions that write a misc. register not perform the write ↵	Giacomo Gabrielli
	until commit. ARM instructions updating cumulative flags (ARM FP exceptions and saturation flags) are not serialized. Added aliases for ARM FP exceptions and saturation flags in FPSCR. Removed write accesses to the FP condition codes for most ARM VFP instructions: only VCMP and VCMPE instructions update the FP condition codes. Removed a potential cause of seg. faults in the O3 model for NEON memory macro-ops (ARM).
2010-11-08	ARM/Alpha/Cpu: Change prefetchs to be more like normal loads.	Ali Saidi
	This change modifies the way prefetches work. They are now like normal loads that don't writeback a register. Previously prefetches were supposed to call prefetch() on the exection context, so they executed with execute() methods instead of initiateAcc() completeAcc(). The prefetch() methods for all the CPUs are blank, meaning that they get executed, but don't actually do anything. On Alpha dead cache copy code was removed and prefetches are now normal ops. They count as executed operations, but still don't do anything and IsMemRef is not longer set on them. On ARM IsDataPrefetch or IsInstructionPreftech is now set on all prefetch instructions. The timing simple CPU doesn't try to do anything special for prefetches now and they execute with the normal memory code path.
2010-10-31	ISA,CPU,etc: Create an ISA defined PC type that abstracts out ISA behaviors.	Gabe Black
	This change is a low level and pervasive reorganization of how PCs are managed in M5. Back when Alpha was the only ISA, there were only 2 PCs to worry about, the PC and the NPC, and the lsb of the PC signaled whether or not you were in PAL mode. As other ISAs were added, we had to add an NNPC, micro PC and next micropc, x86 and ARM introduced variable length instruction sets, and ARM started to keep track of mode bits in the PC. Each CPU model handled PCs in its own custom way that needed to be updated individually to handle the new dimensions of variability, or, in the case of ARMs mode-bit-in-the-pc hack, the complexity could be hidden in the ISA at the ISA implementation's expense. Areas like the branch predictor hadn't been updated to handle branch delay slots or micropcs, and it turns out that had introduced a significant (10s of percent) performance bug in SPARC and to a lesser extend MIPS. Rather than perpetuate the problem by reworking O3 again to handle the PC features needed by x86, this change was introduced to rework PC handling in a more modular, transparent, and hopefully efficient way. PC type: Rather than having the superset of all possible elements of PC state declared in each of the CPU models, each ISA defines its own PCState type which has exactly the elements it needs. A cross product of canned PCState classes are defined in the new "generic" ISA directory for ISAs with/without delay slots and microcode. These are either typedef-ed or subclassed by each ISA. To read or write this structure through a Context, you use the new pcState() accessor which reads or writes depending on whether it has an argument. If you just want the address of the current or next instruction or the current micro PC, you can get those through read-only accessors on either the PCState type or the Contexts. These are instAddr(), nextInstAddr(), and microPC(). Note the move away from readPC. That name is ambiguous since it's not clear whether or not it should be the actual address to fetch from, or if it should have extra bits in it like the PAL mode bit. Each class is free to define its own functions to get at whatever values it needs however it needs to to be used in ISA specific code. Eventually Alpha's PAL mode bit could be moved out of the PC and into a separate field like ARM. These types can be reset to a particular pc (where npc = pc + sizeof(MachInst), nnpc = npc + sizeof(MachInst), upc = 0, nupc = 1 as appropriate), printed, serialized, and compared. There is a branching() function which encapsulates code in the CPU models that checked if an instruction branched or not. Exactly what that means in the context of branch delay slots which can skip an instruction when not taken is ambiguous, and ideally this function and its uses can be eliminated. PCStates also generally know how to advance themselves in various ways depending on if they point at an instruction, a microop, or the last microop of a macroop. More on that later. Ideally, accessing all the PCs at once when setting them will improve performance of M5 even though more data needs to be moved around. This is because often all the PCs need to be manipulated together, and by getting them all at once you avoid multiple function calls. Also, the PCs of a particular thread will have spatial locality in the cache. Previously they were grouped by element in arrays which spread out accesses. Advancing the PC: The PCs were previously managed entirely by the CPU which had to know about PC semantics, try to figure out which dimension to increment the PC in, what to set NPC/NNPC, etc. These decisions are best left to the ISA in conjunction with the PC type itself. Because most of the information about how to increment the PC (mainly what type of instruction it refers to) is contained in the instruction object, a new advancePC virtual function was added to the StaticInst class. Subclasses provide an implementation that moves around the right element of the PC with a minimal amount of decision making. In ISAs like Alpha, the instructions always simply assign NPC to PC without having to worry about micropcs, nnpcs, etc. The added cost of a virtual function call should be outweighed by not having to figure out as much about what to do with the PCs and mucking around with the extra elements. One drawback of making the StaticInsts advance the PC is that you have to actually have one to advance the PC. This would, superficially, seem to require decoding an instruction before fetch could advance. This is, as far as I can tell, realistic. fetch would advance through memory addresses, not PCs, perhaps predicting new memory addresses using existing ones. More sophisticated decisions about control flow would be made later on, after the instruction was decoded, and handed back to fetch. If branching needs to happen, some amount of decoding needs to happen to see that it's a branch, what the target is, etc. This could get a little more complicated if that gets done by the predecoder, but I'm choosing to ignore that for now. Variable length instructions: To handle variable length instructions in x86 and ARM, the predecoder now takes in the current PC by reference to the getExtMachInst function. It can modify the PC however it needs to (by setting NPC to be the PC + instruction length, for instance). This could be improved since the CPU doesn't know if the PC was modified and always has to write it back. ISA parser: To support the new API, all PC related operand types were removed from the parser and replaced with a PCState type. There are two warts on this implementation. First, as with all the other operand types, the PCState still has to have a valid operand type even though it doesn't use it. Second, using syntax like PCS.npc(target) doesn't work for two reasons, this looks like the syntax for operand type overriding, and the parser can't figure out if you're reading or writing. Instructions that use the PCS operand (which I've consistently called it) need to first read it into a local variable, manipulate it, and then write it back out. Return address stack: The return address stack needed a little extra help because, in the presence of branch delay slots, it has to merge together elements of the return PC and the call PC. To handle that, a buildRetPC utility function was added. There are basically only two versions in all the ISAs, but it didn't seem short enough to put into the generic ISA directory. Also, the branch predictor code in O3 and InOrder were adjusted so that they always store the PC of the actual call instruction in the RAS, not the next PC. If the call instruction is a microop, the next PC refers to the next microop in the same macroop which is probably not desirable. The buildRetPC function advances the PC intelligently to the next macroop (in an ISA specific way) so that that case works. Change in stats: There were no change in stats except in MIPS and SPARC in the O3 model. MIPS runs in about 9% fewer ticks. SPARC runs with 30%-50% fewer ticks, which could likely be improved further by setting call/return instruction flags and taking advantage of the RAS. TODO: Add != operators to the PCState classes, defined trivially to be !(a==b). Smooth out places where PCs are split apart, passed around, and put back together later. I think this might happen in SPARC's fault code. Add ISA specific constructors that allow setting PC elements without calling a bunch of accessors. Try to eliminate the need for the branching() function. Factor out Alpha's PAL mode pc bit into a separate flag field, and eliminate places where it's blindly masked out or tested in the PC.
2010-09-13	Faults: Pass the StaticInst involved, if any, to a Fault's invoke method.	Gabe Black
	Also move the "Fault" reference counted pointer type into a separate file, sim/fault.hh. It would be better to name this less similarly to sim/faults.hh to reduce confusion, but fault.hh matches the name of the type. We could change Fault to FaultPtr to match other pointer types, and then changing the name of the file would make more sense.
2010-08-23	CPU: Make Exec trace to print predication result (if false) for memory ↵	Min Kyu Jeong
	instructions
2010-08-23	ARM/O3: store the result of the predicate evaluation in DynInst or Threadstate.	Min Kyu Jeong
	THis allows the CPU to handle predicated-false instructions accordingly. This particular patch makes loads that are predicated-false to be sent straight to the commit stage directly, not waiting for return of the data that was never requested since it was predicated-false.
2010-08-23	CPU: Set a default value when readBytes faults.	Ali Saidi
	This was being done in read(), but if readBytes was called directly it wouldn't happen. Also, instead of setting the memory blob being read to -1 which would (I believe) require using memset with -1 as a parameter, this now uses bzero. It's hoped that it's more specialized behavior will make it slightly faster.
2010-08-13	CPU: Add readBytes and writeBytes functions to the exec contexts.	Gabe Black