Age | Commit message (Collapse) | Author |
|
The previous implementation did a pair of nested RMW operations,
which isn't compatible with the way that locked RMW operations are
implemented in the cache models. It was convenient though in that
it didn't require any new micro-ops, and supported cmpxchg16b using
64-bit memory ops. It also worked in AtomicSimpleCPU where
atomicity was guaranteed by the core and not by the memory system.
It did not work with timing CPU models though.
This new implementation defines new 'split' load and store micro-ops
which allow a single memory operation to use a pair of registers as
the source or destination, then uses a single ldsplit/stsplit RMW
pair to implement cmpxchg. This patch requires support for 128-bit
memory accesses in the ISA (added via a separate patch) to support
cmpxchg16b.
|
|
Although the cache models support wider accesses, the ISA descriptions
assume that (for the most part) memory operands are integer types,
which makes it difficult to define instructions that do memory accesses
larger than 64 bits.
This patch adds some generic support for memory operands that are arrays
of uint64_t, and specifically a 'u2qw' operand type for x86 that is an
array of 2 uint64_ts (128 bits). This support is unused at this point,
but will be needed shortly for cmpxchg16b. Ideally the 128-bit SSE
memory accesses will also be rewritten to use this support.
Support for 128-bit accesses could also have been added using the gcc
__int128_t extension, which would have been less disruptive. However,
although clang also supports __int128_t, it's still non-standard.
Also, more importantly, this approach creates a path to defining
256- and 512-byte operands as well, which will be useful for eventual
AVX support.
|
|
Result of running 'hg m5style --skip-all --fix-white -a'.
|
|
For historical reasons, the ExecContext interface had a single
function, readMem(), that did two different things depending on
whether the ExecContext supported atomic memory mode (i.e.,
AtomicSimpleCPU) or timing memory mode (all the other models).
In the former case, it actually performed a memory read; in the
latter case, it merely initiated a read access, and the read
completion did not happen until later when a response packet
arrived from the memory system.
This led to some confusing things, including timing accesses
being required to provide a pointer for the return data even
though that pointer was only used in atomic mode.
This patch splits this interface, adding a new initiateMemRead()
function to the ExecContext interface to replace the timing-mode
use of readMem().
For consistency and clarity, the readMemTiming() helper function
in the ISA definitions is renamed to initiateMemRead() as well.
For x86, where the access size is passed in explicitly, we can
also get rid of the data parameter at this level. For other ISAs,
where the access size is determined from the type of the data
parameter, we have to keep the parameter for that purpose.
|
|
The key parameter can be used to read out various config parameters from
within the simulated software.
|
|
These are packed single-precision approximate reciprocal operations,
vector and scalar versions, respectively.
This code was basically developed by copying the code for
sqrtps and sqrtss. The mrcp micro-op was simplified relative to
msqrt since there are no double-precision versions of this operation.
|
|
fild loads an integer value into the x87 top of stack register.
fucomi/fucomip compare two x87 register values (the latter
also doing a stack pop).
These instructions are used by some versions of GNU libstdc++.
|
|
Added explicit data sizes and an opcode type for correct execution.
|
|
This patch updates the x86 decoder so that it can decode instructions with vex
prefix. It also updates the isa with opcodes from vex opcode maps 1, 2 and 3.
Note that none of the instructions have been implemented yet. The
implementations would be provided in due course of time.
|
|
All x87 misc registers are implemented in an array of 64 bit values
but in real hardware the size of some of these registers is smaller.
Previsouly all 64 bits where incorrectly set and then later read. To
ensure correctness we mask the value in setMiscRegNoEffect to write
only the valid bits.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
Same exception is raised whether division with zero is performed or the
quotient is greater than the maximum value that the provided space can hold.
Divide-by-Zero is the AMD terminology, while Divide-Error is Intel's.
|
|
|
|
When running with the Exec flag, the mwait instruction attempted
to print out its source registers, which were never actually
initialized. This led to sporadic assertion failures when the
value stored there was invalid.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
Makes x86-style locked operations even more distinct from
LLSC operations. Using "locked" by itself should be
obviously ambiguous now.
|
|
This patch corrects the FXSAVE and FXRSTOR Macroops. The actual code used for
saving/restore the FP registers is in the file but it was not used.
The FXSAVE and FXRSTOR instructions are used in the kernel for saving and
loading the state of the mmx,xmm and fpu registers.
This operation is triggered in FS by issuing a Device Not Available Fault. The
cr0 register has a TS flag that is set upon each context change. Every time a
task access any FP related register (SIMD as well) if the TS flag is set to
one, the device not available fault is issued. The kernel saves the current
state of the registers, and restore the previous state of the currently running
task.
Right now Gem5 lacks of this capability. the Device Not Available Fault is
never issued, leading to several problems when different threads share the same
CPU and SMT is not used. The PARSEC Ferret benchmark is an example of this
behavior.
In order to test this a hack in the atomic cpu code was done to detect if a
static instruction has any FP operands and the cr0 reg TS bit is set. This
check must be done in the ISA dependent code. But it seems to be tricky to
access the cr0 register while executing an instruction.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
This patch implements the simd128 ADDSUBPD instruction for the x86 architecture.
Tested with a simple program in assembly language which executes the
instruction. Checked that different versions of the instruction are executed
by using the execution tracing option.
Committed by: Nilay Vaish <nilay@cs.wisc.edu
|
|
Instead of counting the number of opcode bytes in an instruction and recording
each byte before the actual opcode, we can represent the path we took to get to
the actual opcode byte by using a type code. That has a couple of advantages.
First, we can disambiguate the properties of opcodes of the same length which
have different properties. Second, it reduces the amount of data stored in an
ExtMachInst, making them slightly easier/faster to create and process. This
also adds some flexibility as far as how different types of opcodes are
handled, which might come in handy if we decide to support VEX or XOP
instructions.
This change also adds tables to support properly decoding 3 byte opcodes.
Before we would fall off the end of some arrays, on top of the ambiguity
described above.
This change doesn't measureably affect performance on the twolf benchmark.
--HG--
rename : src/arch/x86/isa/decoder/three_byte_opcodes.isa => src/arch/x86/isa/decoder/three_byte_0f38_opcodes.isa
rename : src/arch/x86/isa/decoder/three_byte_opcodes.isa => src/arch/x86/isa/decoder/three_byte_0f3a_opcodes.isa
|
|
The data size used for actually writing the base value for the segment was the
default size, but really it should set the entire value without any possible
truncation.
|
|
The far pointer should be shifted right to get the selector value, not left.
Also, when calculating the width of the offset, the wrong register was used in
one spot.
|
|
Mwait works as follows:
1. A cpu monitors an address of interest (monitor instruction)
2. A cpu calls mwait - this loads the cache line into that cpu's cache.
3. The cpu goes to sleep.
4. When another processor requests write permission for the line, it is
evicted from the sleeping cpu's cache. This eviction is forwarded to the
sleeping cpu, which then wakes up.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
This patch takes quite a large step in transitioning from the ad-hoc
RefCountingPtr to the c++11 shared_ptr by adopting its use for all
Faults. There are no changes in behaviour, and the code modifications
are mostly just replacing "new" with "make_shared".
|
|
The o3 cpu relies upon instructions that suspend a thread context being
flagged as "IsQuiesce". If they are not, unpredictable behavior can occur.
This patch fixes that for the x86 ISA.
|
|
This patch sets op class of two fp instructions: movfp and pop x87 stack
as IntAluOp since these instructions do not make use of the fp alu.
|
|
This patch encompasses several interrelated and interdependent changes
to the ISA generation step. The end goal is to reduce the size of the
generated compilation units for instruction execution and decoding so
that batch compilation can proceed with all CPUs active without
exhausting physical memory.
The ISA parser (src/arch/isa_parser.py) has been improved so that it can
accept 'split [output_type];' directives at the top level of the grammar
and 'split(output_type)' python calls within 'exec {{ ... }}' blocks.
This has the effect of "splitting" the files into smaller compilation
units. I use air-quotes around "splitting" because the files themselves
are not split, but preprocessing directives are inserted to have the same
effect.
Architecturally, the ISA parser has had some changes in how it works.
In general, it emits code sooner. It doesn't generate per-CPU files,
and instead defers to the C preprocessor to create the duplicate copies
for each CPU type. Likewise there are more files emitted and the C
preprocessor does more substitution that used to be done by the ISA parser.
Finally, the build system (SCons) needs to be able to cope with a
dynamic list of source files coming out of the ISA parser. The changes
to the SCons{cript,truct} files support this. In broad strokes, the
targets requested on the command line are hidden from SCons until all
the build dependencies are determined, otherwise it would try, realize
it can't reach the goal, and terminate in failure. Since build steps
(i.e. running the ISA parser) must be taken to determine the file list,
several new build stages have been inserted at the very start of the
build. First, the build dependencies from the ISA parser will be emitted
to arch/$ISA/generated/inc.d, which is then read by a new SCons builder
to finalize the dependencies. (Once inc.d exists, the ISA parser will not
need to be run to complete this step.) Once the dependencies are known,
the 'Environments' are made by the makeEnv() function. This function used
to be called before the build began but now happens during the build.
It is easy to see that this step is quite slow; this is a known issue
and it's important to realize that it was already slow, but there was
no obvious cause to attribute it to since nothing was displayed to the
terminal. Since new steps that used to be performed serially are now in a
potentially-parallel build phase, the pathname handling in the SCons scripts
has been tightened up to deal with chdir() race conditions. In general,
pathnames are computed earlier and more likely to be stored, passed around,
and processed as absolute paths rather than relative paths. In the end,
some of these issues had to be fixed by inserting serializing dependencies
in the build.
Minor note:
For the null ISA, we just provide a dummy inc.d so SCons is never
compelled to try to generate it. While it seems slightly wrong to have
anything in src/arch/*/generated (i.e. a non-generated 'generated' file),
it's by far the simplest solution.
|
|
With (upcoming) separate compilation, they are useless. Only
link-time optimization could re-inline them, but ideally
feedback-directed optimization would choose to do so only for
profitable (i.e. common) instructions.
|
|
|
|
|
|
|
|
|
|
This is an implementation of the x86 int3 and int immediate
instructions for long mode according to 'AMD64 Programmers Manual
Volume 3'.
|
|
Convert condition code registers from being specialized
("pseudo") integer registers to using the recently
added CC register class.
Nilay Vaish also contributed to this patch.
|
|
|
|
|
|
The x87 FPU supports three floating point formats: 32-bit, 64-bit, and
80-bit floats. The current gem5 implementation supports 32-bit and
64-bit floats, but only works correctly for 64-bit floats. This
changeset fixes the 32-bit float handling by correctly loading and
rounding (using truncation) 32-bit floats instead of simply truncating
the bit pattern.
80-bit floats are loaded by first loading the 80-bits of the float to
two temporary integer registers. A micro-op (cvtint_fp80) then
converts the contents of the two integer registers to the internal FP
representation (double). Similarly, when storing an 80-bit float,
there are two conversion routines (ctvfp80h_int and cvtfp80l_int) that
convert an internal FP register to 80-bit and stores the upper 64-bits
or lower 32-bits to an integer register, which is the written to
memory using normal integer stores.
|
|
X87 store instructions typically loads and pops the top value of the
stack and stores it in memory. The current implementation pops the
stack at the same time as the floating point value is loaded to a
temporary register. This will corrupt the state of the x87 stack if
the store fails. This changeset introduces a pop87 micro-instruction
that pops the stack and uses this instruction in the affected
macro-instructions to pop the stack after storing the value to memory.
|
|
The current implementation of the x87 never updates the x87 tag
word. This is currently not a big issue since the simulated x87 never
checks for stack overflows, however this becomes an issue when
switching between a virtualized CPU and a simulated CPU. This
changeset adds support, which is enabled by default, for updating the
tag register to every floating point microop that updates the stack
top using the spm mechanism.
The new tag words is generated by the helper function
X86ISA::genX87Tags(). This function is currently limited to flagging a
stack position as valid or invalid and does not try to distinguish
between the valid, zero, and special states.
|
|
This changeset actually fixes two issues:
* The lfpimm instruction didn't work correctly when applied to a
floating point constant (it did work for integers containing the
bit string representation of a constant) since it used
reinterpret_cast to convert a double to a uint64_t. This caused a
compilation error, at least, in gcc 4.6.3.
* The instructions loading floating point constants in the x87
processor didn't work correctly since they just stored a truncated
integer instead of a double in the floating point register. This
changeset fixes the old microcode by using lfpimm instruction
instead of the limm instructions.
|
|
The current implementation of fprem simply does an fmod and doesn't
simulate any of the iterative behavior in a real fprem. This isn't
normally a problem, however, it can lead to problems when switching
between CPU models. If switching from a real CPU in the middle of an
fprem loop to a simulated CPU, the output of the fprem loop becomes
correupted. This changeset changes the fprem implementation to work
like the one on real hardware.
|
|
This changeset fixes two problems in the FABS and FCHS
implementation. First, the ISA parser expects the assignment in
flag_code to be a pure assignment and not an and-assignment, which
leads to the isa_parser omitting the misc reg update. Second, the FCHS
and FABS macro-ops don't set the SetStatus flag, which means that the
default micro-op version, which doesn't update FSW, is executed.
|
|
Currently call and return instructions are marked as IsCall and IsReturn. Thus, the
branch predictor does not use RAS for these instructions. Similarly, the number of
function calls that took place is recorded as 0. This patch marks these instructions
as they should be.
|
|
Currently all the integer microops are marked as IntAluOp and the floating
point microops are marked as FloatAddOp. This patch adds support for marking
different microops differently. Now IntMultOp, IntDivOp, FloatDivOp,
FloatMultOp, FloatCvtOp, FloatSqrtOp classes will be used as well. This will
help in providing different latencies for different op class.
|
|
The 'lret' instruction reloads instruction pointer and code segment from the
stack and then pops them. But the popping part is missing from the current
implementation. This caused incorrect behavior in some code related to the
Fiasco OS. Microops are being added to rectify the behavior of the instruction.
Committed by: Nilay Vaish <nilay@cs.wisc.edu>
|
|
This patch implements ftan, fprem, fyl2x, fld* floating-point instructions.
|
|
This patch fixes the warnings that clang3.2svn emit due to the "-Wall"
flag. There is one case of an uninitialised value in the ARM neon ISA
description, and then a whole range of unused private fields that are
pruned.
|
|
|
|
|
|
|
|
Used as a command in full-system scripts helps the user ensure the benchmarks have finished successfully.
For example, one can use:
/path/to/benchmark args || /sbin/m5 fail 1
and thus ensure gem5 will exit with an error if the benchmark fails.
|
|
This patch implements the fnstsw instruction. The code was originally written
by Vince Weaver. Gabe had made some comments about the code, but those were
never addressed. This patch addresses those comments.
|
|
This patch implements the fsincos instruction. The code was originally written
by Vince Weaver. Gabe had made some comments about the code, but those were
never addressed. This patch addresses those comments.
|