Instruction-Level Simulation And Tracing
This is $Revision: 1.107 $, last updated $Date: 2004/06/08 17:36:44 $.
For an up-to-date version, please check www.xsim.com/bib.
WARNING: THIS PAGE IS STILL UNDER CONSTRUCTION.
Places that are known to have dubious or absent information are marked with
.
There is also a simulators mailing list.
To subscribe, write to <majordomo@xsim.com>.
For sample messages see
here.
Quick Index
The most important thing is what does it do?
If you are building or using a simulator you need to be concerned
at some level about the implementation.
But first you need to figure out what you want it to do.
Why aren't you using the real thing?
Do you want an accurate simulation?
If yes, use real hardware.
If no, make up the numbers.
NullSIM.
It's not as accurate, but it's cheaper and faster
than any other simulation tool.
It's the only universal simulator!
Tired of configuring your simulator to do exactly what you want?
Use NullSIM, with a familiar user interface
and predictable results!
Instruction-set simulators can execute programs written or compiled for
computers that do not yet exist, which no longer exist, or which are
more expensive to purchase than to simulate.
Simulators can also provide access to internal state that is invisible
on the real hardware, can provide deterministic execution in the face
of races, and can be used to ``stress test'' for situations that are
hard to produce on real hardware.
Instruction-level tracing can provide detailed information about the
behavior of programs; that information drives analyzers that analyze
or predict behavior of various system components and which can, in
turn, improve the design and implementation of everything from
architectures to compilers to applications.
Although simulators and tracing tools appear to perform different
tasks, they in practice do much the same work: both manipulate
machine-level details, and both use similar implementation techniques.
This web page is a jumping-off point for lots of work related to
instruction-level simulation and tracing.
Please contribute!
Please send comments, contributions, and suggestions to
`pardo@xsim.com'.
If you'd like to help, edit this page, there is
lots that needs to be done;
your help is appreciated.
This web also page lists a few OS emulation tools.
Although these don't specifically fit the category of tools
covered by this page,
it's interesting to consider whether you could glue together a
processor emulator and an OS emulator and wind up with a whole
simulated system.
To date, whole simulated systems are built as integrated tools,
rather than being assembled modularly.
Some terminology:
- Simulation is recreating an environment in enough detail
that desired effects of a ``real'' system can be observed.
- Instruction-Set Simulation is simulating a processor at the
instruction-set level.
Instruction-set simulation is simulation that is detailed enough
to run executable programs intended for the machine being
simulated.
It is possible to do both a more-detailed simulation,
for example timing-accurate or RTL (register transfer level)
simulation are even more detailed, and bus architecture or cluster
simulation are less detailed.
- Emulation is simulation that uses special hardware
assistance
[RFD 72],
[Tucker 65],
[Wilkes 69].
- The target machine is the one being simulated;
the host machine is the one where the simulation runs.
This terminology parallels retargetable compiler terminology.
However, there is no standard terminology where the simulation
framework is produced on yet a third platform.
That is, a target simulator which runs on special
host hardware often has the simulation software
compiled on a general-purpose machine.
Some have suggested ``generation host'' or ``ghost''
for the machine where the software is created,
that suggests the place where the simulator runs is the
``runtime host'' or ``rhost'' (pronounced "roast").
See also the Glossary.
A list of tools, organized according to various interesting features.
See also a listing of tools ordered
alphabetically.
Interesting things about the tools include:
Simulation and tracing tools can perform a wide variety of tasks.
Here are some common uses:
- atr: address tracing
Classical ``address tracing'' gathers a list of instruction
and/or data memory references performed by a system.
There are many variations, such as
tracing only targets of control transfers
or tracing other resources.
- db: debugging
A simulator can help with debugging
because: it runs deterministically and repeatably;
it is possible to query system state without disturbing it;
the simulator can be backed up to an earlier checkpoint in order
to implement reverse execution
(``foo is twelve ... what was the value of bar
in the routine we just returned from?'');
and because a simulator can perform consistency checks that cannot
be done on real hardware.
- otr: other tracing and event counting
A generalization of address tracing is to
trace, count, or categorize events on any kind of processor or
system event or resource.
For example, a tool may collect
the common values of variables; register usage patterns;
interrupt or exception event counts, timing information, and so on.
- sim: (instruction set) simulation
Simulators commonly implement a processor architecture
that does not yet or no longer exists.
Simulators can also implement other devices such as
memory, bus, I/O devices, user input, and so on.
- tb: tool building
Here, ``tool building'' is meant to encompass tools
that are used to build other tools,
for example, a tool that builds various tracing tools is a
tool-building tool, whereas a configurable cache simulator
is not.
The usual distinction is that a tool-building tool can be
extended
[NG87,
NG88]
using a general-purpose programming language
(e.g. C, C++, ...), whereas a configurable tool is programmed
with a less-powerful language e.g. a list of
cache size, line size, associativity, etc.
In addition, some tools are used for
- No: Application errors
such as stores to random memory locations
may cause the simulation or tracing tool to fail
or produce spurious answers,
or may cause the application program to fail
in an unexpected (unintended) way or produce spurious answers.
- Some:
Certain kinds of errors are detected or serviced.
For example, application errors may be constrained
so that they can clobber application data in random ways
but that they cannot cause the simulation or tracing tool
to fail or produce erronious results.
- Yes:
Application errors are detected and handled in some
predictable way.
Typically, ``predictable'' means that the error
model is the same as a reference for the target architecture.
- Yes*:
Selectable; turning on checking may slow execution.
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
- No
- Y1:
multiplexes all target processors on a single host processor
- Y=:
same number of host and target processors
(to be precise, should be a ``Y-'' category
for several host processors per target processor).
- Y+:
can multiplex a large number of target processors
onto a potentially smaller number of host processors
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
- No
- S: yes, but not all kinds.
For example, a tracing tool might execute the traced program
correctly but fail to trace signal handlers.
- Yes
(Detail)
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
- d: device
- u: user
- s: system
Note: the system mode may be marked in parenthesis,
e.g. (s),
indicating that the host processor does not have a distinct
system mode in hardware,
but the tool is intended to work with
(simulate, trace, etc.) operating system code.
Processor simulators typically implement either a full procesor
or just the user-mode part of the instruction set.
A full simulation is more precise and allows analysis of
operating systems, etc.
However, it also requires implementing the processor's
protected mode architecture, simulated devices, etc.
An alternative is to implement just the user-mode portion
of the ISA and to implement system calls (transitions to
protected mode) using simulator code rather than by simulating
the operating system.
OS emulation is typically less accurate
THIS CATEGORY NOT YET ORGANIZED, SEE THE
SHADE PAPER.
- asm: assembly code
- exe: executable code, no symbol table information
- exe*: executable code, with symbol table information
- hll: high-level language
``Decompilation technology'' here refers to the process of analyzing a
(machine code)
fragment and, through analysis, creating some higher-level
information about the fragment.
For simulation and tracing tools, decompilation is typically simpler
than
static program decompilation,
in which the goal is to read a binary program and produce source code
for it in some high-level language.
Simulation and tracing ``has it easy'' in comparison because it is
possible to get by with a lower-level representation and also to punt
hard problems to the runtime, when more information is available.
Even so, executable machine code is difficult to simulate and trace
efficiently (within 2 orders of magnitude of the performance
of native execution) when using ``naive'' instruction-by-instruction
translation,
because lots of relevant information is unavailable statically.
For example, every instruction is potentially a branch target;
every word of memory is potentially used both as code and as data;
every mutable word of memory is potentially executed, modified
(at runtime), and then executed again; and so on.
Executable machine code is also inherently (target) machine-dependent
and thus lexing and parsing the machine code is a source of potential
portability problems.
(Note that
some tools use a high-level input, so that relatively little
analysis is needed to determine the original programmers intent,
at least at a level needed to simulate the program with modest efficiency.)
The following is a a list of tools and papers that show how to reduce
the overhead of analyzing each instruction;
how to reduce the number of times each instruction is analyzed;
how to perform optimistic analysis and recover when it's wrong;
and how to improve the abstraction of machine-dependent parts of the
tool.
A short list:
A slightly longer list:
The ``simulation technology'' is how the original machine instructions
(or other source representation) gets translated into an executable
representation that is suitable for simulation and/or tracing.
Choices include:
- ddi: Decode-and-dispatch
interpretation: the input representation for an operation is
fetched and decoded each time it is executed.
- pdi: Predecode
interpretation:
the input form is translated into a form that is faster to
decode; that form is then saved so that successive invocations
(e.g. subsequent iterations of a loop) need only fetch and
decode the ``fast'' form.
Note that
- The translation may happen before program invocation,
during startup, or incrementally during execution; and
that the translated form may be discarded and regenerated.
- If the original instructions change, the translated
form becomes incoherent with the original
representation; a system that fails to update
(invalidate) the translated form before it is then
reexecuted will simulate the old instructions
instead of the new ones. For some systems (e.g., those
with hardware coherent instruction caches) such
behavior is erronious.
- tci: Threaded code
interpretation:
a particularly common and efficient form of predecode
interpretation.
- scc: Static
cross-compilation:
The input form is statically (before program execution)
translated from the target instruction set to the host
instruction set.
Note that:
- All translation costs are paid statically, so runtime
efficiency may be very good.
In contrast, dynamic analysis and transformation costs
are paid during simulation, and so it may be necessary
to ``cut corners'' with dynamic translation in order to
manage the runtime cost.
Cutting corners may affect both the quality of
analysis of the original program and the quality of
code generation.
- Instructions that cannot be located statically
or which do not exist until runtime cannot be
translated statically.
- Historically, it is difficult to distinguish between
memory words that are used for instructions and those
that are used for data; translating data as
instructions may cause errors.
- Translating to machine code allows the use of the
host hardware's instruction fetch/decode/dispatch
hardware to help simulate the target's.
- Translating to machine code makes it easier to
translate clumps of host instructions;
most dispatching between target instructions is thus
eliminated.
- dcc: Dynamic Cross
Compilation:
Host machine code is generated dynamically, as the program
runs.
Note that:
- Translating ``on demand'' eases the problem of
determining what is code and what is data; a given
word may even be used as both code and data.
- Translating to machine code is often more expensive
than translating to other representations; both the
cost of generating the machine code and the cost of
executing it contribute to the overall execution time.
- Theoretical performance advantages from dynamic
cross-compilation may be overwhelmed by the host's
increased cache miss ratio due to dynamic
cross-compilation's larger code sizes
[Pittman 95].
- aug: Augmentation:
cross-compilation
where the host and target are the same machine.
Note that
- Augmentation is typically done statically.
- There is a fine line between having identical host and
target machines (augmetnation) and having
nearly-identical machines in which just a few
features (e.g. memory references) are simulated, but
in which the bulk of instruction sets and encodings are
identical.
- emu: Emulation:
Where software simulation is sped up using hardware
assistance.
``Hardware assistance'' might include special compatability
modes but might also include careful use of page mappings.
(See ``emulation''.)
Move an instruction from one place to another,
but execute with the same
host
and
target.
Compile instruction sequences from a target
machine to run on a
host
machine.
Simulation and tracing tools that perform execution
using interpretation;
the original executable code is neither preprocessed
(augmentation or static cross-compilation)
nor is it dynamically compiled to
host
code.
Statically
cross-compile instruction sequences from a
target
machine to run on some
host
machine.
Augmentation-based tracing tools run
host
instructions native,
but some instructions are simulated.
For example,
Proteus executes arithmetic and stack-relative memory reference
instructions native,
and simulates load and store instructions that may reference
shared memory.
Some tools rely on having multiple strategies
in order to achieve their desired functionality.
For the purposes here,
``untraced native execution''
counts as a translator.
- 1951: EDSAC Debug
(displaced execution, native execution)
- 1991: Dynascope
(interpretation, native execution)
- 1992: Accelerator
(static cross-compilation, interpretation)
- 1993: MINT
(dynamic cross-compilation, interpretation)
- 1993: Vest and mx
(static cross-compilation, interpretation)
- 1994: Executor
(interpretation, dynamic cross-compilation)
- 1994: SimICS
(interpretation, dynamic cross-compilation)
- 1995: FreePort Express
(static cross-compilation, interpretation;
uses Vest and mx technology)
Some tools/papers not listed under other headings.
THIS CATEGORY NOT YET ORGANIZED.
Generally, the closer the match between the
host
and the
target,
the easier it is to write a simulator,
and the better the efficiency.
Possible mismatches include:
- Byte or word size.
For example,
Kx10
simulates a machine with 36-bit words;
it runs on machines with 32-bit and 64-bit words.
- Numeric representation.
For example, whether integers are sign-magnitude,
one's complement, or two's complement.
Or, for example,
Vest,
which simulates all VAX floating-point formats
on a host machine that lacks some of the VAX formats.
- Which instruction combinations cause exceptions,
and how those exceptions are reported.
- Synchronization and atomicity.
In particular, the details may be messy
where the target machine synchronizes
implicitly and the host does so explicitly,
since all target operations that might
cause synchronization generally need to be treated as if they
do.
Note that target support for self-modifying code may be treated as a
special case of synchronization.
For example, target machines with no caches or unified instruction and
data caches will typically write instructions using ordinary store
instructions.
Therefore, all store instructions must be treated as potential
code-modifying instructions.
For timing-accurate simulation
(see Talisman
and RSIM),
some matches between the host and target can improve the efficiency,
but many do not.
THIS CATEGORY NOT YET ORGANIZED.
Some instruction-set simulators also perform timing simulation.
Timing is not strictly an element of timing simulation, but is often
useful, since one major use for instruction set simulation is to
collect information for predicting or analyzing performance.
Important features of timing simulation include both the processor
pipeline and the memory system
(see Talisman
and RSIM).
There are many ways to measure performance.
Some common metrics include:
- host instructions executed per target instruction executed;
- host cycles executed per target instruction executed;
- relative wallclock time of host and target
Metrics that are more abstract have the advantage that they are
typically
simple to reason about
and applicable across a variety of implementations.
For example, host instructions may be counted relatively easily for each
of a variety of target instructions,
and the counts are relatively isolated from the structure of the caches
and microarchitecture.
Conversly, concrete metrics tend to more accurately reflect all related
costs.
For example the effects of caches and microarchitectures are
included.l
It is worth noting that few reports give enough information about the
measurement methodology in order to make a valid comparison.
For example, if dilation is ``typically'' 20x, what is ``typical'', and
what is the performance for ``non-typical'' workloads?
THIS CATEGORY NOT YET ORGANIZED.
The status of tool
- info:
only information is available
- nonprod:
the tool is available but is not a product
- product:
the tool is a commercial product
Longer writeups and cross-references.
Some of the tools here have bibliographic entries, home pages or
online papers, noted with ``See: ...''.
Many are also described and referenced in
the 1994 SIGMETRICS Shade paper, noted with
``See:
Shade''.
See here for a list of tools.
See:
The listed tools include:
- Gemulator [IBM PC] (Atari ST emulator)
- ST XFormer [Atari ST] (Atari 130XE emulator)
- PC XFormer 2 [IBM PC] (Atari 800 emulator)
- PC XFormer 3 [IBM PC] (Atari 130XE emulator)
See:
The listed tools include Apple II emulators:
- Apple 2000 [Amiga]
- AppleOnAmiga [Amiga]
- STM [Macintosh]
- YAE [Unix/X]
See:
The listed tools include Macintosh emulators:
- AMax [Amiga] (software + hardware)
- Emplant [Amiga] (software + hardware)
- ShapeShifter [Amiga]
- MAE [Unix/X]
See:
ATOM is built on top of
OM.
See:
See:
See:
See: bib cite, Shade
As of 1994, Cerberus was being actively used and updated by
<csa@transmeta.com>,
who might be willing to provide information and/or code.
See:
Amiga
PET
VIC20
See
See:
Crusoe is an x86 emulator.
It both interprets x86 instructions and also
translates x86 instructions to a host VLIW instruction set;
translations are cached for reuse.
The host instruction set is not exported, only target instructions may
be executed.
A demonstration Crusoe executed both x86 and Java instructions.
Categories:
See:
See:
A prototype/research vehicle for decompiling DOS EXE binary files.
It uses digital signatures to determine library function calls and
the original compiler.
See:
See:
See:
See:
See:
See:
See:
The EDSAC Debugger uses a tracing simulator that operates
by:
fetching the simulated instruction;
decoding it to save trace information;
checking to see if the instruction is a branch,
and updating the simulated program counter if it is;
else placing the instruction in the middle of the simulator loop
and executing it directly;
and then returning to the top of the simulator loop.
As an aside, the 1951 paper on the EDSAC debugger
contains a pretty complete description of a modern debugger...
Categories:
See:
EEL reads object files and executables
and allows tools built on top of EEL to modify the machine code
without needing details of the underlying architecture or operating
system or with the consequences of adding or deleting code.
EEL appears as a C++ class.
EEL is provided with an executable,
which it analyzes, creating abstractions such as
executable (whole program), routines, CFGs, instructions and snippets.
A tool built on EEL then edits the executable by
performing structured rewrites of the EEL constructs;
EEL ensures that details of register allocation, branches, etc.
are updated correctly in the final code.
Categories:
See:
See:
See:
See:
FLEX-ES (formerly OPEN/370) provides a System/390 on a Pentium.
It includes system-mode operation, runs 8 popular S/370 OS's.
On a 2-processor Pentium-II/400MHz,
it provides 7 to 8 MIPS on one processor and I/O functions on the other
processor.
They also sell installed systems (hardware/software turnkey systems).
Categories:
FLEX-ES home page.
FreePort Express
is a tool for convering Sun SPARC binaries to DEC Alpha AXP binaries.
See:
FreePort
Express web page
g88 is a portable simulator
that simulates both user and system-mode code.
It uses threaded code to performance on the order
of a few tens of instructions per simulated instruction.
See:
g88 was written by
Robert Bedichek.
See:
Built on top of
OM.
See:
See:
See:
``The Interpreter'' is a micro-architecture that is intended
for a variety of uses including emulation of existing or hypothetical
machines and program profiling.
An emulator is written in microcode and instructions executed from the
microinstructions that are executed from the microstore give both
parallelism and fast execution.
Categories:
More detailed review:
- A brief (1,000 word) history of microprogramming.
- (pg. 715) Suggested applications: emulation of existing or hypothetical
machines; direct execution of high-level languages; tuning
the instruction set to the application (by iterative profiling
and instruction-set change).
- (pg. 715) ``Emulation is defined in this paper as the ability
to execute machine language programs intended for one machine
(the mulated machine) on another machine (the host machine).
Within this broad definition, any machine with certain basic
cpabilities can emulate any other machine; however, a measure
of the efficiency of operation is generally implied when the
term emulation is used.
For example, ... a conventional computer [has poor] emulation
efficiency ... since for each machine language instruction of
the emulated machine there corresponds a sequence of machine
instructions to be executed on the host machine ... (called
simulation ... [Husson 70] and
[Tucker 65])
turns out to be
significanly more efficent on micorprogrammable computers.
In a microprogrammed machine, the controls for performing the
fetching and execution fo the machine instructions of the
mulated machine consist of a sequence of microinstructions
which allows the increased efficiency.''
In short, as Deutsch and Schiffman point
out, you get hardware support for instruction fetch and
decode, which are typically multi-instruction operations in
decode-and-dispatch interpreters.
- (pg. 717) Description of Interpreter features that help it
emulate a variety of machine architectures and instruction
encodings.
- (pg. 719) ``The basic items necessary to define a machine and
hence emulate it are:
- Memory structure (registers, stacks, etc.),,
- Machine language format (0, 1, 2, 3 address)
including effective operand address calculation,
and
- Operation codes and their functional m eaning
(instructions for arithmetic, branching, etc.).''
Note that you also need e.g. data formats, an exception model,
a device or other I/O model, ...
- (pg. 719) ``The process of writing an emulation therefore,
involves the following analysis and the microprogramming of
these basic items:
- Mapping the registers (stacks, etc.) from the emulated
machine onto the host machine.
- Analysis of the machine language format and addressing
structure.
- Analysis of each operation code defined i the machine
language.
- (pg. 719-720) ``All of the registers of the emulated machine
must be mapped onto the host machine; that is, each register
must have a corresponding register on the host machine. The
most frequently used registers are mapped onto the registers
within the Interpreter Logic Unit (e.g., registers A1, A2,
A3). The remaining registers are stored either in the main
system memory or in special high speed registers depending on
the desired emulatin speed [[Which, I assume, means ``do you
want R5 fast and R6 slow or R6 fast and R5 slow; it doesn't
make sense to me that they'd offer you slower emulation as a
feature --pardo]]. The machine language format may be 0, 1,
2, 3 address or variable length for increased code density,
and may involve indexing, indirection, relative addressing,
stacks and complex address conversions. Figure 14 shows the
general micrporogram requirements (MPM Map) and operating
procesures for the exmulation task.
- Summary:
You're probably already familiar with the concepts in this paper.
The paper describes the overall structure of a classic
decode-and-dispatch interpreter;
this one happens to use microcode, but many of the same
features are the same when using normal machine code.
The opportunity with microcode (which tends to be poorly
stated in all of these papers) is that writable microcode
allows the use of a machine with a very fast and very flexible
but very space-consuming instruction set; microcode makes such
an instruction set useful by providing a fast
mechanism for mapping that instruction set to a denser
representation (the one stored in primary memory).
In the particular case of emulation, much of the interpreter
can be written directly in the low-density machine code and
can take advantage of that code's
flexibility and performance without being hurt by the low
encoding density.
- See also:
[Rosin 69]
and
Deutsch's ST-80 VM,
written (largely) in Xerox Dorado microcode.
See:
This review/summary by Pardo.
See:
See:
See:
See:
See:
See:
See:
MPtrace statically augments parallel programs written for the i386-based
Sequent Symmetry multiprocessor.
The instrumented programs are then run to generate
multiprocessor address traces.
See:
MPtrace was written by
David Keppel
and Eric J. Koldinger
under the supervision of
Susan J. Eggers
and
Henry M. Levy
Emulators:
See:
The New Jersey Machine Code Toolkit
lets programmers decode and encode machine instructions symbolically,
guided by machine specifications that mappings between symbolic
and machine (binary) forms.
It thus helps programmers write applications such as
assemblers, diassemblers, linkers, run-time code generators, tracing tools,
and other tools that consume or produce machine code.
Questions and comments can be sent to
`toolkit@cs.princeton.edu'.
See:
See:
Summary:
Virtual machines (VMs) provide greater flexibility and protection
but require the ability to run one operating system (OS) under the control of another.
In the absence of virtualization hardware, VMs are typically built
by porting the OS to run in user mode,
using a special kernel-level environment
or as a system-level simulator.
``Partial Emulation'' or a ``Lightweight Virtual Machine''
is an augmentation-based approach to system-level simulation:
directly execute most instructions,
statically rewrite and virtualize those instructions which are ``tricky''
due to running in a VM environment.
Compared to the other approaches, partial emulation offers
fewer OS modifications than user-mode execution
(user-mode Linux requires a machine description around 33,000 lines)
and higher performance than a full (all instructions) simulator
(Bochs is about 10x slower than native execution).
The implementation described here emultes all privilged instructions
and some non-privileged instructions.
One approach replaces each ``interesting'' instruction with illegal
instruction traps.
A second approach is to call emulation subroutines.
``Rewriting'' is done during compilation,
and the current implementation requires OS source code [EY 03].
The approach here must:
detect and emulate privileged and some non-privileged instructions;
redirect system calls and page faults to the user-level OS;
emulate an MMU;
emulate devices.
The implementation with illegal instruction traps
uses a companion process and debugger-type accesses
to simulate interesting instructions.
Otherwise,
the user-level OS and its processes are executed in a single host
process.
The ``illegal instruction trap'' approach inserts an illegal instruction
before each ``interesting'' instruction.
The companion process then skips the illegal instruction,
simulates the ``interesting'' instruction, then restarts the process.
It is about 1,500 lines of C code.
The ``procedure call'' approach is about 1,400 lines but is faster.
There are still out-of-process traps due to e.g., MMU emulation
(ala SimOS).
For IA-32, the ``interesting'' instructions are
mov, push, and pop instructions that
manipulate segment registers;
call, jmp, and ret instructions that
cross segment boundaries;
iret;
instructions that manipulate special registers;
and instructions that read and write (privileged bits of)
the flag register.
Not all host OSs have the right facilities to implement a partial
emulator.
Some target OS changes were needed.
For NetBSD, six address constants were changed
to avoid host OS conflicts,
and device drivers were removed.
For FreeBSD, there were also replaced BIOS calls
with code that returned the needed values;
had they tried to implement (run) the BIOS
the system would need to execute virtual 8086 mode.
User-level execution speed was similar to native.
For OS-intensive microbenchmarks, the
``illegal instruction trap'' implementat was at least 100x slower
than native (non-virtual) execution and slower than Bochs.
The ``procedure call'' approach was 3-5x faster,
but little slower than Bochs and still 10x slower than VMware
which was in turn 4x-10x slower than native.
A test benchmark (patch) was 15x slower
using illegal instruction traps
and about 5x slower using procedure calls.
For comparison, VMware was about 1.1x slower.
The paper proposes using a separate host process
for each page table base register value
in order to reduce overhead for MMU emulation.
Categories:
Further reading:
``Running BSD Kernels as User Processes by Partial Emulation and
Rewriting of Machine Instructions'' [EY 03].
See:
See:
See:
See:
See:
See:
Simulates pipeline-level parallelism and memory system behavior.
See:
See:
Shade combines efficient instruction-set simulation
with
a flexible, extensible trace generation capability.
Efficiency is achieved by dynamically compiling and caching
code to simulate and trace the application program;
the cost is as low as two instructions per simulated instruction.
The user may control
the extent of
tracing
in various ways;
arbitrarily detailed application state information may be collected
during the simulation, but
tracing less translates directly into greater efficiency.
Current
Shade implementations run on SPARC systems and
simulate the SPARC (Versions 8 and 9)
and MIPS I instruction sets.
See:
Shade was written by
Bob Cmelik,
with help from
David Keppel.
SimICS is a multiprocessor simulator.
SimICS simulates both the user and system modes
of 88000 and SPARC processors and is used for
simulation, debugging, and prototyping.
See:
SimICS should soon be available under license.
Contact
Peter Magnusson.
SimICS is a rewrite of gsim,
which, in turn, was derived from
g88.
SimICS was written by
Peter Magnusson,
David Samuelsson,
Bengt Werner
and
Henrik Forsberg.
- Spectrum [Amiga]
- ZXAM [Amiga]
- KGB [Amiga]
- !MZX [Archimedes]
- !Speccy [Archimedes]
- Speculator [Archimedes]
- ZX-SPECTRUM Emulator [Atari]
- JPP [IBM PC]
- Z80 [IBM PC]
- SpecEm [IBM PC]
- SP [IBM PC]
- SPECTRUM [IBM PC]
- VGASpec [IBM PC]
- Elwro 800-3 Jr v1.0 [IBM PC]
- MacSpeccy [Macintosh]
- PowerSpectum [PowerMAC]
- xzx [Unix/X]
- xz80 [Unix/X]
See:
See:
See:
SimOS emulates both user-mode and system-mode code for a MIPS-based
multiprocessor.
It uses a combination of direct-execution
(some OS rewrites may be required)
and dynamic cross-compilation
(no rewrites needed)
in order to emulate and, to some degree, instrument.
Categories:
See:
Sleipnir is an instruction-level simulator generator
in the style of yacc.
The configuration file is extended C, with special constructs
to describe bit-level encodings and common code
and support for generation of a threaded-code simulator.
For example, 0b_10ii0sss_s0iidddd
specifies a 16-bit pattern with constant values which must match
and named ``don't care'' fields i (split over two locations),
s, and d.
Sleipnir combines the various patterns to create an instruction decoder.
Named fields are substituted in action rules for an instruction.
For example,
add 0b_10ii0sss_s0iidddd { GP(reg[$d]) = GP(reg[$s]) + $^c }.
Here, ^ indicates sign-extension.
Threaded-code dispatch is implied.
For simple machines, Sleipnir can generate cycle-accurate simulators.
For more complex machines, it generates ISA machines.
Threaded-code simulators are typically weak at VLIW simulation
and machines with some kinds of exposed latencies.
Threaded-code simulators typically simulate one instruction
entirely before starting the next,
but with VLIW and exposed latencies,
the effects of a single instruction are spread over the
execution of several instructions.
Sleipnir supports some kinds of exposed latencies
by running an after() function after each instruction.
Simulator code that creates values writes them in to buffers,
and code in after() can copy the values as needed to
memory, the PC, and so on.
Reported machine description sizes, speeds, and level of accuracy
include the following.
``Speed'' is based on a 250 MHz MIPS R10000-based machine.
In Norse mythology, ``Sleipnir'' is an eight-legged horse that could
travel over land and sea and through the air.
| Architecture | MD lines | Sim. speed | Accuracy |
| MIPS-I (integer) | 700 | 5.1 MIPS | ISA |
| M*Core | 970 | 6.4 MIPS | Cycle |
| ARM/Thumb | 2,812 | 3.6 MIPS | ISA |
| TI C6201 | 5,231 | 3.4 MIPS | Cycle |
| Lucent DSP1600 | 3,903 | 3.7 MIPS | Cycle |
See:
SoftPC is an 8086/80286 emulator which runs on a variety of host
machines.
The first version implemented an 8086 processor core using an
interpreter.
It provided device emulators for EGA/VGA and Hercules
graphics, hard disks, floppies, and and an interrupt controller.
In about 1986, Steve Chamberlain
developed a dynamic cross-compiler for the Sun 3/260.
The basic emulation structure is an array of bytes for simulated memory
and and an ``action'' array, which is a same-size array of bytes.
There are then three arrays R, W, and
X for reads, writes, and execution;
each is subscripted by the ``action'' byte and contains a pointer to the
correspondition read, write, or execute action.
For example, a read of location 17 is implemented by
reading a = action[17], then branching to
R[a].
Similarly, executing location 17 is implemented by reading
a = action[17], then branching to X[a].
The default action is that each instruction is interpreted.
Each branch invokes the translator.
The translator (dynamic cross-compiler) generates a translation that
starts at the last branch and goes through the current branch.
SoftPC then records the current branch target,
which is the starting place for the next branch's translation.
SoftPC ``installs'' the translation by allocating a byte subscript
a, then it fills in the action table with the value
a and sets R[a] to act as a normal read;
W[a] to invalidate the corresponding translation; and
X[a] to point to the new translation.
For each byte ``covered'' by the translation,
the action table is set to a byte value that will invalidate the
translation.
For each translation, SoftPC also sets a back-pointer in a 256-entry
table so that when a particular translation is being invalidated
it is easy to find the location in the ``action'' table
which currently uses that translation.
There are thus a maximum of 256 translations at any time
(actually 254 due to reserved byte values).
The simulated system had up to 1MB RAM.
In about 1988 Henry ???
extended the system to use the low bit of the address as part of the
subscript, in order to expand the table to 512 translations.
This is used in the first Apple MacIntosh target of SoftPC.
SoftPC emulates many devices, including
EGA, VGA, and Hercules video;
disks, including floppies and hard disks;
the interrupt controller; and so on.
In about 1987, Steve Chamberlain
implemented an 8087 (FP coprocessor) that was not a faithful 8087
(e.g., did not provide full 80-bit FP) but which provided sufficient
accuracy to run common applications.
Categories:
See:
See:
See:
See:
See:
An Atari ST emulator
that runs on (at least) a Sun SPARC IPC under SunOS 4.1;
it emulates an MC68000, RAM, ROM, Atari ST graphics, keyboard, BIOS,
clock
and maybe some other stuff.
On a SPECint=13.8 machine
it runs average half the speed of a real ST.
See:
By: Marinos "nino" Yannikos.
T2 is a SPARCle/Fugu simulator that is implemented by
dynamically cross-compiling
SPARCle code to SPARC code.
It simulates both user and system mode code and was used for
doing program development before the arrival of SPARCle hardware.
The name T2 is short for ``Talisman-2''.
Note that, despite the similarity in names,
Talisman and T2 share little in
implementation or core features: the former uses a
threaded code implementation and
provides timing simulation of an m88k, while the latter uses
dynamic cross-compilation and provides fast simulation of a SPARCle.
See:
Talisman is a fast timing-accurate simulator
for an 88000-based multiple-processor machine.
Talisman provides both user-mode and system mode simulation
and can boot OS kernels.
Simulation is reasonably fast,
on the order of a hundred instructions per simulated instruction.
Talisman also does low-level timing simulation and typically
produces estimated running times that are within a few percent
of running times on real hardware.
Note that e.g. turning off dynamic RAM refresh simulation
makes the timing accuracy substantially worse!
See:
See:
Built on top of
OM.
See:
See:
See:
See:
See:
According to a
Microsoft
information release, "Windows x86" is a user-space x86 emulator
with an OS interface to 32-bit Microsoft Windows (tm).
According to a
Microsoft
information release, "Windows on Windows" is a user-space x86
emulator with an interface to 16-bit Microsoft Windows (tm).
See:
Wine is a Microsoft Windows(tm)
OS emulator for i*86 systems.
Most of the application's code runs native,
but calls to ``OS'' functions are transformed into calls into Unix/X.
Some programs require enhanced mode device drivers
and will (probably) never run under Wine.
Wine is neither a processor emulator nor a tracing tool.
See:
See:
See:
See:
Simulators
- 2500 A.D.
- Avocet Systems
(also compilers and assemblers).
- ChipTools
on a 33 MHz 486 matches the speed of a 12 MHz 8051
- Cybernetic Micro Systems
- Dunfield Development Systems
Low cost $50.00
500,000+ instructions/second on 486/33
Can interface to target system for physical I/O
Includes PC hosted "on chip" debugger with identical user
interface
- HiTech Equipment Corp.
- Iota Systems, Inc.
- J & M Microtek, Inc.
- Keil Electronics
- Lear Com Company
- Mandeno Granville Electronics, Ltd
- Micro Computer Control Corporation
Simulator/source code debugger ($79.95)
- Microtek Research
- Production Languages Corp.
- PseudoCorp
Emulators ($$$ - high, $$ - medium, $ - low priced)
- Advanced Micro Solutions $$
- Advanced Microcomputer Systems, Inc. $
- American Automation $$$ $$
- Applied Microsystems $$
- ChipTools (front end for Nohau's emulator)
- Cybernetic Micro Systems $
- Dunfield Development Systems $
plans for pseudo-ice using Dallas DS5000/DS2250
used together with their resident monitor and host debugger
- HBI Limited $
- Hewlett-Packard $$$
- HiTech Equipment Corp.
- Huntsville Microsystems $$
- Intel Corporation $$$
- Kontron Electronics $$$
- Mandeno Granville Electronics, Ltd
full line covering everything from the Atmel flash to the
Siemens powerhouse 80c517a
- MetaLink Corporation $$ $
- Nohau Corporation $$
- Orion Instruments $$$
- Philips $
DS-750 pseudo-ICE developed by Philips and CEIBO
real-time emulation and simulator debug mode
source-level debugging for C, PL/M, and assembler
programs 8xC75x parts
low cost - only $100
DOS and Windows versions available
- Signum Systems $$
- Sophia Systems $$$
- Zax Corporation
- Zitek Corporation $$$
(Contacts listed in FAQ below).
See:
A glossary of some terms used here and in the cited works.
See also Terminology.
- An application is
some code (program or program fragment) that is executed or
traced by one of the tools described here.
Note that an operating system is considered an application:
it is thus possible to speak distinctly of the
host
and
target
operating systems.
The target operating system
may itself be managing programs;
these are considered to be a part of the OS ``application''
and are refered to as ``user-mode parts of the application''.
- Emulation
is simulating a target
machine using both software and a host machine that has special
hardware to help speed the simulation.
See: [Tucker 65];
referenced by [Wilkes]
as the original definition.
- Fidelity
From Paul A. Fishwick <fishwick@cis.ufl.edu>:
``Simulation fidelity'' is usually captured under the title
``Validation'' within the simulation literature, and within
modeling literature in general.
A good place to start with validation is the proceedings of
the Winter Simulation Conference since the first part of the
proceedings is dedicated to tutorials and introductions.
Recently, Sargent had a tutorial on validation and you may
find others as well.
- The host machine is the
``real machine'' where the simulation or tracing is finally
run.
Compare to the target machine,
which is the machine that is being simulated or traced.
Note that the host and the target may be the same machine,
e.g. a V8 SPARC simulator that runs on a V8 SPARC.
See also virtual host.
There are many other terms that can and have been used
for host and target.
For example,
[Wilkes] refers to them
as the ``object machine'' and ``subject machine''.
- Static analysis,
optimization, etc.
is performed using the static code but no runtime data.
Compare to dynamic or runtime operations,
which may use program data
and which may be interleaved with program execution.
Note that static execution is possible,
but is limited to pieces that do not depend on program data
or places where data values is speculated
and a ``backup'' mechanism is available where the speculation
was erronious.
- The target machine is
the machine that is being simulated or traced.
The target machine may be old hardware (e.g. machines that no
longer exist), proposed hardware (e.g. machines that do not
yet exist), or machines that do currently exist, but for which
it is nonetheless valuable to perform simulation or tracing.
Compare to the host machine,
which is the real machine that actually executes the
simulation and tracing code.
Note that the host and target may be the same machine,
e.g. a V8 SPARC simulator that runs on a V8 SPARC.
There's reportedly an IBM paper that referes to the target
as the ``guest'' machine.
- The term virtual host
may be used when there are several levels of simulation and
tracing.
For example, SoftPC can run on a
SPARC and simulate an 8086; that simulated 8086 can then
execute Z80MU, which rus on an 8086
and simulates a Z80.
As far as Z80MU is concerned, it is running on an 8086 host;
the simulated 8086 provided by SoftPC is thus a virtual host
for Z80MU.
Note that the real host and the virtual host may be the same
machine.
For example, Shade runs on a
V8 SPARC and simulates a V8 SPARC, and so Shade can simulate
a V8 SPARC that is running Shade that is simulating a V8 SPARC
that is running an application.
- [ASH 86]
-
\bibitem{ASH:86}
Anant Agarwal,
Richard L. Sites
and Mark Horowitz,
``ATUM: A New Technique for Capturing Address Traces Using Microcode,''
Proceedings of the 13th International Symposium on Computer
Architecture (ISCA-14),
June 1986,
pp.~119-127.
- [AS 92]
-
\bibitem{AS:92}
Kristy Andrews
and
Duane Sand,
``Migrating a CISC Computer Family onto RISC via Object Code Translation,''
Proceedings of the Fifth International Conference on Architectural
Support for Programming Languages and Operating Systems (ASPLOS-V),
October 1992,
pp.~213-222.
- [BL 94]
-
\bibitem{BL:94}
Thomas Ball,
and
James R. Larus
``Optimally Profiling and Tracing Programs,''
ACM Transactions on Programming Languages and Systems,
(16)2,
May 1994,
- [Baumann 86]
-
\bibitem{Baumann:86}
Robert A. Baumann,
``Z80MU,''
Byte Magazine,
October 1986,
pp.~203-216.
- [Jeremiassen 00]
-
\bibitem{Jeremiassen:00}
Tor E. Jeremiassen,
``Sleipnir --- An Instruction-Level Simulator Generator,''
International Conference on Computer Design, pp.~23--31. IEEE, 2000.
- [Bedichek 90]
-
\bibitem{Bedichek:90}
Robert Bedichek,
``Some Efficient Architecture Simulation Techniques,''
Winter 1990 USENIX Conference,
January 1990,
pp.~53-63.
PostScript(tm) paper
[Link broken, please e-mail <pardo@xsim.com> to
get it fixed.]
- [Bedichek 94]
-
\bibitem{Bedichek:94}
Robert Bedichek,
``The Meerkat Multicomputer: Tradeoffs in Multicomputer Architecture,''
Doctoral Dissertation,
University of Washington Department of Computer Science and Engineering
technical report 94-06-06, 1994.
- [Bedichek 95]
-
\bibitem{Bedichek:95}
@inproceedings(Bedichek:95,
author = "Robert C. Bedichek",
title = "Talisman: Fast and Accurate Multicomputer Simulation",
booktitle="Proceedings of the 1995 ACM SIGMETRICS Conference on
Modeling and Measurement of Computer Systems",
month=May,
year="1995",
page=14--24
)
- [BKLW 89]
-
\bibitem{BKLW:89}
Anita Borg,
R. E. Kessler,
Georgia Lazana
and
David W. Wall,
``Long Address Traces from RISC Machines: Generation and Analysis,''
Digital Equipment Western Research Laboratory Research Report 89/14,
(appears in shorter form as~\cite{BKW:90})
September 1989.
Abstract/paper.
- [BKW 90]
-
\bibitem{BKW:90}
Anita Borg, R. E. Kessler and
David W. Wall,
``Generation and Analysis of Very Long Address Traces,''
Proceedings of the 17th Annual Symposium on Computer Architecture (ISCA-17),
May 1990,
pp.~270-279.
- [Boothe 92]
-
\bibitem{Boothe:92}
Bob Boothe,
``Fast Accurate Simulation of Large Shared Memory Multiprocessors,''
technical report UCB/CSD 92/682,
University of California, Berkeley, Computer Science Division,
April 1992.
- [BDCW 91]
-
\bibitem{BDCW:91}
Eric A. Brewer,
Chrysanthos N. Dellarocas, Adrian Colbrook and
William E. Weihl,
``{\sc Proteus}: A High-Performance Parallel-Architecture Simulator,''
Massachusetts Institute of Technology technical report
MIT/LCS/TR-516,
1991.
- [BAD 87]
-
\bibitem{BAD:87}
Eugene D. Brooks III, Timothy S. Axelrod and Gregory A. Darmohray,
``The Cerberus Multiprocessor,''
Lawrence Livermore National Laboratory technical report,
Preprint UCRL-94914,
1987.
- [Chamberlain 94]
-
\bibitem{Chamberlain:94}
Steve Chamberlain, Personal communication, 1994.
- [CUL 89]
-
\bibitem{CUL:89}
Craig Chambers, David Ungar and Elgin Lee,
``An Efficient Implementation of {\sc Self}, a Dynamically-Typed
Object-Oriented Language Based on Prototypes,''
OOPSLA '89 Proceedings,
October 1989,
pp.~49-70.
- [CHRG 95]
-
%A John Chapin
%A Steve Herrod
%A Mendel Rosenblum
%A Anoop Gupta
%T Memory System Performance of UNIX on CC-NUMA Multiprocessors
%J ACM SIGMETRICS '95
%P 1-13
%D May 1995
%W ftp://www-flash.stanford.edu/pub/hive/numa-os.ps
- [CHKW 86]
-
\bibitem{CHKW:86}
Fred Chow, A. M. Himelstein, Earl Killian and L. Weber,
``Engineering a RISC Compiler System,''
IEEE COMPCON,
March 1986.
- [CG 93]
-
\bibitem{CG:93}
Cristina Cifuentes
and K.J. Gough
``A Methodology for Decompilation,''
In Proceedings of the XIX Conferencia Latinoamericana deInformatica,
pp. 257-266,
Buenos Aires, Argentina, August 1993.
PostScript(tm) paper,
PostScript(tm) paper.
(Note: these papers may have moved to
here.)
- [CG 94]
-
\bibitem{CG:94}
Cristina Cifuentes
and
K.J. Gough
``Decompilation of Binary Programs,''
Technical report 3/94,
Queensland University of Technology, School of Computing Science,
1994.
PostScript(tm) paper
(Note: these papers may have moved to
here.)
- [CG 95]
-
\bibitem{CG:95}
C. Cifuentes and K. John Gough,
``Decompilation of Binary Programs,''
Software--Practice&Experience, July 1995.
PostScript(tm) paper
Describes general techniques and a 80286/DOS to C converter.
- [Cifuentes 93]
-
\bibitem{Cifuentes:93}
C. Cifuentes,
``A Structuring Algorithm for Decompilation'', Proceedings of the XIX
Conferencia Latinoamericana de Informatica, Aug 1993, Buenos Aires,
pp. 267 - 276.
PostScript(tm) paper
- [Cifuentes 94a]
-
\bibitem{Cifuentes:94a}
Cristina Cifuentes
``Interprocedural Data Flow Decompilation,''
Technical report 4/94,
Queensland University of Technology, School of Computing Science, 1994.
PostScript(tm) paper
(Note: these papers may have moved to
here.)
- [Cifuentes 94b]
-
\bibitem{Cifuentes:94b}
Cristina Cifuentes
``Reverse Compilation Techniques,''
Doctoral disseration,
Queensland University of Technology,
July 1994.
PostScript(tm) paper
(474MB).
- [Cifuentes 94c]
-
\bibitem{Cifuentes:94c}
C. Cifuentes,
``Structuring Decompiled Graphs,''
Technical Report 4/94, Queensland University of
Technology, Faculty of Information Technology, April 1994.
PostScript(tm)
- [Cifuentes 95]
-
\bibitem{Cifuentes:95}
C. Cifuentes,
``Interprocedural Data Flow Decompilation'', Journal of Programming Languages.
In print, 1995.
PostScript(tm) paper
- [Cifuentes 95b]
-
\bibitem{Cifuentes:95b}
C. Cifuentes,
``An Environment for the Reverse Engineering of Executable Programs''.
To appear: Proceedings of the Asia-Pacific Software Engineering
Conference (APSEC). IEEE. Brisbane, Australia. December 1995.
PostScript(tm) paper
- [Conte & Gimarc 95]
-
``Fast Simulation of Computer Architectures'',
Thomas M. Conte and Charles E. Gimarc, Editors.
Kluwer Academic Publishers, 1995.
ISBN 0-7923-9593-X.
See
here
for ordering information.
- [CDKHLWZ 00]
-
%A Robert F. Cmelik
%A David R. Ditzel
%A Edmund J. Kelly
%A Colin B. Hunter
%A Douglas A. Laird
%A Malcolm John Wing
%A Gregorz B. Zyner
%T Combining Hardware and Software to Provide an Improved Microprocessor
%R United States Patent #US6031992
Available as of 2000/03 via
http://www.patents.ibm.com/details?&pn=US06031992__
HERE
r
77%
- [98]
-
US06011908
01/04/2000
Gated store buffer for an advanced microprocessor
Available as of 2000/03 via
77%
r
77%
- [98]
-
US05958061
09/28/1999
Host microprocessor with apparatus for temporarily
holding target
processor state
e
Available as of 2000/03 via
77%
- [Cmelik 93a]
-
\bibitem{Cmelik:93a}
Robert F. Cmelik,
``Introduction to Shade,''
Sun Microsystems Laboratories, Incorporated,
February 1993.
- [Cmelik 93b]
-
\bibitem{Cmelik:93b}
Robert F. Cmelik,
``The Shade User's Manual,''
Sun Microsystems Laboratories, Incorporated,
February 1993.
- [Cmelik 93c]
-
\bibitem{Cmelik:93c}
Robert F. Cmelik,
``SpixTools Introduction and User's Manual,''
Sun Microsystems Laboratories, Incorporated,
technical report TR93-6,
February 1993.
Html pointer
- [CK 93]
-
\bibitem{CK:93}
Robert F. Cmelik,
and
David Keppel,
``Shade: A Fast Instruction-Set Simulator for Execution Profiling,''
Sun Microsystems Laboratories, Incorporated, and the University of
Washington,
technical report
SMLI 93-12
and UWCSE
93-06-06,
1993.
Html pointer,
PostScript(tm) paper.
- [CK 94]
-
\bibitem{CK:94}
Robert F. Cmelik,
and
David Keppel,
``Shade: A Fast Instruction-Set Simulator for Execution Profiling,''
Proceedings of the 1994 ACM SIGMETRICS Conference
on Measurement and Modeling of Computer Systems
May 1994,
pp.~128-137.
Html pointer,
PostScript(tm) paper.
[Link broken, please e-mail <pardo@xsim.com> to
get it fixed.]
- [CK 95]
-
\bibitem{CK:95}
Robert F. Cmelik,
and
David Keppel,
``Shade: A Fast Instruction-Set Simulator for Execution Profiling,''
Appears as Chapter~2 of
``[Conte & Gimarc 95]'',
pp.~5-46.
- [CMMJS 88]
-
\bibitem{CMMJS:88}
R. C. Covington, S. Madala, V. Mehta, J. R. Jump and J. B. Sinclair,
``The Rice Parallel Processing Testbed,''
Proceedings of the 1988 ACM SIGMETRICS Conference on Measurement and
Modeling of Computer Systems,
1988,
pp.~4-11.
- [DLHH 94]
-
\bibitem{DLHH:94}
Peter Davies, Philippe LaCroute, John Heinlein and
Mark Horowitz,
``Mable: A Technique for Efficient Machine Simulation,''
Quantum Effect Design, Incorporated, and Stanford University
technical report CSL-TR-94-636