Glossary

A Brief Categorization

A list of tools, organized according to various interesting features. See also a listing of tools ordered alphabetically. Interesting things about the tools include:

Purpose of the tool
Supports buggy applications (that is: is the tool robust in the face of application errors?).
Supports dynamic instruction space modification (a.k.a. ``Dynamic Linking'', ``Runtime Code Generation'', or ``Self-Modifying Code'')
Supports multiple target processors
Supports multiple protection domains (address spaces)
Supports signals, exceptions and asynchronous events
Supports system-mode simulation or tracing
Implementation:
Timing simulation
Performance of the tool
Product status

Purpose Of The Tool

Simulation and tracing tools can perform a wide variety of tasks. Here are some common uses:

atr: address tracing
Classical ``address tracing'' gathers a list of instruction and/or data memory references performed by a system. There are many variations, such as tracing only targets of control transfers or tracing other resources.
db: debugging
A simulator can help with debugging because: it runs deterministically and repeatably; it is possible to query system state without disturbing it; the simulator can be backed up to an earlier checkpoint in order to implement reverse execution (``foo is twelve ... what was the value of bar in the routine we just returned from?''); and because a simulator can perform consistency checks that cannot be done on real hardware.
otr: other tracing and event counting
A generalization of address tracing is to trace, count, or categorize events on any kind of processor or system event or resource. For example, a tool may collect the common values of variables; register usage patterns; interrupt or exception event counts, timing information, and so on.
sim: (instruction set) simulation
Simulators commonly implement a processor architecture that does not yet or no longer exists. Simulators can also implement other devices such as memory, bus, I/O devices, user input, and so on.
tb: tool building
Here, ``tool building'' is meant to encompass tools that are used to build other tools, for example, a tool that builds various tracing tools is a tool-building tool, whereas a configurable cache simulator is not. The usual distinction is that a tool-building tool can be extended [NG87, NG88] using a general-purpose programming language (e.g. C, C++, ...), whereas a configurable tool is programmed with a less-powerful language e.g. a list of cache size, line size, associativity, etc.

In addition, some tools are used for

os: operating system (OS) Emulation
Compare OS emulation ``as a purpose'' with simulators that emulate the OS for simplicity (see system-mode simulation or tracing).

Handles Application Bugs Robustly

No: Application errors such as stores to random memory locations may cause the simulation or tracing tool to fail or produce spurious answers, or may cause the application program to fail in an unexpected (unintended) way or produce spurious answers.
Some: Certain kinds of errors are detected or serviced. For example, application errors may be constrained so that they can clobber application data in random ways but that they cannot cause the simulation or tracing tool to fail or produce erronious results.
Yes: Application errors are detected and handled in some predictable way. Typically, ``predictable'' means that the error model is the same as a reference for the target architecture.
Yes*: Selectable; turning on checking may slow execution.

Works with Self-Modifying Code

THIS CATEGORY NOT YET ORGANIZED, SEE THE SHADE PAPER.

No
Yes, but not all kinds
Yes

Multiple Processors

THIS CATEGORY NOT YET ORGANIZED, SEE THE SHADE PAPER.

No
Y1: multiplexes all target processors on a single host processor
Y=: same number of host and target processors (to be precise, should be a ``Y-'' category for several host processors per target processor).
Y+: can multiplex a large number of target processors onto a potentially smaller number of host processors

Support for Multiple Protection Domains

THIS CATEGORY NOT YET ORGANIZED, SEE THE SHADE PAPER.

No
Yes

Signals and Exceptions

THIS CATEGORY NOT YET ORGANIZED, SEE THE SHADE PAPER.

No
S: yes, but not all kinds. For example, a tracing tool might execute the traced program correctly but fail to trace signal handlers.
Yes

Support for System-Mode Code

THIS CATEGORY NOT YET ORGANIZED, SEE THE SHADE PAPER.

d: device
u: user
s: system

Note: the system mode may be marked in parenthesis, e.g. (s), indicating that the host processor does not have a distinct system mode in hardware, but the tool is intended to work with (simulate, trace, etc.) operating system code.
Processor simulators typically implement either a full procesor or just the user-mode part of the instruction set. A full simulation is more precise and allows analysis of operating systems, etc. However, it also requires implementing the processor's protected mode architecture, simulated devices, etc. An alternative is to implement just the user-mode portion of the ISA and to implement system calls (transitions to protected mode) using simulator code rather than by simulating the operating system. OS emulation is typically less accurate

Input Representation

THIS CATEGORY NOT YET ORGANIZED, SEE THE SHADE PAPER.

asm: assembly code
exe: executable code, no symbol table information
exe*: executable code, with symbol table information
hll: high-level language

Implementation: Decompilation Technology

``Decompilation technology'' here refers to the process of analyzing a (machine code) fragment and, through analysis, creating some higher-level information about the fragment. For simulation and tracing tools, decompilation is typically simpler than static program decompilation, in which the goal is to read a binary program and produce source code for it in some high-level language. Simulation and tracing ``has it easy'' in comparison because it is possible to get by with a lower-level representation and also to punt hard problems to the runtime, when more information is available.

Even so, executable machine code is difficult to simulate and trace efficiently (within 2 orders of magnitude of the performance of native execution) when using ``naive'' instruction-by-instruction translation, because lots of relevant information is unavailable statically. For example, every instruction is potentially a branch target; every word of memory is potentially used both as code and as data; every mutable word of memory is potentially executed, modified (at runtime), and then executed again; and so on.

Executable machine code is also inherently (target) machine-dependent and thus lexing and parsing the machine code is a source of potential portability problems. (Note that some tools use a high-level input, so that relatively little analysis is needed to determine the original programmers intent, at least at a level needed to simulate the program with modest efficiency.)

The following is a a list of tools and papers that show how to reduce the overhead of analyzing each instruction; how to reduce the number of times each instruction is analyzed; how to perform optimistic analysis and recover when it's wrong; and how to improve the abstraction of machine-dependent parts of the tool.

A short list:

A slightly longer list:

Implementation: Simulation Technology

The ``simulation technology'' is how the original machine instructions (or other source representation) gets translated into an executable representation that is suitable for simulation and/or tracing. Choices include:

ddi: Decode-and-dispatch interpretation: the input representation for an operation is fetched and decoded each time it is executed.
pdi: Predecode interpretation: the input form is translated into a form that is faster to decode; that form is then saved so that successive invocations (e.g. subsequent iterations of a loop) need only fetch and decode the ``fast'' form. Note that
- The translation may happen before program invocation, during startup, or incrementally during execution; and that the translated form may be discarded and regenerated.
- If the original instructions change, the translated form becomes incoherent with the original representation; a system that fails to update (invalidate) the translated form before it is then reexecuted will simulate the old instructions instead of the new ones. For some systems (e.g., those with hardware coherent instruction caches) such behavior is erronious.
tci: Threaded code interpretation: a particularly common and efficient form of predecode interpretation.
scc: Static cross-compilation: The input form is statically (before program execution) translated from the target instruction set to the host instruction set. Note that:
- All translation costs are paid statically, so runtime efficiency may be very good. In contrast, dynamic analysis and transformation costs are paid during simulation, and so it may be necessary to ``cut corners'' with dynamic translation in order to manage the runtime cost. Cutting corners may affect both the quality of analysis of the original program and the quality of code generation.
- Instructions that cannot be located statically or which do not exist until runtime cannot be translated statically.
- Historically, it is difficult to distinguish between memory words that are used for instructions and those that are used for data; translating data as instructions may cause errors.
- Translating to machine code allows the use of the host hardware's instruction fetch/decode/dispatch hardware to help simulate the target's.
- Translating to machine code makes it easier to translate clumps of host instructions; most dispatching between target instructions is thus eliminated.
dcc: Dynamic Cross Compilation: Host machine code is generated dynamically, as the program runs. Note that:
- Translating ``on demand'' eases the problem of determining what is code and what is data; a given word may even be used as both code and data.
- Translating to machine code is often more expensive than translating to other representations; both the cost of generating the machine code and the cost of executing it contribute to the overall execution time.
- Theoretical performance advantages from dynamic cross-compilation may be overwhelmed by the host's increased cache miss ratio due to dynamic cross-compilation's larger code sizes [Pittman 95].
aug: Augmentation: cross-compilation where the host and target are the same machine. Note that
- Augmentation is typically done statically.
- There is a fine line between having identical host and target machines (augmetnation) and having nearly-identical machines in which just a few features (e.g. memory references) are simulated, but in which the bulk of instruction sets and encodings are identical.
emu: Emulation: Where software simulation is sped up using hardware assistance. ``Hardware assistance'' might include special compatability modes but might also include careful use of page mappings. (See ``emulation''.)

Dynamic Compilation: Displaced Execution

Move an instruction from one place to another, but execute with the same host and target.

1951: EDSAC Debug
1987: Shadow

Dynamic Compilation: Cross-Compilation

Compile instruction sequences from a target machine to run on a host machine.

1984: ST-80
1987: CRISP
1987: Mimic
1988: SoftPC
1988: SELF
1991: Shade
1993: MINT
1994: Executor
1994: IMS.
1993: SimICS; in particular, ``Partial Translation''
1994: SimOS.
1994: T2.

Hardware Emulation

1986: ATUM
1987: CRISP
2000: Crusoe
1993: Migrant
1993: Tapeworm II
1993: WWT
1994: IMS

Interpreters

Simulation and tracing tools that perform execution using interpretation; the original executable code is neither preprocessed (augmentation or static cross-compilation) nor is it dynamically compiled to host code.

1986: Z80MU
1987: Cerberus
1988: g88
1990: Spa
1991: SimICS
1991: Dynascope
1992: Accelerator
1992: GNU Simulators
1992: SPIM
1993: Cygnus
1993: Dynascope-II
1993: Executor
1993: MINT
1993: WWT
1994: Dynascope-II
1994: Talisman (also known as ``mg88'').
1994: Kx10
1994: Mable
1994: Mime

Static Cross-Compilation

Statically cross-compile instruction sequences from a target machine to run on some host machine.

1983: dis+mod+run
1986: Moxie
1987: Cerberus
1992: Accelerator
1993: Vest and mx
1994: FlashPort
1994: FreePort Express
1994: Migrant
1994: Pixie-II

Static Augmentation

Augmentation-based tracing tools run host instructions native, but some instructions are simulated. For example, Proteus executes arithmetic and stack-relative memory reference instructions native, and simulates load and store instructions that may reference shared memory.

1983: Simon
1986: Pixie
1988: RPPT
1989: MPtrace
1989: Titan tracing
1989: TRAPEDS
1991: Proteus
1991: Tango Lite
1992: FAST
1992: OM
1992: Purify
1993: ATOM (based on OM)
1993: Hiprof (based on OM)
1993: qp/qpt
1993: Third Degree (based on OM)
1993: WWT
1994: IDtrace

Multiple Strategies

Some tools rely on having multiple strategies in order to achieve their desired functionality. For the purposes here, ``untraced native execution'' counts as a translator.

1951: EDSAC Debug (displaced execution, native execution)
1991: Dynascope (interpretation, native execution)
1992: Accelerator (static cross-compilation, interpretation)
1993: MINT (dynamic cross-compilation, interpretation)
1993: Vest and mx (static cross-compilation, interpretation)
1994: Executor (interpretation, dynamic cross-compilation)
1994: SimICS (interpretation, dynamic cross-compilation)
1995: FreePort Express (static cross-compilation, interpretation; uses Vest and mx technology)

Other

Some tools/papers not listed under other headings.

Match Between Host and Target

THIS CATEGORY NOT YET ORGANIZED.

Generally, the closer the match between the host and the target, the easier it is to write a simulator, and the better the efficiency. Possible mismatches include:

Byte or word size. For example, Kx10 simulates a machine with 36-bit words; it runs on machines with 32-bit and 64-bit words.
Numeric representation. For example, whether integers are sign-magnitude, one's complement, or two's complement. Or, for example, Vest, which simulates all VAX floating-point formats on a host machine that lacks some of the VAX formats.
Which instruction combinations cause exceptions, and how those exceptions are reported.
Synchronization and atomicity. In particular, the details may be messy where the target machine synchronizes implicitly and the host does so explicitly, since all target operations that might cause synchronization generally need to be treated as if they do.

Note that target support for self-modifying code may be treated as a special case of synchronization. For example, target machines with no caches or unified instruction and data caches will typically write instructions using ordinary store instructions. Therefore, all store instructions must be treated as potential code-modifying instructions.

For timing-accurate simulation (see Talisman and RSIM), some matches between the host and target can improve the efficiency, but many do not.

Timing Simulation

THIS CATEGORY NOT YET ORGANIZED.

Some instruction-set simulators also perform timing simulation. Timing is not strictly an element of timing simulation, but is often useful, since one major use for instruction set simulation is to collect information for predicting or analyzing performance. Important features of timing simulation include both the processor pipeline and the memory system (see Talisman and RSIM).

Performance

There are many ways to measure performance. Some common metrics include:

host instructions executed per target instruction executed;
host cycles executed per target instruction executed;
relative wallclock time of host and target

Metrics that are more abstract have the advantage that they are typically simple to reason about and applicable across a variety of implementations. For example, host instructions may be counted relatively easily for each of a variety of target instructions, and the counts are relatively isolated from the structure of the caches and microarchitecture. Conversly, concrete metrics tend to more accurately reflect all related costs. For example the effects of caches and microarchitectures are included.l

It is worth noting that few reports give enough information about the measurement methodology in order to make a valid comparison. For example, if dilation is ``typically'' 20x, what is ``typical'', and what is the performance for ``non-typical'' workloads?

Product Status

THIS CATEGORY NOT YET ORGANIZED.

The status of tool

info: only information is available
nonprod: the tool is available but is not a product
product: the tool is a commercial product

From instruction-set simulation and tracing