Loading tcg/README +57 −62 Original line number Diff line number Diff line Loading @@ -16,14 +16,18 @@ from the host, although it is never the case for QEMU. A TCG "function" corresponds to a QEMU Translated Block (TB). A TCG "temporary" is a variable only live in a given function. Temporaries are allocated explicitly in each function. A TCG "temporary" is a variable only live in a basic block. Temporaries are allocated explicitly in each function. A TCG "global" is a variable which is live in all the functions. They are defined before the functions defined. A TCG global can be a memory location (e.g. a QEMU CPU register), a fixed host register (e.g. the QEMU CPU state pointer) or a memory location which is stored in a register outside QEMU TBs (not implemented yet). A TCG "local temporary" is a variable only live in a function. Local temporaries are allocated explicitly in each function. A TCG "global" is a variable which is live in all the functions (equivalent of a C global variable). They are defined before the functions defined. A TCG global can be a memory location (e.g. a QEMU CPU register), a fixed host register (e.g. the QEMU CPU state pointer) or a memory location which is stored in a register outside QEMU TBs (not implemented yet). A TCG "basic block" corresponds to a list of instructions terminated by a branch instruction. Loading @@ -32,11 +36,11 @@ by a branch instruction. 3.1) Introduction TCG instructions operate on variables which are temporaries or globals. TCG instructions and variables are strongly typed. Two types are supported: 32 bit integers and 64 bit integers. Pointers are defined as an alias to 32 bit or 64 bit integers depending on the TCG target word size. TCG instructions operate on variables which are temporaries, local temporaries or globals. TCG instructions and variables are strongly typed. Two types are supported: 32 bit integers and 64 bit integers. Pointers are defined as an alias to 32 bit or 64 bit integers depending on the TCG target word size. Each instruction has a fixed number of output variable operands, input variable operands and always constant operands. Loading @@ -44,14 +48,12 @@ variable operands and always constant operands. The notable exception is the call instruction which has a variable number of outputs and inputs. In the textual form, output operands come first, followed by input operands, followed by constant operands. The output type is included in the instruction name. Constants are prefixed with a '$'. In the textual form, output operands usually come first, followed by input operands, followed by constant operands. The output type is included in the instruction name. Constants are prefixed with a '$'. add_i32 t0, t1, t2 (t0 <- t1 + t2) sub_i64 t2, t3, $4 (t2 <- t3 - 4) 3.2) Assumptions * Basic blocks Loading @@ -62,9 +64,8 @@ sub_i64 t2, t3, $4 (t2 <- t3 - 4) - Basic blocks start after the end of a previous basic block, at a set_label instruction or after a legacy dyngen operation. After the end of a basic block, temporaries at destroyed and globals are stored at their initial storage (register or memory place depending on their declarations). After the end of a basic block, the content of temporaries is destroyed, but local temporaries and globals are preserved. * Floating point types are not supported yet Loading Loading @@ -100,7 +101,7 @@ optimizations: is suppressed. - A liveness analysis is done at the basic block level. The information is used to suppress moves from a dead temporary to information is used to suppress moves from a dead variable to another one. It is also used to remove instructions which compute dead results. The later is especially useful for condition code optimization in QEMU. Loading @@ -113,47 +114,6 @@ optimizations: only the last instruction is kept. - A macro system is supported (may get closer to function inlining some day). It is useful if the liveness analysis is likely to prove that some results of a computation are indeed not useful. With the macro system, the user can provide several alternative implementations which are used depending on the used results. It is especially useful for condition code optimization in QEMU. Here is an example: macro_2 t0, t1, $1 mov_i32 t0, $0x1234 The macro identified by the ID "$1" normally returns the values t0 and t1. Suppose its implementation is: macro_start brcond_i32 t2, $0, $TCG_COND_EQ, $1 mov_i32 t0, $2 br $2 set_label $1 mov_i32 t0, $3 set_label $2 add_i32 t1, t3, t4 macro_end If t0 is not used after the macro, the user can provide a simpler implementation: macro_start add_i32 t1, t2, t4 macro_end TCG automatically chooses the right implementation depending on which macro outputs are used after it. Note that if TCG did more expensive optimizations, macros would be less useful. In the previous example a macro is useful because the liveness analysis is done on each basic block separately. Hence TCG cannot remove the code computing 't0' even if it is not used after the first macro implementation. 3.4) Instruction Reference ********* Function call Loading Loading @@ -241,6 +201,10 @@ t0=t1|t2 t0=t1^t2 * not_i32/i64 t0, t1 t0=~t1 ********* Shifts * shl_i32/i64 t0, t1, t2 Loading Loading @@ -428,3 +392,34 @@ to apply more optimizations because more registers will be free for the generated code. The exception model is the same as the dyngen one. 6) Recommended coding rules for best performance - Use globals to represent the parts of the QEMU CPU state which are often modified, e.g. the integer registers and the condition codes. TCG will be able to use host registers to store them. - Avoid globals stored in fixed registers. They must be used only to store the pointer to the CPU state and possibly to store a pointer to a register window. The other uses are to ensure backward compatibility with dyngen during the porting a new target to TCG. - Use temporaries. Use local temporaries only when really needed, e.g. when you need to use a value after a jump. Local temporaries introduce a performance hit in the current TCG implementation: their content is saved to memory at end of each basic block. - Free temporaries and local temporaries when they are no longer used (tcg_temp_free). Since tcg_const_x() also creates a temporary, you should free it after it is used. Freeing temporaries does not yield a better generated code, but it reduces the memory usage of TCG and the speed of the translation. - Don't hesitate to use helpers for complicated or seldom used target intructions. There is little performance advantage in using TCG to implement target instructions taking more than about twenty TCG instructions. - Use the 'discard' instruction if you know that TCG won't be able to prove that a given global is "dead" at a given program point. The x86 target uses it to improve the condition codes optimisation. tcg/TODO +7 −24 Original line number Diff line number Diff line - test macro system - Add new instructions such as: andnot, ror, rol, setcond, clz, ctz, popcnt. - test conditional jumps - See if it is worth exporting mul2, mulu2, div2, divu2. - test mul, div, ext8s, ext16s, bswap - generate a global TB prologue and epilogue to save/restore registers to/from the CPU state and to reserve a stack frame to optimize helper calls. Modify cpu-exec.c so that it does not use global register variables (except maybe for 'env'). - fully convert the x86 target. The minimal amount of work includes: - add cc_src, cc_dst and cc_op as globals - disable its eflags optimization (the liveness analysis should suffice) - move complicated operations to helpers (in particular FPU, SSE, MMX). - optimize the x86 target: - move some or all the registers as globals - use the TB prologue and epilogue to have QEMU target registers in pre assigned host registers. - Support of globals saved in fixed registers between TBs. Ideas: - Move the slow part of the qemu_ld/st ops after the end of the TB. - Experiment: change instruction storage to simplify macro handling and to handle dynamic allocation and see if the translation speed is OK. - change exception syntax to get closer to QOP system (exception - Change exception syntax to get closer to QOP system (exception parameters given with a specific instruction). - Add float and vector support. Loading
tcg/README +57 −62 Original line number Diff line number Diff line Loading @@ -16,14 +16,18 @@ from the host, although it is never the case for QEMU. A TCG "function" corresponds to a QEMU Translated Block (TB). A TCG "temporary" is a variable only live in a given function. Temporaries are allocated explicitly in each function. A TCG "temporary" is a variable only live in a basic block. Temporaries are allocated explicitly in each function. A TCG "global" is a variable which is live in all the functions. They are defined before the functions defined. A TCG global can be a memory location (e.g. a QEMU CPU register), a fixed host register (e.g. the QEMU CPU state pointer) or a memory location which is stored in a register outside QEMU TBs (not implemented yet). A TCG "local temporary" is a variable only live in a function. Local temporaries are allocated explicitly in each function. A TCG "global" is a variable which is live in all the functions (equivalent of a C global variable). They are defined before the functions defined. A TCG global can be a memory location (e.g. a QEMU CPU register), a fixed host register (e.g. the QEMU CPU state pointer) or a memory location which is stored in a register outside QEMU TBs (not implemented yet). A TCG "basic block" corresponds to a list of instructions terminated by a branch instruction. Loading @@ -32,11 +36,11 @@ by a branch instruction. 3.1) Introduction TCG instructions operate on variables which are temporaries or globals. TCG instructions and variables are strongly typed. Two types are supported: 32 bit integers and 64 bit integers. Pointers are defined as an alias to 32 bit or 64 bit integers depending on the TCG target word size. TCG instructions operate on variables which are temporaries, local temporaries or globals. TCG instructions and variables are strongly typed. Two types are supported: 32 bit integers and 64 bit integers. Pointers are defined as an alias to 32 bit or 64 bit integers depending on the TCG target word size. Each instruction has a fixed number of output variable operands, input variable operands and always constant operands. Loading @@ -44,14 +48,12 @@ variable operands and always constant operands. The notable exception is the call instruction which has a variable number of outputs and inputs. In the textual form, output operands come first, followed by input operands, followed by constant operands. The output type is included in the instruction name. Constants are prefixed with a '$'. In the textual form, output operands usually come first, followed by input operands, followed by constant operands. The output type is included in the instruction name. Constants are prefixed with a '$'. add_i32 t0, t1, t2 (t0 <- t1 + t2) sub_i64 t2, t3, $4 (t2 <- t3 - 4) 3.2) Assumptions * Basic blocks Loading @@ -62,9 +64,8 @@ sub_i64 t2, t3, $4 (t2 <- t3 - 4) - Basic blocks start after the end of a previous basic block, at a set_label instruction or after a legacy dyngen operation. After the end of a basic block, temporaries at destroyed and globals are stored at their initial storage (register or memory place depending on their declarations). After the end of a basic block, the content of temporaries is destroyed, but local temporaries and globals are preserved. * Floating point types are not supported yet Loading Loading @@ -100,7 +101,7 @@ optimizations: is suppressed. - A liveness analysis is done at the basic block level. The information is used to suppress moves from a dead temporary to information is used to suppress moves from a dead variable to another one. It is also used to remove instructions which compute dead results. The later is especially useful for condition code optimization in QEMU. Loading @@ -113,47 +114,6 @@ optimizations: only the last instruction is kept. - A macro system is supported (may get closer to function inlining some day). It is useful if the liveness analysis is likely to prove that some results of a computation are indeed not useful. With the macro system, the user can provide several alternative implementations which are used depending on the used results. It is especially useful for condition code optimization in QEMU. Here is an example: macro_2 t0, t1, $1 mov_i32 t0, $0x1234 The macro identified by the ID "$1" normally returns the values t0 and t1. Suppose its implementation is: macro_start brcond_i32 t2, $0, $TCG_COND_EQ, $1 mov_i32 t0, $2 br $2 set_label $1 mov_i32 t0, $3 set_label $2 add_i32 t1, t3, t4 macro_end If t0 is not used after the macro, the user can provide a simpler implementation: macro_start add_i32 t1, t2, t4 macro_end TCG automatically chooses the right implementation depending on which macro outputs are used after it. Note that if TCG did more expensive optimizations, macros would be less useful. In the previous example a macro is useful because the liveness analysis is done on each basic block separately. Hence TCG cannot remove the code computing 't0' even if it is not used after the first macro implementation. 3.4) Instruction Reference ********* Function call Loading Loading @@ -241,6 +201,10 @@ t0=t1|t2 t0=t1^t2 * not_i32/i64 t0, t1 t0=~t1 ********* Shifts * shl_i32/i64 t0, t1, t2 Loading Loading @@ -428,3 +392,34 @@ to apply more optimizations because more registers will be free for the generated code. The exception model is the same as the dyngen one. 6) Recommended coding rules for best performance - Use globals to represent the parts of the QEMU CPU state which are often modified, e.g. the integer registers and the condition codes. TCG will be able to use host registers to store them. - Avoid globals stored in fixed registers. They must be used only to store the pointer to the CPU state and possibly to store a pointer to a register window. The other uses are to ensure backward compatibility with dyngen during the porting a new target to TCG. - Use temporaries. Use local temporaries only when really needed, e.g. when you need to use a value after a jump. Local temporaries introduce a performance hit in the current TCG implementation: their content is saved to memory at end of each basic block. - Free temporaries and local temporaries when they are no longer used (tcg_temp_free). Since tcg_const_x() also creates a temporary, you should free it after it is used. Freeing temporaries does not yield a better generated code, but it reduces the memory usage of TCG and the speed of the translation. - Don't hesitate to use helpers for complicated or seldom used target intructions. There is little performance advantage in using TCG to implement target instructions taking more than about twenty TCG instructions. - Use the 'discard' instruction if you know that TCG won't be able to prove that a given global is "dead" at a given program point. The x86 target uses it to improve the condition codes optimisation.
tcg/TODO +7 −24 Original line number Diff line number Diff line - test macro system - Add new instructions such as: andnot, ror, rol, setcond, clz, ctz, popcnt. - test conditional jumps - See if it is worth exporting mul2, mulu2, div2, divu2. - test mul, div, ext8s, ext16s, bswap - generate a global TB prologue and epilogue to save/restore registers to/from the CPU state and to reserve a stack frame to optimize helper calls. Modify cpu-exec.c so that it does not use global register variables (except maybe for 'env'). - fully convert the x86 target. The minimal amount of work includes: - add cc_src, cc_dst and cc_op as globals - disable its eflags optimization (the liveness analysis should suffice) - move complicated operations to helpers (in particular FPU, SSE, MMX). - optimize the x86 target: - move some or all the registers as globals - use the TB prologue and epilogue to have QEMU target registers in pre assigned host registers. - Support of globals saved in fixed registers between TBs. Ideas: - Move the slow part of the qemu_ld/st ops after the end of the TB. - Experiment: change instruction storage to simplify macro handling and to handle dynamic allocation and see if the translation speed is OK. - change exception syntax to get closer to QOP system (exception - Change exception syntax to get closer to QOP system (exception parameters given with a specific instruction). - Add float and vector support.