FRED, i.e., the Intel flexible return and event delivery architecture,
defines simple new transitions that change privilege level (ring
transitions).
The new transitions defined by the FRED architecture are FRED event
delivery and, for returning from events, two FRED return instructions.
FRED event delivery can effect a transition from ring 3 to ring 0, but
it is used also to deliver events incident to ring 0. One FRED
instruction (ERETU) effects a return from ring 0 to ring 3, while the
other (ERETS) returns while remaining in ring 0. Collectively, FRED
event delivery and the FRED return instructions are FRED transitions.
In addition to these transitions, the FRED architecture defines a new
instruction (LKGS) for managing the state of the GS segment register.
The LKGS instruction can be used by 64-bit operating systems that do
not use the new FRED transitions.
WRMSRNS is an instruction that behaves exactly like WRMSR, with the
only difference being that it is not a serializing instruction by
default. Under certain conditions, WRMSRNS may replace WRMSR to improve
performance. FRED uses it to switch RSP0 in a faster manner.
Search for the latest FRED spec in most search engines with this search
pattern:
site:intel.com FRED (flexible return and event delivery) specification
The CPUID feature flag CPUID.(EAX=7,ECX=1):EAX[17] enumerates FRED, and
the CPUID feature flag CPUID.(EAX=7,ECX=1):EAX[18] enumerates LKGS, and
the CPUID feature flag CPUID.(EAX=7,ECX=1):EAX[19] enumerates WRMSRNS.
Add CPUID definitions for FRED/LKGS/WRMSRNS, and expose them to KVM guests.
Because FRED relies on LKGS and WRMSRNS, add that to feature dependency
map.
Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
Message-ID: <20231109072012.8078-2-xin3.li@intel.com>
[Fix order of dependencies, add dependencies from LM to FRED. - Paolo]
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
macOS 10.15 introduced the more efficient hv_vcpu_run_until() function
to supersede hv_vcpu_run(). According to the documentation, there is no
longer any reason to use the latter on modern host OS versions, especially
after 11.0 added support for an indefinite deadline.
Observed behaviour of the newer function is that as documented, it exits
much less frequently - and most of the original function’s exits seem to
have been effectively pointless.
Another reason to use the new function is that it is a prerequisite for
using newer features such as in-kernel APIC support. (Not covered by
this patch.)
This change implements the upgrade by selecting one of three code paths
at compile time: two static code paths for the new and old functions
respectively, when building for targets where the new function is either
not available, or where the built executable won’t run on older
platforms lacking the new function anyway. The third code path selects
dynamically based on runtime detected availability of the weakly-linked
symbol.
Signed-off-by: Phil Dennis-Jordan <phil@philjordan.eu>
Message-ID: <20240605112556.43193-7-phil@philjordan.eu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When interrupting a vCPU thread, this patch actually tells the hypervisor to
stop running guest code on that vCPU.
Calling hv_vcpu_interrupt actually forces a vCPU exit, analogously to
hv_vcpus_exit on aarch64. Alternatively, if the vCPU thread
is not
running the VM, it will immediately cause an exit when it attempts
to do so.
Previously, hvf_kick_vcpu_thread relied upon hv_vcpu_run returning very
frequently, including many spurious exits, which made it less of a problem that
nothing was actively done to stop the vCPU thread running guest code.
The newer, more efficient hv_vcpu_run_until exits much more rarely, so a true
"kick" is needed before switching to that.
Signed-off-by: Phil Dennis-Jordan <phil@philjordan.eu>
Message-ID: <20240605112556.43193-6-phil@philjordan.eu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When using x86 macOS Hypervisor.framework as accelerator, detection of
dirty memory regions is implemented by marking logged memory region
slots as read-only in the EPT, then setting the dirty flag when a
guest write causes a fault. The area marked dirty should then be marked
writable in order for subsequent writes to succeed without a VM exit.
However, dirty bits are tracked on a per-page basis, whereas the fault
handler was marking the whole logged memory region as writable. This
change fixes the fault handler so only the protection of the single
faulting page is marked as dirty.
(Note: the dirty page tracking appeared to work despite this error
because HVF’s hv_vcpu_run() function generated unnecessary EPT fault
exits, which ended up causing the dirty marking handler to run even
when the memory region had been marked RW. When using
hv_vcpu_run_until(), a change planned for a subsequent commit, these
spurious exits no longer occur, so dirty memory tracking malfunctions.)
Additionally, the dirty page is set to permit code execution, the same
as all other guest memory; changing memory protection from RX to RW not
RWX appears to have been an oversight.
Signed-off-by: Phil Dennis-Jordan <phil@philjordan.eu>
Reviewed-by: Roman Bolshakov <roman@roolebo.dev>
Tested-by: Roman Bolshakov <roman@roolebo.dev>
Message-ID: <20240605112556.43193-5-phil@philjordan.eu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
A bunch of function definitions used empty parentheses instead of (void) syntax, yielding the following warning when building with clang on macOS:
warning: a function declaration without a prototype is deprecated in all versions of C [-Wstrict-prototypes]
In addition to fixing these function headers, it also fixes what appears to be a typo causing a variable to be unused after initialisation.
warning: variable 'entry_ctls' set but not used [-Wunused-but-set-variable]
Signed-off-by: Phil Dennis-Jordan <phil@philjordan.eu>
Reviewed-by: Roman Bolshakov <roman@roolebo.dev>
Tested-by: Roman Bolshakov <roman@roolebo.dev>
Message-ID: <20240605112556.43193-3-phil@philjordan.eu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This patch adds the INVTSC bit to the Hypervisor.framework accelerator's
CPUID bit passthrough allow-list. Previously, specifying +invtsc in the CPU
configuration would fail with the following warning despite the host CPU
advertising the feature:
qemu-system-x86_64: warning: host doesn't support requested feature:
CPUID.80000007H:EDX.invtsc [bit 8]
x86 macOS itself relies on a fixed rate TSC for its own Mach absolute time
timestamp mechanism, so there's no reason we can't enable this bit for guests.
When the feature is enabled, a migration blocker is installed.
Signed-off-by: Phil Dennis-Jordan <phil@philjordan.eu>
Reviewed-by: Roman Bolshakov <roman@roolebo.dev>
Tested-by: Roman Bolshakov <roman@roolebo.dev>
Message-ID: <20240605112556.43193-2-phil@philjordan.eu>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
The calculation of FrameTemp is done using the size indicated by mo_pushpop()
before being written back to EBP, but the final writeback to EBP is done using
the size indicated by mo_stacksize().
In the case where mo_pushpop() is MO_32 and mo_stacksize() is MO_16 then the
final writeback to EBP is done using MO_16 which can leave junk in the top
16-bits of EBP after executing ENTER.
Change the writeback of EBP to use the same size indicated by mo_pushpop() to
ensure that the full value is written back.
Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2198
Message-ID: <20240606095319.229650-5-mark.cave-ayland@ilande.co.uk>
Cc: qemu-stable@nongnu.org
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When OS/2 Warp configures its segment descriptors, many of them are configured with
the P flag clear to allow for a fault-on-demand implementation. In the case where
the stack value is POPped into the segment registers, the SP is incremented before
calling gen_helper_load_seg() to validate the segment descriptor:
IN:
0xffef2c0c: 66 07 popl %es
OP:
ld_i32 loc9,env,$0xfffffffffffffff8
sub_i32 loc9,loc9,$0x1
brcond_i32 loc9,$0x0,lt,$L0
st16_i32 loc9,env,$0xfffffffffffffff8
st8_i32 $0x1,env,$0xfffffffffffffffc
---- 0000000000000c0c 0000000000000000
ext16u_i64 loc0,rsp
add_i64 loc0,loc0,ss_base
ext32u_i64 loc0,loc0
qemu_ld_a64_i64 loc0,loc0,noat+un+leul,5
add_i64 loc3,rsp,$0x4
deposit_i64 rsp,rsp,loc3,$0x0,$0x10
extrl_i64_i32 loc5,loc0
call load_seg,$0x0,$0,env,$0x0,loc5
add_i64 rip,rip,$0x2
ext16u_i64 rip,rip
exit_tb $0x0
set_label $L0
exit_tb $0x7fff58000043
If helper_load_seg() generates a fault when validating the segment descriptor then as
the SP has already been incremented, the topmost word of the stack is overwritten by
the arguments pushed onto the stack by the CPU before taking the fault handler. As a
consequence things rapidly go wrong upon return from the fault handler due to the
corrupted stack.
Update the logic for the existing writeback condition so that a POP into the segment
registers also calls helper_load_seg() first before incrementing the SP, so that if a
fault occurs the SP remains unaltered.
Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Resolves: https://gitlab.com/qemu-project/qemu/-/issues/2198
Message-ID: <20240606095319.229650-4-mark.cave-ayland@ilande.co.uk>
Fixes: cc1d28bdbe0 ("target/i386: move 00-5F opcodes to new decoder", 2024-05-07)
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Instead of directly implementing the writeback using gen_op_st_v(), use the
existing gen_writeback() function.
Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-ID: <20240606095319.229650-3-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
This will make subsequent changes a little easier to read.
Signed-off-by: Mark Cave-Ayland <mark.cave-ayland@ilande.co.uk>
Message-ID: <20240606095319.229650-2-mark.cave-ayland@ilande.co.uk>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
DISAS_NORETURN suppresses the work normally done by gen_eob(), and therefore
must be used in special cases only. Document them.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
HLT uses DISAS_NORETURN because the corresponding helper calls
cpu_loop_exit(). However, while gen_eob() clears HF_RF_MASK and
synthesizes a #DB exception if single-step is active, none of this is
done by HLT. Note that the single-step trap is generated after the halt
is finished.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
PAUSE uses DISAS_NORETURN because the corresponding helper
calls cpu_loop_exit(). However, while HLT clear HF_INHIBIT_IRQ_MASK
to correctly handle "STI; HLT", the same is missing from PAUSE.
And also gen_eob() clears HF_RF_MASK and synthesizes a #DB exception
if single-step is active; none of this is done by HLT and PAUSE.
Start fixing PAUSE, HLT will follow.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
From vm entry to exit, VMRUN is handled as a single instruction. It
uses DISAS_NORETURN in order to avoid processing TF or RF before
the first instruction executes in the guest. However, the corresponding
handling is missing in vmexit. Add it, and at the same time reorganize
the comments with quotes from the manual about the tasks performed
by a #VMEXIT.
Another gen_eob() task that is missing in VMRUN is preparing the
HF_INHIBIT_IRQ flag for the next instruction, in this case by loading
it from the VMCB control state.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
If the required DR7 (either from the VMCB or from the host save
area) disables a breakpoint that was enabled prior to vmentry
or vmexit, it is left enabled and will trigger EXCP_DEBUG.
This causes a spurious #DB on the next crossing of the breakpoint.
To disable it, vmentry/vmexit must use cpu_x86_update_dr7
to load DR7.
Because cpu_x86_update_dr7 takes a 32-bit argument, check
reserved bits prior to calling cpu_x86_update_dr7, and do the
same for DR6 as well for consistency.
This scenario is tested by the "host_rflags" test in kvm-unit-tests.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
DR7.GD triggers a #DB exception on any access to debug registers.
The GD bit is cleared so that the #DB handler itself can access
the debug registers.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use decode.c's support for intercepts, doing the check in TCG-generated
code rather than the helper. This is cleaner because it allows removing
the eip_addend argument to helper_pause(), even though it adds a bit of
bloat for opcode 0x90's new decoding function.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Use decode.c's support for intercepts, doing the check in TCG-generated
code rather than the helper. This is cleaner because it allows removing
the eip_addend argument to helper_hlt().
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
ICEBP generates a trap-like exception, while gen_exception() produces
a fault. Resurrect gen_update_eip_next() to implement the desired
semantics.
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
When preparing an exception stack frame for a fault exception, the value
pushed for RF is 1. Take that into account. The same should be true
of interrupts for repeated string instructions, but the situation there
is complicated.
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
description:
loongarch_cpu_dump_state() want to dump all loongarch cpu
state registers, but there is a tiny typographical error when
printing "PRCFG2".
Cc: qemu-stable@nongnu.org
Signed-off-by: lanyanzhi <lanyanzhi22b@ict.ac.cn>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20240604073831.666690-1-lanyanzhi22b@ict.ac.cn>
Signed-off-by: Song Gao <gaosong@loongson.cn>
(cherry picked from commit 78f932ea1f7b3b9b0ac628dc2a91281318fe51fa)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
Features check of CPUID_SSE and CPUID_SSE2 should use cpuid_features,
rather than cpuid_ext_features.
Signed-off-by: Xinyu Li <lixinyu20s@ict.ac.cn>
Reviewed-by: Zhao Liu <zhao1.liu@intel.com>
Message-ID: <20240602100904.2137939-1-lixinyu20s@ict.ac.cn>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit da7c95920d027dbb00c6879c1da0216b19509191)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
xsave.flat checks that "executing the XSETBV instruction causes a general-
protection fault (#GP) if ECX = 0 and EAX[2:1] has the value 10b". QEMU allows
that option, so the test fails. Add the condition.
Cc: qemu-stable@nongnu.org
Fixes: 892544317fe ("target/i386: implement XSAVE and XRSTOR of AVX registers", 2022-10-18)
Reported-by: Thomas Huth <thuth@redhat.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
(cherry picked from commit 7604bbc2d87d153e65e38cf2d671a5a9a35917b1)
Signed-off-by: Michael Tokarev <mjt@tls.msk.ru>
description:
loongarch_cpu_dump_state() want to dump all loongarch cpu
state registers, but there is a tiny typographical error when
printing "PRCFG2".
Cc: qemu-stable@nongnu.org
Signed-off-by: lanyanzhi <lanyanzhi22b@ict.ac.cn>
Reviewed-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Song Gao <gaosong@loongson.cn>
Message-Id: <20240604073831.666690-1-lanyanzhi22b@ict.ac.cn>
Signed-off-by: Song Gao <gaosong@loongson.cn>
This patch adds a new board attribute 'v-eiointc'.
A value of true enables the virt extended I/O interrupt controller.
VMs working in kvm mode have 'v-eiointc' enabled by default.
Signed-off-by: Song Gao <gaosong@loongson.cn>
Reviewed-by: Bibo Mao <maobibo@loongson.cn>
Message-Id: <20240528083855.1912757-4-gaosong@loongson.cn>
Ignore the "monitor" portion and treat them the same
as their base ASIs.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
VIS4 completes the set, adding missing signed 8-bit ops
and missing unsigned 16 and 32-bit ops.
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>