From 50d4fa86dbbf8f3cde7e81505222dd4653875c70 Mon Sep 17 00:00:00 2001 From: Laszlo Ersek Date: Fri, 14 Jun 2013 07:39:25 +0000 Subject: OvmfPkg: VirtioNetDxe: add technical notes Contributed-under: TianoCore Contribution Agreement 1.0 Signed-off-by: Laszlo Ersek Reviewed-by: Jordan Justen Reviewed-by: Stefan Hajnoczi git-svn-id: https://svn.code.sf.net/p/edk2/code/trunk/edk2@14404 6f19259b-4bc3-4df7-8a09-765794883524 --- OvmfPkg/VirtioNetDxe/TechNotes.txt | 355 +++++++++++++++++++++++++++++++++++++ 1 file changed, 355 insertions(+) create mode 100644 OvmfPkg/VirtioNetDxe/TechNotes.txt (limited to 'OvmfPkg') diff --git a/OvmfPkg/VirtioNetDxe/TechNotes.txt b/OvmfPkg/VirtioNetDxe/TechNotes.txt new file mode 100644 index 0000000000..9c1dfe6a77 --- /dev/null +++ b/OvmfPkg/VirtioNetDxe/TechNotes.txt @@ -0,0 +1,355 @@ +## @file +# +# Technical notes for the virtio-net driver. +# +# Copyright (C) 2013, Red Hat, Inc. +# +# This program and the accompanying materials are licensed and made available +# under the terms and conditions of the BSD License which accompanies this +# distribution. The full text of the license may be found at +# http://opensource.org/licenses/bsd-license.php +# +# THE PROGRAM IS DISTRIBUTED UNDER THE BSD LICENSE ON AN "AS IS" BASIS, WITHOUT +# WARRANTIES OR REPRESENTATIONS OF ANY KIND, EITHER EXPRESS OR IMPLIED. +# +## + +Disclaimer +---------- + +All statements concerning standards and specifications are informative and not +normative. They are made in good faith. Corrections are most welcome on the +edk2-devel mailing list. + +The following documents have been perused while writing the driver and this +document: +- Unified Extensible Firmware Interface Specification, Version 2.3.1, Errata C; + June 27, 2012 +- Driver Writer's Guide for UEFI 2.3.1, 03/08/2012, Version 1.01; +- Virtio PCI Card Specification, v0.9.5 DRAFT, 2012 May 7. + + +Summary +------- + +The VirtioNetDxe UEFI_DRIVER implements the Simple Network Protocol for +virtio-net devices. Higher level protocols are automatically installed on top +of it by the DXE Core / the ConnectController() boot service, enabling for +virtio-net devices eg. DHCP configuration, TCP transfers with edk2 StdLib +applications, and PXE booting in OVMF. + + +UEFI driver structure +--------------------- + +A driver instance, belonging to a given virtio-net device, can be in one of +four states at any time. The states stack up as follows below. The state +transitions are labeled with the primary function (and its important callees +faithfully indented) that implement the transition. + + | ^ + | | + [DriverBinding.c] | | [DriverBinding.c] + VirtioNetDriverBindingStart | | VirtioNetDriverBindingStop + VirtioNetSnpPopulate | | VirtioNetSnpEvacuate + VirtioNetGetFeatures | | + v | + +-------------------------+ + | EfiSimpleNetworkStopped | + +-------------------------+ + | ^ + [SnpStart.c] | | [SnpStop.c] + VirtioNetStart | | VirtioNetStop + | | + v | + +-------------------------+ + | EfiSimpleNetworkStarted | + +-------------------------+ + | ^ + [SnpInitialize.c] | | [SnpShutdown.c] + VirtioNetInitialize | | VirtioNetShutdown + VirtioNetInitRing {Rx, Tx} | | VirtioNetShutdownRx [SnpSharedHelpers.c] + VirtioRingInit | | VirtioNetShutdownTx [SnpSharedHelpers.c] + VirtioNetInitTx | | VirtioRingUninit {Tx, Rx} + VirtioNetInitRx | | + v | + +-----------------------------+ + | EfiSimpleNetworkInitialized | + +-----------------------------+ + +The state at the top means "nonexistent" and is hence unnamed on the diagram -- +a driver instance actually doesn't exist at that point. The transition +functions out of and into that state implement the Driver Binding Protocol. + +The lower three states characterize an existent driver instance and are all +states defined by the Simple Network Protocol. The transition functions between +them are member functions of the Simple Network Protocol. + +Each transition function validates its expected source state and its +parameters. For example, VirtioNetDriverBindingStop will refuse to disconnect +from the controller unless it's in EfiSimpleNetworkStopped. + + +Driver instance states (Simple Network Protocol) +------------------------------------------------ + +In the EfiSimpleNetworkStopped state, the virtio-net device is (has been) +re-set. No resources are allocated for networking / traffic purposes. The MAC +address and other device attributes have been retrieved from the device (this +is necessary for completing the VirtioNetDriverBindingStart transition). + +The EfiSimpleNetworkStarted is completely identical to the +EfiSimpleNetworkStopped state for virtio-net, in the functional and +resource-usage sense. This state is mandated / provided by the Simple Network +Protocol for flexibility that the virtio-net driver doesn't exploit. + +In particular, the EfiSimpleNetworkStarted state is the target of the Shutdown +SNP member function, and must therefore correspond to a hardware configuration +where "[it] is safe for another driver to initialize". (Clearly another UEFI +driver could not do that due to the exclusivity of the driver binding that +VirtioNetDriverBindingStart() installs, but a later OS driver might qualify.) + +The EfiSimpleNetworkInitialized state is the live state of the virtio NIC / the +driver instance. Virtio and other resources required for network traffic have +been allocated, and the following SNP member functions are available (in +addition to VirtioNetShutdown which leaves the state): + +- VirtioNetReceive [SnpReceive.c]: poll the virtio NIC for an Rx packet that + may have arrived asynchronously; + +- VirtioNetTransmit [SnpTransmit.c]: queue a Tx packet for asynchronous + transmission (meant to be used together with VirtioNetGetStatus); + +- VirtioNetGetStatus [SnpGetStatus.c]: query link status and status of pending + Tx packets; + +- VirtioNetMcastIpToMac [SnpMcastIpToMac.c]: transform a multicast IPv4/IPv6 + address into a multicast MAC address; + +- VirtioNetReceiveFilters [SnpReceiveFilters.c]: emulate unicast / multicast / + broadcast filter configuration (not their actual effect -- a more liberal + filter setting than requested is allowed by the UEFI specification). + +The following SNP member functions are not supported [SnpUnsupported.c]: + +- VirtioNetReset: reinitialize the virtio NIC without shutting it down (a loop + from/to EfiSimpleNetworkInitialized); + +- VirtioNetStationAddress: assign a new MAC address to the virtio NIC, + +- VirtioNetStatistics: collect statistics, + +- VirtioNetNvData: access non-volatile data on the virtio NIC. + +Missing support for these functions is allowed by the UEFI specification and +doesn't seem to trip up higher level protocols. + + +Events and task priority levels +------------------------------- + +The UEFI specification defines a sophisticated mechanism for asynchronous +events / callbacks (see "6.1 Event, Timer, and Task Priority Services" for +details). Such callbacks work like software interrupts, and some notion of +locking / masking is important to implement critical sections (atomic or +exclusive access to data or a device). This notion is defined as Task Priority +Levels. + +The virtio-net driver for OVMF must concern itself with events for two reasons: + +- The Simple Network Protocol provides its clients with a (non-optional) WAIT + type event called WaitForPacket: it allows them to check or wait for Rx + packets by polling or blocking on this event. (This functionality overlaps + with the Receive member function.) The event is available to clients starting + with EfiSimpleNetworkStopped (inclusive). + + The virtio-net driver is informed about such client polling or blockage by + receiving an asynchronous callback (a software interrupt). In the callback + function the driver must interrogate the driver instance state, and if it is + EfiSimpleNetworkInitialized, access the Rx queue and see if any packets are + available for consumption. If so, it must signal the WaitForPacket WAIT type + event, waking the client. + + For simplicity and safety, all parts of the virtio-net driver that access any + bit of the driver instance (data or device) run at the TPL_CALLBACK level. + This is the highest level allowed for an SNP implementation, and all code + protected in this manner satisfies even stricter non-blocking requirements + than what's documented for TPL_CALLBACK. + + The task priority level for the WaitForPacket callback too is set by the + driver, the choice is TPL_CALLBACK again. This in effect serializes the + WaitForPacket callback (VirtioNetIsPacketAvailable [Events.c]) with "normal" + parts of the driver. + +- According to the Driver Writer's Guide, a network driver should install a + callback function for the global EXIT_BOOT_SERVICES event (a special NOTIFY + type event). When the ExitBootServices() boot service has cleaned up internal + firmware state and is about to pass control to the OS, any network driver has + to stop any in-flight DMA transfers, lest it corrupts OS memory. For this + reason EXIT_BOOT_SERVICES is emitted and the network driver must abort + in-flight DMA transfers. + + This callback (VirtioNetExitBoot) is synchronized with the rest of the driver + code just the same as explained for WaitForPacket. In + EfiSimpleNetworkInitialized state it resets the virtio NIC, halting all data + transfer. After the callback returns, no further driver code is expected to + be scheduled. + + +Virtio internals -- Rx +---------------------- + +Requests (Rx and Tx alike) are always submitted by the guest and processed by +the host. For Tx, processing means transmission. For Rx, processing means +filling in the request with an incoming packet. Submitted requests exist on the +"Available Ring", and answered (processed) requests show up on the "Used Ring". + +Packet data includes the media (Ethernet) header: destination MAC, source MAC, +and Ethertype (14 bytes total). + +The following structures implement packet reception. Most of them are defined +in the Virtio specification, the only driver-specific trait here is the static +pre-configuration of the two-part descriptor chains, in VirtioNetInitRx. The +diagram is simplified. + + Available Index Available Index + last processed incremented + by the host by the guest + v -------> v +Available +-------+-------+-------+-------+-------+ +Ring |DescIdx|DescIdx|DescIdx|DescIdx|DescIdx| + +-------+-------+-------+-------+-------+ + =D6 =D2 + + D2 D3 D4 D5 D6 D7 +Descr. +----------+----------++----------+----------++----------+----------+ +Table |Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx||Adr:Len:Nx|Adr:Len:Nx| + +----------+----------++----------+----------++----------+----------+ + =A2 =D3 =A3 =A4 =D5 =A5 =A6 =D7 =A7 + + + A2 A3 A4 A5 A6 A7 +Receive +---------------+---------------+---------------+ +Destination |vnet hdr:packet|vnet hdr:packet|vnet hdr:packet| +Area +---------------+---------------+---------------+ + + Used Index Used Index incremented + last processed by the guest by the host + v -------> v +Used +-----------+-----------+-----------+-----------+-----------+ +Ring |DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len|DescIdx:Len| + +-----------+-----------+-----------+-----------+-----------+ + =D4 + +In VirtioNetInitRx, the guest allocates the fixed size Receive Destination +Area, which accommodates all packets delivered asynchronously by the host. To +each packet, a slice of this area is dedicated; each slice is further +subdivided into virtio-net request header and network packet data. The +(guest-physical) addresses of these sub-slices are denoted with A2, A3, A4 and +so on. Importantly, an even-subscript "A" always belongs to a virtio-net +request header, while an odd-subscript "A" always belongs to a packet +sub-slice. + +Furthermore, the guest lays out a static pattern in the Descriptor Table. For +each packet that can be in-flight or already arrived from the host, +VirtioNetInitRx sets up a separate, two-part descriptor chain. For packet N, +the Nth descriptor chain is set up as follows: + +- the first (=head) descriptor, with even index, points to the fixed-size + sub-slice receiving the virtio-net request header, + +- the second descriptor (with odd index) points to the fixed (1514 byte) size + sub-slice receiving the packet data, + +- a link from the first (head) descriptor in the chain is established to the + second (tail) descriptor in the chain. + +Finally, the guest populates the Available Ring with the indices of the head +descriptors. All descriptor indices on both the Available Ring and the Used +Ring are even. + +Packet reception occurs as follows: + +- The host consumes a descriptor index off the Available Ring. This index is + even (=2*N), and fingers the head descriptor of the chain belonging to packet + N. + +- The host reads the descriptors D(2*N) and -- following the Next link there + --- D(2*N+1), and stores the virtio-net request header at A(2*N), and the + packet data at A(2*N+1). + +- The host places the index of the head descriptor, 2*N, onto the Used Ring, + and sets the Len field in the same Used Ring Element to the total number of + bytes transferred for the entire descriptor chain. This enables the guest to + identify the length of Rx packets. + +- VirtioNetReceive polls the Used Ring. If a new Used Ring Element shows up, it + copies the data out to the caller, and recycles the index of the head + descriptor (ie. 2*N) to the Available Ring. + +- Because the host can process (answer) Rx requests in any order theoretically, + the order of head descriptor indices on each of the Available Ring and the + Used Ring is virtually random. (Except right after the initial population in + VirtioNetInitRx, when the Available Ring is full and increasing, and the Used + Ring is empty.) + +- If the Available Ring is empty, the host is forced to drop packets. If the + Used Ring is empty, VirtioNetReceive returns EFI_NOT_READY (no packet + available). + + +Virtio internals -- Tx +---------------------- + +The transmission structure erected by VirtioNetInitTx is similar, it differs +in the following: + +- There is no Receive Destination Area. + +- Each head descriptor, D(2*N), points to a read-only virtio-net request header + that is shared by all of the head descriptors. This virtio-net request header + is never modified by the host. + +- Each tail descriptor is re-pointed to the caller-supplied packet buffer + whenever VirtioNetTransmit places the corresponding head descriptor on the + Available Ring. The caller is responsible to hang on to the unmodified buffer + until it is reported transmitted by VirtioNetGetStatus. + +Steps of packet transmission: + +- Client code calls VirtioNetTransmit. VirtioNetTransmit tracks free descriptor + chains by keeping the indices of their head descriptors in a stack that is + private to the driver instance. All elements of the stack are even. + +- If the stack is empty (that is, each descriptor chain, in isolation, is + either pending transmission, or has been processed by the host but not + yet recycled by a VirtioNetGetStatus call), then VirtioNetTransmit returns + EFI_NOT_READY. + +- Otherwise the index of a free chain's head descriptor is popped from the + stack. The linked tail descriptor is re-pointed as discussed above. The head + descriptor's index is pushed on the Available Ring. + +- The host moves the head descriptor index from the Available Ring to the Used + Ring when it transmits the packet. + +- Client code calls VirtioNetGetStatus. In case the Used Ring is empty, the + function reports no Tx completion. Otherwise, a head descriptor's index is + consumed from the Used Ring and recycled to the private stack. The client + code's original packet buffer address is fetched from the tail descriptor + (where it has been stored at VirtioNetTransmit time) and returned to the + caller. + +- The Len field of the Used Ring Element is not checked. The host is assumed to + have transmitted the entire packet -- VirtioNetTransmit had forced it below + 1514 bytes (inclusive). The Virtio specification suggests this packet size is + always accepted (and a lower MTU could be encountered on any later hop as + well). Additionally, there's no good way to report a short transmit via + VirtioNetGetStatus; EFI_DEVICE_ERROR seems too serious from the specification + and higher level protocols could interpret it as a fatal condition. + +- The host can theoretically reorder head descriptor indices when moving them + from the Available Ring to the Used Ring (out of order transmission). Because + of this (and the choice of a stack over a list for free descriptor chain + tracking) the order of head descriptor indices on either Ring is + unpredictable. -- cgit v1.2.3