eBPF for Linux Admins: Part I
Table of Contents
eBPF - This article is part of a series.
Pre-requisites#
To get the most out of this article, it’s helpful to have some background in Linux networking and packet tracing with tcpdump
.
Some of the internals were intentionally excluded to simplify the topic.
Classic BPF#
Let’s take the scenario were you wanted to observe all ARP packet coming to the NIC. The packet first lands in the network device hardware and then later will be placed in an receive queue (RX_RING) inside the Kernel.
For a user to see ARP packets, packets needs to be copied from kernel space to the user space. Then each of the packets needs to be filtered based on its packet type; ARP.
If the system is going to copy all packets get’s into RX_RING to user space and then checking for a matching packet type, system have to do packet copy from kernel space to users pace. Switching CPU from kernel space to user space to copy packet is inefficient and will affect the system performance.
So how can we filter packets which are - on-the way - within the kernel space and copy only the matching packets to user space?
Here comes the BPF or Berkley Packet Filter.
The BPF virtual machine is a pseudo VM inside the Linux kernel. For the sake of simplicity, you can consider this as a JavaScript engine inside your browser!
One of the tool in Linux that uses BPF is the tcpdump
which utilizes the BPF for packet filtering.
The BPF VM supports a limited set of instructions and there are many restrictions to the usage as well.
Below are the registers in BPF VM (or pseudo-machine)
- A 32bit wide accumulator
[A]
where the contents of the packet get loaded. - A 32bit wide index register
[X]
. - A scratch memory area of 16 32bit registers.
- A program counter.
The filters we pass to tcpdump
command will be converted into “byte code” and then injected directly into the kernel.(More about byte code will be coming later in this article.)
The load instructions loads the packet data to accumulator, and then we can examine the packets in BPF VM.
Let’s examine the code generated by the tcpdump
command that filters the ARP
packets coming to interface ens33
.
[root@localhost ~]# tcpdump -i ens33 arp -d
(000) ldh [12]
(001) jeq #0x806 jt 2 jf 3
(002) ret #262144
(003) ret #0
[root@localhost ~]#
Explanation
(000) ldh - Load half word (16 bits) from index 12 of the packet ; skip 6 byte dst mac and 6 byte src mac.
(001) jeq - If accumulator value is 0x806 ; ie ARP packet, then jump to 2 else jump to 3
(002) ret - Return the contents with buffer size 262144 ; ie entire packet or [max snapshot length](https://github.com/the-tcpdump-group/tcpdump/blob/tcpdump-4.9/netdissect.h#L263)
(003) ret - Return nothing to the users pace
You can find more details of the inner working of BPF in this Usenix paper
So the above filter skips the source and destination mac fields and then loads 16bit
s from the index 12
which is the packet type.
So the 16bits - 0x806
(0000100000000110
) at offset 12
will try to match ARP
packet!
Few points to note;
The
Ethernet type II
packet have below format;Ethernet packets are
big-endain
.In a
32bit
system, a full word is32bit
, half word is16bit
.1
byte =8bits
,2
byte =16bits
You can find the Ethernet type hex representation of packet types in IANA
------------------------------------------------------------------------------------------------------------------------------------------------ Ethertype (decimal) Ethertype (hex) Exp. Ethernet (decimal) Exp. Ethernet (octal) Description Reference ------------------------------------------------------------------------------------------------------------------------------------------------ 2054 0806 - - Address Resolution Protocol (ARP) [RFC7042] ------------------------------------------------------------------------------------------------------------------------------------------------
The Byte Code#
The BPF program we discussed above can be converted to byte code.
What is byte code?
A compact, platform-independent instruction set designed for execution by a virtual machine, rather than directly by a physical CPU. In this case the VM is a BPF pseudo VM sitting inside the Kernel.
The user space can inject this bytecode to the BPF pseudo VM and the VM will convert that to the architecture dependant assembly code which can be executed directly on the hardware.
We can generate the bytecode of the BPF instruction in tcpdump
itself.
[root@localhost ~]# tcpdump -i ens33 arp -ddd
4
40 0 0 12
21 0 1 2054
6 0 0 262144
6 0 0 0
The bytecode can be injected into the system in different ways. The tcmpdump
utility have it’s own logic to do this operation.
With that we concludes the Part - 1 of eBPF for Linux Admins here.
In the next part, we will discuss eXpressDataPath - XDP and eBPF.