Portal:DeveloperDocs/nftables internals

From nftables wiki
Jump to navigation Jump to search

This page contains information for Netfilter developers on how nftables internals work.

The kernel subsystem

The nf_tables kernel subsystem contains 2 key components:

  • the netlink API (i.e, control plane API)
  • the nf_tables core (i.e, the data plane engine)

Other components, such as external modules, are also in place and are intermixed with both the API and the core.

Generally speaking, the nf_tables subsystem is implementing a virtual machine of low-level expressions that operates on network packets.

TODO: add info.

nf_tables netlink API

The source code is mostly in net/netfilter/nf_tables_api.c [elixir src] [git src]

TODO: add info.

nf_tables core

The source code is mostly in net/netfilter/nf_tables_core.c [elixir src] [git src]

You can see there one of the most important functions in the core: nft_do_chain(). In a nut shell, this is the function that evaluates network packets against the ruleset.

The logic in this function is rather simple:

  • for each rule in the chain
    • for each low level expression in the rule
      • evaluate the packet against the expression
    • evaluate expression return code (break, continue, drop, accept, jump, goto, etc)

TODO: add info.


There are many low expressions that allows us to operate over network packets in different ways. You can think on these low level expressions as assembly-like instructions.

  • nft_immediate: loads an immediate value into a register.
  • nft_cmp: compare a given data with data from a given register.
  • nft_payload: set/get arbitrary data from packet headers.
  • nft_bitwise: perform bit-wise math operations over data in a given register.
  • nft_byteorder: perform byte order operations over data in a given register.
  • nft_counter: a basic counter for packet/bytes that gets incremented everything is evaluated for a packet.
  • nft_meta: set/get packet meta information, such as related interfaces, timestamps, etc.
  • nft_lookup: search for data from a given register (key) into a dataset. If the set is a map/vmap, returns the value for that key.

TODO: add info.

The userspace components

There are several important components in the userpsace part of nftables:

  • libmnl: generic low level library used to communicate with the kernel using netlink sockets.
  • libnftnl: low level library that is capable of interacting with the nf_tables subsystem netlink API in the kernel. Is responsible for creating/parsing the nf_tables netlink messages. Uses libmnl under the hood.
  • libnftables: high level library that implements the logic to translate from high level statements to netlink objects and the other way around. Uses libnftnl under the hood.
  • nft: the command line interface binary. This is what most end users actually use in their systems. It reads user input and calls libnftables under the hood.

Generally speaking, the userspace compiles high level statements (rules, etc) into the netlink bytecode that the kernel API understands When inspecting the ruleset (i.e, listing it) what it does is the opposite, reconstruct the low level netlink bytecode into high level statements.


This library provides data structures for entities existing in nf_tables nomenclature, such as tables, chains and rules. It serves as an intermediate layer between nftables and iptables-nft user space applications and nfnetlink messages the kernel sends and receives.

In general, each data structure comes with a set of handling routines:

To allocate and free an object of given type
Data structure fields are accessed via an attribute number (via a specific enum field)
Populating a netlink message or vice versa
Providing a textual representation, mostly for debugging purposes

Where sensible, there is a list-variant, too. If so, it comes with handling routines as well:

Allocating and freeing the list object (and members)
Add and remove from the list

Where useful, there might be a lookup routine as well. With nftnl_chain_list, e.g. the list object contains a hash table for chain names as well so list lookup by chain name is faster than a linear search.

A typical extra for list objects are iterators: A data structure containing state while browsing through the list. Usually the only routines used are allocators and a next routine.

These are the entities defined by libnftnl:

A rather boring "namespace" for chains
A container for rules, may attach to a netfilter hook in kernel
A container for expressions
An nftables VM code instruction
Similar to a chain, but holds flows between interfaces
A generic object, typically holding stateful information
A container for lists of tables, chains, sets and rules - not used by nftables application anymore
A container for elements
A set element
A trace event sent by the kernel


While nftables distinguishes between expressions and statements, such difference does not quite exist in libnftnl layer. For instance, a statement like:

ip saddr

is actually two expressions:

loading IPv4 header's source address into a register
comparing data from a register against a stored value

Since expressions have access to the packet, its meta data, all nftables registers (including the verdict register) and may store multiple values internally, they are mighty and versatile.


This is a common API for various object types. An object's type is defined post allocation by setting the NFTNL_OBJ_TYPE attribute. Currently existing object types are:

  • counter
  • quota
  • ct helper
  • limit
  • tunnel
  • ct timeout
  • secmark
  • ct expect
  • synproxy


This is a wrapper interface around the same functionality in libmnl (which is used internally). In general, nftnl batches aid in collecting multiple netlink messages for kernel submission.


One goal in nftables development was to provide users with a library for easier integration into applications than "shelling out" using system() and trying to parse nft command output.

At first, libnftnl was supposed to achieve this but the fact that it exposes internal implementation details apart from being pretty low-level in general made it rather unsuitable from a users' perspective.

To overcome this, nft backend code was separated into a library which should fill the gap between libnftnl on one side and nft application itself on the other.

Usage of libnftables is supposed to be simple and straightforward, almost like calling nft itself but with a bit more convenience. First step is to create a new context:

 struct nft_ctx *ctx = nft_ctx_new(0);

The context allows to configure library behaviour on a "per session" basis. With this in place, nftables commands may be executed:

 int rc = nft_run_cmd_from_buffer(ctx, "add table inet t");

or whole dump files loaded:

 int rc = nft_run_cmd_from_filename(ctx, "/etc/nftables/all-in-one.nft");

To control output, there are a number of functions:

 FILE *nft_ctx_set_output(struct nft_ctx *ctx, FILE *fp);
 int nft_ctx_buffer_output(struct nft_ctx *ctx);
 int nft_ctx_unbuffer_output(struct nft_ctx *ctx);
 const char *nft_ctx_get_output_buffer(struct nft_ctx *ctx);

Same for stderr. See libnftables(3) man page for further details.

nft: from user space to the kernel

The following describes the steps and entities involved after a call to nft in user space until the actual communication with the kernel.

Since creation of libnftables, nft is merely a lightweight front-end, basically just creating a libnftables handle, allowing to configure it via command-line options and feeding nftables syntax into it. Within the library, the actual work takes place. It may be divided into several phases:

  • Input parsing into internal data structure
  • Evaluation and expansion
  • Serialization into netlink messages
  • nfnetlink message session with kernel
  • Error handling

Input parsing into internal data structures

Depending on whether input comes from command line or a file (which may be stdin), main() calls either nft_run_cmd_from_buffer() or nft_run_cmd_from_filename() library functions.

If JSON output was selected (nft -j), the JSON parser (in src/parser_json.c) is tried first. If this did not succeed, the standard ("human-readable") syntax parser is called.

Eventually both parsers populate a list of commands (struct cmd) and a list of error messages (struct error_record) in case errors were detected.

Standard syntax

The parser for standard syntax is implemented in lex and yacc, see src/scanner.l and src/parser_bison.y for reference. It is entered via the generated function nft_parse().

As a basic rule, in lex/yacc the scanner recognizes the words and the parser interprets them in their context. There is also (limited) scanner control from the parser by definition of a scope in which some words are valid or not. The parser defines recursive patterns to match input against. Here is the top-most one, input:

input       :       /* empty */
            |       input           line
                    if ($2 != NULL) {
                            $2->location = @2;
                            list_add_tail(&$2->list, state->cmds);

So it may be empty or (by recursion) consist of a number of line patterns. Each of those lines parses into a command and is appended to the list. The snippet above also shows how parser-provided location data is stored in the command object. This is used for error reporting.

JSON syntax

The JSON parser lives in src/parser_json.c and is entered via nft_parse_json_buffer() function or nft_parse_json_filename(), respectively. It uses jansson library for (de-)serialization and value (un-)packing. To learn about the code and to understand the program flow, json_parse_cmd() function is a good starting point.

Evaluation and expansion

Input evaluation is a crucial step and combines several tasks. It extends input validation from mere syntax checks done by the parser to semantical ones, taking context into perspective.

Input may be changed, too. Sometimes it is necessary to insert extra statements as dependency, sometimes types of right hand sides of comparisons must adjust to left hand side type.

Before all the above, the list of commands is scanned for cache requirements - see nft_cache_evaluate() for details. Since caching may be an expensive operation if in-kernel ruleset is huge, this step attempts to reduce the data fetched from kernel to the bare minimum needed for correct operation. A final call to nft_cache_update() then does the actual fetch.

If evaluation passed, expansion takes place. This is mostly to cover for input in "dump" notation, i.e. rules nested in chains nested in tables, etc. Such input is converted into individual "add" commands as required by the netlink message format. The code is pretty straightforward, see nft_cmd_expand() for reference.

Serialization into netlink messages

In this step, nftables-internal data types are converted into libnftables ones (e.g., struct table into struct nftnl_table). The latter abstract their internal layout as attributes and are therefore opaque to the caller.

libnftnl provides helpers to convert its own data structures into netlink message format: A generic nftnl_nlmsg_build_hdr() for the header and type-specific ones for the payload (e.g., nftnl_table_nlmsg_build_payload()).

The netlink messages are stored in a struct nftnl_batch which provides the backing storage. This surrounding data structure serializes into an introductory NFNL_MSG_BATCH_BEGIN message and a finalizing one with type NFNL_MSG_BATCH_END.

In kernel space, the batch constitutes a transaction: If one of the messages is rejected, none of them take effect. Ditto, if the final batch end message is missing the whole batch will undo. This is how nft's --check option is implemented.

nfnetlink message session with kernel

In nft, communication with the kernel takes place in the function mnl_batch_talk(): It converts the nftnl_batch into a message suitable for sendmsg(), adjusts buffer sizes (if needed), transmits the data and listens for a reply. Any error messages are handled by mnl_batch_extack_cb() function which records them for later reporting. Other messages are relevant for --echo mode, in which the kernel "echoes" the requests back after updating them (with handle values, for instance). These are handled by netlink_echo_callback(), more or less a wrapper around nft's event monitoring code.

Error handling

Each struct cmd object is identified by its own sequence number (monotonic within the batch). Netlink error messages contain this number and also an offset value, which allow to identify not only the problematic message but also the specific attribute of that message which was rejected.

Mapping from message attribute back to line or word(s) of input works via a mapping from attribute offset to the struct location object stored while parsing. That bison parser-provided data holds line and column numbers, allowing nft to underline problematic parts of input when reporting back to the caller.

To follow the above in the source code, see nft_cmd_error() function being called for each command and error it caused. The mapping is established earlier while creating netlink messages, i.e. in code called from do_command() - watch out for the various calls to cmd_add_loc() populating the field attr in struct cmd.

nft: from the kernel to user space

Communication between nft in user space and nftables in kernel happens via netlink, a packet-based IPC mechanism for that purpose. Its kernel source code lives in net/netlink directory and allows to be extended by calling netlink_kernel_create(), passing a unique unit number and a struct netlink_kernel_cfg object.

nfnetlink is such an extension, attempting to serve all netfilter-related user space applications. It is implemented in net/netfilter/nfnetlink.c and itself allows to be extended as well by means of groups (see nfnetlink_groups in include/uapi/linux/netfilter/nfnetlink.h). These in turn map to nfnetlink subsystems, see the constant array nfnl_group2type in nfnetlink source file. NFNL_SUBSYS_NFTABLES is the relevant one here, implemented in net/netfilter/nf_tables_api.c (see nf_tables_subsys and the call to nfnetlink_subsys_register() in there).

For insight, it is worthwhile to remain in generic nfnetlink code for a little longer: nfnetlink_net_ops are registered as a "pernet" subsystem, i.e. each network namespace gets its own instance. Upon netns creation, nfnetlink_net_init() is called which actually creates the NETLINK_NETFILTER subsystem. Its receive callback (nfnetlink_rcv()) checks whether the first message header starts a batch and diverts the code flow accordingly.

For batch handling, subsystems need to define commit and abort callbacks. Also, for each contained message, there must be a responsible callback entry with type NFNL_CB_BATCH. nf_tables_subsys fulfills these requirements.

Each callback in nf_tables_cb (and therefore each supported message type) decides whether it must be part of a batch or not - nfnetlink code does not allow for multiple handlers of the same message. In nftables, only getters for different ruleset elements are non-batch, anything mangling the ruleset is.

Non-batched handlers

These are getters for:

* table
* chain
* rule
* set
* set element
* generation ID
* stateful objects
* flowtable

They all behave similar: Unless NLM_F_DUMP flag is set in the message, they perform a lookup based on the required identifiers and return an nfnetlink message to user space. There are type-specific helpers populating a netlink message named nf_tables_fill_<SOMETHING>_info, packet sending is done by a call to nfnetlink_unicast().

If NLM_F_DUMP was given, the getter iterates over all ruleset elements of given type and fills a netlink message for each. In some cases, filtering the output by identifiers given in the request is supported - useful to dump e.g. all rules of a specific chain only.

The iterator code is a bit complicated due to the fact that socket buffer size may be exceeded. In that case, partial data is submitted to user space and the dump continued afterwards. The iterators keep a "cursor" (actually a counter) for where to pick up again.

Batched handlers

To allow for rolling back a transaction which has failed or was aborted, message handlers of type NFNL_CB_BATCH allocate a struct nft_trans object and add it to the per-net commit list. This "log" of what was done is also useful to defer actions till the very end of the transaction. See nf_tables_commit() for reference of what it is used for in the success-case. Similar code is found in nf_tables_abort(), reverting the previous changes.

To make the ruleset update atomic, nftables uses an internal generation ID. Its value alternates between zero and one upon each commit. Ruleset elements have a two-bit "generation mask", indicating whether that element is active in the generation at its bit index. This way, elements may die, get born or stay alive when the generation ID toggles again.