Difference between revisions of "Portal:DeveloperDocs/nftables internals"

From nftables wiki
Jump to navigation Jump to search
(Added section about libnftables)
Line 175: Line 175:
Same for stderr. See libnftables(3) man page for further details.
Same for stderr. See libnftables(3) man page for further details.


= nft: from userspace to the kernel =
= nft: from user space to the kernel =


'''TODO:''' add info.
The following describes the steps and entities involved after a call to
'''nft''' in user space until the actual communication with the kernel.
 
Since creation of '''libnftables''', '''nft''' is merely a lightweight
front-end, basically just creating a '''libnftables''' handle, allowing to
configure it via command-line options and feeding '''nftables''' syntax into
it. Within the library, the actual work takes place. It may be divided into
several phases:
 
* Input parsing into internal data structure
* Evaluation and expansion
* Serialization into netlink messages
* nfnetlink message session with kernel
* Error handling
 
== Input parsing into internal data structures ==
 
Depending on whether input comes from command line or a file (which may be
''stdin''), '''main()''' calls either '''nft_run_cmd_from_buffer()''' or
'''nft_run_cmd_from_filename()''' library functions.
 
If JSON output was selected ('''nft -j'''), the JSON parser (in
''src/parser_json.c'') is tried first. If this did not succeed, the standard
("human-readable") syntax parser is called.
 
Eventually both parsers populate a list of commands ('''struct cmd''') and a
list of error messages ('''struct error_record''') in case errors were
detected.
 
=== Standard syntax ===
 
The parser for standard syntax is implemented in lex and yacc, see
''src/scanner.l'' and ''src/parser_bison.y'' for reference. It is entered via
the generated function '''nft_parse()'''.
 
As a basic rule, in lex/yacc the scanner recognizes the words and the parser
interprets them in their context. There is also (limited) scanner control from
the parser by definition of a scope in which some words are valid or not. The
parser defines recursive patterns to match input against. Here is the top-most
one, '''input''':
 
<nowiki>
input      :      /* empty */
            |      input          line
            {
                    if ($2 != NULL) {
                            $2->location = @2;
                            list_add_tail(&$2->list, state->cmds);
                    }
            }
            ;
</nowiki>
 
So it may be empty or (by recursion) consist of a number of
'''line''' patterns. Each of those lines parses into a command and is appended
to the list. The snippet above also shows how parser-provided '''location'''
data is stored in the command object. This is used for error reporting.
 
=== JSON syntax ===
 
The JSON parser lives in ''src/parser_json.c'' and is entered via
'''nft_parse_json_buffer()''' function or '''nft_parse_json_filename()''',
respectively. It uses jansson library for (de-)serialization and value
(un-)packing. To learn about the code and to understand the program flow,
'''json_parse_cmd()''' function is a good starting point.
 
== Evaluation and expansion ==
 
Input evaluation is a crucial step and combines several tasks. It extends input
validation from mere syntax checks done by the parser to semantical ones,
taking context into perspective.
 
Input may be changed, too. Sometimes it is necessary to insert extra statements
as dependency, sometimes types of right hand sides of comparisons must adjust
to left hand side type.
 
Before all the above, the list of commands is scanned for cache requirements -
see '''nft_cache_evaluate()''' for details. Since caching may be an expensive
operation if in-kernel ruleset is huge, this step attempts to reduce the data
fetched from kernel to the bare minimum needed for correct operation. A final
call to '''nft_cache_update()''' then does the actual fetch.
 
If evaluation passed, expansion takes place. This is mostly to cover for input
in "dump" notation, i.e. rules nested in chains nested in tables, etc. Such
input is converted into individual "add" commands as required by the netlink
message format. The code is pretty straightforward, see '''nft_cmd_expand()'''
for reference.
 
== Serialization into netlink messages ==
 
In this step, '''nftables'''-internal data types are converted into
'''libnftables''' ones (e.g., '''struct table''' into
'''struct nftnl_table'''). The latter abstract their internal layout as
attributes and are therefore opaque to the caller.
 
'''libnftnl''' provides helpers to convert its own data structures into netlink
message format: A generic '''nftnl_nlmsg_build_hdr()''' for the header and
type-specific ones for the payload (e.g.,
'''nftnl_table_nlmsg_build_payload()''').
 
The netlink messages are stored in a '''struct nftnl_batch''' which provides
the backing storage. This surrounding data structure serializes into an
introductory '''NFNL_MSG_BATCH_BEGIN''' message and a finalizing one with type
'''NFNL_MSG_BATCH_END'''.
 
In kernel space, the batch constitutes a transaction: If one of the messages is
rejected, none of them take effect. Ditto, if the final batch end message is
missing the whole batch will undo. This is how '''nft''''s '''--check''' option
is implemented.
 
== nfnetlink message session with kernel ==
 
In '''nft''', communication with the kernel takes place in the function
'''mnl_batch_talk()''': It converts the '''nftnl_batch''' into a message
suitable for '''sendmsg()''', adjusts buffer sizes (if needed), transmits the
data and listens for a reply. Any error messages are handled by
'''mnl_batch_extack_cb()''' function which records them for later reporting.
Other messages are relevant for '''--echo''' mode, in which the kernel "echoes"
the requests back after updating them (with handle values, for instance). These
are handled by '''netlink_echo_callback()''', more or less a wrapper around
'''nft''''s event monitoring code.
 
== Error handling ==
 
Each '''struct cmd''' object is identified by its own sequence number
(monotonic within the batch). Netlink error messages contain this number and
also an offset value, which allow to identify not only the problematic message
but also the specific attribute of that message which was rejected.
 
Mapping from message attribute back to line or word(s) of input works via a
mapping from attribute offset to the '''struct location''' object stored while
parsing. That bison parser-provided data holds line and column numbers,
allowing '''nft''' to underline problematic parts of input when reporting back
to the caller.
 
To follow the above in the source code, see '''nft_cmd_error()''' function
being called for each command and error it caused. The mapping is established
earlier while creating netlink messages, i.e. in code called from
'''do_command()''' - watch out for the various calls to '''cmd_add_loc()'''
populating the field '''attr''' in '''struct cmd'''.


= nft: from the kernel to the usespace =
= nft: from the kernel to the usespace =


'''TODO:''' add info.
'''TODO:''' add info.

Revision as of 13:24, 13 September 2022

This page contains information for Netfilter developers on how nftables internals work.

The kernel subsystem

The nf_tables kernel subsystem contains 2 key components:

  • the netlink API (i.e, control plane API)
  • the nf_tables core (i.e, the data plane engine)

Other components, such as external modules, are also in place and are intermixed with both the API and the core.

Generally speaking, the nf_tables subsystem is implementing a virtual machine of low-level expressions that operates on network packets.

TODO: add info.

nf_tables netlink API

The source code is mostly in net/netfilter/nf_tables_api.c [elixir src] [git src]

TODO: add info.

nf_tables core

The source code is mostly in net/netfilter/nf_tables_core.c [elixir src] [git src]

You can see there one of the most important functions in the core: nft_do_chain(). In a nut shell, this is the function that evaluates network packets against the ruleset.

The logic in this function is rather simple:

  • for each rule in the chain
    • for each low level expression in the rule
      • evaluate the packet against the expression
    • evaluate expression return code (break, continue, drop, accept, jump, goto, etc)

TODO: add info.

expressions

There are many low expressions that allows us to operate over network packets in different ways. You can think on these low level expressions as assembly-like instructions.

  • nft_immediate: loads an immediate value into a register.
  • nft_cmp: compare a given data with data from a given register.
  • nft_payload: set/get arbitrary data from packet headers.
  • nft_bitwise: perform bit-wise math operations over data in a given register.
  • nft_byteorder: perform byte order operations over data in a given register.
  • nft_counter: a basic counter for packet/bytes that gets incremented everything is evaluated for a packet.
  • nft_meta: set/get packet meta information, such as related interfaces, timestamps, etc.
  • nft_lookup: search for data from a given register (key) into a dataset. If the set is a map/vmap, returns the value for that key.

TODO: add info.

The userspace components

There are several important components in the userpsace part of nftables:

  • libmnl: generic low level library used to communicate with the kernel using netlink sockets.
  • libnftnl: low level library that is capable of interacting with the nf_tables subsystem netlink API in the kernel. Is responsible for creating/parsing the nf_tables netlink messages. Uses libmnl under the hood.
  • libnftables: high level library that implements the logic to translate from high level statements to netlink objects and the other way around. Uses libnftnl under the hood.
  • nft: the command line interface binary. This is what most end users actually use in their systems. It reads user input and calls libnftables under the hood.

Generally speaking, the userspace compiles high level statements (rules, etc) into the netlink bytecode that the kernel API understands When inspecting the ruleset (i.e, listing it) what it does is the opposite, reconstruct the low level netlink bytecode into high level statements.

libnftnl

This library provides data structures for entities existing in nf_tables nomenclature, such as tables, chains and rules. It serves as an intermediate layer between nftables and iptables-nft user space applications and nfnetlink messages the kernel sends and receives.

In general, each data structure comes with a set of handling routines:

allocators
To allocate and free an object of given type
setters/getters
Data structure fields are accessed via an attribute number (via a specific enum field)
serializers
Populating a netlink message or vice versa
printers
Providing a textual representation, mostly for debugging purposes

Where sensible, there is a list-variant, too. If so, it comes with handling routines as well:

allocators
Allocating and freeing the list object (and members)
populators
Add and remove from the list

Where useful, there might be a lookup routine as well. With nftnl_chain_list, e.g. the list object contains a hash table for chain names as well so list lookup by chain name is faster than a linear search.

A typical extra for list objects are iterators: A data structure containing state while browsing through the list. Usually the only routines used are allocators and a next routine.

These are the entities defined by libnftnl:

table
A rather boring "namespace" for chains
chain
A container for rules, may attach to a netfilter hook in kernel
rule
A container for expressions
expr
An nftables VM code instruction
flowtable
Similar to a chain, but holds flows between interfaces
obj
A generic object, typically holding stateful information
ruleset
A container for lists of tables, chains, sets and rules - not used by nftables application anymore
set
A container for elements
set_elem
A set element
trace
A trace event sent by the kernel

nftnl_expr

While nftables distinguishes between expressions and statements, such difference does not quite exist in libnftnl layer. For instance, a statement like:

ip saddr 192.168.0.1

is actually two expressions:

payload
loading IPv4 header's source address into a register
cmp
comparing data from a register against a stored value

Since expressions have access to the packet, its meta data, all nftables registers (including the verdict register) and may store multiple values internally, they are mighty and versatile.

nftnl_obj

This is a common API for various object types. An object's type is defined post allocation by setting the NFTNL_OBJ_TYPE attribute. Currently existing object types are:

  • counter
  • quota
  • ct helper
  • limit
  • tunnel
  • ct timeout
  • secmark
  • ct expect
  • synproxy

nftnl_batch

This is a wrapper interface around the same functionality in libmnl (which is used internally). In general, nftnl batches aid in collecting multiple netlink messages for kernel submission.

libnftables

One goal in nftables development was to provide users with a library for easier integration into applications than "shelling out" using system() and trying to parse nft command output.

At first, libnftnl was supposed to achieve this but the fact that it exposes internal implementation details apart from being pretty low-level in general made it rather unsuitable from a users' perspective.

To overcome this, nft backend code was separated into a library which should fill the gap between libnftnl on one side and nft application itself on the other.

Usage of libnftables is supposed to be simple and straightforward, almost like calling nft itself but with a bit more convenience. First step is to create a new context:

 struct nft_ctx *ctx = nft_ctx_new(0);

The context allows to configure library behaviour on a "per session" basis. With this in place, nftables commands may be executed:

 int rc = nft_run_cmd_from_buffer(ctx, "add table inet t");

or whole dump files loaded:

 int rc = nft_run_cmd_from_filename(ctx, "/etc/nftables/all-in-one.nft");

To control output, there are a number of functions:

 FILE *nft_ctx_set_output(struct nft_ctx *ctx, FILE *fp);
 int nft_ctx_buffer_output(struct nft_ctx *ctx);
 int nft_ctx_unbuffer_output(struct nft_ctx *ctx);
 const char *nft_ctx_get_output_buffer(struct nft_ctx *ctx);

Same for stderr. See libnftables(3) man page for further details.

nft: from user space to the kernel

The following describes the steps and entities involved after a call to nft in user space until the actual communication with the kernel.

Since creation of libnftables, nft is merely a lightweight front-end, basically just creating a libnftables handle, allowing to configure it via command-line options and feeding nftables syntax into it. Within the library, the actual work takes place. It may be divided into several phases:

  • Input parsing into internal data structure
  • Evaluation and expansion
  • Serialization into netlink messages
  • nfnetlink message session with kernel
  • Error handling

Input parsing into internal data structures

Depending on whether input comes from command line or a file (which may be stdin), main() calls either nft_run_cmd_from_buffer() or nft_run_cmd_from_filename() library functions.

If JSON output was selected (nft -j), the JSON parser (in src/parser_json.c) is tried first. If this did not succeed, the standard ("human-readable") syntax parser is called.

Eventually both parsers populate a list of commands (struct cmd) and a list of error messages (struct error_record) in case errors were detected.

Standard syntax

The parser for standard syntax is implemented in lex and yacc, see src/scanner.l and src/parser_bison.y for reference. It is entered via the generated function nft_parse().

As a basic rule, in lex/yacc the scanner recognizes the words and the parser interprets them in their context. There is also (limited) scanner control from the parser by definition of a scope in which some words are valid or not. The parser defines recursive patterns to match input against. Here is the top-most one, input:

input       :       /* empty */
            |       input           line
            {
                    if ($2 != NULL) {
                            $2->location = @2;
                            list_add_tail(&$2->list, state->cmds);
                    }
            }
            ;

So it may be empty or (by recursion) consist of a number of line patterns. Each of those lines parses into a command and is appended to the list. The snippet above also shows how parser-provided location data is stored in the command object. This is used for error reporting.

JSON syntax

The JSON parser lives in src/parser_json.c and is entered via nft_parse_json_buffer() function or nft_parse_json_filename(), respectively. It uses jansson library for (de-)serialization and value (un-)packing. To learn about the code and to understand the program flow, json_parse_cmd() function is a good starting point.

Evaluation and expansion

Input evaluation is a crucial step and combines several tasks. It extends input validation from mere syntax checks done by the parser to semantical ones, taking context into perspective.

Input may be changed, too. Sometimes it is necessary to insert extra statements as dependency, sometimes types of right hand sides of comparisons must adjust to left hand side type.

Before all the above, the list of commands is scanned for cache requirements - see nft_cache_evaluate() for details. Since caching may be an expensive operation if in-kernel ruleset is huge, this step attempts to reduce the data fetched from kernel to the bare minimum needed for correct operation. A final call to nft_cache_update() then does the actual fetch.

If evaluation passed, expansion takes place. This is mostly to cover for input in "dump" notation, i.e. rules nested in chains nested in tables, etc. Such input is converted into individual "add" commands as required by the netlink message format. The code is pretty straightforward, see nft_cmd_expand() for reference.

Serialization into netlink messages

In this step, nftables-internal data types are converted into libnftables ones (e.g., struct table into struct nftnl_table). The latter abstract their internal layout as attributes and are therefore opaque to the caller.

libnftnl provides helpers to convert its own data structures into netlink message format: A generic nftnl_nlmsg_build_hdr() for the header and type-specific ones for the payload (e.g., nftnl_table_nlmsg_build_payload()).

The netlink messages are stored in a struct nftnl_batch which provides the backing storage. This surrounding data structure serializes into an introductory NFNL_MSG_BATCH_BEGIN message and a finalizing one with type NFNL_MSG_BATCH_END.

In kernel space, the batch constitutes a transaction: If one of the messages is rejected, none of them take effect. Ditto, if the final batch end message is missing the whole batch will undo. This is how nft's --check option is implemented.

nfnetlink message session with kernel

In nft, communication with the kernel takes place in the function mnl_batch_talk(): It converts the nftnl_batch into a message suitable for sendmsg(), adjusts buffer sizes (if needed), transmits the data and listens for a reply. Any error messages are handled by mnl_batch_extack_cb() function which records them for later reporting. Other messages are relevant for --echo mode, in which the kernel "echoes" the requests back after updating them (with handle values, for instance). These are handled by netlink_echo_callback(), more or less a wrapper around nft's event monitoring code.

Error handling

Each struct cmd object is identified by its own sequence number (monotonic within the batch). Netlink error messages contain this number and also an offset value, which allow to identify not only the problematic message but also the specific attribute of that message which was rejected.

Mapping from message attribute back to line or word(s) of input works via a mapping from attribute offset to the struct location object stored while parsing. That bison parser-provided data holds line and column numbers, allowing nft to underline problematic parts of input when reporting back to the caller.

To follow the above in the source code, see nft_cmd_error() function being called for each command and error it caused. The mapping is established earlier while creating netlink messages, i.e. in code called from do_command() - watch out for the various calls to cmd_add_loc() populating the field attr in struct cmd.

nft: from the kernel to the usespace

TODO: add info.