Linux Firewalling in 2021 and a Gentle Introduction to NFTables Part II

Posted on Thu 28 October 2021 in Computing

Introduction

Following on from Part I, let's look at an example ruleset I built recently and walk through it.

As you can see it's a simple set opening just three ports but setting up some counters for all inbound DNS traffic, rate-limiting it, counting requests that exceed the rate-limit (we'll call those flooded) and logging flood traffic but rate-limiting that too so we don't DoS ourself with log traffic.

Reference Ruleset

flush ruleset

# ipv4 only firewall / table
table ip main {

    # Base input table - drop by default
    chain inbound {
        type filter hook input priority 0; policy drop;

        # Allow traffic from established and related packets, drop invalid
        ct state vmap { established : accept, related : accept, invalid : drop }

        # Allow loopback traffic
        meta iifname lo accept

        # Allow ssh, http (for letsencrypt) and 853 for DoT
        tcp dport ssh accept
        tcp dport http accept
        tcp dport 853 jump inbound_dns
    }

    # Base outchain chain.  Not required as default is accept
    chain outbound {
        type filter hook output priority 0; policy accept;
    }

    # Base forward chain - drop by default
    chain forward {
        type filter hook forward priority 0; policy drop;
    }

    # Regular chain for limiting, counting and logging DNS traffic
    chain inbound_dns {
        counter name counter_all_dns_packets
        ct state new add @rate_meter_inbound_dns { ip saddr limit rate 30/minute burst 10 packets } accept
        counter name counter_flooded_dns_packets
        limit rate 6/minute log prefix "[nftables dns flood]"
    }

    # Counters and Maps
    counter counter_all_dns_packets {
    }

    counter counter_flooded_dns_packets {
    }

    set rate_meter_inbound_dns {
        type ipv4_addr
        flags dynamic
        timeout 10m
    }
}

Walkthrough

flush ruleset

Flush any existing rules first.

Tables

table ip main {

Tables are namespaces or containers for chains and chains are containers for rules. Multiple tables can be used if necessary.

Here we define a new table with the arbitrary name "main". ip defines a table containing rules for IPv4 traffic. Possible values are ip, ip6, inet, arp, bridge, netdev (inet captures both IPv4 and IPv6 traffic)

So, now we have a table, let's add some chains to it.

Chains

chain inbound {
    type filter hook input priority 0; policy drop;

Here we create a "Base Chain" with the arbitrary name "inbound". Base chains (as opposed to "regular" chains) have a type, a hook and a priority.

The type can be either filter, route or nat. We want to filter IP traffic so the filter type is used accordingly.

The hook will be familiar to users of iptables as this is one place where they are the same (as they are both part of netfilter). The hooks for IPv4 and IPv6 are prerouting, input, forward, output, postrouting

The priority is a signed integer (so negative values are allowed) e.g. 10, -100.

Rules

As iptables, we define rules within chains.

A rule consists of something to match written as an expression and a action statement to perform upon it.

I'll start by explaining a very simple rule in the ruleset.

meta iifname lo accept

The matches portion here is meta iifname lo and accept is the statement. This is actually known as a "verdict statement". Here is a copy of the possible verdict statements from the wiki

  • accept: Accept the packet and stop the remain rules evaluation

  • drop: Drop the packet and stop the remain rules evaluation

  • queue: Queue the packet to userspace and stop the remain rules evaluation

  • continue: Continue the ruleset evaluation with the next rule

  • return: Return from the current chain and continue at the next rule of the last chain. In a base chain it is equivalent to accept

  • jump : Continue at the first rule of . It will continue at the next rule after a return statement is issued

  • goto : Similar to jump, but after the new chain the evaluation will continue at the last chain instead of the one containing the goto statement

The match here (meta iifname lo) is a meta type match that is matching information about the packet rather than the contents of it. So in this case, match packets on the input interface lo (loopback) and accept it. As stated above, accept is a verdict statement. accept is a type of statement sometimes referred to as a "terminating" statement in nftables, as such, no more rules are evaluated. Not all verdict statements are terminating.

tcp dport ssh accept
tcp dport http accept

Hardly worthy of explanation. These are tcp matches. Port numbers can be used instead of service names. Service names must match those in /etc/services

tcp dport 853 jump inbound_dns

Here the verdict statement jump tells the rule evaluation to continue with the rules in the named chain inbound_dns. More on that chain later. This is similar to iptables's -j

ct state vmap { established : accept, related : accept, invalid : drop }

This can look unfriendly at first but is actually very straight forward and very concise.

ct state is the first part of the connection tracking match and matches stateful traffic which is part of the conntrack (connection tracking) netfilter design.

The vmap or verdict map, is a map containing expressions as keys and verdicts as values. Without the use of a vmap, the same one line rule becomes:

ct state established accept
ct state related accept
ct state invalid drop

A question of style ultimately, but I prefer the vmap one.

Rate-limiting, Counters and Logging

TCP traffic with a destination port of 853 we want to have evaluated in it's own chain. This is defined by the syntax chain inbound_dns. Nftables refers to this type of chain as a "regular" chain and requires just an arbitrary name. Essentially, regular chains are means of rule organization.

Let's look at the first rule in this chain:

counter name counter_all_dns_packets

This is a non-verdict and non-terminating statement. It increments a "named" counter which is defined later in the ruleset as:

counter counter_all_dns_packets

Therefor, every inbound 853 packet is counted. This is perfect for metrics and by using a named counter, we can do some interrogation from the command line, e.g.

$>nft list counter main counter_all_dns_packets
table ip main {
        counter counter_all_dns_packets {
                packets 4106 bytes 262200
        }

Counters can also be dumped as JSON with the -j flag. Perfect for putting in to something like DataDog DogStatsD and generating metrics.

$>nft -j list counter main counter_all_dns_packets
{"nftables": [{"metainfo": {"version": "0.9.8", "release_name": "E.D.S.", "json_schema_version": 1}}, {"counter": {"family": "ip", "name": "counter_all_dns_packets", "table": "main", "handle": 5, "packets": 4107, "bytes": 262264}}]}

Because this is a non-verdict and non-terminating statement, rule evaluation continues with:

ct state new add @rate_meter_inbound_dns { ip saddr limit rate 30/minute burst 10 packets } accept

This is obviously the most comprehensive and powerful rule in our ruleset. As you can probably see, we want to prevent a traffic-flooding DoS attack by rate-limiting inbound traffic but importantly, do this per source IP address.

30 packets are allowed per minute (per IP), with a burst-limit of 10 (see below for an explanation of this). All limits are reset after 10minutes. Furthermore, we also want to count the number of requests that break this rule (flood) and we want to log a selection of those requests.

Those coming from iptables will know that this is implemented as a hashlimit.

nftables uses dynamic maps and sets to keep state. Because this rate-limiting rule tracks source IP addresses it is therefore dynamic. So, we match new tcp connections and use a "named dynamic set" to store the source ip address which forms part of the rule match. @ specifies the named set.

The named set is defined with this syntax:

set rate_meter_inbound_dns {
    type ipv4_addr
    flags dynamic
    timeout 10m

The man page covers sets. In a nutshell, we create a set to store the IPv4 address as part of the rule, it's contents are dynamic of course, and any elements older than 10 minutes will be purged - this allows us to reset all rate-limiting limits every 10 minutes.

Because this is a named set, we can also easily interrogate it from the command line.

$>nft list set main rate_meter_inbound_dns
table ip main {
        set rate_meter_inbound_dns {
                type ipv4_addr
                size 65535
                flags dynamic,timeout
                timeout 10m
                elements = { 1.7.23.29 limit rate 30/minute burst 10 packets expires 9m45s892ms, 17.93.41.155 limit rate 30/minute burst 10 packets expires 9m23s708ms }
        }
}

Lastly, because the verdict statement is accept any packets that match our rate-limit requirements are accepted and no further rules in the chain are evaluated. Any that do not (flood traffic), are evaluated by our final two rules:

counter name counter_flooded_dns_packets

Increment the named counter which tracks the number of flood packets (roughly corresponds to the number of DNS requests)

limit rate 6/minute log prefix "[nftables dns flood]"

Our final rule, again is only matched by flood traffic. This logs to the kernel log with the level WARN by default. To log every flood packet is asking to DoS ourselves, to avoid that we rate-limit the the logging to 6/minute. Note that the order in which we write the match is critical here. Writing the rule as...

log prefix "[nftables dns flood]" limit rate 6/minute

will not have the desired effect as the match to log is applied before the limit match thus every packet would be logged.

Understanding Burst and Rate-limit

This seems to be a source of confusion and misunderstanding-understanding, however it is actually fairly simple.

An easy example to understand is rate-limiting bandwidth. Assume a rate-limiting rule of 30MB/minute and a burst-limit of 2MB/second for a web server serving a website - this does not mean that the maximum download rate would be a constant 512KB/s (30MB/1m). It allows for a user to burst that limit, so for the first second of a new connection, the download rate could actually be a maximum of 2MB/s. This is ideal for websites where downloading a small about of initial content should be quick, but for sustained downloads, (which could quickly saturate bandwidth with many concurrent users) download speeds would be capped at approx 512KB/s after the initial burst.

I think of the rate-limit and burst like a stamina meter in a video game. Think about sprinting in something like Battlefield or Zelda. The burst-limit is the maximum size of the stamina gauge and the rate-limit is how quickly that burst-limit will recharge. If the timeout in the dynamic set is 10minutes, then the stamina gauge is recharged to full every 10 minutes.

Command Line Examples

nft -cf <filename> - check syntax is valid without applying

nft -f <filename> - load a ruleset

nft list ruleset

nft list counters

nft list sets

List a named table, counter or set:

nft list table main

nft list counter main counter_all_dns_packets

nft list set main rate_meter_inbound_dns

-j dumps output as JSON