Linux 4.13 Released

Linux v4.13 was released this past weekend on Sunday, September 3rd; this is a quick summary of the SELinux and audit changes.

SELinux

  • The largest SELinux change in Linux v4.13 is the addition of SELinux access controls for Infiniband. This was a large effort the involved a new SELinux policy version (v31), two new object classes (infiniband_pkey and infiniband_endport), the creation of a LSM notification mechanism, and a number of changes to core Infiniband code. Daniel Jurgens, the patchset author, provided an excellent summary of the changes in his cover letter, a portion of it is excerpted below:

    From: Daniel Jurgens

    Infiniband applications access HW from user-space – traffic is generated directly by HW, bypassing the kernel. Consequently, Infiniband Partitions, which are associated directly with HW transport endpoints, are a natural choice for enforcing granular mandatory access control for Infiniband. QPs may only send or receives packets tagged with the corresponding partition key (PKey). The PKey is not a cryptographic key; it’s a 16 bit number identifying the partition.

    Every Infiniband fabric is controlled by a central Subnet Manager (SM). The SM provisions the partitions by assigning each port with the partitions it can access. In addition, the SM tags each port with a subnet prefix, which identifies the subnet. Determining which users are allowed to access which partition keys on a given subnet forms an effective policy for isolating users on the fabric. Any application that attempts to send traffic on a given subnet is automatically subject to the policy, regardless of which device and port it uses. SM software configures the subnet through a privileged Subnet Management Interface (SMI), which is presented by each Infiniband port. Thus, the SMI must also be controlled to prevent unauthorized changes to fabric configuration and partitioning.

    To support access control for IB partitions and subnet management, security contexts must be provided for two new types of objects - PKeys and IB ports.

    A PKey label consists of a subnet prefix and a range of PKey values and is similar to the labeling mechanism for netports. Each Infiniband port can reside on a different subnet. So labeling the PKey values for specific subnet prefixes provides the user maximum flexibility, as PKey values may be determined independently for different subnets. There is a single access vector for PKeys called “access”.

    An Infiniband port is labeled by device name and port number. There is a single access vector for IB ports called “manage_subnet”.

    Because RDMA allows kernel bypass, enforcement must be done during connection setup. Communication over RDMA requires a send and receive queue, collectively known as a Queue Pair (QP). A QP must be initialized by privileged system calls before it can be used to send or receive data. During initialization the user must provide the PKey and port the QP will use; at this time access control can be enforced.

    Because there is a possibility that the enforcement settings or security policy can change, a means of notifying the ib_core module of such changes is required. To facilitate this a generic notification callback mechanism is added to the LSM. One callback is registered for checking the QP PKey associations when the policy changes. Mad agents also register a callback, they cache the permission to send and receive SMPs to avoid another per packet call to the LSM.

    Because frequent accesses to the same PKey’s SID is expected a cache is implemented which is very similar to the netport cache.

    In order to properly enforce security when changes to the PKey table or security policy or enforcement occur ib_core must track which QPs are using which port, pkey index, and alternate path for every IB device. This makes operations that used to be atomic transactional.

  • An important part of the Infiniband work was the creation of a LSM notification mechanism that allows various kernel subsystems to receive notification of LSM events. At present this is limited to just SELinux policy changes, but I expect additional events to be added in the future as they are needed.

  • The SELinux “file:map” permission was added to control memory mapped file access. This allows the SELinux policy to prevent direct memory access to files and ensure that every file access is revalidated. Stephen Smalley provides more information in the patch description:

    From: Stephen Smalley

    Add a map permission check on mmap so that we can distinguish memory mapped access (since it has different implications for revocation). When a file is opened and then read or written via syscalls like read(2)/write(2), we revalidate access on each read/write operation via selinux_file_permission() and therefore can revoke access if the process context, the file context, or the policy changes in such a manner that access is no longer allowed. When a file is opened and then memory mapped via mmap(2) and then subsequently read or written directly in memory, we presently have no way to revalidate or revoke access. The purpose of a separate map permission check on mmap(2) is to permit policy to prohibit memory mapping of specific files for which we need to ensure that every access is revalidated, particularly useful for scenarios where we expect the file to be relabeled at runtime in order to reflect state changes (e.g. cross-domain solution, assured pipeline without data copying).

  • Allow proper per-file labeling for tracefs filesystems using the SELinux genfscon mechanism.

  • Starting with Linux v4.13, whenever SELinux policy is loaded into the kernel we log the SELinux policy capability state to the kernel’s ring buffer. An example can be seen below:
    [    2.017308] SELinux:  policy capability network_peer_controls=1
    [    2.017880] SELinux:  policy capability open_perms=1
    [    2.018344] SELinux:  policy capability extended_socket_class=0
    [    2.018919] SELinux:  policy capability always_check_network=0
    [    2.019513] SELinux:  policy capability cgroup_seclabel=0
    
  • The Linux Kernel does not allow directly opening sockets, returning the ENXIO error. However, before the kernel ultimately rejects the access, the SELinux policy is checked and in the case of a socket file descriptor the resulting check can seem a bit odd. The SELinux socket object classes do not contain the “open” permission, they contain the “recvfrom” permission instead; this difference causes a socket “open” access to appear as a “recvfrom” SELinux denial. Linux v4.13 fixes this by skipping open access checking on sockets and letting the core kernel code handle the denial.

  • Allow the LSM security_sb_clone_mnt_opts() hook to enable or disable the native labeling behavior. This is important for proper SELinux file labeling on NFS v4.2+.

  • Normally valid SELinux labels must be used when labeling files, however, if the process has the CAP_MAC_ADMIN capability it is possible to set an unknown, or invalid, SELinux label on a file. Prior to Linux v4.13 setting an unknown SELinux label on a file would cause the SELinux subsystem to perform the usual SELinux checks, in addition to any other stacked LSM’s CAP_MAC_ADMIN checks. Depending on the LSMs that were in use this could result in odd, or unexpected behavior. We fix this in Linux v4.13 by only performing the base CAP_MAC_ADMIN capability checks in addition to the SELinux checks; no other LSMs are asked to provide access control decisions.

  • The SELinux internal ebitmap type was converted to use the kmem_cache mechanism. This potentially saves a small amount of memory on some systems and provides better SELinux memory usage statistics.

  • SELinux was converted to use the LSM security_task_alloc() hook instead of the security_task_create() hook. The expectation is that the security_task_create() hook will be deprecated and eventually removed from the Linux Kernel.

  • A clang build warning related to redundant filesystem labeling behavior checks was fixed.

Audit

  • Linux v4.3 added the concept of ambient capabilities to the Linux Kernel, we now log the ambient capabilities in the audit BPRM_FCAPS and CAPSET records using the “cap_pa”, “old_pa”, and “pa” fields.

  • Prior to Linux v4.13 file capabilities would only be recorded in the audit PATH record if they were set. Starting with v4.13 the permitted and inheritable file capabilities are always recorded in the PATH record, resulting in a more consistent record format.

  • The “new_<capability>” prefix has been shortened to simply “<capability>” in the audit BPRM_FCAPS record; for example “new_pp” is now “pp”.

  • Fixed a race condition where the kernel/auditd connection could be reset shortly after the audit daemon starts and registers itself with the kernel. Fedora BZ #1459326 has more information:

    This issue is partly due to the read-copy nature of RCU, and partly due to how we sync the auditd_connection state across kauditd_thread and the audit control channel. The kauditd_thread thread is always running so it can service the record queues and emit the multicast messages, if it happens to be just past the “main_queue” label, but before the “if (sk == NULL || …)” if-statement which calls auditd_reset() when the new auditd connection is registered it could end up resetting the auditd connection, regardless of if it is valid or not. This is a rather small window and the variable nature of multi-core scheduling explains why this is proving rather difficult to reproduce.

  • Fixed a user-after-free problem in the audit filesystem watch code. The core problem was improper fsnotify reference counting in the audit subsystem. Jan Kara provides more information in the patch description:

    From: Jan Kara

    audit_remove_watch_rule() drops watch’s reference to parent but then continues to work with it. That is not safe as parent can get freed once we drop our reference. The following is a trivial reproducer:

    mount -o loop image /mnt
    touch /mnt/file
    auditctl -w /mnt/file -p wax
    umount /mnt
    auditctl -D
    <crash in fsnotify_destroy_mark()>

    Grab our own reference in audit_remove_watch_rule() earlier to make sure mark does not get freed under us.

  • Ensure we cleanup any audit filesystem watch fsnotify marks when a filesystem is unmounted.

  • Ensure that all of the audit records are sent to any multicast listeners, e.g. the systemd journal, when the audit daemon connection is reset. Prior to Linux v4.13 some audit records could be lost when the audit daemon unregistered from the kernel.

  • Fixed a memory leak in the auditd_send_unicast_skb() function that would leak an audit REPLACE record in certain situations.

Kernel Repository Process

It has been over a year since I formally updated the SELinux and audit kernel repository processes, and based on how things have evolved it seems we are due for another update. This time the changes are rather small, and shouldn’t surprise anyone who has been following upstream development.

The process below applies to both SELinux and audit.

  1. After the merge window closes upstream, a decision will be made regarding the need to rebase the next branch on top of the current Linux -rc1 release. If there have been a number of subsystem related changes outside of the subsystem’s next branch, or if the branch’s base is too far behind linux/master, it may be necessary to rebase the next branch. If a rebase is needed, it should be done before any patches are merged, and rebasing the next branch during the remaining -rcX releases should only be done in extreme cases.

  2. Patches will be merged into the subsystem’s next branch during the development cycle which extends from merge window close up until the merge window reopens. However, it is important to note that large, complicated, or invasive patches sent late in the development cycle may be deferred until the next cycle. As a general rule, only small patches or critical fixes will be merged after -rc5/-rc6.

  3. Any patches deemed necessary for the current Linux -rcX releases will be merged into the current stable-X.Y branch, marked with a signed tag, and a pull request sent against linux/master as soon as it is reasonable to do so.

  4. During the development cycle Fedora Rawhide test kernels will be generated using the next and most recent stable-X.Y branches on a weekly basis, if not more often. These kernels will be tested against the SELinux test suite and audit test suite as well as being made available to everyone for additional testing.

  5. Once the merge window opens, the next branch will be copied to a new branch, stable-X.Y, and the branch will be marked with a signed tag in the format subsystem-pr-YYYYMMDD. A pull request will be sent against the linux/master branch using the signed tag.

For reference, the previous process was defined here.

UPDATE: The SELinux kernel process has been updated now that we are basing the tree against Linus’ tree and sending pull requests directly to Linus.

UPDATE #2: The SELinux and audit processes have been merged (ha!) and the process has been changed to reflect potential rebasing to -rc1 at the start of the development cycle.

UPDATE #3: While the process documented here is not changing at this time, all future process updates will be documented in the README.md file in the base directory of each kernel subsystem tree. This post, and the process it describes, should be considered deprecated moving forward.

Kernel Repository Move

In a move that is long overdue, I’m moving the SELinux and audit repositories from infradead.org to kernel.org. The URLs for both repositories are shown below:

SELinux

Audit