CompletePart 1 • Chapter 6

The Anatomy of a FreeBSD Driver

A deep dive into the internal structure, lifecycle, and essential components that define every FreeBSD device driver.

The Anatomy of a FreeBSD Driver

Introduction

If the previous chapters gave you the language and tools to work inside FreeBSD, this chapter will show you how drivers are actually structured. Think of it as moving from learning carpentry techniques to understanding architectural blueprints: before you build a house, you need to know where the foundation goes, how the frame connects, where utilities run, and how all the pieces fit together.

Important: This chapter focuses on understanding driver structure and patterns. You will not yet write a complete, fully functional driver in this chapter; that begins in Chapter 7. Here, we're building your mental model and pattern recognition skills first.

Writing a device driver can feel mysterious at first. You know it talks to hardware, you know it lives in the kernel, but how does it all work? How does the kernel discover your driver? How does it decide when to call your code? What happens when a user program opens /dev/yourdevice? And most importantly, what does the blueprint of a real, working driver actually look like?

This chapter answers those questions by showing you the anatomy of FreeBSD drivers, the common structures, patterns, and lifecycles that all drivers share. You'll learn:

  • How drivers plug into FreeBSD through newbus, devfs, and module packaging
  • The common patterns that character, network, and storage drivers follow
  • The lifecycle from discovery through probe, attach, operation, and detach
  • How to recognize driver structure in real FreeBSD source code
  • Where to find your way around when reading or writing drivers

By the end of this chapter, you won't just understand drivers conceptually, you'll be able to read real FreeBSD driver code and immediately recognize the patterns. You'll know where to look for device attachment, how initialization happens, and how cleanup works. This chapter is your blueprint for understanding any driver you encounter in the FreeBSD source tree.

What This Chapter IS

This chapter is your architectural tour of driver structure. It teaches you:

  • Pattern recognition: The shapes and idioms all drivers follow
  • Navigation skills: Where to find what in driver source code
  • Vocabulary: The names and concepts (newbus, devfs, softc, cdevsw, ifnet)
  • Lifecycle understanding: When and why each driver function is called
  • Structural overview: How the pieces connect without deep implementation

Think of this as learning to read blueprints before you start building.

What This Chapter Is Not

This chapter deliberately defers deep mechanics so we can focus on structure without overwhelming beginners. We will not dive deeply into:

  • Bus specifics (PCI/USB/ACPI/FDT): We'll mention buses conceptually, but skip hardware- and bus-specific discovery/attachment details.
  • Interrupt handling: You'll see where handlers fit in a driver's lifecycle, not how to program or tune them.
  • DMA programming: We'll acknowledge DMA and why it exists, not how to set up maps, tags, or synchronization.
  • Hardware register I/O: We'll preview bus_space_* at a high level, not full MMIO/PIO access patterns.
  • Network packet paths: We'll point out how ifnet surfaces an interface, not implement packet TX/RX pipelines.
  • GEOM internals: We'll introduce storage surfaces, not provider/consumer plumbing or graph transformations.

If you're curious about these topics while reading, great, bookmark the terms and keep going. This chapter gives you the map; the detailed territories come later in the book.

How This Fits in Your Journey

You're entering the final chapter of Part 1 - Foundations. By the time you finish this chapter, you will have a clear blueprint of how a FreeBSD driver is shaped and how it plugs into the system, completing the foundation you've been building:

  • Chapters 1 to 5 (so far): why drivers matter, a safe lab, UNIX/FreeBSD basics, C for userspace, and C in the kernel context.
  • Chapter 6 (this chapter): the driver's anatomy; structure, lifecycle, and the user-visible surface, so you can recognise the pieces before you start coding.

With that foundation in place, Part 2 - Building Your First Driver will move from concepts to code, step by step:

  • Chapter 7: Writing Your First Driver - scaffold and load a minimal driver.
  • Chapter 8: Working with Device Files - create a /dev node and wire basic entry points.
  • Chapter 9: Reading and Writing to Devices - implement simple data paths for read(2)/write(2).
  • Chapter 10: Handling Input and Output Efficiently - introduce tidy, responsive I/O patterns.

Think of Chapter 6 as the bridge: you now have the language (C) and the environment (FreeBSD), and with this anatomy in mind, you're ready to start building in Part 2.

If you're skimming: Chapter 6 = the blueprint. Part 2 = the build.

Reader Guidance: How to Use This Chapter

This chapter is designed as both a structural reference and a guided reading experience. Unlike the hands-on coding focus of Chapter 7, this chapter emphasizes understanding, pattern recognition, and navigation. You'll spend time examining real FreeBSD driver code, identifying structures, and building a mental model of how everything connects.

Estimated Time Investment

Your total time depends on how deeply you engage. Use the track that fits your pace.

Track A - Read-through only Plan 8-10 hours to absorb the concepts, skim the diagrams, and read the code excerpts at a comfortable beginner pace. This gives you a solid mental model without hands-on steps.

Track B - Read + follow along in /usr/src Plan 12-14 hours if you open the referenced files under /usr/src/sys as you read, scroll around the surrounding context, and type the micro-snippets into a scratch file. This reinforces pattern recognition and navigation skills.

Track C - Read + follow along + all four labs Add 2.5-3.5 hours to complete all four beginner-safe labs for this chapter: Lab 1 (Scavenger Hunt), Lab 2 (Hello Module), Lab 3 (Device Node), Lab 4 (Error Handling). These are short, focused checkpoints that validate what you learned in the chapter's tours and explanations.

Optional - Challenge questions Add 2-4 hours to tackle the end-of-chapter challenges. These deepen your understanding of entry points, error unwinding, dependencies, and classification by reading real drivers.

Suggested pacing Break this chapter into two or three sessions. A practical split is: Session 1: Read through the driver model, skeleton, and lifecycle while following along in /usr/src. Session 2: Complete Labs 1-2. Session 3: Complete Labs 3-4 and, if desired, the challenges.

Reminder Don't rush. The goal here is driver literacy: the ability to open any driver, locate its probe/attach/detach paths, recognise the cdev/ifnet/GEOM shapes, and understand how it plugs into newbus and devfs. Mastery here makes Chapter 7's build go much faster and with fewer surprises.

What to Have Ready

To get the most from this chapter, prepare your workspace:

  1. Your FreeBSD lab environment from Chapter 2 (VM or physical machine)
  2. FreeBSD 14.3 with /usr/src installed (we'll reference real files from the kernel source tree)
  3. A terminal where you can run commands and examine files
  4. Your lab logbook for notes and observations
  5. Access to manual pages: You'll frequently consult man 9 <function> for reference

Note: All examples were tested on FreeBSD 14.3; adjust commands if you use a different release.

Pacing and Approach

This chapter works best when you:

  • Read sequentially: Each section builds on the previous one. The order matters.
  • Keep /usr/src open: When we reference a file like /usr/src/sys/dev/null/null.c, actually open it and look at the surrounding context
  • Use man 9 as you go: When you see a function like device_get_softc(), run man 9 device_get_softc to see the official documentation
  • Type micro-snippets yourself: Even in this "read-only" chapter, typing key patterns (like a probe function or method table) cements the shapes in your memory
  • Don't rush the labs: They're designed as checkpoints. Complete each one before moving to the next section

Managing Your Curiosity

As you read, you'll encounter concepts that spark deeper questions:

  • "How exactly do PCI interrupts work?"
  • "What are all the flags in bus_alloc_resource_any()?"
  • "How does the network stack call my transmit function?"

This is expected and healthy. But resist the urge to dive down every rabbit hole now. This chapter is about recognizing patterns and understanding structure. The deep mechanics have their own dedicated chapters.

Strategy: Keep a "Curiosity List" in your lab logbook. When something piques your interest, jot it down with a note about where in the book it will be covered. For example:

Curiosity List:
- Interrupt handler details  ->  Chapter 19: Handling Interrupts and 
                             ->  Chapter 20: Advanced Interrupt Handling
- DMA buffer setup  ->  Chapter 21: DMA and High-Speed Data Transfer
- Network packet queues  ->  Chapter 28: Writing a Network Driver
- PCI configuration space  ->  Chapter 18: Writing a PCI Driver

This lets you acknowledge your questions without derailing your current focus.

Success Criteria

By the end of this chapter, you should be able to:

  • Open any FreeBSD driver and immediately locate its probe, attach, and detach functions.
  • Identify whether a driver is character, network, storage, or bus-oriented.
  • Recognize a device method table and understand what it maps.
  • Find the softc structure and understand its role.
  • Trace the basic lifecycle from module load through device operation.
  • Read logs and match them to driver lifecycle events.
  • Locate relevant manual pages for key functions.

If you can do these things, you're ready for Chapter 7's hands-on coding.

How to Get the Most Out of This Chapter

Now that you know what to expect and how to pace yourself, let's discuss specific learning tactics that will make driver structure click for you. These strategies have proven effective for beginners tackling FreeBSD's driver model.

Keep /usr/src Nearby

Every code example in this chapter comes from real FreeBSD 14.3 source files. Don't just read the snippets in this book, open the actual files and see them in context.

Why this matters:

Seeing the full file shows you:

  • How includes are organized at the top
  • How multiple functions relate to each other
  • Comments and documentation the original developers left
  • Real-world patterns and idioms

Quick Locator: Where in the Source Tree?

Shape you're studyingTypical place in /usr/src/sysA concrete file to open first
Minimal character device (cdevsw)dev/null/dev/null/null.c
Simple infrastructure device (LED)dev/led/dev/led/led.c
Pseudo-network interface (tun/tap)net/net/if_tuntap.c
UART PCI "glue" exampledev/uart/dev/uart/uart_bus_pci.c
Bus plumbing (for reference)dev/pci/, kern/, bus/skim dev/pci/pcib*.* and related

Tip: open one of these side-by-side with the explanations to strengthen pattern recognition.

Practical tip: Keep a second terminal or editor window open. When the text says:

"Here's an example from /usr/src/sys/dev/null/null.c (lines 67-73):"

Actually navigate there:

% cd /usr/src/sys/dev/null
% less null.c

Use / in less to search for patterns like probe or cdevsw, and jump directly to relevant sections.

Type the Micro-Snippets Yourself

Even though Chapter 7 is where you'll write complete drivers, typing short patterns now builds fluency.

When you see a probe function example, don't just read it, type it in a scratch file:

static int
mydriver_probe(device_t dev)
{
    device_set_desc(dev, "My Example Driver");
    return (BUS_PROBE_DEFAULT);
}

Why this works: Typing engages muscle memory. Your fingers learn the shapes (device_t, BUS_PROBE_DEFAULT) faster than your eyes alone. By the time you reach Chapter 7, these patterns will feel natural.

Practical tip:

Create a scratch directory:

% mkdir -p ~/scratch/chapter06
% cd ~/scratch/chapter06
% vi patterns.c

Use this space to collect the patterns you're learning.

Treat Labs as Checkpoints

This chapter includes four hands-on labs (see "Hands-On Labs" section):

  1. Lab 1: Read-only scavenger hunt through real drivers
  2. Lab 2: Build and load a minimal module that just logs messages
  3. Lab 3: Create and remove a device node in /dev
  4. Lab 4: Error handling and defensive programming

Don't skip these. They're your validation that the concepts have moved from "I read about it" to "I can do it."

Timing: Complete each lab when you reach the "Hands-On Labs" section, not before. The labs assume you've read the earlier sections covering driver structure and patterns. They're designed to synthesize everything into hands-on practice.

Success mindset: Labs are meant to be achievable. If you get stuck, revisit the relevant section, check the man 9 pages cited in the text, and use the Summary Reference Table - Driver Building Blocks at a Glance at the end of the chapter. Each lab should take 20-45 minutes.

Defer the Deep Mechanics

This chapter repeatedly says things like:

  • "Interrupts are covered in Chapter 19 and 20"
  • "DMA details in Chapter 21"
  • "Network packet processing in Chapter 28"

Trust this structure. Trying to learn everything at once leads to confusion and burnout.

Analogy: When you learn to drive, you first understand the car's controls (steering, pedals, gearshift) before studying engine mechanics. Similarly, learn driver structure now, and dive into mechanisms later when you have context.

Strategy: When you hit a "defer this" moment, acknowledge it and move on. The deep topics are coming, and they'll make more sense once you've written a basic driver.

Use man 9 as Your Reference

FreeBSD's section 9 manual pages document kernel interfaces. They're invaluable but can be dense.

When to use them:

  • You see a function name you don't recognize
  • You want to know all parameters and return values
  • You need to confirm behavior

Example:

% man 9 device_get_softc
% man 9 bus_alloc_resource
% man 9 make_dev

Pro tip: Use apropos to search for related functions:

% apropos device | grep "^device"

This shows you all device-related functions at once.

Skim Code Before Reading Explanations

When a section references a source file, try this approach:

  1. Skim the file first (30 seconds)
  2. Notice patterns (where are probe/attach? what includes are there?)
  3. Then read the explanation in this chapter
  4. Go back to the code with newfound understanding

Why this works: Your brain creates a rough mental map first, then the explanation fills in details. This is more effective than reading explanation -> code, which treats code as an afterthought.

Visualize as You Read

Driver structure has a lot of moving parts: buses, devices, methods, lifecycles. Draw diagrams as you encounter new concepts.

Examples of useful diagrams:

  • Device tree showing parent-child relationships
  • Lifecycle flowchart (probe -> attach -> operate -> detach)
  • Character device flow (open -> read/write -> close)
  • Relationship between device_t, softc, and cdev

Tools: Paper and pencil work great. Or use simple text art:

root
 |- nexus0
     |- acpi0
         |- pci0
             |- em0 (network)
             |- ahci0 (storage)
                 |- ada0 (disk)

Study Patterns Across Multiple Drivers

The "Read-Only Tour of Tiny Real Drivers" section tours four real drivers (null, led, tun and minimal PCI). Don't just read each in isolation, compare them:

  • How does null.c structure its cdevsw vs. led.c?
  • Where does each driver initialize its softc?
  • What's similar in their probe functions? What's different?

Pattern recognition is the goal. Once you see the same shape repeated, you'll recognize it everywhere.

Set Realistic Expectations

Plan for about 18-22 hours if you complete all activities in this chapter (reading, tours, labs, and reviews). If you also tackle the optional challenges, allow up to an additional 4 hours. At two hours per day, expect roughly a week or a little more, that's normal and expected.

This isn't a race. The goal is mastery of structure, which is foundational for every subsequent chapter.

Mindset: Think of this chapter as a training program, not a sprint. Athletes don't try to gain all their strength in one session. Similarly, you're building driver literacy gradually.

When to Take Breaks

You'll know you need a break when:

  • You've read the same paragraph three times without absorbing it
  • Function names start blurring together
  • You feel overwhelmed by details

Solution: Step away. Go for a walk, do something else, then return refreshed. This material will still be here, and your brain processes complex information better with rest.

You're Building a Foundation

Remember: This chapter is your blueprint. Chapter 7 is where you'll build actual code. Investing time here pays enormous dividends later because you won't be guessing about the structure; you'll know it.

Let's begin with the big picture.

The Big Picture: How FreeBSD Sees Devices and Drivers

Before we examine any code, we need to establish a mental model of how FreeBSD conceptually organizes devices and drivers. Understanding this model is like learning how a building's plumbing works before you replace a pipe, you need to know where the water comes from and where it goes.

This section provides the one-page overview you'll carry forward through the rest of the chapter. We'll define key terms, show how pieces connect, and give you just enough vocabulary to navigate the rest of the material without drowning in details.

One-Screen Driver Lifecycle

Boot/Hot-plug
|
v
[ Device enumerated by bus ]
| (PCI/USB/ACPI/FDT discovers hardware and creates device_t)
v
[ probe(dev) ]
| Decide: "Am I the right driver?" (score and return priority)
| If not mine  ->  return ENXIO / lower score
v
[ attach(dev) ]
| Allocate softc/state
| Claim resources (memory BARs/IRQ/etc.)
| Create user surface (e.g., make_dev / ifnet)
| Register callbacks, start timers
v
[ operate ]
| Runtime: open/read/write/ioctl, TX/RX, interrupts, callouts
| Normal errors handled; resources reused
v
[ detach(dev) ]
| Quiesce I/O and timers
| Destroy user surface (destroy_dev / if_detach / etc.)
| Free resources and state
v
Goodbye

Keep this flow in mind while you read the tours, every driver you'll see fits this outline.

Devices, Drivers, and Devclasses

FreeBSD uses precise terminology for the components in its device model. Let's define them in plain language:

Device

A device is the kernel's representation of a hardware resource or logical entity. It's a device_t structure that the kernel creates and manages.

Think of it as a name tag for something the kernel needs to track: a network card, a disk controller, a USB keyboard, or even a pseudo-device like /dev/null.

Key insight: A device exists whether or not a driver is attached to it. During boot, buses enumerate hardware and create device_t structures for everything they find. These devices sit waiting for drivers to claim them.

Driver

A driver is code that knows how to control a specific type of device. It's the implementation, the probe, attach, and operational functions that make hardware useful.

A single driver can handle multiple device models. For example, the em driver handles dozens of different Intel Ethernet cards by checking device IDs and adapting behavior.

Devclass

A devclass (device class) is a grouping of related devices. It's how FreeBSD keeps track of, say, "all UART devices" or "all disk controllers."

When you run sysctl dev.em, you're querying the em devclass, which shows all instances (em0, em1, etc.) managed by that driver.

Example:

devclass: uart
devices in this class: uart0, uart1, uart2
each device has a driver attached (or not)

Relationship summary:

  • Devclass = category (e.g., "network interfaces")
  • Device = instance (e.g., "em0")
  • Driver = code (e.g., the em driver's functions)

Why this matters: When you write a driver, you'll register it with a devclass, and each device your driver attaches to becomes part of that class.

The Bus Hierarchy and Newbus (One Page)

FreeBSD organizes devices in a tree structure called the device tree, with buses as internal nodes and devices as leaves. This is managed by a framework called Newbus.

What is a bus?

A bus is any device that can have children. Examples:

  • PCI bus: Contains PCI cards (network, graphics, storage controllers)
  • USB hub: Contains USB peripherals
  • ACPI bus: Contains platform devices enumerated by ACPI tables

The device tree structure:

root
 |- nexus0 (platform-specific root bus)
     |- acpi0 (ACPI bus)
         |- cpu0
         |- cpu1
         |- pci0 (PCI bus)
             |- em0 (network card)
             |- ahci0 (SATA controller)
             |   |- ada0 (disk)
             |   |- ada1 (disk)
             |- ehci0 (USB controller)
                 |- usbus0 (USB bus)
                     |- ukbd0 (USB keyboard)

What is Newbus?

Newbus is FreeBSD's object-oriented device framework. It provides:

  • Device discovery: Buses enumerate their children
  • Driver matching: Probe functions determine which driver fits each device
  • Resource management: Buses allocate IRQs, memory ranges, and other resources to devices
  • Lifecycle management: Coordinating probe, attach, detach

The probe-attach flow:

  1. A bus (e.g., PCI) enumerates its devices by scanning hardware
  2. For each device, the kernel creates a device_t
  3. The kernel calls every compatible driver's probe function: "Can you handle this?"
  4. The driver with the best match wins
  5. The kernel calls that driver's attach function to initialize it

Why "Newbus"?

It replaced an older, less flexible device framework. The "new" is historical; it's been standard for decades now.

Your role as a driver author:

  • You write probe, attach, and detach functions
  • Newbus calls them at the right times
  • You don't manually search for devices, Newbus brings them to you

See it in action:

% devinfo -rv

This shows the complete device tree with resource assignments.

From Kernel to /dev: What devfs Presents

Many devices (especially character devices) appear as files in /dev. How does that work?

devfs (device filesystem)

devfs is a special filesystem that dynamically presents device nodes as files. It's kernel-managed: when a driver creates a device node, it instantly appears in /dev. When the driver is unloaded, the node disappears.

Why files?

The UNIX philosophy: "everything is a file" means uniform access:

% ls -l /dev/null
crw-rw-rw-  1 root  wheel  0x14 Oct 14 12:34 /dev/null

That c means character device. The major number (part of 0x14) identifies the driver; the minor number identifies which instance.

Note: Historically, the device number was split into 'major' (driver) and 'minor' (instance). With devfs and dynamic devices on modern FreeBSD, you don't rely on fixed major/minor values; treat that number as an internal identifier, and use the cdev and devfs APIs instead.

User-space view:

When a program opens /dev/null, the kernel:

  1. Looks up the device by major/minor number
  2. Finds the associated cdev (character device structure)
  3. Calls the driver's d_open function
  4. Returns a file descriptor to the program

For reads/writes:

  • User program calls read(fd, buf, len)
  • Kernel translates to driver's d_read function
  • Driver handles it, returns data or error
  • Kernel passes result back to user program

Not all devices appear in /dev:

  • Network interfaces (em0, wlan0) appear in ifconfig, not /dev
  • Storage layers often use /dev/ada0, but GEOM adds complexity
  • Pseudo-devices may or may not create nodes

Key takeaway: Character drivers typically create /dev entries using make_dev(), and devfs makes them visible. We'll cover this in detail in the "Creating and Removing Device Nodes" section.

Your Manual-Page Map (Read, Don't Memorize)

FreeBSD's section 9 manual pages document kernel APIs. Here's your starter map of the most important pages for driver development. You don't need to memorize these, just know they exist so you can look them up later.

Core device and driver APIs:

  • device(9) - Overview of the device_t abstraction
  • devclass(9) - Device class management
  • DRIVER_MODULE(9) - Registering your driver with the kernel
  • DEVICE_PROBE(9) - How probe methods work
  • DEVICE_ATTACH(9) - How attach methods work
  • DEVICE_DETACH(9) - How detach methods work

Character devices:

  • make_dev(9) - Creating device nodes in /dev
  • destroy_dev(9) - Removing device nodes
  • cdev(9) - Character device structure and operations

Network interfaces:

  • ifnet(9) - Network interface structure and registration
  • if_attach(9) - Attaching a network interface
  • mbuf(9) - Network buffer management

Storage:

  • GEOM(4) - Overview of FreeBSD's storage layer (note: section 4, not 9)
  • g_bio(9) - Bio (block I/O) structure

Resources and hardware access:

  • bus_alloc_resource(9) - Claiming IRQs, memory, etc.
  • bus_space(9) - Portable MMIO and PIO access
  • bus_dma(9) - DMA memory management

Module and lifecycle:

  • module(9) - Kernel module infrastructure
  • MODULE_DEPEND(9) - Declaring module dependencies
  • MODULE_VERSION(9) - Versioning your module

Locking and synchronization:

  • mutex(9) - Mutual exclusion locks
  • sx(9) - Shared/exclusive locks
  • rmlock(9) - Read-mostly locks

Utility functions:

  • printf(9) - Kernel printf variants (including device_printf)
  • malloc(9) - Kernel memory allocation
  • sysctl(9) - Creating sysctl nodes for observability

How to use this map:

When you encounter an unfamiliar function or concept, check if it has a man page:

% man 9 <function_or_topic>

Examples:

% man 9 device_get_softc
% man 9 bus_alloc_resource
% man 9 make_dev

If you're not sure of the exact name, use apropos:

% apropos -s 9 device

Pro tip: Many man pages include SEE ALSO sections at the bottom, pointing to related topics. Follow those breadcrumbs when exploring.

This is your reference library. You don't read it cover-to-cover, you consult it when needed. As you work through this chapter and later chapters, you'll build familiarity with the most common pages naturally.

Summary

You now have the big picture:

  • Devices are kernel objects, drivers are code, devclasses are groupings
  • Newbus manages the device tree and driver lifecycle (probe/attach/detach)
  • devfs presents devices as files in /dev (for character devices)
  • Manual pages in section 9 are your reference library

This mental model is your foundation. In the next section, we'll explore the different families of drivers and how to choose the right shape for your hardware.

Driver Families: Choosing the Right Shape

Not all drivers are created equal. Depending on what your hardware does, you'll need to present the right "face" to the FreeBSD kernel. Think of driver families like professional specializations: a cardiologist and an orthopedic surgeon are both doctors, but they work very differently. Similarly, a character device driver and a network driver both interact with hardware, but they plug into different parts of the kernel.

This section helps you identify which family your driver belongs to and understand the structural differences between them. We'll keep this at a recognition level, later chapters will dive into implementation.

Character Devices

Character devices are the simplest and most common driver family. They present a stream-oriented interface to user programs: open, close, read, write, and ioctl.

When to use:

  • Hardware that sends or receives data byte-by-byte or in arbitrary chunks
  • Control surfaces for configuration (LEDs, GPIO pins)
  • Sensors, serial ports, sound cards, custom hardware
  • Pseudo-devices that implement software functionality

User-space view:

% ls -l /dev/cuau0
crw-rw----  1 root  dialer  0x4d Oct 14 10:23 /dev/cuau0

Programs interact with character devices like files:

int fd = open("/dev/cuau0", O_RDWR);
write(fd, "Hello", 5);
read(fd, buffer, sizeof(buffer));
ioctl(fd, SOME_COMMAND, &arg);
close(fd);

Kernel view:

Your driver implements a struct cdevsw (character device switch) with function pointers:

static struct cdevsw mydev_cdevsw = {
    .d_version = D_VERSION,
    .d_open    = mydev_open,
    .d_close   = mydev_close,
    .d_read    = mydev_read,
    .d_write   = mydev_write,
    .d_ioctl   = mydev_ioctl,
    .d_name    = "mydev",
};

When a user program calls read(), the kernel routes it to your mydev_read() function.

Examples in FreeBSD:

  • /dev/null, /dev/zero, /dev/random - Pseudo-devices
  • /dev/led/* - LED control
  • /dev/cuau0 - Serial port
  • /dev/dsp - Audio device

Why start here: Character devices are the simplest family to understand and implement. If you're learning driver development, you'll almost certainly start with a character device. Chapter 7's first driver is a character device.

Storage via GEOM (Why "Block Devices" Are Different Here)

FreeBSD's storage architecture centers on GEOM (Geometry Management), a modular framework for storage transformations and layering.

Historical note: Traditional UNIX had "block devices" and "character devices." Modern FreeBSD unified this, all devices are character devices, and GEOM sits on top to provide block-level storage services.

GEOM conceptual model:

  • Providers: Supply storage (e.g., a disk: ada0)
  • Consumers: Use storage (e.g., a filesystem)
  • Geoms: Transformations in between (partitioning, RAID, encryption)

Example stack:

Filesystem (UFS)
     ->  consumes
GEOM LABEL (geom_label)
     ->  consumes
GEOM PART (partition table)
     ->  consumes
ada0 (disk driver via CAM)
     ->  talks to
AHCI driver (hardware)

When to use:

  • You're writing a disk controller driver (SATA, NVMe, SCSI)
  • You're implementing a storage transformation (software RAID, encryption, compression)
  • Your device presents block-oriented storage

User-space view:

% ls -l /dev/ada0
crw-r-----  1 root  operator  0xa9 Oct 14 10:23 /dev/ada0

Notice it's still a character device (c), but GEOM and the buffer cache provide block semantics.

Kernel view:

Storage drivers typically interact with CAM (Common Access Method), FreeBSD's SCSI/ATA layer. You register a SIM (SCSI Interface Module) that handles I/O requests.

Alternatively, you can create a GEOM class that processes bio (block I/O) requests.

Examples:

  • ahci - SATA controller driver
  • nvd - NVMe disk driver
  • gmirror - GEOM mirror (RAID 1)
  • geli - GEOM encryption layer

Why this is advanced

Storage drivers involve understanding:

  • DMA and scatter-gather lists
  • Block I/O scheduling
  • CAM or GEOM frameworks
  • Data integrity and error handling

We won't cover this in depth until much later. For now, just recognize that storage drivers have a different shape than character devices.

Network via ifnet

Network drivers don't appear in /dev. Instead, they register as network interfaces that appear in ifconfig and integrate with the FreeBSD network stack.

When to use:

  • Ethernet cards
  • Wireless adapters
  • Virtual network interfaces (tunnels, bridges, VPNs)
  • Any device that sends/receives network packets

User-space view:

% ifconfig em0
em0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
    ether 00:0c:29:3a:4f:1e
    inet 192.168.1.100 netmask 0xffffff00 broadcast 192.168.1.255

Programs don't open network interfaces directly. Instead, they create sockets and the kernel routes packets through the appropriate interface.

Kernel view:

Your driver allocates and registers an if_t (interface) structure:

if_t ifp;

ifp = if_alloc(IFT_ETHER);
if_setsoftc(ifp, sc);
if_initname(ifp, device_get_name(dev), device_get_unit(dev));
if_setflags(ifp, IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST);
if_setinitfn(ifp, mydriver_init);
if_setioctlfn(ifp, mydriver_ioctl);
if_settransmitfn(ifp, mydriver_transmit);
if_setqflushfn(ifp, mydriver_qflush);

ether_ifattach(ifp, sc->mac_addr);

Your driver must handle:

  • Transmit: Kernel gives you packets (mbufs) to send
  • Receive: You receive packets from hardware and pass them up the stack
  • Initialization: Configure hardware when interface comes up
  • ioctl: Handle configuration changes (address, MTU, etc.)

Examples:

  • em - Intel Ethernet (e1000 family)
  • igb - Intel Gigabit Ethernet
  • bge - Broadcom Gigabit Ethernet
  • if_tun - Tunnel device

Why this is different

Network drivers must:

  • Manage packet queues and mbuf chains
  • Handle link state changes
  • Support multicast filtering
  • Implement hardware offload features (checksums, TSO, etc.)

Chapter 28 covers network driver development in depth.

Pseudo and Clone Devices (Safe, Small, Instructive)

Pseudo-devices are software-only drivers with no backing hardware. They're perfect for learning because you can focus entirely on driver structure without worrying about hardware behavior.

Common pseudo-devices:

  1. null (/dev/null) - Discards writes, returns EOF on reads
  2. zero (/dev/zero) - Returns infinite zeros
  3. random (/dev/random) - Random number generator
  4. md - Memory disk (RAM disk)
  5. tun/tap - Network tunnel devices

Why they're valuable for learning:

  • No hardware complexity (no registers, no DMA, no interrupts)
  • Focus purely on driver structure and lifecycle
  • Easy to test (just read/write to /dev)
  • Small, readable source code

Special case: Clone devices

Some pseudo-devices support multiple simultaneous opens by creating new device nodes on demand. Example: /dev/bpf (Berkeley Packet Filter).

When you open /dev/bpf, the driver allocates a new instance (/dev/bpf0, /dev/bpf1, etc.) for your session.

Example: tun device (hybrid)

The tun device is interesting because it's both:

  • A character device (/dev/tun0) for control
  • A network interface (tun0 in ifconfig) for data

Programs open /dev/tun0 to configure the tunnel, but packets flow through the network interface. This "mixed model" demonstrates how drivers can present multiple surfaces.

Where to find them in the source:

% ls /usr/src/sys/dev/null/
% ls /usr/src/sys/dev/md/
% ls /usr/src/sys/net/if_tuntap.c

The "Read-Only Tour of Tiny Real Drivers" section will tour these drivers in detail. For now, just recognize that pseudo-devices are your training wheels, simple enough to understand, real enough to be useful.

Decision Checklist: Which Shape Fits?

Use this checklist to identify the right driver family for your hardware:

Choose Character Device if:

  • Hardware sends/receives arbitrary data streams (not packets, not blocks)
  • User programs need direct file-like access (open/read/write)
  • You're implementing a control interface (GPIO, LED, sensor)
  • It's a pseudo-device providing software functionality
  • It doesn't fit the network or storage models

Choose Network Interface if:

  • Hardware sends/receives network packets (Ethernet frames, etc.)
  • Should integrate with the network stack (routing, firewalls, sockets)
  • Appears in ifconfig, not /dev
  • Needs to support protocols (TCP/IP, etc.)

Choose Storage/GEOM if:

  • Hardware provides block-oriented storage
  • Should appear as a disk in the system
  • Needs to support filesystems
  • Requires partitioning, or sits in a storage transformation stack

Mixed Models:

  • Some devices (like tun) present both a control plane (character device) and a data plane (network interface or storage)
  • This is less common but powerful when needed

Still unsure?

  • Look at similar existing drivers
  • Check what user programs expect (do they open files, or use sockets?)
  • Ask: "What subsystem does my hardware naturally integrate with?"

Mini Exercise: Classify Live Drivers

Let's practice pattern recognition on your running FreeBSD system.

Instructions:

  1. Identify one character device:

    % ls -l /dev/null /dev/random /dev/cuau*
    

    Pick one. What makes it a character device?

  2. Identify one network interface:

    % ifconfig -l
    

    Pick one (e.g., em0, lo0). Look it up:

    % man 4 em
    

    What hardware does it drive?

  3. Identify one storage participant:

    % geom disk list
    

    Pick a disk (e.g., ada0 or nvd0). What driver manages it?

  4. Find the driver source:

    For each, try to locate its source:

    % find /usr/src/sys -name "null.c"
    % find /usr/src/sys -name "if_em.c"
    % find /usr/src/sys -name "ahci.c"
    
  5. Record in your lab logbook:

    Character: /dev/random -> sys/dev/random/randomdev.c
    Network:   em0 -> sys/dev/e1000/if_em.c
    Storage:   ada0 (via CAM) -> sys/dev/ahci/ahci.c
    

What you're learning: Recognition. By the time you've done this, you've connected abstract concepts (character, network, storage) to real, concrete examples on your system.

Summary

Drivers come in families with different shapes:

  • Character devices: Stream I/O via /dev, simplest to learn
  • Storage devices: Block I/O via GEOM/CAM, advanced
  • Network interfaces: Packet I/O via ifnet, no /dev presence
  • Pseudo-devices: Software-only, perfect for learning structure

Choosing the right shape: Match your hardware's purpose to the kernel subsystem it naturally integrates with.

In the next section, we'll examine the minimal driver skeleton, the universal scaffolding that all drivers share, regardless of family.

The Minimal Driver Skeleton

Every FreeBSD driver, from the simplest pseudo-device to the most complex PCI controller, shares a common skeleton, a scaffolding of required components that the kernel expects. Think of this skeleton as the chassis of a car: before you can add an engine, seats, or a stereo, you need the basic frame that everything else bolts onto.

This section introduces the universal pattern you'll see in every driver. We'll keep this minimal, just enough to load, attach, and unload cleanly. Later sections and chapters will add the muscles, organs, and features.

Core Types: device_t and the softc

Two fundamental types appear in every driver: device_t and your driver's softc (software context) structure.

device_t - the kernel's handle to this device

device_t is an opaque handle managed by the kernel. You never poke inside it; you ask the kernel for what you need via accessors.

#include <sys/bus.h>

const char *name   = device_get_name(dev);   // e.g., "mydriver"
int         unit   = device_get_unit(dev);   // 0, 1, 2, ...
device_t    parent = device_get_parent(dev); // the parent bus (PCI, USB, etc.)
void       *cookie = device_get_softc(dev);  // pointer to your softc (explained below)

Why opaque?

So the kernel can evolve its internal representation without breaking your code. You interact through a stable API instead of struct fields.

Where you see it

Every lifecycle callback (probe, attach, detach, ...) receives a device_t dev. That parameter is your "session" with the kernel for this particular device instance.

The softc - your driver's private state

Each device instance needs a place to keep state: resources, locks, stats, and any hardware-specific bits. That's the softc you define.

You define it

struct mydriver_softc {
    device_t         dev;        // back-pointer to device_t (handy for prints, etc.)
    struct resource *mem_res;    // MMIO resource
    int              mem_rid;    // resource ID (e.g., PCIR_BAR(0))
    struct mtx       mtx;        // driver lock
    uint64_t         bytes_rx;   // example statistic
    /* ... your driver-specific state ... */
};

The kernel allocates it for you

When you register the driver, you tell Newbus the size of your softc:

static driver_t mydriver_driver = {
    "mydriver",
    mydriver_methods,
    sizeof(struct mydriver_softc) // Newbus allocates and zeroes this per instance
};

Newbus creates (and zeroes) one softc per device instance during device creation. You don't malloc() it, and you don't free() it.

You retrieve it where you work

static int
mydriver_attach(device_t dev)
{
    struct mydriver_softc *sc;

    sc = device_get_softc(dev);  // get your per-instance state
    sc->dev = dev;               // stash the handle for convenience

    /* initialize locks/resources, map registers, set up interrupts, etc. */
    return (0);
}

That one-liner

struct mydriver_softc *sc = device_get_softc(dev);

appears at the top of almost every driver method that needs state. It's the idiomatic way to enter your driver's world.

Mental model

  • device_t: "ticket" the kernel hands you for this device.
  • softc: your "backpack" of state tied to that ticket.
  • Access pattern: kernel calls your method with dev -> you call device_get_softc(dev) -> operate via sc->....

Before we move on

  • Lifetime: the softc exists once Newbus creates the device object and lasts until the device is deleted. You still must destroy locks and release resources in detach; Newbus only frees the softc memory.
  • Probe vs attach: identify in probe; don't allocate resources there. Initialize hardware in attach.
  • Types: device_get_softc() returns void *; assigning to struct mydriver_softc * is fine in C (no cast needed).

That's all you need for the skeleton. We'll layer in resources, interrupts, and power management in their dedicated sections, keeping this mental model as home base.

Method Tables and kobj - Why Callbacks Look "Magical"

FreeBSD drivers use method tables to connect your functions to Newbus. This might look a bit magical at first, but it's actually simple and elegant.

The method table:

static device_method_t mydriver_methods[] = {
    /* Device interface (device_if.m) */
    DEVMETHOD(device_probe,     mydriver_probe),
    DEVMETHOD(device_attach,    mydriver_attach),
    DEVMETHOD(device_detach,    mydriver_detach),

    DEVMETHOD_END
};

What this table means (practical view)

It's a routing table from Newbus "method names" to your functions:

  • device_probe -> mydriver_probe Runs when the kernel is asking "does this driver match this device?" Do: check IDs/compat strings, set a description if you like, return a probe result. Don't: allocate resources or touch hardware yet.
  • device_attach -> mydriver_attach Runs after your probe wins. Do: allocate resources (MMIO/IRQs), initialize hardware, set up interrupts, create your /dev node if applicable. Handle failures cleanly. Don't: leave partial state behind, either unwind or fail gracefully.
  • device_detach -> mydriver_detach Runs when the device is being removed/unloaded. Do: stop hardware, tear down interrupts, destroy device nodes, free resources, destroy locks. Don't: return success if the device is still in use; return EBUSY when appropriate.

Why keep it this small?

This chapter focuses on the driver skeleton. We add power management and other hooks later, so you master the core lifecycle first.

The magic behind it: kobj

Under the hood, FreeBSD uses kobj (kernel objects) to implement method dispatch:

  1. Interfaces (collections of methods) are defined in .m files (e.g., device_if.m, bus_if.m).
  2. Build tools generate C glue from those .m files.
  3. At runtime, kobj uses your method table to look up the right function to call.

Example

When the kernel wants to probe a device, it effectively does:

DEVICE_PROBE(dev);  // The macro expands to a kobj lookup; kobj finds mydriver_probe here

Why this matters

  • The kernel can call methods polymorphically (same call, different driver implementations).
  • You override only what you need; unimplemented methods fall back to defaults where appropriate.
  • Interfaces are composable: you'll add more (e.g., bus or power-management methods) as your driver grows.

What you'll add later (when ready)

  • device_shutdown -> mydriver_shutdown Called during reboot/power-off to put hardware in a safe state. (Add once your basic attach/detach path is solid.)
  • device_suspend / device_resume For sleep/hibernate support: quiesce and restore hardware. (Covered when we tackle power management in Chapter 22.)

Mental model

Think of the table as a dictionary: keys are method names like device_attach; values are your functions. The DEVICE_* macros ask kobj to "find the function for this method on this object," and kobj consults your table to call it. No magic, just generated dispatch code.

Registration Macros You'll Always Meet

These macros are the driver's "business card." They tell the kernel what you are, where you attach, and what you depend on.

1) DRIVER_MODULE - register your driver

/* Minimal pattern: pick the correct parent bus for your hardware */
DRIVER_MODULE(mydriver, pci, mydriver_driver, 0, 0);

/*
 * Use the parent bus your device lives on: 'pci', 'usb', 'acpi', 'simplebus', etc.
 * 'nexus' is the machine-specific root bus and is rarely what you want for ordinary drivers.
 */

Parameters (order matters):

  • mydriver - the driver name (shows up in logs and as the base of the unit name, like mydriver0).
  • pci - the parent bus where you attach (choose what matches your hardware: pci, usb, acpi, simplebus, ...).
  • mydriver_driver - your driver_t (declares method table and softc size).
  • 0 - devclass (pass 0 to let the kernel manage it automatically).
  • 0 - optional event handler for module load and unload notifications (usually 0 or NULL at first).

When to keep it minimal

Early in this chapter we are focused on the skeleton. Passing 0 for devclass and the event handler keeps things simple. Note: pick the real parent bus for your device; nexus is the root bus and almost never the right choice for ordinary drivers.

Explicit devclass pattern (you will see this a lot too):

static devclass_t mydriver_devclass;

DRIVER_MODULE(mydriver, pci, mydriver_driver, mydriver_devclass, 0, 0);

Use an explicit devclass_t when you need tighter control over unit numbering and aliases, or when other code needs to look up your devclass by name.

What DRIVER_MODULE actually accomplishes

  • Registers your driver with Newbus under a parent bus.
  • Exposes your method table and softc size via driver_t.
  • Ensures the loader knows how to match devices discovered on that bus with your driver.

2) MODULE_VERSION - tag your module with a version

MODULE_VERSION(mydriver, 1);

This stamps the module with a simple integer version.

Why it matters

  • The kernel and other modules can check your version to avoid mismatches.
  • If you make a breaking change to the module ABI or exported symbols, bump this number.

Convention: start at 1 and bump only when something external would break if an older version was loaded.

3) MODULE_DEPEND - declare dependencies (when you have them)

/* mydriver requires the USB stack to be present */
MODULE_DEPEND(mydriver, usb, 1, 1, 1);

Parameters:

  • mydriver - your module.
  • usb - the module you depend on.
  • 1, 1, 1 - min, preferred, max versions of the dependency (all 1 is common when there's no nuanced versioning to enforce).

When to use

  • Your driver needs another module to be loaded first (e.g., usb, pci, or a helper library module).
  • You export or consume symbols that require consistent versions across modules.

Mental model

  • DRIVER_MODULE tells Newbus who you are and where you plug in.
  • MODULE_VERSION helps the loader keep compatible pieces together.
  • MODULE_DEPEND ensures modules load in the right order so your symbols and subsystems are ready when your driver starts.

What you'll write now vs later

For the minimal driver skeleton in this chapter, you'll almost always include DRIVER_MODULE and MODULE_VERSION.

Add MODULE_DEPEND when you actually rely on another module; we'll introduce common dependencies (and when they're required) in later chapters for PCI/USB/ACPI/SoC buses.

Finding Your State and Speaking Clearly

Two patterns appear in nearly every driver function: retrieving your softc and logging messages.

Retrieving state: device_get_softc()

static int
mydriver_attach(device_t dev)
{
    struct mydriver_softc *sc;
    
    sc = device_get_softc(dev);  // Get our private data
    
    // Now use sc-> for everything
    sc->dev = dev;
    sc->some_flag = 1;
}

This is your first line in almost every driver function. It connects the device_t the kernel gave you to your private state.

Logging: device_printf()

When your driver needs to log information, use device_printf():

device_printf(dev, "Driver attached successfully\n");
device_printf(dev, "Hardware version: %d.%d\n", major, minor);

Why device_printf instead of regular printf?

  • It prefixes output with your device name: mydriver0: Driver attached successfully
  • Users immediately know which device is talking
  • Essential when multiple instances exist (mydriver0, mydriver1, ...)

Example output:

em0: Intel PRO/1000 Network Connection 7.6.1-k
em0: Link is Up 1000 Mbps Full Duplex

Logging etiquette (we'll expand on this in the "Logging, Errors, and User-Facing Behaviour" section):

  • Attach: Log one line on successful attach
  • Errors: Always log why something failed
  • Verbose info: Only during boot or when debugging
  • Avoid spam: Don't log on every packet/interrupt (use counters instead)

Good example:

if (error != 0) {
    device_printf(dev, "Could not allocate memory resource\n");
    return (error);
}
device_printf(dev, "Attached successfully\n");

Bad example:

printf("Attaching...\n");  // No device name!
printf("Step 1\n");         // Too verbose
printf("Step 2\n");         // User doesn't care

Build & Load a Stub Safely (Preview Only)

We won't build a complete driver yet (that's Chapter 7 and Lab 2), but let's preview the build and load cycle so you know what's coming.

The minimal Makefile:

# Makefile
KMOD=    mydriver
SRCS=    mydriver.c

.include <bsd.kmod.mk>

That's it. FreeBSD's kernel module build system (bsd.kmod.mk) handles all the complexity.

Build:

% make clean
% make

This produces mydriver.ko (kernel object file).

Load:

% sudo kldload ./mydriver.ko

Verify:

% kldstat | grep mydriver
% dmesg | tail

Unload:

% sudo kldunload mydriver

What happens behind the scenes:

  1. kldload reads your .ko file
  2. Kernel resolves symbols and links it into the kernel
  3. Kernel calls your module event handler with MOD_LOAD
  4. If you registered devices/drivers, they're now available
  5. Newbus may immediately probe/attach if devices are present

On unload:

  1. Kernel checks if it's safe to unload (no devices attached, no active users)
  2. Calls your module event handler with MOD_UNLOAD
  3. Unlinks the code from the kernel
  4. Frees the module

Safety note: In your lab VM, loading/unloading is safe. If your code crashes the kernel, the VM reboots, no harm done. Never test new drivers on production systems.

Lab preview: In the "Hands-On Labs" section, Lab 2 will walk you through building and loading a minimal module that just logs messages. For now, just know this is the cycle you'll follow.

Summary

The minimal driver skeleton includes:

  1. device_t - Opaque handle to your device
  2. softc structure - Your per-device private data
  3. Method table - Maps kernel method calls to your functions
  4. DRIVER_MODULE - Registers your driver with the kernel
  5. MODULE_VERSION - Declares your version
  6. device_get_softc() - Retrieves your state in every function
  7. device_printf() - Logs messages with device name prefix

This pattern appears in every FreeBSD driver. Master it, and you can read any driver code with confidence.

Next, we'll explore the Newbus lifecycle, when and why each of these methods is called.

The Newbus Lifecycle: From Discovery to Goodbye

You've seen the skeleton (probe, attach, detach functions). Now let's understand when and why the kernel calls them. The Newbus device lifecycle is a precisely orchestrated sequence, and knowing this flow is essential for writing correct initialization and cleanup code.

Think of it like the lifecycle of a restaurant: there's a specific order to opening (inspect location, attach utilities, set up kitchen), operating (serve customers), and closing (clean up, turn off equipment, detach utilities). Drivers follow a similar lifecycle, and understanding the sequence helps you write robust code.

Where Enumeration Comes From

Before your driver ever runs, hardware must be discovered. This is called enumeration, and it's the job of bus drivers.

How buses discover devices

PCI bus: Reads configuration space at every bus/device/function address. When it finds a responding device, it reads vendor ID, device ID, class code, and resource requirements (memory BARs, IRQ lines).

USB bus: When you plug in a device, the hub detects electrical changes, issues a USB reset, and queries the device descriptor to learn what it is.

ACPI bus: Parses tables provided by BIOS/UEFI that describe platform devices (UARTs, timers, embedded controllers, etc.).

Device tree (ARM/embedded): Reads a devicetree blob (DTB) that statically describes hardware layout.

Key insight: Your driver doesn't search for devices. Devices are brought to you by bus drivers. You react to what the kernel presents.

The enumeration result

For each discovered device, the bus creates a device_t structure containing:

  • Device name (e.g., pci0:0:2:0)
  • Parent bus
  • Vendor/device IDs or compatible strings
  • Resource requirements

See it yourself:

% devinfo -v        # View device tree
% pciconf -lv       # PCI devices with vendor/device IDs
% sudo usbconfig dump_device_desc    # USB device descriptors

Timing: Enumeration happens during boot for built-in devices, or dynamically when you plug in hot-pluggable hardware (USB, Thunderbolt, PCIe hot-plug, etc).

probe: "Am I Your Driver?"

Once a device exists, the kernel needs to find the right driver for it. It does this by calling every compatible driver's probe function.

The probe signature:

static int
mydriver_probe(device_t dev)
{
    /* Examine the device and decide if we can handle it */
    
    /* If yes: */
    device_set_desc(dev, "My Awesome Hardware");
    return (BUS_PROBE_DEFAULT);
    
    /* If no: */
    return (ENXIO);
}

Your job in probe:

  1. Examine device properties (vendor/device ID, compatible string, etc.)
  2. Decide if you can handle it
  3. Return a priority value or error

Example: PCI driver probe

static int
mydriver_probe(device_t dev)
{
    uint16_t vendor = pci_get_vendor(dev);
    uint16_t device = pci_get_device(dev);
    
    if (vendor == MY_VENDOR_ID && device == MY_DEVICE_ID) {
        device_set_desc(dev, "My PCI Device");
        return (BUS_PROBE_DEFAULT);
    }
    
    return (ENXIO);  /* Not our device */
}

Probe return values and priority:

Return ValueNumeric ValueMeaning
BUS_PROBE_SPECIFIC-20000Exactly matches this device variant
BUS_PROBE_VENDOR-10000Vendor-specific driver
BUS_PROBE_DEFAULT0Standard driver for this device class
BUS_PROBE_GENERIC100Generic fallback (e.g., ugen for USB)
BUS_PROBE_NOWILDCARD50Don't match wildcards
ENXIO6Not our device

Lower numbers win. If multiple drivers return success, the kernel chooses the one with the lowest (most specific) value.

Why this matters: Allows specific drivers to override generic ones. Example: A vendor-optimized driver can beat a generic class driver by returning BUS_PROBE_VENDOR.

Rules for probe:

  • Do: Examine device properties
  • Do: Set a descriptive device description with device_set_desc()
  • Do: Return quickly (no long initialization)
  • Don't: Modify hardware state
  • Don't: Allocate resources (wait for attach)
  • Don't: Assume you'll win (another driver might beat you)

Real example from /usr/src/sys/dev/uart/uart_bus_pci.c:

static int
uart_pci_probe(device_t dev)
{
        struct uart_softc *sc;
        const struct pci_id *id;
        struct pci_id cid = {
                .regshft = 0,
                .rclk = 0,
                .rid = 0x10 | PCI_NO_MSI,
                .desc = "Generic SimpleComm PCI device",
        };
        int result;

        sc = device_get_softc(dev);

        id = uart_pci_match(dev, pci_ns8250_ids);
        if (id != NULL) {
                sc->sc_class = &uart_ns8250_class;
                goto match;
        }
        if (pci_get_class(dev) == PCIC_SIMPLECOMM &&
            pci_get_subclass(dev) == PCIS_SIMPLECOMM_UART &&
            pci_get_progif(dev) < PCIP_SIMPLECOMM_UART_16550A) {
                /* XXX rclk what to do */
                id = &cid;
                sc->sc_class = &uart_ns8250_class;
                goto match;
        }
        /* Add checks for non-ns8250 IDs here. */
        return (ENXIO);

 match:
        result = uart_bus_probe(dev, id->regshft, 0, id->rclk,
            id->rid & PCI_RID_MASK, 0, 0);
        /* Bail out on error. */
        if (result > 0)
                return (result);
        /*
         * If we haven't already matched this to a console, check if it's a
         * PCI device which is known to only exist once in any given system
         * and we can match it that way.
         */
        if (sc->sc_sysdev == NULL)
                uart_pci_unique_console_match(dev);
        /* Set/override the device description. */
        if (id->desc)
                device_set_desc(dev, id->desc);
        return (result);
}

What happens after probe: The kernel collects all successful probe results, sorts by priority, and selects the winner. That driver's attach function will be called next.

attach: "Prepare to Operate"

If your probe function won, the kernel calls your attach function. This is where real initialization happens.

The attach signature:

static int
mydriver_attach(device_t dev)
{
    struct mydriver_softc *sc;
    int error;
    
    sc = device_get_softc(dev);
    sc->dev = dev;
    
    /* Initialization steps go here */
    
    device_printf(dev, "Attached successfully\n");
    return (0);  /* Success */
}

Typical attach flow:

Step 1: Get your softc

struct mydriver_softc *sc = device_get_softc(dev);
sc->dev = dev;  /* Store back-pointer */

Step 2: Allocate hardware resources

sc->mem_rid = PCIR_BAR(0);
sc->mem_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY,
    &sc->mem_rid, RF_ACTIVE);
if (sc->mem_res == NULL) {
    device_printf(dev, "Could not allocate memory\n");
    return (ENXIO);
}

Step 3: Initialize hardware

/* Reset hardware */
/* Configure registers */
/* Detect hardware capabilities */

Step 4: Set up interrupts (if needed)

sc->irq_rid = 0;
sc->irq_res = bus_alloc_resource_any(dev, SYS_RES_IRQ,
    &sc->irq_rid, RF_ACTIVE | RF_SHAREABLE);
    
// Placeholder - interrupt handler implementation covered in Chapter 19
error = bus_setup_intr(dev, sc->irq_res, INTR_TYPE_NET | INTR_MPSAFE,
    NULL, mydriver_intr, sc, &sc->irq_hand);

Step 5: Create device nodes or register with subsystems

/* Character device: */
sc->cdev = make_dev(&mydriver_cdevsw, unit,
    UID_ROOT, GID_WHEEL, 0600, "mydriver%d", unit);
    
/* Network interface: */
ether_ifattach(ifp, sc->mac_addr);

/* Storage: */
/* Register with CAM or GEOM */

Step 6: Mark device ready

device_printf(dev, "Successfully attached\n");
return (0);

Error handling is critical: If any step fails, you must clean up everything you've already done:

static int
mydriver_attach(device_t dev)
{
    struct mydriver_softc *sc;
    int error;
    
    sc = device_get_softc(dev);
    
    /* Step 1 */
    sc->mem_res = bus_alloc_resource_any(...);
    if (sc->mem_res == NULL) {
        error = ENXIO;
        goto fail;
    }
    
    /* Step 2 */
    error = mydriver_hw_init(sc);
    if (error != 0)
        goto fail;
    
    /* Step 3 */
    sc->irq_res = bus_alloc_resource_any(...);
    if (sc->irq_res == NULL) {
        error = ENXIO;
        goto fail;
    }
    
    /* Success! */
    return (0);

fail:
    mydriver_detach(dev);  /* Clean up partial state */
    return (error);
}

Why jump to fail and call detach? Because detach is designed to clean up resources. By calling it on failure, you reuse cleanup logic instead of duplicating it.

detach and shutdown: "Leave No Footprints"

When your driver is unloaded or the device is removed, the kernel calls your detach function to cleanly shut down.

The detach signature:

static int
mydriver_detach(device_t dev)
{
    struct mydriver_softc *sc;
    
    sc = device_get_softc(dev);
    
    /* Cleanup steps in reverse order of attach */
    
    device_printf(dev, "Detached\n");
    return (0);
}

Typical detach flow (reverse of attach):

Step 1: Check if safe to detach

if (sc->open_count > 0) {
    return (EBUSY);  /* Device is in use, can't detach now */
}

Step 2: Stop hardware

mydriver_hw_stop(sc);  /* Disable interrupts, stop DMA, reset chip */

Step 3: Teardown interrupts

if (sc->irq_hand != NULL) {
    bus_teardown_intr(dev, sc->irq_res, sc->irq_hand);
    sc->irq_hand = NULL;
}
if (sc->irq_res != NULL) {
    bus_release_resource(dev, SYS_RES_IRQ, sc->irq_rid, sc->irq_res);
    sc->irq_res = NULL;
}

Step 4: Destroy device nodes or unregister

if (sc->cdev != NULL) {
    destroy_dev(sc->cdev);
    sc->cdev = NULL;
}
/* or */
ether_ifdetach(ifp);

Step 5: Release hardware resources

if (sc->mem_res != NULL) {
    bus_release_resource(dev, SYS_RES_MEMORY, sc->mem_rid, sc->mem_res);
    sc->mem_res = NULL;
}

Step 6: Free other allocations

if (sc->buffer != NULL) {
    free(sc->buffer, M_DEVBUF);
    sc->buffer = NULL;
}
mtx_destroy(&sc->mtx);

Critical rules:

  • Do: Release resources in reverse order of allocation
  • Do: Always check pointers before freeing (detach might be called on partial attach)
  • Do: Set pointers to NULL after freeing
  • Don't: Access hardware after stopping it
  • Don't: Free resources still in use

The shutdown method:

Some drivers also implement a shutdown method for graceful system shutdown:

static int
mydriver_shutdown(device_t dev)
{
    struct mydriver_softc *sc = device_get_softc(dev);
    
    /* Put hardware in a safe state for reboot */
    mydriver_hw_shutdown(sc);
    
    return (0);
}

Add to method table:

DEVMETHOD(device_shutdown,  mydriver_shutdown),

This is called when the system reboots or powers down, allowing your driver to gracefully stop hardware.

The Failure-Unwinding Pattern

We've seen hints of this, but let's make it explicit. Failure unwinding is a reusable pattern for handling partial attach failures.

The pattern:

static int
mydriver_attach(device_t dev)
{
    struct mydriver_softc *sc;
    int error = 0;
    
    sc = device_get_softc(dev);
    sc->dev = dev;
    
    /* Initialize mutex */
    mtx_init(&sc->mtx, "mydriver", NULL, MTX_DEF);
    
    /* Allocate resource 1 */
    sc->mem_res = bus_alloc_resource_any(...);
    if (sc->mem_res == NULL) {
        error = ENXIO;
        goto fail_mtx;
    }
    
    /* Allocate resource 2 */
    sc->irq_res = bus_alloc_resource_any(...);
    if (sc->irq_res == NULL) {
        error = ENXIO;
        goto fail_mem;
    }
    
    /* Initialize hardware */
    error = mydriver_hw_init(sc);
    if (error != 0)
        goto fail_irq;
    
    /* Success! */
    device_printf(dev, "Attached\n");
    return (0);

/* Cleanup labels in reverse order */
fail_irq:
    bus_release_resource(dev, SYS_RES_IRQ, sc->irq_rid, sc->irq_res);
fail_mem:
    bus_release_resource(dev, SYS_RES_MEMORY, sc->mem_rid, sc->mem_res);
fail_mtx:
    mtx_destroy(&sc->mtx);
    return (error);
}

Why this works:

  • Each goto jumps to the right cleanup level
  • Resources are released in reverse order
  • No resource is left dangling
  • Code is readable and maintainable

Alternative pattern

Call detach on failure:

fail:
    mydriver_detach(dev);
    return (error);
}

This works if your detach function checks pointers before freeing (it should!).

Watching the Lifecycle in Logs

The best way to understand the lifecycle is to see it happen. FreeBSD's logging makes this easy.

Watch in real-time:

Terminal 1:

% tail -f /var/log/messages

Terminal 2:

% sudo kldload if_em
% sudo kldunload if_em

What you'll see:

Oct 14 12:34:56 freebsd kernel: em0: <Intel(R) PRO/1000 Network Connection> port 0xc000-0xc01f mem 0xf0000000-0xf001ffff at device 2.0 on pci0
Oct 14 12:34:56 freebsd kernel: em0: Ethernet address: 00:0c:29:3a:4f:1e
Oct 14 12:34:56 freebsd kernel: em0: netmap queues/slots: TX 1/1024, RX 1/1024

The first line comes from the driver's attach function. You can see it detected the device, allocated resources, and initialized.

On unload:

Oct 14 12:35:10 freebsd kernel: em0: detached

Using dmesg:

% dmesg | grep em0

This shows all kernel messages related to em0 since boot.

Using devmatch:

FreeBSD's devmatch utility shows unattached devices and suggests drivers:

% devmatch

Example output:

pci0:0:2:0 needs if_em

Exercise: Load and unload a simple driver while watching logs. Try:

% sudo kldload null
% dmesg | tail
% kldstat | grep null
% sudo kldunload null

You won't see much from null (it's quiet), but the kernel confirms load/unload.

Summary

The Newbus lifecycle follows a strict sequence:

  1. Enumeration: Bus drivers discover hardware and create device_t structures
  2. Probe: Kernel asks drivers "Can you handle this?" via probe functions
  3. Driver selection: Best match wins based on priority return values
  4. Attach: Winner's attach function initializes hardware and resources
  5. Operation: Device is ready for use (read/write, transmit/receive, etc.)
  6. Detach: Driver cleanly shuts down and releases all resources
  7. Destruction: Kernel frees device_t after successful detach

Key patterns:

  • Probe: Examine only, don't modify
  • Attach: Initialize everything, handle failures with cleanup jumps
  • Detach: Reverse order of attach, check all pointers, set to NULL

Next, we'll explore character device entry points, including how your driver handles open, read, write, and ioctl operations.

Character Device Entry Points: Your I/O Surface

Now that you understand how drivers attach and detach, let's explore how they actually do work. For character devices, this means implementing the cdevsw (character device switch), a structure that routes user-space system calls to your driver functions.

Think of cdevsw as a menu of services your driver offers. When a program opens /dev/yourdevice and calls read(), the kernel looks up your driver's d_read function and calls it. This section shows you how that routing works.

cdev and cdevsw: The Routing Table

Two related structures power character device operations:

  • struct cdev - Represents a character device instance
  • struct cdevsw - Defines the operations your driver supports

The cdevsw structure (from /usr/src/sys/sys/conf.h):

struct cdevsw {
    int                 d_version;   /* Always D_VERSION */
    u_int               d_flags;     /* Device flags */
    const char         *d_name;      /* Base device name */
    
    d_open_t           *d_open;      /* Open handler */
    d_close_t          *d_close;     /* Close handler */
    d_read_t           *d_read;      /* Read handler */
    d_write_t          *d_write;     /* Write handler */
    d_ioctl_t          *d_ioctl;     /* Ioctl handler */
    d_poll_t           *d_poll;      /* Poll/select handler */
    d_mmap_t           *d_mmap;      /* Mmap handler */
    d_strategy_t       *d_strategy;  /* (Deprecated) */
    dumper_t           *d_dump;      /* Crash dump handler */
    d_kqfilter_t       *d_kqfilter;  /* Kqueue filter */
    d_purge_t          *d_purge;     /* Purge handler */
    /* ... additional fields for advanced features ... */
};

Minimal example from /usr/src/sys/dev/null/null.c:

static struct cdevsw null_cdevsw = {
        .d_version =    D_VERSION,
        .d_read =       (d_read_t *)nullop,
        .d_write =      null_write,
        .d_ioctl =      null_ioctl,
        .d_kqfilter =   kqfilter,
        .d_name =       "null",
};

Notice what's missing: no d_open or d_close. If you don't implement a method, the kernel provides sensible defaults:

  • Missing d_open -> Always succeeds
  • Missing d_close -> Always succeeds
  • Missing d_read -> Returns EOF (0 bytes)
  • Missing d_write -> Returns ENODEV error

Why this works: Most simple devices don't need complex open/close logic. Implement only what you need.

open/close: Sessions and Per-open State

When a user program opens your device, the kernel calls your d_open function. This is your opportunity to initialize per-open state, check permissions, or reject the open if conditions aren't right.

The d_open signature:

typedef int d_open_t(struct cdev *dev, int oflags, int devtype, struct thread *td);

Parameters:

  • dev - Your cdev structure
  • oflags - Open flags (O_RDONLY, O_RDWR, O_NONBLOCK, etc.)
  • devtype - Device type (usually ignored)
  • td - Thread performing the open

Typical open function:

static int
mydriver_open(struct cdev *dev, int oflags, int devtype, struct thread *td)
{
    struct mydriver_softc *sc;
    
    sc = dev->si_drv1;  /* Get softc from cdev back-pointer */
    
    /* Check if already open (if exclusive access needed) */
    if (sc->flags & MYDRV_OPEN) {
        return (EBUSY);
    }
    
    /* Mark as open */
    sc->flags |= MYDRV_OPEN;
    sc->open_count++;
    
    device_printf(sc->dev, "Device opened\n");
    return (0);
}

The d_close signature:

typedef int d_close_t(struct cdev *dev, int fflag, int devtype, struct thread *td);

Typical close function:

static int
mydriver_close(struct cdev *dev, int fflag, int devtype, struct thread *td)
{
    struct mydriver_softc *sc;
    
    sc = dev->si_drv1;
    
    /* Clean up per-open state */
    sc->flags &= ~MYDRV_OPEN;
    sc->open_count--;
    
    device_printf(sc->dev, "Device closed\n");
    return (0);
}

When to use open/close:

  • Initialize per-session state (buffers, cursors)
  • Enforce exclusive access (only one opener at a time)
  • Reset hardware state on open/close
  • Track usage for debugging

When you can skip them:

  • Device doesn't need setup on open
  • Hardware is always ready (like /dev/null)

read/write: Moving Bytes Safely

Read and write are the heart of data transfer for character devices. The kernel provides a uio (user I/O) structure to abstract the buffer and handle copying safely between kernel and user space.

The d_read signature:

typedef int d_read_t(struct cdev *dev, struct uio *uio, int ioflag);

The d_write signature:

typedef int d_write_t(struct cdev *dev, struct uio *uio, int ioflag);

Parameters:

  • dev - Your cdev
  • uio - User I/O structure (describes buffer, offset, remaining bytes)
  • ioflag - I/O flags (IO_NDELAY for non-blocking, etc.)

Simple read example:

static int
mydriver_read(struct cdev *dev, struct uio *uio, int ioflag)
{
    struct mydriver_softc *sc;
    char data[128];
    int error, len;
    
    sc = dev->si_drv1;
    
    /* How much does user want? */
    len = MIN(uio->uio_resid, sizeof(data));
    if (len == 0)
        return (0);  /* EOF */
    
    /* Fill buffer with your data */
    snprintf(data, sizeof(data), "Hello from mydriver\\n");
    len = MIN(len, strlen(data));
    
    /* Copy to user space */
    error = uiomove(data, len, uio);
    
    return (error);
}

Simple write example:

static int
mydriver_write(struct cdev *dev, struct uio *uio, int ioflag)
{
    struct mydriver_softc *sc;
    char buffer[128];
    int error, len;
    
    sc = dev->si_drv1;
    
    /* Get write size (bounded by our buffer) */
    len = MIN(uio->uio_resid, sizeof(buffer) - 1);
    if (len == 0)
        return (0);
    
    /* Copy from user space */
    error = uiomove(buffer, len, uio);
    if (error != 0)
        return (error);
    
    buffer[len] = '\\0';  /* Null terminate if treating as string */
    
    /* Do something with the data */
    device_printf(sc->dev, "User wrote: %s\\n", buffer);
    
    return (0);
}

Key functions for I/O:

uiomove() - Copy between kernel buffer and user space

int uiomove(void *cp, int n, struct uio *uio);

uio_resid - Remaining bytes to transfer

if (uio->uio_resid == 0)
    return (0);  /* Nothing to do */

Why uio exists

It handles:

  • Multi-segment buffers (scatter-gather)
  • Partial transfers
  • Offset tracking
  • Safe copying between kernel and user space

ioctl: Control Paths

Ioctl (I/O control) is the Swiss Army knife of device operations. It handles anything that doesn't fit read/write: configuration, querying status, triggering actions, etc.

The d_ioctl signature:

typedef int d_ioctl_t(struct cdev *dev, u_long cmd, caddr_t data, 
                       int fflag, struct thread *td);

Parameters:

  • dev - Your cdev
  • cmd - Command code (user-defined constant)
  • data - Pointer to data structure (already copied from user space by kernel)
  • fflag - File flags
  • td - Thread

Defining ioctl commands

Use the _IO, _IOR, _IOW, _IOWR macros:

#include <sys/ioccom.h>

/* Command with no data */
#define MYDRV_RESET         _IO('M', 0)

/* Command that reads data (kernel -> user) */
#define MYDRV_GETSTATUS     _IOR('M', 1, struct mydrv_status)

/* Command that writes data (user -> kernel) */
#define MYDRV_SETCONFIG     _IOW('M', 2, struct mydrv_config)

/* Command that does both */
#define MYDRV_EXCHANGE      _IOWR('M', 3, struct mydrv_data)

The 'M' is your "magic number" (unique letter identifying your driver). Pick one not used by system ioctls.

Implementing ioctl:

static int
mydriver_ioctl(struct cdev *dev, u_long cmd, caddr_t data,
               int fflag, struct thread *td)
{
    struct mydriver_softc *sc;
    struct mydrv_status *status;
    struct mydrv_config *config;
    
    sc = dev->si_drv1;
    
    switch (cmd) {
    case MYDRV_RESET:
        /* Reset hardware */
        mydriver_hw_reset(sc);
        return (0);
        
    case MYDRV_GETSTATUS:
        /* Return status to user */
        status = (struct mydrv_status *)data;
        status->flags = sc->flags;
        status->count = sc->packet_count;
        return (0);
        
    case MYDRV_SETCONFIG:
        /* Apply configuration */
        config = (struct mydrv_config *)data;
        if (config->speed > MAX_SPEED)
            return (EINVAL);
        sc->speed = config->speed;
        return (0);
        
    default:
        return (ENOTTY);  /* Invalid ioctl */
    }
}

Best practices:

  • Always return ENOTTY for unknown commands
  • Validate all input (ranges, pointers, etc.)
  • Use meaningful names for commands
  • Document your ioctl interface (man page or header comments)
  • Don't assume data pointers are valid (kernel already validated them)

Real example from /usr/src/sys/dev/usb/misc/uled.c:

static int
uled_ioctl(struct usb_fifo *fifo, u_long cmd, void *addr, int fflags)
{
        struct uled_softc *sc;
        struct uled_color color;
        int error;

        sc = usb_fifo_softc(fifo);
        error = 0;

        mtx_lock(&sc->sc_mtx);

        switch(cmd) {
        case ULED_GET_COLOR:
                *(struct uled_color *)addr = sc->sc_color;
                break;
        case ULED_SET_COLOR:
                color = *(struct uled_color *)addr;
                uint8_t buf[8];

                sc->sc_color.red = color.red;
                sc->sc_color.green = color.green;
                sc->sc_color.blue = color.blue;

                if (sc->sc_flags & ULED_FLAG_BLINK1) {
                        buf[0] = 0x1;
                        buf[1] = 'n';
                        buf[2] = color.red;
                        buf[3] = color.green;
                        buf[4] = color.blue;
                        buf[5] = buf[6] = buf[7] = 0;
                } else {
                        buf[0] = color.red;
                        buf[1] = color.green;
                        buf[2] = color.blue;
                        buf[3] = buf[4] = buf[5] = 0;
                        buf[6] = 0x1a;
                        buf[7] = 0x05;
                }
                error = uled_ctrl_msg(sc, UT_WRITE_CLASS_INTERFACE,
                    UR_SET_REPORT, 0x200, 0, buf, sizeof(buf));
                break;
        default:
                error = ENOTTY;
                break;
        }

        mtx_unlock(&sc->sc_mtx);
        return (error);
}

poll/kqfilter: Readiness Notifications

Poll and kqfilter support event-driven I/O, allowing programs to wait efficiently for your device to be ready for reading or writing.

When you need these:

  • Your device may not be ready immediately (hardware buffer empty/full)
  • You want to support select(), poll(), or kqueue() system calls
  • Non-blocking I/O makes sense for your device

The d_poll signature:

typedef int d_poll_t(struct cdev *dev, int events, struct thread *td);

Basic implementation:

static int
mydriver_poll(struct cdev *dev, int events, struct thread *td)
{
    struct mydriver_softc *sc = dev->si_drv1;
    int revents = 0;
    
    if (events & (POLLIN | POLLRDNORM)) {
        /* Check if data available for reading */
        if (sc->rx_ready)
            revents |= events & (POLLIN | POLLRDNORM);
        else
            selrecord(td, &sc->rsel);  /* Register for notification */
    }
    
    if (events & (POLLOUT | POLLWRNORM)) {
        /* Check if ready for writing */
        if (sc->tx_ready)
            revents |= events & (POLLOUT | POLLWRNORM);
        else
            selrecord(td, &sc->wsel);
    }
    
    return (revents);
}

When hardware becomes ready, wake up waiters:

/* In your interrupt handler or completion routine: */
selwakeup(&sc->rsel);  /* Wake readers */
selwakeup(&sc->wsel);  /* Wake writers */

The d_kqfilter signature (kqueue support):

typedef int d_kqfilter_t(struct cdev *dev, struct knote *kn);

Kqueue is more complex. For beginners, implementing poll is sufficient. Kqueue details belong in advanced chapters.

mmap: When Mapping Makes Sense

Mmap allows user programs to map device memory directly into their address space. This is powerful but advanced.

When to support mmap:

  • Hardware has a large memory region (framebuffer, DMA buffers)
  • Performance is critical (avoid copy overhead)
  • User space needs direct access to hardware registers (dangerous!)

When NOT to support mmap:

  • Security concerns (exposing kernel or hardware memory)
  • Synchronization complexity (cache coherency, DMA ordering)
  • It's overkill for simple devices

The d_mmap signature:

typedef int d_mmap_t(struct cdev *dev, vm_ooffset_t offset, vm_paddr_t *paddr,
                     int nprot, vm_memattr_t *memattr);

Basic implementation:

static int
mydriver_mmap(struct cdev *dev, vm_ooffset_t offset, vm_paddr_t *paddr,
              int nprot, vm_memattr_t *memattr)
{
    struct mydriver_softc *sc = dev->si_drv1;
    
    /* Only allow mapping hardware memory region */
    if (offset >= sc->mem_size)
        return (EINVAL);
    
    *paddr = rman_get_start(sc->mem_res) + offset;
    *memattr = VM_MEMATTR_UNCACHEABLE;  /* Uncached device memory */
    
    return (0);
}

For beginners: Defer mmap implementation until you actually need it. Most drivers don't.

Back-pointers (si_drv1, etc.)

You've seen dev->si_drv1 throughout this section. This is how you store your softc pointer in the cdev so you can retrieve it later.

Setting the back-pointer (in attach):

sc->cdev = make_dev(&mydriver_cdevsw, unit, UID_ROOT, GID_WHEEL,
                    0600, "mydriver%d", unit);
sc->cdev->si_drv1 = sc;  /* Store our softc */

Retrieving it (in every entry point):

struct mydriver_softc *sc = dev->si_drv1;

Available back-pointers:

  • si_drv1 - Primary driver data (typically your softc)
  • si_drv2 - Secondary data (if needed)

Why not just device_get_softc()?

Because cdev entry points receive a struct cdev *, not a device_t. The si_drv1 field is the bridge.

Permissions and Ownership

When creating device nodes, set appropriate permissions to balance usability and security.

make_dev parameters:

struct cdev *
make_dev(struct cdevsw *devsw, int unit, uid_t uid, gid_t gid,
         int perms, const char *fmt, ...);

Common permission patterns:

Root-only device (hardware control, dangerous operations):

make_dev(&mydrv_cdevsw, unit, UID_ROOT, GID_WHEEL, 0600, "mydriver%d", unit);

Permissions: rw------- (owner=root)

User-accessible read-only:

make_dev(&mydrv_cdevsw, unit, UID_ROOT, GID_WHEEL, 0444, "mysensor%d", unit);

Permissions: r--r--r-- (everyone can read)

Group-accessible device (e.g., audio):

make_dev(&mydrv_cdevsw, unit, UID_ROOT, GID_OPERATOR, 0660, "myaudio%d", unit);

Permissions: rw-rw---- (root and operator group)

Public device (like /dev/null):

make_dev(&mydrv_cdevsw, unit, UID_ROOT, GID_WHEEL, 0666, "mynull", unit);

Permissions: rw-rw-rw- (everyone)

Security principle: Start restrictive (0600) and only open up when necessary and safe.

Summary

Character device entry points route user-space I/O to your driver:

  • cdevsw: Routing table mapping system calls to your functions
  • open/close: Initialize and clean up per-session state
  • read/write: Transfer data using uiomove() and struct uio
  • ioctl: Configuration and control commands
  • poll/kqfilter: Event-driven readiness notifications (advanced)
  • mmap: Direct memory mapping (advanced, security-sensitive)
  • si_drv1: Back-pointer to retrieve your softc
  • Permissions: Set appropriate access controls with make_dev()

Next, we'll look at alternative surfaces for network and storage drivers, which present very different interfaces.

Alternative Surfaces: Network and Storage (Fast Orientation)

Character devices use /dev and cdevsw. But not all drivers fit that model. Network and storage drivers integrate with different kernel subsystems, presenting alternative "surfaces" to the rest of the system. This section provides a fast orientation, just enough to recognize these patterns when you see them.

A First Look at ifnet

Network drivers don't create /dev entries. Instead, they register network interfaces that appear in ifconfig and integrate with the network stack.

The ifnet structure (simplified view):

struct ifnet {
    char      if_xname[IFNAMSIZ];  /* Interface name (e.g., "em0") */
    u_int     if_flags;             /* Flags (UP, RUNNING, etc.) */
    int       if_mtu;               /* Maximum transmission unit */
    uint64_t  if_baudrate;          /* Link speed */
    u_char    if_addr[ETHER_ADDR_LEN];  /* Hardware address */
    
    /* Driver-provided methods */
    if_init_fn_t    if_init;      /* Initialize interface */
    if_ioctl_fn_t   if_ioctl;     /* Handle ioctl commands */
    if_transmit_fn_t if_transmit; /* Transmit a packet */
    if_qflush_fn_t  if_qflush;    /* Flush transmit queue */
    /* ... many more fields ... */
};

Registering a network interface (in attach):

if_t ifp;

/* Allocate interface structure */
ifp = if_alloc(IFT_ETHER);
if (ifp == NULL)
    return (ENOSPC);

/* Set driver data */
if_setsoftc(ifp, sc);
if_initname(ifp, device_get_name(dev), device_get_unit(dev));

/* Set capabilities and flags */
if_setflags(ifp, IFF_BROADCAST | IFF_SIMPLEX | IFF_MULTICAST);
if_setcapabilities(ifp, IFCAP_VLAN_MTU | IFCAP_HWCSUM);

/* Provide driver methods */
if_setinitfn(ifp, mydriver_init);
if_setioctlfn(ifp, mydriver_ioctl);
if_settransmitfn(ifp, mydriver_transmit);
if_setqflushfn(ifp, mydriver_qflush);

/* Attach as Ethernet interface */
ether_ifattach(ifp, sc->mac_addr);

What the driver must implement:

if_init - Initialize hardware and bring interface up:

static void
mydriver_init(void *arg)
{
    struct mydriver_softc *sc = arg;
    
    /* Reset hardware */
    /* Configure MAC address */
    /* Enable interrupts */
    /* Mark interface running */
    
    if_setdrvflagbits(sc->ifp, IFF_DRV_RUNNING, 0);
}

if_transmit - Transmit a packet:

static int
mydriver_transmit(if_t ifp, struct mbuf *m)
{
    struct mydriver_softc *sc = if_getsoftc(ifp);
    
    /* Queue packet for transmission */
    /* Program DMA descriptor */
    /* Notify hardware */
    
    return (0);
}

if_ioctl - Handle configuration changes:

static int
mydriver_ioctl(if_t ifp, u_long command, caddr_t data)
{
    struct mydriver_softc *sc = if_getsoftc(ifp);
    
    switch (command) {
    case SIOCSIFFLAGS:    /* Interface flags changed */
        /* Handle up/down, promisc, etc. */
        break;
    case SIOCSIFMEDIA:    /* Media selection changed */
        /* Handle speed/duplex changes */
        break;
    /* ... many more ... */
    }
    return (0);
}

Receiving packets (typically in interrupt handler):

/* In interrupt handler when packet arrives: */
struct mbuf *m;

m = mydriver_rx_packet(sc);  /* Get packet from hardware */
if (m != NULL) {
    (*ifp->if_input)(ifp, m);  /* Pass to network stack */
}

Key difference from character devices:

  • No open/close/read/write
  • Packets, not byte streams
  • Asynchronous transmit/receive model
  • Integration with routing, firewalls, protocols

Where to learn more: Chapter 28 covers network driver implementation in depth.

A First Look at GEOM

Storage drivers integrate with FreeBSD's GEOM (GEOmetry Management) layer, a modular framework for storage transformations.

GEOM conceptual model:

File System (UFS/ZFS)
     -> 
GEOM Consumer
     -> 
GEOM Class (partition, mirror, encryption)
     -> 
GEOM Provider
     -> 
Disk Driver (CAM)
     -> 
Hardware (AHCI, NVMe)

Providers and Consumers:

  • Provider: Supplies storage (e.g., a disk: ada0)
  • Consumer: Uses storage (e.g., a filesystem)
  • GEOM Class: Transformation layer (partitioning, RAID, encryption)

Creating a GEOM provider:

struct g_provider *pp;

pp = g_new_providerf(gp, "%s", name);
pp->mediasize = disk_size;
pp->sectorsize = 512;
g_error_provider(pp, 0);  /* Mark available */

Handling I/O requests (bio structure):

static void
mygeom_start(struct bio *bp)
{
    struct mygeom_softc *sc;
    
    sc = bp->bio_to->geom->softc;
    
    switch (bp->bio_cmd) {
    case BIO_READ:
        mygeom_read(sc, bp);
        break;
    case BIO_WRITE:
        mygeom_write(sc, bp);
        break;
    case BIO_DELETE:  /* TRIM command */
        mygeom_delete(sc, bp);
        break;
    default:
        g_io_deliver(bp, EOPNOTSUPP);
        return;
    }
}

Completing I/O:

bp->bio_completed = bp->bio_length;
bp->bio_resid = 0;
g_io_deliver(bp, 0);  /* Success */

Key differences from character devices:

  • Block-oriented (not byte streams)
  • Asynchronous I/O model (bio requests)
  • Layered architecture (transformations stack)
  • Integration with filesystems and storage stack

Where to learn more: Chapter 27 covers cover GEOM and CAM in depth.

Mixed Models (tun as a Bridge)

Some drivers expose both a control plane (character device) and a data plane (network interface or storage). This "bridge" pattern provides flexibility.

Example: tun/tap device

The tun device (network tunnel) presents:

  1. Character device (/dev/tun0) for control and packet I/O
  2. Network interface (tun0 in ifconfig) for kernel routing

User-space view:

/* Open control interface */
int fd = open("/dev/tun0", O_RDWR);

/* Configure via ioctl */
struct tuninfo info = { ... };
ioctl(fd, TUNSIFINFO, &info);

/* Read packets from network stack */
char packet[2048];
read(fd, packet, sizeof(packet));

/* Write packets to network stack */
write(fd, packet, packet_len);

Kernel view

The tun driver:

  • Creates a /dev/tunX node (cdevsw)
  • Creates a tunX network interface (ifnet)
  • Routes packets between them

When the network stack has a packet for tun0:

  1. Packet goes to tun driver's if_transmit
  2. Driver queues it
  3. User's read() on /dev/tun0 retrieves it

When user writes to /dev/tun0:

  1. Driver receives data in d_write
  2. Driver wraps it in mbuf
  3. Calls (*ifp->if_input)() to inject into network stack

Why this pattern

  • Control plane: Configuration, setup, teardown
  • Data plane: High-performance packet/block transfer
  • Separation: Clean interface boundaries

Other examples

  • BPF (Berkeley Packet Filter): /dev/bpf for control, sniffs network interfaces
  • TAP: Similar to TUN but operates at Ethernet layer

What Comes Later

This section provided recognition-level understanding of alternative surfaces. Full implementation comes in dedicated chapters:

Network drivers - Chapter 28

  • mbuf management and packet queues
  • DMA descriptor rings
  • Interrupt moderation and NAPI-like polling
  • Hardware offload (checksums, TSO, RSS)
  • Link state management
  • Media selection (speed/duplex negotiation)

Storage drivers - Chapter 27

  • CAM (Common Access Method) architecture
  • SCSI/ATA command handling
  • DMA and scatter-gather for block I/O
  • Error recovery and retries
  • NCQ (Native Command Queuing)
  • GEOM class implementation

For now: Just recognize that not all drivers use cdevsw. Some integrate with specialized kernel subsystems (network stack, storage layer) and present domain-specific interfaces.

Summary

Alternative driver surfaces:

  • Network interfaces (ifnet): Integrate with network stack, appear in ifconfig
  • Storage (GEOM): Block-oriented, layered transformations, filesystem integration
  • Mixed models: Combine character device control plane with network/storage data plane

Key takeaway: The driver family (character, network, storage) determines which kernel subsystem you integrate with. All still follow the same Newbus lifecycle (probe/attach/detach).

Next, we'll preview resources and registers, the vocabulary for hardware access.

Resources & Registers: A Safe Preview (No Deep Dive Yet)

Drivers don't just manage data structures, they talk to hardware. This means claiming resources (memory regions, IRQs), reading/writing registers, setting up interrupts, and potentially using DMA. This section provides just enough vocabulary to recognize these patterns without drowning in implementation details. Think of it as learning to recognize tools in a workshop before you learn to use them.

Claiming Resources (bus_alloc_resource_*)

Hardware devices use resources: memory-mapped I/O regions, I/O ports, IRQ lines, DMA channels. Before you can use them, you must ask the bus to allocate them.

The allocation function:

struct resource *
bus_alloc_resource_any(device_t dev, int type, int *rid, u_int flags);

Resource types (from /usr/src/sys/amd64/include/resource.h, /usr/src/sys/arm64/include/resource.h, etc):

  • SYS_RES_MEMORY - Memory-mapped I/O region
  • SYS_RES_IOPORT - I/O port region (x86)
  • SYS_RES_IRQ - Interrupt line
  • SYS_RES_DRQ - DMA channel (legacy)

Example: Allocating PCI BAR 0 (memory region):

sc->mem_rid = PCIR_BAR(0);  /* Base Address Register 0 */
sc->mem_res = bus_alloc_resource_any(dev, SYS_RES_MEMORY,
                                      &sc->mem_rid, RF_ACTIVE);
if (sc->mem_res == NULL) {
    device_printf(dev, "Could not allocate memory resource\\n");
    return (ENXIO);
}

Example: Allocating IRQ:

sc->irq_rid = 0;
sc->irq_res = bus_alloc_resource_any(dev, SYS_RES_IRQ,
                                      &sc->irq_rid, RF_ACTIVE | RF_SHAREABLE);
if (sc->irq_res == NULL) {
    device_printf(dev, "Could not allocate IRQ\\n");
    return (ENXIO);
}

Releasing resources (in detach):

if (sc->mem_res != NULL) {
    bus_release_resource(dev, SYS_RES_MEMORY, sc->mem_rid, sc->mem_res);
    sc->mem_res = NULL;
}

What you need to know now:

  • Hardware resources must be allocated before use
  • Always release them in detach
  • Allocation can fail (always check return value)

Full details: Chapter 18 covers resource management, PCI configuration, and memory mapping.

Talking to Hardware with bus_space

Once you've allocated a memory resource, you need to read and write hardware registers. FreeBSD provides bus_space functions for portable MMIO (Memory-Mapped I/O) and PIO (Port I/O) access.

Why not just pointer dereferences?

Direct memory access like *(uint32_t *)addr doesn't work reliably because:

  • Endianness varies across architectures
  • Memory barriers and ordering matter
  • Some architectures need special instructions

bus_space abstractions:

bus_space_tag_t    bst;   /* Bus space tag (method table) */
bus_space_handle_t bsh;   /* Bus space handle (mapped address) */

Getting bus_space handles from resource:

sc->bst = rman_get_bustag(sc->mem_res);
sc->bsh = rman_get_bushandle(sc->mem_res);

Reading registers:

uint32_t value;

value = bus_space_read_4(sc->bst, sc->bsh, offset);
/* _4 means 4 bytes (32 bits), offset is byte offset into region */

Writing registers:

bus_space_write_4(sc->bst, sc->bsh, offset, value);

Common width variants:

  • bus_space_read_1 / bus_space_write_1 - 8-bit (byte)
  • bus_space_read_2 / bus_space_write_2 - 16-bit (word)
  • bus_space_read_4 / bus_space_write_4 - 32-bit (dword)
  • bus_space_read_8 / bus_space_write_8 - 64-bit (qword)

Example: Reading hardware status register:

#define MY_STATUS_REG  0x00
#define MY_CONTROL_REG 0x04

/* Read status */
uint32_t status = bus_space_read_4(sc->bst, sc->bsh, MY_STATUS_REG);

/* Check a flag */
if (status & STATUS_READY) {
    /* Hardware is ready */
}

/* Write control register */
bus_space_write_4(sc->bst, sc->bsh, MY_CONTROL_REG, CTRL_START);

What you need to know now:

  • Use bus_space_read/write for hardware access
  • Never dereference hardware addresses directly
  • Offsets are in bytes

Full details: Chapter 16 covers bus_space patterns, memory barriers, and register access strategies.

Interrupts in Two Sentences

When hardware needs attention (packet arrived, transfer complete, error occurred), it raises an interrupt. Your driver registers an interrupt handler that the kernel calls asynchronously when the interrupt fires.

Setting up an interrupt handler (implementation details in Chapter 19):

// Placeholder - full interrupt programming covered in Chapter 19
error = bus_setup_intr(dev, sc->irq_res,
                       INTR_TYPE_NET | INTR_MPSAFE,
                       NULL, mydriver_intr, sc, &sc->irq_hand);

Your interrupt handler:

static void
mydriver_intr(void *arg)
{
    struct mydriver_softc *sc = arg;
    
    /* Read interrupt status */
    /* Handle the event */
    /* Acknowledge interrupt to hardware */
}

Golden rule: Keep interrupt handlers short and fast. Defer heavy work to a taskqueue or thread.

What you need to know now:

  • Interrupts are asynchronous hardware notifications
  • You register a handler function
  • Handler runs in interrupt context (limited what you can do)

Full details: Chapter 19 covers interrupt handling, filter vs. thread handlers, interrupt moderation, and taskqueues.

DMA in Two Sentences

For high-performance data transfer, hardware uses DMA (Direct Memory Access) to move data between memory and device without CPU involvement. FreeBSD provides bus_dma for safe, portable DMA setup including bounce buffers for architectures with IOMMU or DMA limitations.

Typical DMA pattern:

  1. Allocate DMA-capable memory with bus_dmamem_alloc
  2. Load buffer addresses into hardware descriptors
  3. Tell hardware to start DMA
  4. Hardware interrupts when done
  5. Unload and free when driver detaches

What you need to know now:

  • DMA = zero-copy data transfer
  • Requires special memory allocation
  • Architecture-dependent (bus_dma handles portability)

Full details: Chapter 21 covers DMA architecture, descriptor rings, scatter-gather, synchronization, and bounce buffers.

Concurrency Note

The kernel is multithreaded and preemptible. Your driver can be called simultaneously from:

  • Multiple user processes (different threads opening your device)
  • Interrupt context (hardware events)
  • System threads (taskqueues, timers)

This means you need locks to protect shared state:

/* In softc: */
struct mtx mtx;

/* In attach: */
mtx_init(&sc->mtx, "mydriver", NULL, MTX_DEF);

/* In your functions: */
mtx_lock(&sc->mtx);
/* ... access shared state ... */
mtx_unlock(&sc->mtx);

/* In detach: */
mtx_destroy(&sc->mtx);

What you need to know now:

  • Shared data needs protection
  • Use mutexes (MTX_DEF for most cases)
  • Acquire lock, do work, release lock
  • Interrupt handlers may need special lock types

Full details: Chapter 11 covers locking strategies, lock types (mutex, sx, rm), lock ordering, deadlock prevention, and lockless algorithms.

Summary

This section previewed the vocabulary for hardware access:

  • Resources: Allocate with bus_alloc_resource_any(), release in detach
  • Registers: Access with bus_space_read/write_N(), never direct pointers
  • Interrupts: Register handlers with bus_setup_intr(), keep them short
  • DMA: Use bus_dma for zero-copy transfers (complex, covered later)
  • Locking: Protect shared state with mutexes

Remember: This chapter is about recognition, not mastery. When you see these patterns in driver code, you'll know what they are. Implementation details come in dedicated chapters.

Next, we'll look at creating and removing device nodes in /dev.

Creating and Removing Device Nodes

Character devices need to appear in /dev so user programs can open them. This section shows you the minimal API for creating and destroying device nodes using FreeBSD's devfs.

make_dev/make_dev_s: Creating /dev/foo

The core function for creating device nodes is make_dev():

struct cdev *
make_dev(struct cdevsw *devsw, int unit, uid_t uid, gid_t gid,
         int perms, const char *fmt, ...);

Parameters:

  • devsw - Your character device switch (cdevsw)
  • unit - Unit number (minor number)
  • uid - Owner user ID (typically UID_ROOT)
  • gid - Owner group ID (typically GID_WHEEL)
  • perms - Permissions (octal, like 0600 or 0666)
  • fmt, ... - Printf-style device name

Example (creating /dev/mydriver0):

sc->cdev = make_dev(&mydriver_cdevsw, unit,
                    UID_ROOT, GID_WHEEL, 0600, "mydriver%d", unit);
if (sc->cdev == NULL) {
    device_printf(dev, "Failed to create device node\\n");
    return (ENOMEM);
}

/* Store softc pointer for retrieval in entry points */
sc->cdev->si_drv1 = sc;

The safer variant: make_dev_s()

make_dev_s() handles race conditions better and returns error codes:

struct make_dev_args mda;
int error;

make_dev_args_init(&mda);
mda.mda_devsw = &mydriver_cdevsw;
mda.mda_uid = UID_ROOT;
mda.mda_gid = GID_WHEEL;
mda.mda_mode = 0600;
mda.mda_si_drv1 = sc;  /* Set back-pointer directly */

error = make_dev_s(&mda, &sc->cdev, "mydriver%d", unit);
if (error != 0) {
    device_printf(dev, "Failed to create device node: %d\\n", error);
    return (error);
}

When to create device nodes: Typically in your attach function, after hardware initialization succeeds.

Minors and Naming Conventions

Minor numbers identify which instance of your driver a device node represents. The kernel assigns them automatically based on the unit parameter you pass to make_dev().

Naming conventions:

  • Single instance: mydriver (no number)
  • Multiple instances: mydriver0, mydriver1, etc.
  • Sub-devices: mydriver0.ctl, mydriver0a, mydriver0b
  • Subdirectories: Use / in name: "led/%s" creates /dev/led/foo

Examples from FreeBSD:

  • /dev/null, /dev/zero - Single, unnumbered
  • /dev/cuau0, /dev/cuau1 - Serial ports, numbered
  • /dev/ada0, /dev/ada1 - Disks, numbered
  • /dev/pts/0 - Pseudo-terminal in subdirectory

Best practices:

  • Use device number from device_get_unit() for consistency
  • Follow established naming patterns (users expect them)
  • Use descriptive names (not just /dev/dev0)

destroy_dev: Cleaning Up

When your driver detaches, you must remove device nodes to prevent stale entries in /dev.

Simple cleanup:

if (sc->cdev != NULL) {
    destroy_dev(sc->cdev);
    sc->cdev = NULL;
}

The problem: destroy_dev() is synchronous, it waits for all open file descriptors to close. This can block indefinitely if a program has your device open.

Better approach: destroy_dev_sched()

For drivers that might be unloaded while in use:

if (sc->cdev != NULL) {
    destroy_dev_sched(sc->cdev);  /* Schedule for destruction */
    sc->cdev = NULL;
}

This returns immediately and defers actual destruction until all users close the device.

When to destroy: Always in your detach function, before releasing other resources.

Complete example pattern:

static int
mydriver_detach(device_t dev)
{
    struct mydriver_softc *sc = device_get_softc(dev);
    
    /* Destroy device node first */
    if (sc->cdev != NULL) {
        destroy_dev_sched(sc->cdev);
        sc->cdev = NULL;
    }
    
    /* Then release other resources */
    if (sc->irq_hand != NULL)
        bus_teardown_intr(dev, sc->irq_res, sc->irq_hand);
    if (sc->irq_res != NULL)
        bus_release_resource(dev, SYS_RES_IRQ, sc->irq_rid, sc->irq_res);
    if (sc->mem_res != NULL)
        bus_release_resource(dev, SYS_RES_MEMORY, sc->mem_rid, sc->mem_res);
    
    return (0);
}

devctl/devmatch: Runtime Events

FreeBSD provides devctl and devmatch for monitoring device events and matching drivers to hardware.

devctl: Event notification system

Programs can listen to /dev/devctl for device events:

% sudo service devd stop
% cat /dev/devctl
!system=DEVFS subsystem=CDEV type=CREATE cdev=mydriver0
!system=DEVFS subsystem=CDEV type=DESTROY cdev=mydriver0
...
...
--- press CTRL+C to cancel / exit , remember to restart devd ---
% sudo service devd start

Events your driver generates:

  • Device node creation (automatically when you call make_dev)
  • Device node destruction (automatically when you call destroy_dev)
  • Attach/detach (via devctl_notify)

Manual notification (optional):

#include <sys/devctl.h>

/* Notify that device attached */
devctl_notify("DEVICE", "ATTACH", device_get_name(dev), device_get_nameunit(dev));

/* Notify of custom event */
char buf[128];
snprintf(buf, sizeof(buf), "status=%d", sc->status);
devctl_notify("MYDRIVER", "STATUS", device_get_nameunit(dev), buf);

devmatch: Automatic driver loading

The devmatch utility scans unattached devices and suggests (or loads) appropriate drivers:

% devmatch
kldload -n if_em
kldload -n snd_hda

Your driver participates automatically when you use DRIVER_MODULE correctly. The kernel's device database (generated at build time) tracks which drivers match which hardware IDs.

Summary

Creating device nodes:

  • Use make_dev() or make_dev_s() in attach
  • Set ownership and permissions appropriately
  • Store softc back-pointer in si_drv1

Destroying device nodes:

  • Use destroy_dev_sched() in detach for safety
  • Always destroy before releasing other resources

Device events:

  • devctl monitors create/destroy events
  • devmatch auto-loads drivers for unattached devices

Next, we'll explore module packaging and the load/unload lifecycle.

Module Packaging & Lifecycle (Load, Init, Unload)

Your driver doesn't just exist in source form, it's compiled into a kernel module (.ko file) that can be dynamically loaded and unloaded. This section explains what a module is, how the lifecycle works, and how to handle load/unload events gracefully.

What a Kernel Module (.ko) Is

A kernel module is compiled, relocatable code that the kernel can load at runtime without rebooting. Think of it as a plugin for the kernel.

File extension: .ko (kernel object)

Example: mydriver.ko

What's inside:

  • Your driver code (probe, attach, detach, entry points)
  • Module metadata (name, version, dependencies)
  • Symbol table (for linking with kernel symbols)
  • Relocation information

How it's built:

% cd mydriver
% make

FreeBSD's build system (/usr/share/mk/bsd.kmod.mk) compiles your source and links it into a .ko file.

Why modules matter:

  • No reboot needed: Load/unload drivers without restarting
  • Smaller kernel: Only load drivers for hardware you have
  • Development speed: Test changes quickly
  • Modularity: Each driver is independent

Built-in vs. module: Drivers can be compiled directly into the kernel (monolithic) or as modules. For development and learning, always use modules.

The Module Event Handler

When a module is loaded or unloaded, the kernel calls your module event handler to give you a chance to initialize or clean up.

The module event handler signature:

typedef int (*modeventhand_t)(module_t mod, int /*modeventtype_t*/ type, void *data);

Event types:

  • MOD_LOAD - Module is being loaded
  • MOD_UNLOAD - Module is being unloaded
  • MOD_QUIESCE - Kernel is checking if unload is safe
  • MOD_SHUTDOWN - System is shutting down

Typical module event handler:

static int
mydriver_modevent(module_t mod, int type, void *data)
{
    int error = 0;
    
    switch (type) {
    case MOD_LOAD:
        /* Module is being loaded */
        printf("mydriver: Module loaded\\n");
        /* Initialize global state if needed */
        break;
        
    case MOD_UNLOAD:
        /* Module is being unloaded */
        printf("mydriver: Module unloaded\\n");
        /* Clean up global state if needed */
        break;
        
    case MOD_QUIESCE:
        /* Check if it's safe to unload */
        if (driver_is_busy()) {
            error = EBUSY;
        }
        break;
        
    case MOD_SHUTDOWN:
        /* System is shutting down */
        break;
        
    default:
        error = EOPNOTSUPP;
        break;
    }
    
    return (error);
}

Registering the handler (for pseudo-devices without Newbus):

static moduledata_t mydriver_mod = {
    "mydriver",           /* Module name */
    mydriver_modevent,    /* Event handler */
    NULL                  /* Extra data */
};

DECLARE_MODULE(mydriver, mydriver_mod, SI_SUB_DRIVERS, SI_ORDER_MIDDLE);
MODULE_VERSION(mydriver, 1);

For Newbus drivers: The DRIVER_MODULE macro handles most of this automatically. You typically don't need a separate module event handler unless you have global initialization beyond per-device state.

Example: Pseudo-device with module event handler

From /usr/src/sys/dev/null/null.c (simplified):

static int
null_modevent(module_t mod __unused, int type, void *data __unused)
{
        switch(type) {
        case MOD_LOAD:
                if (bootverbose)
                        printf("null: <full device, null device, zero device>\n");
                full_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, &full_cdevsw, 0,
                    NULL, UID_ROOT, GID_WHEEL, 0666, "full");
                null_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, &null_cdevsw, 0,
                    NULL, UID_ROOT, GID_WHEEL, 0666, "null");
                zero_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, &zero_cdevsw, 0,
                    NULL, UID_ROOT, GID_WHEEL, 0666, "zero");
                break;

        case MOD_UNLOAD:
                destroy_dev(full_dev);
                destroy_dev(null_dev);
                destroy_dev(zero_dev);
                break;

        case MOD_SHUTDOWN:
                break;

        default:
                return (EOPNOTSUPP);
        }

        return (0);
}

...
...
    
DEV_MODULE(null, null_modevent, NULL);
MODULE_VERSION(null, 1);

This creates /dev/null and /dev/zero when loaded, destroys them when unloaded.

Declaring Dependencies & Versions

If your driver depends on other kernel modules, declare those dependencies explicitly so the kernel loads them in the correct order.

MODULE_DEPEND macro:

MODULE_DEPEND(mydriver, usb, 1, 1, 1);
MODULE_DEPEND(mydriver, netgraph, 5, 7, 9);

Parameters:

  • mydriver - Your module name
  • usb - Module you depend on
  • 1 - Minimum acceptable version
  • 1 - Preferred version
  • 1 - Maximum acceptable version

Why this matters

If you try to load mydriver without usb being loaded, the kernel will either:

  • Auto-load usb first (if available)
  • Refuse to load mydriver with an error

MODULE_VERSION macro:

MODULE_VERSION(mydriver, 1);

This declares your module's version. Increment it when you make breaking changes to interfaces that other modules might depend on.

Dependency examples:

/* USB device driver */
MODULE_DEPEND(umass, usb, 1, 1, 1);
MODULE_DEPEND(umass, cam, 1, 1, 1);

/* Network driver using Netgraph */
MODULE_DEPEND(ng_ether, netgraph, NG_ABI_VERSION, NG_ABI_VERSION, NG_ABI_VERSION);

When to declare dependencies:

  • You call functions from another module
  • You use data structures defined in another module
  • Your driver won't work without another subsystem

Common dependencies:

  • usb - USB subsystem
  • pci - PCI bus support
  • cam - Storage subsystem (CAM)
  • netgraph - Network graph framework
  • sound - Sound subsystem

kldload/kldunload Flow and Logs

Let's trace what happens when you load and unload a module.

Loading a module:

% sudo kldload mydriver

Kernel flow:

  1. Reads mydriver.ko from filesystem
  2. Verifies ELF format and signature
  3. Resolves symbol dependencies
  4. Links module into kernel
  5. Calls module event handler with MOD_LOAD
  6. For Newbus drivers: immediately probes for matching devices
  7. If devices match: calls attach for each
  8. Module is now active

Check if loaded:

% kldstat
Id Refs Address                Size Name
 1   23 0xffffffff80200000  1c6e230 kernel
 2    1 0xffffffff81e6f000    5000 mydriver.ko

View kernel messages:

% dmesg | tail -5
mydriver0: <My Awesome Driver> mem 0xf0000000-0xf0001fff irq 16 at device 2.0 on pci0
mydriver0: Hardware version 1.2
mydriver0: Attached successfully

Unloading a module:

% sudo kldunload mydriver

Kernel flow:

  1. Calls module event handler with MOD_QUIESCE (optional check)
  2. If EBUSY returned: refuses to unload
  3. For Newbus drivers: calls detach for all attached devices
  4. Calls module event handler with MOD_UNLOAD
  5. Unlinks module from kernel
  6. Frees module memory

Common unload failures:

% sudo kldunload mydriver
kldunload: can't unload file: Device busy

Why:

  • Device nodes still open
  • Module is depended on by other modules
  • Driver returned EBUSY from detach

Force unload (dangerous, only for testing):

% sudo kldunload -f mydriver

This skips safety checks. Use only in a VM when testing!

Troubleshooting Loads

Problem: Module won't load

Check 1: Missing symbols

% sudo kldload ./mydriver.ko
link_elf: symbol usb_ifconfig undefined

Solution: Add MODULE_DEPEND(mydriver, usb, 1, 1, 1) and ensure USB module is loaded.

Check 2: Module not found

% sudo kldload mydriver
kldload: can't load mydriver: No such file or directory

Solution: Either provide full path (./mydriver.ko) or copy to /boot/modules/.

Check 3: Permission denied

% kldload mydriver.ko
kldload: Operation not permitted

Solution: Use sudo or become root.

Check 4: Version mismatch

% sudo kldload mydriver.ko
kldload: can't load mydriver: Exec format error

Solution: Module was compiled for different FreeBSD version. Rebuild against your running kernel.

Check 5: Duplicate symbols

% sudo kldload mydriver.ko
link_elf: symbol mydriver_probe defined in both mydriver.ko and olddriver.ko

Solution: Name collision. Unload conflicting module or rename your functions.

Debugging tips:

1. Verbose loading:

% sudo kldload -v mydriver.ko

2. Check module metadata:

% kldstat -v | grep mydriver

3. View symbols:

% nm mydriver.ko | grep mydriver_probe

4. Test in VM:

Always test new drivers in a VM, never on your main system. Crashes are expected during development!

5. Watch kernel log in real-time:

% tail -f /var/log/messages

Summary

Kernel modules:

  • .ko files containing driver code
  • Can be loaded/unloaded dynamically
  • No reboot needed for testing

Module event handler:

  • Handles MOD_LOAD, MOD_UNLOAD events
  • Initialize/cleanup global state
  • Can refuse unload with EBUSY

Dependencies:

  • Declare with MODULE_DEPEND
  • Version with MODULE_VERSION
  • Kernel enforces load order

Troubleshooting:

  • Missing symbols -> add dependencies
  • Can't unload -> check for open devices or dependencies
  • Always test in VM during development

Next, we'll discuss logging, errors, and user-facing behavior.

Logging, Errors, and User-Facing Behaviour

Your driver isn't just code, it's part of the user experience. Clear logging, consistent error reporting, and useful diagnostics separate professional drivers from amateur ones. This section covers how to be a good citizen of the FreeBSD kernel.

Logging Etiquette (device_printf, rate-limiting hints)

The cardinal rule: Log enough to be useful, but not so much that you spam the console or fill logs.

Use device_printf() for device-related messages:

device_printf(dev, "Attached successfully\\n");
device_printf(dev, "Hardware error: status=0x%x\\n", status);

Output:

mydriver0: Attached successfully
mydriver0: Hardware error: status=0x42

When to log:

Attach: ONE line summarizing successful attachment

device_printf(dev, "Attached (hw ver %d.%d)\\n", major, minor);

Errors: ALWAYS log failures with context

if (error != 0) {
    device_printf(dev, "Could not allocate IRQ: error=%d\\n", error);
    return (error);
}

Configuration changes: Log significant state changes

device_printf(dev, "Link up: 1000 Mbps full-duplex\\n");
device_printf(dev, "Entering power-save mode\\n");

When NOT to log:

Per-packet/per-I/O: NEVER log on every packet or read/write

/* BAD: This will flood the log */
device_printf(dev, "Received packet, length=%d\\n", len);

Verbose debugging info: Not in production code

/* BAD: Too verbose */
device_printf(dev, "Step 1\\n");
device_printf(dev, "Step 2\\n");
device_printf(dev, "Reading register 0x%x\\n", reg);

Rate-limiting for repetitive events:

If an error can occur repeatedly (hardware timeout, overflow), rate-limit:

static struct timeval last_overflow_msg;

if (ppsratecheck(&last_overflow_msg, NULL, 1)) {
    /* Max once per second */
    device_printf(dev, "RX overflow (message rate-limited)\\n");
}

Using printf vs. device_printf:

  • device_printf: For messages about a specific device

  • printf: For messages about module or subsystem

/* On module load */
printf("mydriver: version 1.2 loaded\\n");

/* On device attach */
device_printf(dev, "Attached successfully\\n");

Log levels (for future reference)

FreeBSD kernel doesn't have explicit log levels like syslog, but conventions exist:

  • Critical errors: Always log
  • Warnings: Log with "warning:" prefix
  • Info: Log major state changes
  • Debug: Compile-time conditional (MYDRV_DEBUG)

Example from real driver (/usr/src/sys/dev/uart/uart_core.c):

static void
uart_pps_print_mode(struct uart_softc *sc)
{

  device_printf(sc->sc_dev, "PPS capture mode: ");
  switch(sc->sc_pps_mode & UART_PPS_SIGNAL_MASK) {
  case UART_PPS_DISABLED:
    printf("disabled");
    break;
  case UART_PPS_CTS:
    printf("CTS");
    break;
  case UART_PPS_DCD:
    printf("DCD");
    break;
  default:
    printf("invalid");
    break;
  }
  if (sc->sc_pps_mode & UART_PPS_INVERT_PULSE)
    printf("-Inverted");
  if (sc->sc_pps_mode & UART_PPS_NARROW_PULSE)
    printf("-NarrowPulse");
  printf("\n");
}

Return Codes and Conventions

FreeBSD uses standard errno codes for error reporting. Using them consistently makes your driver predictable and debuggable.

Common errno codes (from <sys/errno.h>):

CodeValueMeaningWhen to Use
00SuccessOperation succeeded
ENOMEM12Out of memorymalloc/bus_alloc_resource failed
ENODEV19No such deviceHardware not present/responding
EINVAL22Invalid argumentBad parameter from user
EIO5Input/output errorHardware communication failed
EBUSY16Device busyCan't detach, resource in use
ETIMEDOUT60TimeoutHardware didn't respond
ENOTTY25Not a typewriterInvalid ioctl command
ENXIO6No such device/addressProbe rejected device

In probe:

if (vendor_id == MY_VENDOR && device_id == MY_DEVICE)
    return (BUS_PROBE_DEFAULT);  /* Success, with priority */
else
    return (ENXIO);  /* Not my device */

In attach:

sc->mem_res = bus_alloc_resource_any(...);
if (sc->mem_res == NULL)
    return (ENOMEM);  /* Resource allocation failed */

error = mydriver_hw_init(sc);
if (error != 0)
    return (EIO);  /* Hardware initialization failed */

return (0);  /* Success */

In entry points (read/write/ioctl):

/* Invalid parameter */
if (len > MAX_LEN)
    return (EINVAL);

/* Hardware not ready */
if (!(sc->flags & FLAG_READY))
    return (ENODEV);

/* I/O error */
if (timeout)
    return (ETIMEDOUT);

/* Success */
return (0);

In ioctl:

switch (cmd) {
case MYDRV_SETSPEED:
    if (speed > MAX_SPEED)
        return (EINVAL);  /* Bad parameter */
    sc->speed = speed;
    return (0);

default:
    return (ENOTTY);  /* Unknown ioctl command */
}

Summary:

  • 0 = success (always)
  • Positive errno = failure
  • Negative values = special meanings in some contexts (like probe priorities)

User-space sees these:

int fd = open("/dev/mydriver0", O_RDWR);
if (fd < 0) {
    perror("open");  /* Prints: "open: No such device" if ENODEV returned */
}

Lightweight Observability with sysctl

sysctl provides a way to expose driver state and statistics without requiring a debugger or special tools. It's invaluable for troubleshooting and monitoring.

Why sysctl is useful:

  • Users can check driver state from shell
  • Monitoring tools can scrape values
  • No device open required
  • Zero overhead when not accessed

Example: Exposing statistics

In softc:

struct mydriver_softc {
    /* ... */
    uint64_t stat_packets_rx;
    uint64_t stat_packets_tx;
    uint64_t stat_errors;
    uint32_t current_speed;
};

In attach, create sysctl nodes:

struct sysctl_ctx_list *ctx;
struct sysctl_oid *tree;

/* Get device's sysctl context */
ctx = device_get_sysctl_ctx(dev);
tree = device_get_sysctl_tree(dev);

/* Add statistics */
SYSCTL_ADD_U64(ctx, SYSCTL_CHILDREN(tree), OID_AUTO,
    "packets_rx", CTLFLAG_RD, &sc->stat_packets_rx, 0,
    "Packets received");

SYSCTL_ADD_U64(ctx, SYSCTL_CHILDREN(tree), OID_AUTO,
    "packets_tx", CTLFLAG_RD, &sc->stat_packets_tx, 0,
    "Packets transmitted");

SYSCTL_ADD_U64(ctx, SYSCTL_CHILDREN(tree), OID_AUTO,
    "errors", CTLFLAG_RD, &sc->stat_errors, 0,
    "Error count");

SYSCTL_ADD_U32(ctx, SYSCTL_CHILDREN(tree), OID_AUTO,
    "speed", CTLFLAG_RD, &sc->current_speed, 0,
    "Current link speed (Mbps)");

User access:

% sysctl dev.mydriver.0
dev.mydriver.0.packets_rx: 1234567
dev.mydriver.0.packets_tx: 987654
dev.mydriver.0.errors: 5
dev.mydriver.0.speed: 1000

Read-write sysctl (for configuration):

static int
mydriver_sysctl_debug(SYSCTL_HANDLER_ARGS)
{
    struct mydriver_softc *sc = arg1;
    int error, value;
    
    value = sc->debug_level;
    error = sysctl_handle_int(oidp, &value, 0, req);
    if (error || !req->newptr)
        return (error);
    
    /* Validate new value */
    if (value < 0 || value > 9)
        return (EINVAL);
    
    sc->debug_level = value;
    device_printf(sc->dev, "Debug level set to %d\\n", value);
    
    return (0);
}

/* In attach: */
SYSCTL_ADD_PROC(ctx, SYSCTL_CHILDREN(tree), OID_AUTO,
    "debug", CTLTYPE_INT | CTLFLAG_RW, sc, 0,
    mydriver_sysctl_debug, "I", "Debug level (0-9)");

User can change it:

% sysctl dev.mydriver.0.debug=3
dev.mydriver.0.debug: 0 -> 3

Best practices:

  • Expose counters and state (read-only)
  • Use clear, descriptive names
  • Add description strings
  • Group related sysctls under subtrees
  • Don't expose sensitive data (keys, passwords)
  • Don't make sysctls for every variable (only useful ones)

Cleanup: Sysctl nodes are automatically cleaned up when device detaches (if you used device_get_sysctl_ctx()).

Summary

Logging etiquette:

  • One line on attach, always log errors
  • Never log per-packet/per-I/O
  • Rate-limit repetitive messages
  • Use device_printf for device messages

Return codes:

  • 0 = success
  • Standard errno codes (ENOMEM, EINVAL, EIO, etc.)
  • Be consistent and predictable

sysctl observability:

  • Expose statistics and state for monitoring
  • Read-only for counters, read-write for config
  • Zero overhead when not used
  • Auto-cleanup on detach

Next, we'll take a read-only tour of tiny real drivers to see these patterns in practice.

Read-Only Tour of Tiny Real Drivers (FreeBSD 14.3)

Now that you understand driver structure conceptually, let's tour real FreeBSD drivers to see these patterns in practice. We'll examine four small, clean examples, pointing out exactly where probe, attach, entry points, and other structures live. This is read-only, you'll implement your own in Chapter 7. For now, recognize and understand.

Tour 1 - The canonical character trio /dev/null, /dev/zero, and /dev/full

Open the file:

% cd /usr/src/sys/dev/null
% less null.c

We'll walk top-to-bottom: headers -> globals -> cdevsw -> write/read/ioctl paths -> module event that creates and destroys the devfs nodes.

1) Includes + minimal globals (we'll be creating devfs nodes)

32: #include <sys/cdefs.h>
33: #include <sys/param.h>
34: #include <sys/systm.h>
35: #include <sys/conf.h>
36: #include <sys/uio.h>
37: #include <sys/kernel.h>
38: #include <sys/malloc.h>
39: #include <sys/module.h>
40: #include <sys/disk.h>
41: #include <sys/bus.h>
42: #include <sys/filio.h>
43:
44: #include <machine/bus.h>
45: #include <machine/vmparam.h>
46:
47: /* For use with destroy_dev(9). */
48: static struct cdev *full_dev;
49: static struct cdev *null_dev;
50: static struct cdev *zero_dev;
51:
52: static d_write_t full_write;
53: static d_write_t null_write;
54: static d_ioctl_t null_ioctl;
55: static d_ioctl_t zero_ioctl;
56: static d_read_t zero_read;
57:
Headers and Global Device Pointers

The null driver begins with standard kernel headers and forward declarations that establish the foundation for three related but distinct character devices.

Header Inclusions
#include <sys/cdefs.h>
#include <sys/param.h>
#include <sys/systm.h>
#include <sys/conf.h>
#include <sys/uio.h>
#include <sys/kernel.h>
#include <sys/malloc.h>
#include <sys/module.h>
#include <sys/disk.h>
#include <sys/bus.h>
#include <sys/filio.h>

#include <machine/bus.h>
#include <machine/vmparam.h>

These headers provide the kernel infrastructure needed for character device drivers:

<sys/cdefs.h> and <sys/param.h>: Fundamental system definitions including compiler directives, basic types, and system-wide constants. Every kernel source file includes these first.

<sys/systm.h>: Core kernel functions like printf(), panic(), and bzero(). This is the kernel's equivalent of <stdio.h> in userspace.

<sys/conf.h>: Character and block device configuration structures, particularly cdevsw (character device switch table) and related types. This header defines the d_open_t, d_read_t, d_write_t function pointer types used throughout the driver.

<sys/uio.h>: User I/O operations. The struct uio type describes data transfers between kernel and userspace, tracking buffer location, size, and direction. The uiomove() function declared here performs the actual data copying.

<sys/kernel.h>: Kernel startup and module infrastructure, including module event types (MOD_LOAD, MOD_UNLOAD) and the SYSINIT framework for initialization ordering.

<sys/malloc.h>: Kernel memory allocation. Though this driver doesn't dynamically allocate memory, the header is included for completeness.

<sys/module.h>: Module loading and unloading infrastructure. Provides DEV_MODULE and related macros for registering loadable kernel modules.

<sys/disk.h> and <sys/bus.h>: Disk and bus subsystem interfaces. The null driver includes these for kernel dump (DIOCSKERNELDUMP) ioctl support.

<sys/filio.h>: File I/O control commands. Defines FIONBIO (set non-blocking I/O) and FIOASYNC (set asynchronous I/O) ioctls that the driver must handle.

<machine/bus.h> and <machine/vmparam.h>: Architecture-specific definitions. The vmparam.h header provides ZERO_REGION_SIZE and zero_region, a kernel virtual memory region pre-filled with zeros that /dev/zero uses for efficient reads.

Device Structure Pointers
/* For use with destroy_dev(9). */
static struct cdev *full_dev;
static struct cdev *null_dev;
static struct cdev *zero_dev;

These three global pointers store references to the character device structures created during module load. Each pointer represents one device node in /dev:

full_dev: Points to the /dev/full device structure. This device simulates a full disk, reads succeed but writes always fail with ENOSPC (no space left on device).

null_dev: Points to the /dev/null device structure, the classic "bit bucket" that discards all written data and returns immediate end-of-file on reads.

zero_dev: Points to the /dev/zero device structure, which returns an infinite stream of zero bytes when read and discards writes like /dev/null.

The comment references destroy_dev(9), indicating these pointers are needed for cleanup during module unload. The make_dev_credf() function called during MOD_LOAD returns struct cdev * values stored here, and destroy_dev() called during MOD_UNLOAD uses these pointers to remove the device nodes.

The static storage class limits these variables to this source file, no other kernel code can access them directly. This encapsulation prevents unintended external modification.

Function Forward Declarations
static d_write_t full_write;
static d_write_t null_write;
static d_ioctl_t null_ioctl;
static d_ioctl_t zero_ioctl;
static d_read_t zero_read;

These forward declarations establish function signatures before the cdevsw structures that reference them. Each declaration uses a typedef from <sys/conf.h>:

d_write_t: Write operation signature: int (*d_write)(struct cdev *dev, struct uio *uio, int ioflag)

d_ioctl_t: Ioctl operation signature: int (*d_ioctl)(struct cdev *dev, u_long cmd, caddr_t data, int fflag, struct thread *td)

d_read_t: Read operation signature: int (*d_read)(struct cdev *dev, struct uio *uio, int ioflag)

Notice the declarations needed:

  • Two write functions (full_write, null_write) because /dev/full and /dev/null behave differently on write
  • Two ioctl functions (null_ioctl, zero_ioctl) because they handle slightly different ioctl commands
  • One read function (zero_read) used by both /dev/zero and /dev/full (both return zeros)

Notably absent: no d_open_t or d_close_t declarations. These devices don't need open or close handlers, they have no per-file-descriptor state to initialize or clean up. Opening /dev/null requires no setup; closing it requires no teardown. The kernel's default handlers suffice.

Also absent: /dev/null doesn't need a read function. The cdevsw for /dev/null uses (d_read_t *)nullop, a kernel-provided function that immediately returns success with zero bytes read, signaling end-of-file.

Design Simplicity

This header section's simplicity reflects the devices' conceptual simplicity. Three device pointers and five function declarations are sufficient because these devices:

  • Maintain no state (no per-device data structures needed)
  • Perform trivial operations (reads return zeros, writes succeed or fail immediately)
  • Don't interact with complex kernel subsystems

This minimal complexity makes null.c an ideal starting point for understanding character device drivers, the concepts are clear without excessive infrastructure.

2) cdevsw: wiring system calls to your driver functions

58: static struct cdevsw full_cdevsw = {
59: 	.d_version =	D_VERSION,
60: 	.d_read =	zero_read,
61: 	.d_write =	full_write,
62: 	.d_ioctl =	zero_ioctl,
63: 	.d_name =	"full",
64: };
66: static struct cdevsw null_cdevsw = {
67: 	.d_version =	D_VERSION,
68: 	.d_read =	(d_read_t *)nullop,
69: 	.d_write =	null_write,
70: 	.d_ioctl =	null_ioctl,
71: 	.d_name =	"null",
72: };
74: static struct cdevsw zero_cdevsw = {
75: 	.d_version =	D_VERSION,
76: 	.d_read =	zero_read,
77: 	.d_write =	null_write,
78: 	.d_ioctl =	zero_ioctl,
79: 	.d_name =	"zero",
80: 	.d_flags =	D_MMAP_ANON,
81: };
Character Device Switch Tables

The cdevsw (character device switch) structures are the kernel's dispatch tables for character device operations. Each structure maps system call operation, read(2), write(2) and ioctl(2) to driver-specific functions. The null driver defines three separate cdevsw structures, one for each device, allowing them to share some implementations while differing where their behavior diverges.

The /dev/full Device Switch
static struct cdevsw full_cdevsw = {
    .d_version =    D_VERSION,
    .d_read =       zero_read,
    .d_write =      full_write,
    .d_ioctl =      zero_ioctl,
    .d_name =       "full",
};

The /dev/full device simulates a filesystem that's completely full. Its cdevsw establishes this behavior through function pointer assignments:

d_version = D_VERSION: Every cdevsw must specify this version constant, ensuring binary compatibility between the driver and the kernel's device framework. The kernel checks this field during device creation and rejects mismatched versions.

d_read = zero_read: Read operations return an infinite stream of zero bytes, identical to /dev/zero. The same function serves both devices since their read behavior is identical.

d_write = full_write: Write operations always fail with ENOSPC (no space left on device), simulating a full disk. This is the distinguishing characteristic of /dev/full.

d_ioctl = zero_ioctl: The ioctl handler processes control operations like FIONBIO (non-blocking mode) and FIOASYNC (async I/O).

d_name = "full": The device name string appears in kernel messages and identifies the device in system accounting. This string determines the device node name created in /dev.

Fields not specified (like d_open, d_close, d_poll) default to NULL, causing the kernel to use built-in default handlers. For simple devices with no state, these defaults are sufficient.

The /dev/null Device Switch
static struct cdevsw null_cdevsw = {
    .d_version =    D_VERSION,
    .d_read =       (d_read_t *)nullop,
    .d_write =      null_write,
    .d_ioctl =      null_ioctl,
    .d_name =       "null",
};

The /dev/null device is the classic Unix bit bucket that discards writes and immediately signals end-of-file on reads:

d_read = (d_read_t \*)nullop: The nullop function is a kernel-provided no-op that returns zero immediately, signaling end-of-file to the application. Any read(2) on /dev/null returns 0 bytes without blocking. The cast to (d_read_t *) satisfies the type checker, nullop has a generic signature that works for any device operation.

d_write = null_write: Write operations succeed immediately, updating the uio structure to indicate all data was consumed, but the data is discarded. Applications see successful writes, but nothing is stored or transmitted.

d_ioctl = null_ioctl: A separate ioctl handler from /dev/full and /dev/zero because /dev/null supports the DIOCSKERNELDUMP ioctl for kernel crash dump configuration. This ioctl removes all kernel dump devices, effectively disabling crash dumps.

The /dev/zero Device Switch
static struct cdevsw zero_cdevsw = {
    .d_version =    D_VERSION,
    .d_read =       zero_read,
    .d_write =      null_write,
    .d_ioctl =      zero_ioctl,
    .d_name =       "zero",
    .d_flags =      D_MMAP_ANON,
};

The /dev/zero device provides an infinite source of zero bytes and discards writes:

d_read = zero_read: Returns zero bytes as fast as the application can read them. The implementation uses a pre-zeroed kernel memory region for efficiency rather than zeroing a buffer on every read.

d_write = null_write: Shares the write implementation with /dev/null, writes are discarded, allowing applications to measure write performance or discard unwanted output.

d_ioctl = zero_ioctl: Handles standard terminal ioctls like FIONBIO and FIOASYNC, rejecting others with ENOIOCTL.

d_flags = D_MMAP_ANON: This flag enables a critical optimization for memory mapping. When an application calls mmap(2) on /dev/zero, the kernel doesn't actually map the device; instead, it creates anonymous memory (memory not backed by any file or device). This behavior allows applications to use /dev/zero for portable anonymous memory allocation:

void *mem = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE, 
                 open("/dev/zero", O_RDWR), 0);

The D_MMAP_ANON flag tells the kernel to substitute anonymous memory allocation for the mapping, providing zero-filled pages without involving the device driver. This pattern was historically important before MAP_ANON was standardized, and remains supported for compatibility.

Function Sharing and Reuse

Notice the strategic sharing of implementations:

zero_read: Used by both /dev/full and /dev/zero because both devices return zeros when read.

null_write: Used by both /dev/null and /dev/zero because both discard written data.

zero_ioctl: Used by both /dev/full and /dev/zero because they support the same basic ioctl operations.

null_ioctl: Used only by /dev/null because it alone supports kernel dump configuration.

full_write: Used only by /dev/full because it alone fails writes with ENOSPC.

This sharing eliminates code duplication while preserving behavioral differences. The three devices require only five functions total (two write, two ioctl, one read) despite having three complete cdevsw structures.

The cdevsw as Contract

Each cdevsw structure defines a contract between the kernel and the driver. When userspace calls read(fd, buf, len) on /dev/zero:

  1. The kernel identifies the file descriptor's associated device
  2. Looks up the cdevsw for that device (zero_cdevsw)
  3. Calls the function pointer in d_read (zero_read)
  4. Returns the result to userspace

This indirection through function pointers enables polymorphism in C: the same system call interface invokes different implementations based on which device is accessed. The kernel doesn't need to know the specifics of /dev/zero, it just calls the function registered in the switch table.

Static Storage and Encapsulation

All three cdevsw structures use static storage class, limiting their visibility to this source file. The structures are referenced by address during device creation (make_dev_credf(&full_cdevsw, ...)), but external code cannot modify them. This encapsulation ensures behavioral consistency, no other driver can accidentally override /dev/null's write behavior.

3) Write paths: "discard everything" vs "no space left"

83: /* ARGSUSED */
84: static int
85: full_write(struct cdev *dev __unused, struct uio *uio __unused, int flags __unused)
86: {
87:
88: 	return (ENOSPC);
89: }
91: /* ARGSUSED */
92: static int
93: null_write(struct cdev *dev __unused, struct uio *uio, int flags __unused)
94: {
95: 	uio->uio_resid = 0;
96:
97: 	return (0);
98: }
Write Operation Implementations

The write functions demonstrate two contrasting approaches to handling output: unconditional failure and unconditional success with data discard. These simple implementations reveal fundamental patterns in device driver design.

The /dev/full Write: Simulating No Space
/* ARGSUSED */
static int
full_write(struct cdev *dev __unused, struct uio *uio __unused, int flags __unused)
{

    return (ENOSPC);
}

The /dev/full write function is deliberately trivial, it immediately returns ENOSPC (error number 28, "No space left on device") without examining its arguments or performing any operations.

Function signature: All d_write_t functions receive three parameters:

  • struct cdev *dev - the device being written to
  • struct uio *uio - describes the user's write buffer (location, size, offset)
  • int flags - I/O flags like O_NONBLOCK or O_DIRECT

The __unused attribute: Each parameter is marked __unused, a compiler directive indicating the parameter is intentionally ignored. This prevents "unused parameter" warnings during compilation. The directive documents that the function's behavior doesn't depend on which device instance is accessed, what data the user provided, or what flags were specified.

The /\* ARGSUSED \*/ comment: This traditional lint directive predates modern compiler attributes, serving the same purpose for older static analysis tools. It signals "arguments unused by design, not by mistake." The comment and __unused attributes are redundant but maintain compatibility with multiple code analysis tools.

Return value ENOSPC: This errno value tells userspace that the write failed because no space remains. To the application, /dev/full appears as a storage device that's completely full. This behavior is useful for testing how programs handle write failures, many applications don't properly check write return values, leading to silent data loss when disks fill. Testing against /dev/full exposes these bugs.

Why not process the uio?: Normal device drivers would call uiomove() to consume data from the user's buffer and update uio->uio_resid to reflect bytes written. The /dev/full driver skips this entirely because it's simulating a failure condition where no bytes were written. Returning an error without touching uio signals "zero bytes written, operation failed."

Applications see:

ssize_t n = write(fd, buf, 100);
// n == -1, errno == ENOSPC
The /dev/null and /dev/zero Write: Discarding Data
/* ARGSUSED */
static int
null_write(struct cdev *dev __unused, struct uio *uio, int flags __unused)
{
    uio->uio_resid = 0;

    return (0);
}

The null_write function (used by both /dev/null and /dev/zero) implements the classic bit bucket behavior: accept all data, discard everything, report success.

Marking data consumed: The single operation uio->uio_resid = 0 is the key to this function's behavior. The uio_resid field tracks how many bytes remain to be transferred. Setting it to zero tells the kernel "all requested bytes were successfully written," even though the driver never actually accessed the user's buffer.

Why this works: The kernel's write system call implementation checks uio_resid to determine how many bytes were written. If a driver sets uio_resid to zero and returns success (0), the kernel calculates:

bytes_written = original_resid - current_resid
              = original_resid - 0
              = original_resid  // all bytes written

The application's write(2) call returns the full byte count requested, indicating complete success.

No actual data transfer: Unlike normal drivers that call uiomove() to copy data from userspace, null_write never accesses the user's buffer. The data remains in userspace, untouched and unread. The driver simply lies about having consumed it. This is safe because the data is being discarded anyway, there's no point copying data into kernel memory just to throw it away.

Return value zero: Returning 0 signals success. Combined with uio_resid = 0, this creates the illusion of a perfectly functioning write operation that accepted all data.

Why uio isn't marked __unused: The function modifies uio->uio_resid, so the parameter is actively used. Only dev and flags are ignored and marked __unused.

Applications see:

ssize_t n = write(fd, buf, 100);
// n == 100, all bytes "written"
Performance Implications

The null_write optimization is significant for performance-sensitive applications. Consider a program redirecting gigabytes of unwanted output to /dev/null:

% ./generate_logs > /dev/null

If the driver actually copied data from userspace (via uiomove()), this would waste CPU cycles and memory bandwidth copying data that's immediately discarded. By setting uio_resid = 0 without touching the buffer, the driver eliminates this overhead entirely. The application fills its userspace buffer, calls write(2), the kernel immediately returns success, and the CPU never accesses the buffer content.

Contrast in Error Handling Philosophy

These two functions embody different design philosophies:

full_write: Simulate a failure condition for testing purposes. Real error, immediate rejection.

null_write: Maximize performance by doing nothing. Fake success, instant return.

Both are correct implementations of their respective device semantics. The simplicity of these functions, five lines combined, demonstrates that device drivers don't need to be complex to be useful. Sometimes the best implementation is the one that does the least work necessary to satisfy the interface contract.

Interface Contract Satisfaction

Both functions satisfy the d_write_t contract:

  • Accept a device pointer, uio descriptor, and flags
  • Return 0 for success or errno for failure
  • Update uio_resid to reflect bytes consumed (or leave it unchanged if none were consumed)

The cdevsw function pointers enforce this contract at compile time. Any function not matching the d_write_t signature would cause a compilation error when assigned to d_write in the cdevsw structure. This type safety ensures all write implementations follow the same calling convention, allowing the kernel to invoke them uniformly.

4) IOCTLs: accept a tiny, sensible subset; reject the rest

100: /* ARGSUSED */
101: static int
102: null_ioctl(struct cdev *dev __unused, u_long cmd, caddr_t data __unused,
103:     int flags __unused, struct thread *td)
104: {
105: 	struct diocskerneldump_arg kda;
106: 	int error;
107:
108: 	error = 0;
109: 	switch (cmd) {
110: 	case DIOCSKERNELDUMP:
111: 		bzero(&kda, sizeof(kda));
112: 		kda.kda_index = KDA_REMOVE_ALL;
113: 		error = dumper_remove(NULL, &kda);
114: 		break;
115: 	case FIONBIO:
116: 		break;
117: 	case FIOASYNC:
118: 		if (*(int *)data != 0)
119: 			error = EINVAL;
120: 		break;
121: 	default:
122: 		error = ENOIOCTL;
123: 	}
124: 	return (error);
125: }
127: /* ARGSUSED */
128: static int
129: zero_ioctl(struct cdev *dev __unused, u_long cmd, caddr_t data __unused,
130: 	   int flags __unused, struct thread *td)
131: {
132: 	int error;
133: 	error = 0;
134:
135: 	switch (cmd) {
136: 	case FIONBIO:
137: 		break;
138: 	case FIOASYNC:
139: 		if (*(int *)data != 0)
140: 			error = EINVAL;
141: 		break;
142: 	default:
143: 		error = ENOIOCTL;
144: 	}
145: 	return (error);
146: }
Ioctl Operation Implementations

The ioctl (I/O control) functions handle device-specific control operations beyond standard read and write. While read and write transfer data, ioctl performs configuration, status queries, and special operations. The null driver implements two ioctl handlers that differ only in their support for kernel crash dump configuration.

The /dev/null Ioctl Handler
/* ARGSUSED */
static int
null_ioctl(struct cdev *dev __unused, u_long cmd, caddr_t data __unused,
    int flags __unused, struct thread *td)
{
    struct diocskerneldump_arg kda;
    int error;

    error = 0;
    switch (cmd) {
    case DIOCSKERNELDUMP:
        bzero(&kda, sizeof(kda));
        kda.kda_index = KDA_REMOVE_ALL;
        error = dumper_remove(NULL, &kda);
        break;
    case FIONBIO:
        break;
    case FIOASYNC:
        if (*(int *)data != 0)
            error = EINVAL;
        break;
    default:
        error = ENOIOCTL;
    }
    return (error);
}

Function signature: The d_ioctl_t type requires five parameters:

  • struct cdev *dev - the device being controlled
  • u_long cmd - the ioctl command number
  • caddr_t data - pointer to command-specific data (in/out parameter)
  • int flags - file descriptor flags from the original open(2)
  • struct thread *td - the calling thread (for credential checks, signal delivery)

Most parameters are marked __unused because this simple device doesn't need per-instance state (dev), doesn't examine most command data (data for some commands), and doesn't check flags or thread credentials.

Command dispatch via switch: The function uses a switch statement to handle different ioctl commands, each identified by a unique constant. The pattern switch (cmd) followed by case labels is universal in ioctl handlers.

Kernel Dump Configuration: DIOCSKERNELDUMP
case DIOCSKERNELDUMP:
    bzero(&kda, sizeof(kda));
    kda.kda_index = KDA_REMOVE_ALL;
    error = dumper_remove(NULL, &kda);
    break;

This case handles kernel crash dump configuration. When the system crashes, the kernel writes diagnostic information (memory contents, register state, stack traces) to a designated dump device, typically a disk partition or swap space. The DIOCSKERNELDUMP ioctl configures this dump device.

Why /dev/null for crash dumps?: The idiom ioctl(fd, DIOCSKERNELDUMP, &args) on /dev/null serves a specific purpose: disabling all kernel dumps. By directing dumps to the bit bucket, administrators can prevent crash dump collection entirely (useful for security-sensitive systems or when disk space is constrained).

Preparing the argument structure: bzero(&kda, sizeof(kda)) zeros the diocskerneldump_arg structure, ensuring all fields start in a known state. This is defensive programming, uninitialized stack memory might contain random values that could confuse the dump subsystem.

Removing all dump devices: kda.kda_index = KDA_REMOVE_ALL sets the magic index value indicating "remove all configured dump devices, don't add a new one." The constant KDA_REMOVE_ALL signals special semantics distinct from specifying a particular device index.

Calling the dump subsystem: dumper_remove(NULL, &kda) invokes the kernel's dump management function. The first parameter (NULL) indicates no specific device is being removed, the kda_index field provides the directive. The function returns 0 on success or an error code on failure.

Non-Blocking I/O: FIONBIO
case FIONBIO:
    break;

The FIONBIO ioctl sets or clears non-blocking mode on the file descriptor. The data parameter points to an integer: non-zero enables non-blocking mode, zero disables it.

Why do nothing?: The handler simply breaks without performing any operation. This is correct because /dev/null operations never block:

  • Reads immediately return end-of-file (0 bytes)
  • Writes immediately succeed (all bytes consumed)

There's no condition under which a /dev/null operation would block, so non-blocking mode is meaningless. The ioctl succeeds (returns 0) but has no effect, maintaining compatibility with applications that configure non-blocking mode without causing errors.

Asynchronous I/O: FIOASYNC
case FIOASYNC:
    if (*(int *)data != 0)
        error = EINVAL;
    break;

The FIOASYNC ioctl enables or disables asynchronous I/O notification. When enabled, the kernel sends SIGIO signals to the process when the device becomes readable or writable.

Parameter interpretation: The data parameter points to an integer. Zero means disable async I/O, non-zero means enable it.

Rejecting async I/O: The handler checks if the application is trying to enable async I/O (*(int *)data != 0). If so, it returns EINVAL (invalid argument), rejecting the request.

Why reject async I/O?: Asynchronous I/O only makes sense for devices that can block. Applications enable it to receive notification when a previously-blocked operation can proceed. Since /dev/null never blocks, async I/O is meaningless and potentially confusing. Rather than silently accepting a nonsensical configuration, the driver returns an error, alerting the application to the logical error.

Disabling async I/O succeeds: If *(int *)data == 0, the condition is false, error remains 0, and the function returns success. Disabling a feature that was never enabled is harmless.

Unknown Commands: Default Case
default:
    error = ENOIOCTL;

Any ioctl command not explicitly handled falls through to the default case, which returns ENOIOCTL. This special error code means "this ioctl is not supported by this device." It's distinct from EINVAL (invalid argument to a supported ioctl) and ENOTTY (inappropriate ioctl for device type, used for terminal operations on non-terminals).

The kernel's ioctl infrastructure may retry the operation through other layers when receiving ENOIOCTL, allowing generic handlers to process common commands.

The /dev/zero Ioctl Handler
/* ARGSUSED */
static int
zero_ioctl(struct cdev *dev __unused, u_long cmd, caddr_t data __unused,
       int flags __unused, struct thread *td)
{
    int error;
    error = 0;

    switch (cmd) {
    case FIONBIO:
        break;
    case FIOASYNC:
        if (*(int *)data != 0)
            error = EINVAL;
        break;
    default:
        error = ENOIOCTL;
    }
    return (error);
}

The zero_ioctl function is nearly identical to null_ioctl, with one critical difference: it doesn't handle DIOCSKERNELDUMP. The /dev/zero device cannot serve as a kernel dump device (dumps must be stored, not discarded), so the ioctl isn't supported.

The FIONBIO and FIOASYNC handling is identical, these are standard file descriptor ioctls that all character devices should handle consistently, even if the operations are no-ops.

Ioctl Design Patterns

Several patterns emerge from these implementations:

Explicit handling of no-op operations: Rather than returning errors for meaningless operations like FIONBIO on /dev/null, the handlers succeed silently. This maintains compatibility with applications that unconditionally configure file descriptors without checking device type.

Rejecting nonsensical configurations: Async I/O makes no sense for these devices, so the handlers return errors when applications try to enable it. This is a design choice, the handlers could succeed silently, but explicit errors help developers identify logic bugs.

Standard error codes: EINVAL for invalid arguments, ENOIOCTL for unsupported commands. These conventions allow userspace to distinguish different failure modes.

Minimal data validation: The handlers cast data pointers and dereference them without extensive validation. This is safe because the kernel's ioctl infrastructure has already verified the pointer is accessible to userspace. Device drivers trust the kernel's argument validation.

Why Two Ioctl Functions?

The /dev/full device uses zero_ioctl (not shown using it in the cdevsw, but by examining the structures we saw earlier). Only /dev/null needs the special dump device handling, so only null_ioctl includes the DIOCSKERNELDUMP case. This separation avoids polluting the simpler zero_ioctl with functionality that only one device needs.

The code reuse strategy: write the minimal handler (zero_ioctl), then extend it for special cases (null_ioctl). This keeps each function focused and avoids conditional logic like "if this is /dev/null, handle dumps."

5) Read path: a simple loop driven by uio->uio_resid

148: /* ARGSUSED */
149: static int
150: zero_read(struct cdev *dev __unused, struct uio *uio, int flags __unused)
151: {
152: 	void *zbuf;
153: 	ssize_t len;
154: 	int error = 0;
155:
156: 	KASSERT(uio->uio_rw == UIO_READ,
157: 	    ("Can't be in %s for write", __func__));
158: 	zbuf = __DECONST(void *, zero_region);
159: 	while (uio->uio_resid > 0 && error == 0) {
160: 		len = uio->uio_resid;
161: 		if (len > ZERO_REGION_SIZE)
162: 			len = ZERO_REGION_SIZE;
163: 		error = uiomove(zbuf, len, uio);
164: 	}
165:
166: 	return (error);
167: }
Read Operation: Infinite Zeros

The zero_read function provides an endless stream of zero bytes, serving both /dev/zero and /dev/full. This implementation demonstrates efficient data transfer using a pre-allocated kernel buffer and the uiomove() function for kernel-to-userspace copying.

Function Structure and Safety Assertion
/* ARGSUSED */
static int
zero_read(struct cdev *dev __unused, struct uio *uio, int flags __unused)
{
    void *zbuf;
    ssize_t len;
    int error = 0;

    KASSERT(uio->uio_rw == UIO_READ,
        ("Can't be in %s for write", __func__));

Function signature: The d_read_t type requires the same parameters as d_write_t:

  • struct cdev *dev - the device being read (unused, marked __unused)
  • struct uio *uio - describes the user's read buffer and tracks transfer progress
  • int flags - I/O flags (unused for this simple device)

Local variables: The function needs minimal state:

  • zbuf - pointer to the source of zero bytes
  • len - number of bytes to transfer in each iteration
  • error - tracks success or failure of transfer operations

Sanity check with KASSERT: The assertion verifies that uio->uio_rw equals UIO_READ, confirming this is actually a read operation. The uio structure serves both read and write operations, with the uio_rw field indicating direction.

This assertion catches programming errors during development. If somehow a write operation called this read function, the assertion would trigger a kernel panic with the message "Can't be in zero_read for write." The __func__ preprocessor macro expands to the current function name, making the error message precise.

In production kernels compiled without debugging, KASSERT compiles to nothing, eliminating any runtime overhead. This pattern, defensive checks during development, zero cost in production, is common throughout FreeBSD's kernel.

Accessing the Pre-Zeroed Buffer
zbuf = __DECONST(void *, zero_region);

The zero_region variable (declared in <machine/vmparam.h>) points to a region of kernel virtual memory that's permanently filled with zeros. The kernel allocates this region during boot and never modifies it, providing an efficient source of zero bytes without repeatedly zeroing temporary buffers.

The __DECONST macro: The zero_region is declared const to prevent accidental modification. However, uiomove() expects a non-const pointer because it's a generic function that handles both read (kernel to user) and write (user to kernel) operations. The __DECONST macro removes the const qualifier, essentially telling the compiler "I know this is const, but I need to pass it to a function expecting non-const. Trust me, it won't be modified."

This is safe because uiomove() with a read-direction uio only copies data from the kernel buffer to userspace, it never writes to the buffer. The const-cast is a necessary workaround for C's type system limitations.

The Transfer Loop
while (uio->uio_resid > 0 && error == 0) {
    len = uio->uio_resid;
    if (len > ZERO_REGION_SIZE)
        len = ZERO_REGION_SIZE;
    error = uiomove(zbuf, len, uio);
}

return (error);

The loop continues until either the entire read request is satisfied (uio->uio_resid == 0) or an error occurs (error != 0).

Checking remaining bytes: uio->uio_resid tracks how many bytes the application requested but haven't yet been transferred. Initially, this equals the original read size. After each successful transfer, uiomove() decrements it.

Limiting transfer size: The code calculates how many bytes to transfer in this iteration:

len = uio->uio_resid;
if (len > ZERO_REGION_SIZE)
    len = ZERO_REGION_SIZE;

If the remaining request exceeds the zero region's size, the transfer is capped at ZERO_REGION_SIZE. This limitation exists because the kernel only pre-allocated a finite zero buffer. Typical values for ZERO_REGION_SIZE are 64KB or 256KB, large enough for efficiency but small enough not to waste kernel memory.

Why this matters: If an application reads 1MB from /dev/zero, the loop executes multiple times, each iteration transferring up to ZERO_REGION_SIZE bytes. The same zero buffer is reused for each iteration, eliminating the need to allocate and zero 1MB of kernel memory.

Performing the transfer: uiomove(zbuf, len, uio) is the kernel's workhorse function for moving data between kernel and userspace. It:

  1. Copies len bytes from zbuf (kernel memory) to the user's buffer (described by uio)
  2. Updates uio->uio_resid by subtracting len (fewer bytes remaining)
  3. Advances uio->uio_offset by len (file position moves forward, though meaningless for /dev/zero)
  4. Returns 0 on success or an error code on failure (typically EFAULT if the user's buffer address is invalid)

If uiomove() returns an error, the loop exits immediately and returns the error to the caller. The application receives whatever data was successfully transferred before the error occurred.

Loop termination: The loop exits when:

  • Success: uio->uio_resid reaches zero, meaning all requested bytes were transferred
  • Error: uiomove() failed, typically because the user's buffer pointer was invalid or the process received a signal
Infinite Stream Semantics

Notice what's missing from this function: no end-of-file check. Most file reads eventually return 0 bytes, signaling EOF. The /dev/zero read function never does this, it always transfers the full requested amount (or fails with an error).

From userspace perspective:

char buf[4096];
ssize_t n = read(zero_fd, buf, sizeof(buf));
// n always equals 4096, never 0 (unless error)

This infinite stream property makes /dev/zero useful for:

  • Allocating zero-initialized memory (pre-MAP_ANON)
  • Generating arbitrary amounts of zero bytes for testing
  • Overwriting disk blocks with zeros for data sanitization
Performance Optimization

The pre-allocated zero_region is a significant optimization. Consider the alternative implementation:

// Inefficient approach
char zeros[4096];
bzero(zeros, sizeof(zeros));
while (uio->uio_resid > 0) {
    len = min(uio->uio_resid, sizeof(zeros));
    error = uiomove(zeros, len, uio);
}

This approach would zero a buffer on every function call, wasting CPU cycles. The production implementation zeros the buffer once at boot and reuses it forever, eliminating repeated zeroing overhead.

For applications reading gigabytes from /dev/zero, this optimization eliminates billions of store instructions, making reads essentially free (bounded only by memory copy speed).

Shared Between Devices

Recall from the cdevsw structures that both /dev/zero and /dev/full use zero_read. This sharing is correct because both devices should return zeros when read. The device identity (dev parameter) is ignored because the behavior is identical regardless of which device is accessed.

This implementation demonstrates a key principle: when multiple devices share behavior, implement it once and reference it from multiple switch tables. Code reuse eliminates duplication and ensures consistent behavior across related devices.

Error Propagation

If uiomove() fails partway through a large read, the function returns the error immediately. The userspace read(2) system call sees a short read followed by an error on the next call. For example:

// Reading 128KB when process receives signal after 64KB
char buf[128 * 1024];
ssize_t n = read(zero_fd, buf, sizeof(buf));
// n might equal 65536 (successful partial read)
// errno unset (partial success)

n = read(zero_fd, buf, sizeof(buf));
// n equals -1, errno equals EINTR (interrupted system call)

This error handling is automatic, uiomove() detects signals and returns EINTR, which the read function propagates to userspace. The driver doesn't need explicit signal handling logic.

6) Module event: create device nodes on load, destroy on unload

169: /* ARGSUSED */
170: static int
171: null_modevent(module_t mod __unused, int type, void *data __unused)
172: {
173: 	switch(type) {
174: 	case MOD_LOAD:
175: 		if (bootverbose)
176: 			printf("null: <full device, null device, zero device>\n");
177: 		full_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, &full_cdevsw, 0,
178: 		    NULL, UID_ROOT, GID_WHEEL, 0666, "full");
179: 		null_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, &null_cdevsw, 0,
180: 		    NULL, UID_ROOT, GID_WHEEL, 0666, "null");
181: 		zero_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, &zero_cdevsw, 0,
182: 		    NULL, UID_ROOT, GID_WHEEL, 0666, "zero");
183: 		break;
184:
185: 	case MOD_UNLOAD:
186: 		destroy_dev(full_dev);
187: 		destroy_dev(null_dev);
188: 		destroy_dev(zero_dev);
189: 		break;
190:
191: 	case MOD_SHUTDOWN:
192: 		break;
193:
194: 	default:
195: 		return (EOPNOTSUPP);
196: 	}
197:
198: 	return (0);
199: }
201: DEV_MODULE(null, null_modevent, NULL);
202: MODULE_VERSION(null, 1);
Module Lifecycle and Registration

The final section of the null driver handles module loading, unloading, and registration with the kernel's module system. This code executes when the module is loaded at boot or via kldload, and when it's unloaded via kldunload.

The Module Event Handler
/* ARGSUSED */
static int
null_modevent(module_t mod __unused, int type, void *data __unused)
{
    switch(type) {

Function signature: Module event handlers receive three parameters:

  • module_t mod - a handle to the module itself (unused here)
  • int type - the event type: MOD_LOAD, MOD_UNLOAD, MOD_SHUTDOWN, etc.
  • void *data - event-specific data (unused for this driver)

The function returns 0 for success or an errno value for failure. A failed MOD_LOAD prevents the module from loading; a failed MOD_UNLOAD keeps the module loaded.

Module Load: Creating Devices
case MOD_LOAD:
    if (bootverbose)
        printf("null: <full device, null device, zero device>\n");
    full_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, &full_cdevsw, 0,
        NULL, UID_ROOT, GID_WHEEL, 0666, "full");
    null_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, &null_cdevsw, 0,
        NULL, UID_ROOT, GID_WHEEL, 0666, "null");
    zero_dev = make_dev_credf(MAKEDEV_ETERNAL_KLD, &zero_cdevsw, 0,
        NULL, UID_ROOT, GID_WHEEL, 0666, "zero");
    break;

The MOD_LOAD case executes when the module is first loaded, either during boot or when an administrator runs kldload null.

Boot message: The if (bootverbose) check controls whether a message appears during boot. The bootverbose variable is set when the system boots with verbose output enabled (via boot loader configuration or kernel option). When true, the driver prints an informational message identifying the devices it provides.

This conditional prevents cluttering the boot output in normal operation while allowing administrators to see driver initialization during diagnostic boots. The message format follows FreeBSD convention: driver name, colon, angle-bracketed device list.

Device creation with make_dev_credf: This function creates character device nodes in /dev. Each call requires several parameters that control the device's properties:

MAKEDEV_ETERNAL_KLD: A flag indicating this device should persist until explicitly destroyed. The ETERNAL part means the device won't be automatically removed if all references are closed, and KLD indicates it's part of a kernel loadable module (as opposed to a statically compiled driver). This flag combination ensures the device nodes remain available as long as the module is loaded, regardless of whether any process has them open.

&full_cdevsw (and similarly for null/zero): Pointer to the character device switch table that defines the device's behavior. This connects the device node to the driver's function implementations.

0: The device unit number. Since these are singleton devices (only one /dev/null exists system-wide), unit 0 is used. Multi-instance devices like /dev/tty0, /dev/tty1 would use different unit numbers.

NULL: Credential pointer for permission checks. NULL means no special credentials are required beyond the standard file permissions.

UID_ROOT: The device file owner (root, UID 0). This determines who can change the device's permissions or delete it.

GID_WHEEL: The device file group (wheel, GID 0). The wheel group traditionally has administrative privileges.

0666: The permission mode in octal. This value (readable and writable by owner, group, and others) allows any process to open these devices. Breaking it down:

  • Owner (root): read (4) + write (2) = 6
  • Group (wheel): read (4) + write (2) = 6
  • Others: read (4) + write (2) = 6

Unlike typical files where world-writable permissions are dangerous, these devices are designed for universal access, any process should be able to write to /dev/null or read from /dev/zero.

"full" (and similarly "null", "zero"): The device name string. This creates /dev/full, /dev/null, and /dev/zero respectively. The make_dev_credf function automatically prepends /dev/ to the name.

Return value storage: Each make_dev_credf call returns a struct cdev * pointer stored in the global variables (full_dev, null_dev, zero_dev). These pointers are essential for the unload handler to remove the devices later.

Module Unload: Destroying Devices
case MOD_UNLOAD:
    destroy_dev(full_dev);
    destroy_dev(null_dev);
    destroy_dev(zero_dev);
    break;

The MOD_UNLOAD case executes when an administrator runs kldunload null to remove the module from the kernel. The module system only calls this handler if the module is eligible for unload (no other code references it).

Device destruction: The destroy_dev function removes a device node from /dev and deallocates associated kernel structures. Each call uses the pointer saved during MOD_LOAD.

The function handles several cleanup tasks automatically:

  • Removes the /dev entry so new opens fail with ENOENT
  • Waits for existing opens to close (or forcibly closes them)
  • Frees the struct cdev and related memory
  • Unregisters the device from kernel accounting

The order of destruction doesn't matter for these independent devices. If they had dependencies (like one device routing operations to another), destruction order would be critical.

What if devices are open?: By default, destroy_dev blocks until all file descriptors referring to the device are closed. An administrator attempting kldunload null while a process has /dev/null open would experience a delay. In practice, /dev/null is frequently open (many daemons redirect output there), so unloading this module is rare.

System Shutdown: No-Op
case MOD_SHUTDOWN:
    break;

The MOD_SHUTDOWN event fires during system shutdown or reboot. The handler does nothing because these devices don't need special shutdown handling:

  • No hardware to disable or park in a safe state
  • No data buffers to flush
  • No network connections to close gracefully

Simply breaking (falling through to return (0)) indicates successful shutdown handling. The devices will cease to exist when the kernel halts; no explicit cleanup is necessary.

Unsupported Events: Error Return
default:
    return (EOPNOTSUPP);

The default case catches any module event types not explicitly handled. Returning EOPNOTSUPP (operation not supported) informs the module system that this event isn't applicable to this driver.

Other possible event types include MOD_QUIESCE (prepare for unload, used to check if unload is safe) and driver-specific custom events. This driver doesn't support those, so the default handler rejects them.

Why not panic?: An unknown event type isn't a driver bug, the kernel might introduce new event types in future versions. Returning an error is more robust than crashing.

Success Return
return (0);

After handling any supported event (load, unload, shutdown), the function returns 0 to signal success. This allows the module operation to complete normally.

Module Registration Macros
DEV_MODULE(null, null_modevent, NULL);
MODULE_VERSION(null, 1);

These macros register the module with the kernel's module system.

DEV_MODULE(null, null_modevent, NULL): Declares a device driver module with three arguments:

  • null - the module name, appearing in kldstat output and used with kldload/kldunload commands
  • null_modevent - pointer to the event handler function
  • NULL - optional additional data passed to the event handler (unused here)

The macro expands to generate data structures that the kernel's linker and module loader recognize. When the module loads, the kernel calls null_modevent with type = MOD_LOAD. When unloading, it calls with type = MOD_UNLOAD.

MODULE_VERSION(null, 1): Declares the module's version number. The arguments are:

  • null - module name (must match DEV_MODULE)
  • 1 - version number (integer)

Version numbers enable dependency checking. If another module depended on this one, it could specify "requires null version >= 1" to ensure compatibility. For this simple driver, versioning is primarily documentation, it signals that this is the first (and likely only) version of the interface.

Complete Module Lifecycle

The complete lifecycle for this driver:

At boot or kldload null:

  1. Kernel loads the module into memory
  2. Processes DEV_MODULE registration
  3. Calls null_modevent(mod, MOD_LOAD, NULL)
  4. Handler creates /dev/full, /dev/null, /dev/zero
  5. Devices are now available to userspace

During operation:

  • Applications open, read, write, ioctl the devices
  • The cdevsw function pointers route operations to driver code
  • No module events occur during normal operation

At kldunload null:

  1. Kernel checks if unload is safe (no dependencies)
  2. Calls null_modevent(mod, MOD_UNLOAD, NULL)
  3. Handler destroys the three devices
  4. Kernel removes module from memory
  5. Attempts to open /dev/null now fail with ENOENT

At system shutdown:

  1. Kernel calls null_modevent(mod, MOD_SHUTDOWN, NULL)
  2. Handler does nothing (returns success)
  3. System continues shutdown sequence
  4. Module ceases to exist when kernel halts

This lifecycle management, explicit load and unload handlers, registration macros, is the standard pattern for all FreeBSD kernel modules. Device drivers, filesystem implementations, network protocols, and system call additions all use the same module event mechanism.

Interactive Exercises - /dev/null, /dev/zero, and /dev/full

Goal: Confirm you can read a real driver, map user-visible behavior to kernel code, and explain the minimal character device skeleton.

A) Map System Calls to cdevsw (Warm-up)
  1. Which function handles writes to /dev/full, and what errno value does it return? Quote the function name and the return statement. What does this error code mean to userspace applications? Hint: Lines 85-89.

  2. Which function handles reads from both /dev/zero and /dev/full? Quote the relevant .d_read assignments from both cdevsw structures. Why is it correct for both devices to share the same read handler, what behavior do they have in common? Hint: Lines 58-64 (full_cdevsw), 74-81 (zero_cdevsw), 150-167 (zero_read).

  3. Create a table listing each cdevsw's name and its read/write function assignments:

cdevsw.d_name.d_read.d_write
full_cdevsw???
null_cdevsw???
zero_cdevsw???
Quote the line ranges used for each structure. *Hint:* Lines 58-81.
B) Read Path Reasoning with uiomove()
  1. Locate the KASSERT that verifies this is a read operation. Quote the line and explain what would happen if this assertion failed. What does the __func__ macro provide in the error message? Hint: Lines 156-157.

  2. Explain the role of uio->uio_resid in the while loop condition. What does this field represent, and how does it change during the loop? Quote the while condition. Hint: Line 159.

  3. Why does the code limit each transfer to ZERO_REGION_SIZE rather than copying all requested bytes at once? What would be the problem with transferring 1MB in a single uiomove() call? Quote the if statement that implements this limit. Hint: Lines 161-162.

  4. The code references two pre-allocated kernel resources: zero_region (a pointer) and ZERO_REGION_SIZE (a constant). Quote the lines where each is used. Then use grep to find where ZERO_REGION_SIZE is defined:

% grep -r "define.*ZERO_REGION_SIZE" /usr/src/sys/amd64/include/
What is the value on your system? *Hint:* Lines 158 (zero_region), 161-162 (ZERO_REGION_SIZE).
C) Write Path Contrasts
  1. Compare the implementations of null_write (lines 93-98) and full_write (lines 85-89). For each function, answer:
  • What does it do with uio->uio_resid?

  • What value does it return?

  • What will a userspace write(2) call return?

    Now verify from userspace:

# This should succeed, reporting bytes written:
% dd if=/dev/zero of=/dev/null bs=64k count=8 2>&1 | grep copied

# This should fail with "No space left on device":
% dd if=/dev/zero of=/dev/full bs=1k count=1 2>&1 | grep -i "space"
For each test, identify which write handler was called and quote the specific line that caused the observed behavior.
D) Minimal ioctl Shape
  1. Create a comparison table of ioctl handling. For null_ioctl (lines 102-125) and zero_ioctl (lines 129-146), fill in:
Commandnull_ioctl behaviorzero_ioctl behavior
DIOCSKERNELDUMP??
FIONBIO??
FIOASYNC??
Unknown command??
For each entry, quote the relevant case statement and explain the behavior.

2. The FIOASYNC case has special handling when enabling async I/O. Quote the conditional check and explain why these devices reject async I/O mode. Hint: Lines 117-120, 138-141.

E) Device Node Lifecycle
  1. During MOD_LOAD, three device nodes are created via make_dev_credf(). For each call (lines 177-182), identify:
  • The device name (what appears in /dev/)

  • The cdevsw pointer (which function table)

  • The permission mode (what does 0666 mean?)

  • The owner and group (UID_ROOT, GID_WHEEL)

    Quote one complete make_dev_credf() call and label each parameter.

  1. During MOD_UNLOAD, destroy_dev() is called three times (lines 186-188). Quote these lines and explain:
  • Why do we need the global pointers (full_dev, null_dev, zero_dev)?
  • What would happen if we forgot to call destroy_dev() during unload?
  • Why must the MOD_LOAD and MOD_UNLOAD operations be symmetric?
F) Trace from Userspace
  1. Verify that /dev/zero produces zeros and /dev/null consumes data:
% dd if=/dev/zero bs=1k count=1 2>/dev/null | hexdump -C | head -n 2
# Expected: all zeros (00 00 00 00...)

% printf 'test data' | dd of=/dev/null 2>/dev/null ; echo "Exit code: $?"
# Expected: Exit code: 0
Explain these results by tracing through:
  • zero_read: Which lines produce the zeros? How does the loop work?

  • null_write: Which line makes the write "succeed"? What happens to the data?

    Quote the specific lines responsible for each behavior.

  1. Read from /dev/full and examine what you get:
% dd if=/dev/full bs=16 count=1 2>/dev/null | hexdump -C

​ What output do you see? Look at the full_cdevsw structure (lines 58-64) which .d_read function does it use?

​ Why does /dev/full return zeros instead of an error?

G) Module Lifecycle
  1. Look at the null_modevent switch statement (lines 173-196). List all the case labels and what each one does. Which cases actually perform work versus just returning success?

  2. Find the two macros at the end of the file (lines 201-202) that register this module. Quote them and explain:

  • What does DEV_MODULE do?
  • What does MODULE_VERSION do?
  • Why do both use the name "null"?
  1. The MAKEDEV_ETERNAL_KLD flag is used in all three make_dev_credf() calls. What does this flag mean, and why is it appropriate for these devices? Hint: Lines 177-182, consider what happens if a process has /dev/null open when you try to unload the module.

Stretch (thought experiment)

Stretch 1: Examine null_write (lines 93-98). The function does two things: sets uio->uio_resid = 0 and returns 0.

Thought experiment: If we changed line 98 to return (EIO); but kept line 95 unchanged, what would happen?

  • What would the kernel think about bytes written?

  • What would write(2) return to userspace?

  • What would errno be set to?

    Quote the lines involved and explain the interaction between uio_resid and the return value.

Stretch 2: In zero_read (lines 159-164), the code limits each transfer to ZERO_REGION_SIZE. Quote lines 161-162 where this limit is enforced.

Thought experiment: Suppose we removed this check and always did:
len = uio->uio_resid;  // No limit!
error = uiomove(zbuf, len, uio);
If a user requests 10MB from `/dev/zero`:
  • What invariant would make this "work" (not crash)?
  • What resource constraint would we be ignoring?
  • Why does the current code use a pre-allocated buffer of limited size?

Hint: The zero_region is only ZERO_REGION_SIZE bytes. What happens if we try to copy more than that from this fixed-size buffer?

Bridge to the next tour

Before moving on: if you can match each user-visible behavior to specific line ranges in null.c, you've internalized the character-device skeleton we'll keep meeting. Next we'll look at led(4), which remains small but adds a user-visible control surface (writes that change state). Keep watching for three things: how the device node is created, how operations are routed, and how the driver declines unsupported actions cleanly.

Tour 2 - A tiny write-only control surface with timers: led(4)

Open the file:

% cd /usr/src/sys/dev/led
% less led.c

In one file we get a practical pattern for write-driven device control backed by a timer and per-device state. You'll see: a per-LED softc, global bookkeeping, a periodic callout that advances blink patterns, a parser that converts human-friendly commands into compact sequences, a write(2) entry point, and minimal create/destroy helpers.

1.0) Includes

12: #include <sys/cdefs.h>
13: #include <sys/param.h>
14: #include <sys/conf.h>
15: #include <sys/ctype.h>
16: #include <sys/kernel.h>
17: #include <sys/limits.h>
18: #include <sys/lock.h>
19: #include <sys/malloc.h>
20: #include <sys/mutex.h>
21: #include <sys/queue.h>
22: #include <sys/sbuf.h>
23: #include <sys/sx.h>
24: #include <sys/systm.h>
25: #include <sys/uio.h>
27: #include <dev/led/led.h>
Headers and Subsystem Interface

The LED driver begins with kernel headers and a crucial subsystem header that establishes its role as an infrastructure component used by other drivers. Unlike the null driver which stands alone, the LED driver provides services to hardware drivers that need to expose status indicators.

Standard Kernel Headers
#include <sys/cdefs.h>
#include <sys/param.h>
#include <sys/conf.h>
#include <sys/ctype.h>
#include <sys/kernel.h>
#include <sys/limits.h>
#include <sys/lock.h>
#include <sys/malloc.h>
#include <sys/mutex.h>
#include <sys/queue.h>
#include <sys/sbuf.h>
#include <sys/sx.h>
#include <sys/systm.h>
#include <sys/uio.h>

These headers provide the infrastructure for a stateful, timer-driven device driver:

<sys/cdefs.h>, <sys/param.h>, <sys/systm.h>: Fundamental system definitions identical to those in null.c. Every kernel source file begins with these.

<sys/conf.h>: Character device configuration, providing cdevsw and make_dev(). The LED driver uses these to create device nodes dynamically as hardware drivers register LEDs.

<sys/ctype.h>: Character classification functions like isdigit(). The LED driver parses user-supplied strings to control blink patterns, requiring character type checking.

<sys/kernel.h>: Kernel initialization infrastructure. This driver uses SYSINIT to perform one-time initialization during boot, setting up global resources before any LEDs are registered.

<sys/limits.h>: System limits like INT_MAX. The LED driver uses this to configure its unit number allocator with maximum range.

<sys/lock.h> and <sys/mutex.h>: Locking primitives for protecting shared data structures. The driver uses a mutex to protect the LED list and blinker state from concurrent access by timer callbacks and user writes.

<sys/queue.h>: BSD linked list macros (LIST_HEAD, LIST_FOREACH, LIST_INSERT_HEAD, LIST_REMOVE). The driver maintains a global list of all registered LEDs, allowing timer callbacks to iterate and update each one.

<sys/sbuf.h>: Safe string buffer manipulation. The driver uses sbuf to build blink pattern strings from user input, avoiding fixed-size buffer overflows. String buffers automatically grow as needed and provide bounds checking.

<sys/sx.h>: Shared/exclusive locks (reader/writer locks). The driver uses an sx lock to protect device creation and destruction, allowing concurrent reads of the LED list while serializing structural modifications.

<sys/uio.h>: User I/O operations. Like null.c, this driver needs struct uio and uiomove() to transfer data between kernel and userspace.

<sys/malloc.h>: Kernel memory allocation. Unlike null.c which had no dynamic memory, the LED driver allocates per-LED state structures and duplicates strings for LED names and blink patterns.

Subsystem Interface Header
#include <dev/led/led.h>

This header defines the LED subsystem's public API, the interface that other kernel drivers use to register and control LEDs. While the specific contents aren't shown in this source file, typical declarations would include:

led_t typedef: A function pointer type for LED control callbacks. Hardware drivers provide a function matching this signature that turns their physical LED on or off:

typedef void led_t(void *priv, int onoff);

Public functions: The API that hardware drivers call:

  • led_create() - register a new LED, creating a /dev/led/name device node
  • led_create_state() - register an LED with initial state
  • led_destroy() - unregister an LED when hardware is removed
  • led_set() - programmatically control an LED from kernel code

Example usage by a hardware driver:

// In a disk driver's attach function:
struct cdev *led_dev;
led_dev = led_create(disk_led_control, sc, "disk0");

// Later, in the LED control callback:
static void
disk_led_control(void *priv, int onoff)
{
    struct disk_softc *sc = priv;
    if (onoff)
        /* Turn on LED via hardware register write */
    else
        /* Turn off LED via hardware register write */
}
Architectural Role

The header organization reveals the LED driver's dual nature:

As a character device driver: It includes standard device driver headers (<sys/conf.h>, <sys/uio.h>) to create /dev/led/* nodes that userspace can write to.

As a subsystem: It includes <dev/led/led.h> to export an API that other drivers consume. Hardware drivers don't manipulate /dev/led/* directly, they call led_create() and provide callbacks.

This pattern, a driver that both exposes user-facing devices and provides kernel-facing APIs, appears throughout FreeBSD. Examples include:

  • The devctl driver: creates /dev/devctl while providing devctl_notify() for kernel event reporting
  • The random driver: creates /dev/random while providing read_random() for kernel consumers
  • The mem driver: creates /dev/mem while providing direct memory access functions

The LED driver sits between hardware-specific drivers (which know how to control physical LEDs) and userspace (which wants to control LED patterns). It provides abstraction, hardware drivers implement simple on/off control; the LED subsystem handles complex blink patterns, timing, and user interface.

1.1) Per-LED State (softc)

30: struct ledsc {
31: 	LIST_ENTRY(ledsc)	list;
32: 	char			*name;
33: 	void			*private;
34: 	int			unit;
35: 	led_t			*func;
36: 	struct cdev *dev;
37: 	struct sbuf		*spec;
38: 	char			*str;
39: 	char			*ptr;
40: 	int			count;
41: 	time_t			last_second;
42: };
Per-LED State Structure

The ledsc structure (LED softc, following FreeBSD naming convention for "software context") contains all per-device state for one registered LED. Unlike the null driver which had no per-device state, the LED driver creates one of these structures for each LED registered in the system, tracking both device identity and current blink pattern execution state.

Structure Definition and Fields
struct ledsc {
    LIST_ENTRY(ledsc)   list;
    char                *name;
    void                *private;
    int                 unit;
    led_t               *func;
    struct cdev *dev;
    struct sbuf         *spec;
    char                *str;
    char                *ptr;
    int                 count;
    time_t              last_second;
};

LIST_ENTRY(ledsc) list: Linkage for the global LED list. The LIST_ENTRY macro (from <sys/queue.h>) embeds forward and backward pointers directly in the structure, allowing this LED to be part of a doubly-linked list without separate allocation. The global led_list chains together all registered LEDs, enabling timer callbacks to iterate and update each one.

char *name: The LED's name string, duplicated from the hardware driver's registration call. This name appears in the device path /dev/led/name and identifies the LED in kernel API calls to led_set(). Examples: "disk0", "power", "heartbeat". The string is dynamically allocated and must be freed when the LED is destroyed.

void *private: An opaque pointer passed back to the hardware driver's control function. The hardware driver provides this during led_create(), typically pointing to its own device context structure. When the LED subsystem needs to turn the LED on or off, it calls the hardware driver's callback with this pointer, allowing the driver to locate the relevant hardware registers.

int unit: A unique unit number for this LED, used to construct the device minor number. Allocated from a unit number pool to prevent conflicts when multiple LEDs are registered. Unlike the null driver's fixed unit numbers (0 for all devices), the LED driver dynamically assigns units as LEDs are created.

led_t *func: Function pointer to the hardware driver's LED control callback. This function has the signature void (*led_t)(void *priv, int onoff) where priv is the private pointer above and onoff is non-zero for "on", zero for "off". This callback is the hardware-specific part, it knows how to manipulate GPIO pins, write to hardware registers, or send USB control transfers to actually light or extinguish the LED.

struct cdev *dev: Pointer to the character device structure representing /dev/led/name. This is what make_dev() returns during LED creation. The device node allows userspace to write blink patterns to the LED. The pointer is needed later to call destroy_dev() when the LED is removed.

The remaining fields track blink pattern execution by the timer callback:

struct sbuf *spec: The parsed blink specification string buffer. When a user writes a pattern like "f" (flash) or "m...---..." (morse code), the parser converts it to a sequence of timing codes and stores it in this sbuf. The string persists as long as the pattern is active, allowing the timer to repeatedly traverse it.

char *str: Pointer to the beginning of the pattern string (extracted from spec via sbuf_data()). This is where pattern execution starts and where it loops back after reaching the end. If NULL, no pattern is active and the LED is in static on/off state.

char *ptr: The current position in the pattern string. The timer callback examines this character to determine what to do next (turn LED on/off, delay for N tenths of a second). After processing each character, ptr advances. When it reaches the string terminator, it wraps back to str for continuous repetition.

int count: A countdown timer for delay characters. Pattern codes like 'a' through 'j' mean "wait for 1-10 tenths of a second". When the timer encounters such a code, it sets count to the delay value and decrements it on each timer tick. While count > 0, the timer skips pattern advancement, implementing the delay.

time_t last_second: Timestamp tracking the last second boundary, used for 'U'/'u' pattern codes that toggle the LED once per second (creating a 1Hz heartbeat pattern). The timer compares time_second (kernel's current time) to this field, only updating the LED when the second changes. This prevents multiple updates within the same second if the timer fires faster than 1Hz.

Memory Management and Lifecycle

Several fields point to dynamically allocated memory:

  • name - allocated with strdup(name, M_LED) during creation
  • spec - created with sbuf_new_auto() when a pattern is set
  • The structure itself is allocated with malloc(sizeof *sc, M_LED, M_WAITOK | M_ZERO)

All must be freed during led_destroy() to prevent memory leaks. The structure's lifetime spans from led_create() to led_destroy(), potentially lasting the entire system uptime if the hardware driver never unregisters the LED.

Relationship to Device Node

The ledsc structure and the /dev/led/name device node are bidirectionally linked:

struct cdev (device node)
     ->  si_drv1
struct ledsc
     ->  dev
struct cdev (same device node)

This bidirectional linkage allows:

  • The write handler to find the LED state: sc = dev->si_drv1
  • The destroy function to remove the device: destroy_dev(sc->dev)
Contrast with null.c

The null driver had no equivalent structure because its devices were stateless. The LED driver needs per-device state because:

Identity: Each LED has a unique name and device node

Callback: Each LED has hardware-specific control logic

Pattern state: Each LED may be executing a different blink pattern at different positions

Timing: Each LED's delay counters and timestamps are independent

This per-device state structure is typical of drivers managing multiple instances of similar hardware. The pattern is universal: one structure per managed entity, containing identity, configuration, and operational state.

1.2) Globals

44: static struct unrhdr *led_unit;
45: static struct mtx led_mtx;
46: static struct sx led_sx;
47: static LIST_HEAD(, ledsc) led_list = LIST_HEAD_INITIALIZER(led_list);
48: static struct callout led_ch;
49: static int blinkers = 0;
51: static MALLOC_DEFINE(M_LED, "LED", "LED driver");
Global State and Synchronization

The LED driver maintains several global variables that coordinate all registered LEDs. These globals provide resource allocation, synchronization, timer management, and a registry of active LEDs, infrastructure shared across all LED instances.

Resource Allocator
static struct unrhdr *led_unit;

The unit number handler allocates unique unit numbers for LED devices. Each registered LED receives a distinct unit number used to construct its device minor number, ensuring /dev/led/disk0 and /dev/led/power don't collide even if created simultaneously.

The unrhdr (unit number handler) provides thread-safe allocation and deallocation of integers from a range. During driver initialization, new_unrhdr(0, INT_MAX, NULL) creates a pool spanning the entire positive integer range. When hardware drivers call led_create(), the code calls alloc_unr(led_unit) to obtain the next available unit. When an LED is destroyed, free_unr(led_unit, sc->unit) returns the unit to the pool for reuse.

This dynamic allocation contrasts with the null driver's fixed units (always 0). The LED driver must handle arbitrary numbers of LEDs appearing and disappearing as hardware is added and removed.

Synchronization Primitives
static struct mtx led_mtx;
static struct sx led_sx;

The driver uses two locks with distinct purposes:

led_mtx (mutex): Protects the LED list and blink pattern execution state. This lock guards:

  • The led_list linked list as LEDs are added and removed
  • The blinkers counter tracking active patterns
  • Individual ledsc fields modified by timer callbacks (ptr, count, last_second)

The mutex uses MTX_DEF semantics (default, can sleep while held). Timer callbacks acquire this mutex briefly to examine and update LED states. Write operations acquire it to install new blink patterns.

led_sx (shared/exclusive lock): Protects device creation and destruction. This lock serializes:

  • Calls to make_dev() and destroy_dev()
  • Unit number allocation and deallocation
  • String duplication for LED names

Shared/exclusive locks allow multiple readers (threads examining which LEDs exist) to proceed concurrently while writers (threads creating or destroying LEDs) have exclusive access. For the LED driver, creation and destruction are infrequent operations that benefit from being fully serialized with an exclusive lock.

Why two locks?: The separation enables concurrency. Timer callbacks need fast access to LED states protected by the mutex, while device creation/destruction requires the heavier sx lock. If a single lock protected everything, timer callbacks would block waiting for slow device operations. The split allows timers to run freely while device management proceeds independently.

LED Registry
static LIST_HEAD(, ledsc) led_list = LIST_HEAD_INITIALIZER(led_list);

The global LED list maintains all registered LEDs in a doubly-linked list. The LIST_HEAD macro (from <sys/queue.h>) declares a list head structure and LIST_HEAD_INITIALIZER sets its initial empty state.

This list serves multiple purposes:

Timer iteration: The timer callback walks the list with LIST_FOREACH(sc, &led_list, list) to update each active LED's blink pattern. Without this registry, the timer wouldn't know which LEDs exist.

Name lookup: The led_set() function searches the list to find an LED by name when kernel code wants to control an LED programmatically.

Cleanup verification: When the last LED is removed (LIST_EMPTY(&led_list)), the driver can stop the timer callback, conserving CPU cycles when no LEDs need servicing.

The list is protected by led_mtx since both timer callbacks and device operations modify it.

Timer Callback Infrastructure
static struct callout led_ch;
static int blinkers = 0;

led_ch (callout): A kernel timer that fires periodically to advance blink patterns. When any LED has an active pattern, the timer is scheduled to fire 10 times per second (hz / 10, where hz is timer ticks per second, typically 1000). Each timer firing calls led_timeout() which walks the LED list and updates pattern states.

The callout remains idle (not scheduled) when no LEDs are blinking, conserving resources. The first LED to receive a blink pattern schedules the timer with callout_reset(&led_ch, hz / 10, led_timeout, NULL). Subsequent patterns don't reschedule, the single timer services all LEDs.

blinkers counter: Tracks how many LEDs currently have active blink patterns. When a pattern is assigned, blinkers++. When a pattern completes or is replaced with static on/off, blinkers--. When the counter reaches zero, the timer callback doesn't reschedule itself, stopping the periodic wakeups.

This reference counting is critical for performance. Without it, the timer would fire continuously even with no work to do. The counter gates timer activity: schedule when transitioning 0 -> 1, stop when transitioning 1 -> 0.

Memory Type Declaration
static MALLOC_DEFINE(M_LED, "LED", "LED driver");

The MALLOC_DEFINE macro registers a memory allocation type for the LED subsystem. All LED-related allocations specify M_LED:

  • malloc(sizeof *sc, M_LED, ...) for softc structures
  • strdup(name, M_LED) for LED name strings

Memory types enable kernel accounting and debugging:

  • vmstat -m shows memory consumption per type
  • Developers can track whether the LED driver is leaking memory
  • Kernel memory debuggers can filter allocations by type

The three arguments are:

  1. M_LED - the C identifier used in malloc() calls
  2. "LED" - short name appearing in accounting output
  3. "LED driver" - descriptive text for documentation
Initialization Coordination

These globals are initialized in a specific sequence during boot:

  1. Static initialization: led_list and blinkers get compile-time initial values
  2. led_drvinit() (via SYSINIT): Allocates led_unit, initializes led_mtx and led_sx, prepares the callout
  3. Runtime: Hardware drivers call led_create() to register LEDs, incrementing blinkers and populating led_list

The static storage class on all globals limits their visibility to this source file. No other kernel code can directly access these variables, all interactions go through the public API (led_create(), led_destroy(), led_set()). This encapsulation prevents external code from corrupting the LED subsystem's internal state.

Contrast with null.c

The null driver had minimal global state: three device pointers for its fixed devices. The LED driver's globals reflect its dynamic nature:

  • Resource allocation: Unit numbers for arbitrary device counts
  • Concurrency: Two locks for different access patterns
  • Registry: A list tracking all active LEDs
  • Scheduling: Timer infrastructure for pattern execution
  • Accounting: Memory type for allocation tracking

This richer global infrastructure supports the LED driver's role as a subsystem managing multiple dynamically-created devices with time-based behaviors, rather than a simple driver exposing fixed stateless devices.

2) The heartbeat: led_timeout() advances the pattern

This periodic callout walks all LEDs and advances each one's pattern. Patterns are encoded in ASCII, so the parser and state machine stay tiny.

54: static void
55: led_timeout(void *p)
56: {
57: 	struct ledsc	*sc;
58: 	LIST_FOREACH(sc, &led_list, list) {
59: 		if (sc->ptr == NULL)
60: 			continue;
61: 		if (sc->count > 0) {
62: 			sc->count--;
63: 			continue;
64: 		}
65: 		if (*sc->ptr == '.') {
66: 			sc->ptr = NULL;
67: 			blinkers--;
68: 			continue;
69: 		} else if (*sc->ptr == 'U' || *sc->ptr == 'u') {
70: 			if (sc->last_second == time_second)
71: 				continue;
72: 			sc->last_second = time_second;
73: 			sc->func(sc->private, *sc->ptr == 'U');
74: 		} else if (*sc->ptr >= 'a' && *sc->ptr <= 'j') {
75: 			sc->func(sc->private, 0);
76: 			sc->count = (*sc->ptr & 0xf) - 1;
77: 		} else if (*sc->ptr >= 'A' && *sc->ptr <= 'J') {
78: 			sc->func(sc->private, 1);
79: 			sc->count = (*sc->ptr & 0xf) - 1;
80: 		}
81: 		sc->ptr++;
82: 		if (*sc->ptr == '\0')
83: 			sc->ptr = sc->str;
84: 	}
85: 	if (blinkers > 0)
86: 		callout_reset(&led_ch, hz / 10, led_timeout, p);
87: }
Timer Callback: Pattern Execution Engine

The led_timeout function is the heart of the LED subsystem's blink pattern execution. Called by the kernel's timer subsystem approximately 10 times per second, it walks the global LED list and advances each active pattern by one step, interpreting a simple pattern language to control LED timing and state.

Function Entry and List Iteration
static void
led_timeout(void *p)
{
    struct ledsc    *sc;
    LIST_FOREACH(sc, &led_list, list) {

Function signature: Timer callbacks receive a single void * argument passed during timer scheduling. This driver doesn't use the argument (it's typically NULL), relying instead on the global LED list to find work.

Iterating all LEDs: The LIST_FOREACH macro walks the doubly-linked led_list, visiting each registered LED. This allows one timer to service multiple independent LEDs, each potentially executing a different blink pattern at a different position. The iteration is safe because the list is protected by led_mtx (the callout was initialized with this mutex via callout_init_mtx()).

Skipping Inactive LEDs
if (sc->ptr == NULL)
    continue;

The ptr field indicates whether this LED has an active blink pattern. When NULL, the LED is in static on/off state and needs no timer processing. The callback skips to the next LED immediately.

This check is the first filter: LEDs without patterns don't consume CPU time. Only LEDs actively blinking require processing on each timer tick.

Handling Delay States
if (sc->count > 0) {
    sc->count--;
    continue;
}

The count field implements delays in blink patterns. When the pattern interpreter encounters timing codes like 'a' through 'j' (meaning "wait 1-10 tenths of a second"), it sets count to the delay value. On subsequent timer ticks, the callback decrements count without advancing through the pattern.

Example: Pattern code 'c' (wait 3 tenths of a second) sets count = 2 (the value is 1 less than the intended delay). The next two timer ticks decrement count to 1, then 0. On the third tick, count is already 0, so this check fails and pattern execution proceeds.

This mechanism creates precise timing: at 10Hz, each count represents 0.1 seconds. Pattern 'AcAc' produces: LED on, wait 0.3s, LED on again, wait 0.3s, repeat.

Pattern Termination
if (*sc->ptr == '.') {
    sc->ptr = NULL;
    blinkers--;
    continue;
}

The period character '.' signals pattern end. Unlike most patterns which loop indefinitely, some user specifications include an explicit terminator. When encountered:

Stop pattern execution: Setting ptr = NULL marks this LED as inactive. Future timer ticks will skip it at the first check.

Decrement blinker count: Reducing blinkers tracks that one fewer LED needs servicing. When this counter reaches zero (checked at function end), the timer stops scheduling itself.

Skip remaining code: The continue jumps to the next LED in the list. Pattern advancement and looping (lines 81-83) don't execute for terminated patterns.

Heartbeat Pattern: Second-Based Toggle
else if (*sc->ptr == 'U' || *sc->ptr == 'u') {
    if (sc->last_second == time_second)
        continue;
    sc->last_second = time_second;
    sc->func(sc->private, *sc->ptr == 'U');
}

The 'U' and 'u' codes create once-per-second toggles, useful for heartbeat indicators showing the system is alive.

Second boundary detection: The kernel variable time_second holds the current Unix timestamp. Comparing it to last_second detects when a second boundary has passed. If the values match, we're still within the same second and the callback skips processing with continue.

Recording the transition: sc->last_second = time_second remembers this second, preventing multiple updates if the timer fires multiple times per second (which it does, 10 times per second).

Updating the LED: The callback invokes the hardware driver's control function. The second parameter determines LED state:

  • *sc->ptr == 'U' -> true (1) -> LED on
  • *sc->ptr == 'u' -> false (0) -> LED off

Pattern "Uu" creates a 1Hz toggle: on for one second, off for one second. Pattern "U" alone keeps the LED on but only updates at second boundaries, which may be used for synchronization purposes.

Off Delay Pattern
else if (*sc->ptr >= 'a' && *sc->ptr <= 'j') {
    sc->func(sc->private, 0);
    sc->count = (*sc->ptr & 0xf) - 1;
}

Lowercase letters 'a' through 'j' mean "turn LED off and wait." This combines two operations: immediate state change plus delay setup.

Turning off the LED: sc->func(sc->private, 0) calls the hardware driver's control function with the off command (second parameter is 0).

Computing the delay: The expression (*sc->ptr & 0xf) - 1 extracts the delay duration from the character code. In ASCII:

  • 'a' is 0x61, 0x61 & 0x0f = 1, minus 1 = 0 (wait 0.1 seconds)
  • 'b' is 0x62, 0x62 & 0x0f = 2, minus 1 = 1 (wait 0.2 seconds)
  • 'c' is 0x63, 0x63 & 0x0f = 3, minus 1 = 2 (wait 0.3 seconds)
  • ...
  • 'j' is 0x6A, 0x6A & 0x0f = 10, minus 1 = 9 (wait 1.0 seconds)

The mask & 0xf isolates the low 4 bits, which conveniently map 'a'-'j' to values 1-10. Subtracting 1 converts to the countdown format (timer ticks remaining minus one).

On Delay Pattern
else if (*sc->ptr >= 'A' && *sc->ptr <= 'J') {
    sc->func(sc->private, 1);
    sc->count = (*sc->ptr & 0xf) - 1;
}

Uppercase letters 'A' through 'J' work identically to lowercase, except the LED is turned on instead of off. The delay calculation is the same:

  • 'A' -> on for 0.1 seconds
  • 'B' -> on for 0.2 seconds
  • ...
  • 'J' -> on for 1.0 seconds

Pattern "AaBb" creates: on 0.1s, off 0.1s, on 0.2s, off 0.2s, repeat. Pattern "Aa" is a standard fast blink at ~2.5Hz.

Pattern Advancement and Looping
sc->ptr++;
if (*sc->ptr == '\0')
    sc->ptr = sc->str;

After processing the current pattern character (whether it was a heartbeat code or a delay code), the pointer advances to the next character.

Detecting pattern end: If the new position is the null terminator, the pattern has been fully executed once. Rather than stopping (as the '.' terminator does), most patterns loop indefinitely.

Looping back: sc->ptr = sc->str resets to the pattern's beginning. The next timer tick will start over from the first character, creating a repeating cycle.

Example: Pattern "AjBj" becomes on-1s, on-1s, repeat continuously. The pattern never stops unless replaced by a new write or the LED is destroyed.

Timer Rescheduling
if (blinkers > 0)
    callout_reset(&led_ch, hz / 10, led_timeout, p);
}

After processing all LEDs, the callback decides whether to reschedule itself. If any LED still has an active pattern (blinkers > 0), the timer is reset to fire again in hz / 10 ticks (0.1 seconds).

Self-perpetuating timer: Each invocation schedules the next invocation, creating a continuous loop as long as work remains. This is different from a periodic timer that fires unconditionally, the LED timer is work-driven.

Automatic shutdown: When the last active pattern terminates (either via '.' or being replaced with static state), blinkers drops to 0 and the timer doesn't reschedule. The callback exits and won't run again until a new pattern activates, conserving CPU when all LEDs are static.

The hz variable: The kernel constant hz represents timer ticks per second (typically 1000 on modern systems). Dividing by 10 gives the delay in ticks for one-tenth of a second, matching the pattern language's resolution.

Pattern Language Summary

The timer interprets a simple language embedded in pattern strings:

CodeMeaningDuration
'a'-'j'LED off0.1-1.0 seconds
'A'-'J'LED on0.1-1.0 seconds
'U'LED onAt second boundary
'u'LED offAt second boundary
'.'End pattern-

Example patterns and their effects:

  • "Aa" -> blink at ~2.5Hz (0.1s on, 0.1s off)
  • "AjAj" -> slow blink at 0.5Hz (1s on, 1s off)
  • "AaAaBjBj" -> fast double blink, long pause
  • "U" -> steady on, synced to seconds
  • "Uu" -> 1Hz toggle

This compact encoding allows complex blink behaviors from short strings, all interpreted by this one timer callback serving all LEDs in the system.

3) Apply a new state/pattern: led_state()

Given a compiled pattern (sbuf) or a simple on/off flag, this function updates the softc and starts or stops the periodic timer.

88: static int
89: led_state(struct ledsc *sc, struct sbuf **sb, int state)
90: {
91: 	struct sbuf *sb2 = NULL;
93: 	sb2 = sc->spec;
94: 	sc->spec = *sb;
95: 	if (*sb != NULL) {
96: 		if (sc->str != NULL)
97: 			free(sc->str, M_LED);
98: 		sc->str = strdup(sbuf_data(*sb), M_LED);
99: 		if (sc->ptr == NULL)
100: 			blinkers++;
101: 		sc->ptr = sc->str;
102: 	} else {
103: 		sc->str = NULL;
104: 		if (sc->ptr != NULL)
105: 			blinkers--;
106: 		sc->ptr = NULL;
107: 		sc->func(sc->private, state);
108: 	}
109: 	sc->count = 0;
110: 	*sb = sb2;
111: 	return(0);
112: }
LED State Management: Installing Patterns

The led_state function installs a new blink pattern or static state for an LED. It handles the transition between different LED modes, managing memory for pattern strings, updating the blinker counter for timer control, and invoking hardware callbacks when needed. This function is the central state change coordinator called by both the write handler and the kernel API.

Function Signature and Pattern Swap
static int
led_state(struct ledsc *sc, struct sbuf **sb, int state)
{
    struct sbuf *sb2 = NULL;

    sb2 = sc->spec;
    sc->spec = *sb;

Parameters: The function receives three values:

  • sc - the LED whose state is being changed
  • sb - pointer to a pointer to a string buffer containing the new pattern (or NULL for static state)
  • state - the desired static state (0 or 1) if no pattern is provided

The double pointer pattern: The sb parameter is struct sbuf **, allowing the function to swap buffers with the caller. The function takes ownership of the caller's buffer and returns the old buffer for cleanup. This swap avoids copying pattern strings and ensures proper memory management.

Preserving the old pattern: sb2 = sc->spec saves the current pattern buffer before installing the new one. At function end, this old buffer is returned to the caller via *sb = sb2. The caller becomes responsible for freeing it with sbuf_delete().

if (*sb != NULL) {
    if (sc->str != NULL)
        free(sc->str, M_LED);
    sc->str = strdup(sbuf_data(*sb), M_LED);
    if (sc->ptr == NULL)
        blinkers++;
    sc->ptr = sc->str;

When the caller provides a pattern (non-NULL sb), the function activates pattern mode.

Freeing the old string: If sc->str is non-NULL, a previous pattern's string exists and must be freed. The free(sc->str, M_LED) call releases this memory back to the kernel heap. The M_LED tag matches the allocation type used during strdup(), maintaining accounting consistency.

Duplicating the new pattern: sbuf_data(*sb) extracts the null-terminated string from the string buffer, and strdup(name, M_LED) allocates memory and copies it. The pattern string must persist because the timer callback will traverse it repeatedly, the string buffer itself may be deleted by the caller, so a separate copy is needed.

Activating the timer: The check if (sc->ptr == NULL) detects whether this LED was previously inactive. If so, incrementing blinkers++ records that one more LED now needs timer servicing. The timer callback checks this counter at the end of each run; transitioning from 0 to 1 causes the timer to be rescheduled.

Starting pattern execution: sc->ptr = sc->str sets the pattern position to the beginning. On the next timer tick, led_timeout will process this LED's first pattern character.

Why not start the timer here?: The timer might already be running if other LEDs have active patterns. The blinkers counter tracks this: if it was already non-zero, the timer is already scheduled and will process this LED on its next tick. Only when blinkers transitions from 0 to 1 (detected in the write handler or led_set()) does the timer need explicit scheduling.

Installing Static State
} else {
    sc->str = NULL;
    if (sc->ptr != NULL)
        blinkers--;
    sc->ptr = NULL;
    sc->func(sc->private, state);
}

When the caller passes NULL for sb, the LED should be set to a static on/off state without blinking.

Clearing pattern state: Setting sc->str = NULL marks that no pattern string exists. This field is checked during cleanup to determine if memory needs freeing.

Deactivating the timer: The check if (sc->ptr != NULL) detects whether this LED was previously executing a pattern. If so, decrementing blinkers-- records that one fewer LED needs timer servicing. If this was the last active LED, blinkers drops to zero and the timer callback won't reschedule itself, stopping timer firings.

Setting to NULL: sc->ptr = NULL marks this LED as inactive. The timer callback's first check (if (sc->ptr == NULL) continue;) will skip this LED on all future ticks.

Immediate hardware update: sc->func(sc->private, state) invokes the hardware driver's control callback to set the LED to the requested state (0 for off, 1 for on). Unlike pattern mode where the timer controls LED changes, static mode requires immediate hardware update since no timer is involved.

Resetting Delay Counter
sc->count = 0;

The delay counter is zeroed regardless of which path was taken. If a pattern is being installed, starting with count = 0 ensures the first pattern character executes immediately without inherited delay. If static state is being set, zeroing is harmless since the field isn't used when ptr is NULL.

Returning the Old Pattern
*sb = sb2;
return(0);

The function returns the previous pattern buffer through the double pointer. The caller receives either:

  • NULL if no previous pattern existed
  • The old sbuf if a pattern is being replaced

The caller must check this returned value and call sbuf_delete() if non-NULL to free the buffer's memory. This ownership transfer pattern prevents memory leaks while avoiding unnecessary copying.

The return value 0 signals success. This function currently cannot fail, but returning an error code provides future extensibility if validation or resource allocation were added.

State Transition Examples

Setting initial pattern on inactive LED:

Before: sc->ptr = NULL, sc->spec = NULL, blinkers = 0
Call:   led_state(sc, &pattern_sb, 0)
After:  sc->ptr = sc->str, sc->spec = pattern_sb, blinkers = 1
        Old NULL returned to caller

Replacing one pattern with another:

Before: sc->ptr = old_str, sc->spec = old_sb, blinkers = 3
Call:   led_state(sc, &new_sb, 0)
After:  sc->ptr = new_str, sc->spec = new_sb, blinkers = 3
        Old old_sb returned to caller for deletion

Changing from pattern to static:

Before: sc->ptr = pattern_str, sc->spec = pattern_sb, blinkers = 1
Call:   led_state(sc, &NULL_ptr, 1)
After:  sc->ptr = NULL, sc->spec = NULL, blinkers = 0
        Hardware callback invoked with state=1 (on)
        Old pattern_sb returned to caller for deletion

Setting static state on already-static LED:

Before: sc->ptr = NULL, sc->spec = NULL, blinkers = 0
Call:   led_state(sc, &NULL_ptr, 0)
After:  sc->ptr = NULL, sc->spec = NULL, blinkers = 0
        Hardware callback invoked with state=0 (off)
        Old NULL returned to caller
Thread Safety Considerations

This function operates under the protection of led_mtx, acquired by the caller (write handler or led_set()). The mutex serializes state changes and protects:

  • The blinkers counter from races when multiple LEDs change state simultaneously
  • Individual LED fields (ptr, str, spec, count) from corruption
  • The relationship between blinkers count and actual active patterns

Without the mutex, two simultaneous writes could both increment blinkers, creating an incorrect count. Or one thread could free sc->str while the timer callback traverses it, causing a use-after-free crash.

Memory Management Discipline

The function demonstrates careful memory management:

Ownership transfer: The caller gives up the new sbuf and receives the old one, establishing clear ownership at all times.

Paired allocation/free: Every strdup() has a corresponding free(), preventing leaks even when patterns are repeatedly replaced.

NULL tolerance: All checks handle NULL pointers gracefully, allowing transitions to/from uninitialized state without special cases.

This discipline prevents the common pattern-replacement bug where updating state leaks the old pattern's memory.

4) Parse user commands into patterns: led_parse()

116: static int
117: led_parse(const char *s, struct sbuf **sb, int *state)
118: {
119: 	int i, error;
121: 	/* '0' or '1' means immediate steady off/on (no pattern). */
124: 	if (*s == '0' || *s == '1') {
125: 		*state = *s & 1;
126: 		return (0);
127: 	}
129: 	*state = 0;
130: 	*sb = sbuf_new_auto();
131: 	if (*sb == NULL)
132: 		return (ENOMEM);
133: 	switch(s[0]) {
135: 	case 'f': /* blink (default 100/100ms); 'f2' => 200/200ms */
136: 		if (s[1] >= '1' && s[1] <= '9') i = s[1] - '1'; else i = 0;
137: 		sbuf_printf(*sb, "%c%c", 'A' + i, 'a' + i);
138: 		break;
149: 	case 'd': /* "digits": flash out numbers 0..9 */
150: 		for(s++; *s; s++) {
151: 			if (!isdigit(*s)) continue;
152: 			i = *s - '0'; if (i == 0) i = 10;
156: 			for (; i > 1; i--) sbuf_cat(*sb, "Aa");
158: 			sbuf_cat(*sb, "Aj");
159: 		}
160: 		sbuf_cat(*sb, "jj");
161: 		break;
162: 	/* other small patterns elided for brevity in this excerpt ... */
187: 	case 'm': /* Morse: '.' -> short, '-' -> long, ' ' -> space */
188: 		for(s++; *s; s++) {
189: 			if (*s == '.') sbuf_cat(*sb, "aA");
190: 			else if (*s == '-') sbuf_cat(*sb, "aC");
191: 			else if (*s == ' ') sbuf_cat(*sb, "b");
192: 			else if (*s == '\n') sbuf_cat(*sb, "d");
193: 		}
198: 		sbuf_cat(*sb, "j");
199: 		break;
200: 	default:
201: 		sbuf_delete(*sb);
202: 		return (EINVAL);
203: 	}
204: 	error = sbuf_finish(*sb);
205: 	if (error != 0 || sbuf_len(*sb) == 0) {
206: 		*sb = NULL;
207: 		return (error);
208: 	}
209: 	return (0);
210: }
Pattern Parser: User Commands to Internal Codes

The led_parse function translates human-friendly pattern specifications from userspace into the internal timing code language that the timer callback interprets. This parser allows users to write simple commands like "f" for flashing or "m...---..." for morse code, which are expanded into sequences of timing codes like "AaAa" or "aAaAaCaCaC".

Function Signature and Quick Static Path
static int
led_parse(const char *s, struct sbuf **sb, int *state)
{
    int i, error;

    /* '0' or '1' means immediate steady off/on (no pattern). */
    if (*s == '0' || *s == '1') {
        *state = *s & 1;
        return (0);
    }

Parameters: The parser receives three values:

  • s - the user's input string from the write operation
  • sb - pointer to a pointer where the allocated string buffer will be returned
  • state - pointer where static state (0 or 1) is returned for non-pattern commands

Fast path for static state: Commands "0" and "1" request static off and on respectively. The expression *s & 1 extracts the low bit of the ASCII character: '0' (0x30) & 1 = 0, '1' (0x31) & 1 = 1. This value is written to *state and the function returns immediately without allocating a string buffer. The caller receives *sb = NULL (never assigned) and knows to use led_state() with static mode.

This fast path handles the most common case efficiently, toggling LEDs on or off without complex timing.

String Buffer Allocation
*state = 0;
*sb = sbuf_new_auto();
if (*sb == NULL)
    return (ENOMEM);

For pattern commands, a string buffer is needed to build the internal code sequence.

Default state: Setting *state = 0 provides a default in case the pattern is used, though this value is ignored when *sb is non-NULL.

Creating auto-sizing buffer: sbuf_new_auto() allocates a string buffer that automatically grows as data is appended. This eliminates the need to pre-calculate pattern length. Morse code for a long message might produce a very long code sequence, but the buffer expands as needed.

Handling allocation failure: If memory is exhausted, the function returns ENOMEM immediately. The caller checks this error and propagates it to userspace, where the write operation fails with "Cannot allocate memory."

Pattern Dispatch
switch(s[0]) {

The first character determines the pattern type. Each case implements a different pattern language, expanding user input into timing codes.

Flash Pattern: Simple Blinking
case 'f': /* blink (default 100/100ms); 'f2' => 200/200ms */
    if (s[1] >= '1' && s[1] <= '9') i = s[1] - '1'; else i = 0;
    sbuf_printf(*sb, "%c%c", 'A' + i, 'a' + i);
    break;

The 'f' command creates a symmetric blink pattern, with equal on and off times.

Speed modifier: If a digit follows 'f', it specifies the blink speed:

  • "f" or "f1" -> i = 0 -> pattern "Aa" -> 0.1s on, 0.1s off (~2.5Hz)
  • "f2" -> i = 1 -> pattern "Bb" -> 0.2s on, 0.2s off (~1.25Hz)
  • "f3" -> i = 2 -> pattern "Cc" -> 0.3s on, 0.3s off (~0.83Hz)
  • ...
  • "f9" -> i = 8 -> pattern "Ii" -> 0.9s on, 0.9s off (~0.56Hz)

Pattern construction: sbuf_printf(*sb, "%c%c", 'A' + i, 'a' + i) generates two characters: an uppercase letter (on state) followed by the corresponding lowercase letter (off state). Both use the same duration, creating symmetric blinking.

This simple two-character pattern repeats indefinitely, providing the classic blink indicator effect.

case 'd': /* "digits": flash out numbers 0..9 */
    for(s++; *s; s++) {
        if (!isdigit(*s)) continue;
        i = *s - '0'; if (i == 0) i = 10;
        for (; i > 1; i--) sbuf_cat(*sb, "Aa");
        sbuf_cat(*sb, "Aj");
    }
    sbuf_cat(*sb, "jj");
    break;

The 'd' command followed by digits creates patterns that visually "count" by flashing the LED.

Parsing digits: The loop advances past the 'd' command character (s++) and examines each subsequent character. Non-digits are silently skipped with continue, allowing "d1x2y3" to be interpreted as "d123".

Digit mapping: i = *s - '0' converts ASCII digit to numeric value. The special case if (i == 0) i = 10 treats zero as ten flashes rather than no flashes, making it distinguishable from the inter-digit pause.

Flash generation: For digit value i:

  • Generate i-1 quick flashes: for (; i > 1; i--) sbuf_cat(*sb, "Aa")
  • Add one longer flash: sbuf_cat(*sb, "Aj")

Example for digit 3: two quick flashes "AaAa" plus one 1-second flash "Aj".

Digit separation: After all digits are processed, sbuf_cat(*sb, "jj") appends a 2-second pause before the pattern repeats, clearly separating repetitions.

Result: Command "d12" generates pattern "AjAjAaAjjj" meaning: 1-second flash (digit 1), pause, quick flash then 1-second flash (digit 2), long pause, repeat. This allows reading numbers from LED blinks, valid for diagnostic codes.

Morse Code Pattern
case 'm': /* Morse: '.' -> short, '-' -> long, ' ' -> space */
    for(s++; *s; s++) {
        if (*s == '.') sbuf_cat(*sb, "aA");
        else if (*s == '-') sbuf_cat(*sb, "aC");
        else if (*s == ' ') sbuf_cat(*sb, "b");
        else if (*s == '\n') sbuf_cat(*sb, "d");
    }
    sbuf_cat(*sb, "j");
    break;

The 'm' command interprets the following characters as morse code elements.

Morse element mapping:

  • '.' (dot) -> "aA" -> 0.1s off, 0.1s on (short flash)
  • '-' (dash) -> "aC" -> 0.1s off, 0.3s on (long flash)
  • ' ' (space) -> "b" -> 0.2s off (word separator)
  • '\n' (newline) -> "d" -> 0.4s off (long pause between messages)

Standard morse timing: International morse code specifies:

  • Dot: 1 unit
  • Dash: 3 units
  • Gap between elements: 1 unit
  • Gap between letters: 3 units (approximated by the trailing pause in each letter)
  • Gap between words: 7 units (space character)

The pattern "aA" gives dot (1 unit off, 1 unit on), "aC" gives dash (1 unit off, 3 units on), with each unit being 0.1 seconds.

Pattern termination: sbuf_cat(*sb, "j") adds a 1-second pause before the message repeats, separating consecutive transmissions.

Example: Command "m... ---" (SOS) generates "aAaAaAaCaCaC" meaning: dot-dot-dot, dash-dash-dash, repeat.

Error Handling for Unknown Commands
default:
    sbuf_delete(*sb);
    return (EINVAL);
}

If the first character doesn't match any known pattern type, the function rejects the command. The allocated string buffer is freed with sbuf_delete() to prevent leaks, and EINVAL (invalid argument) is returned to indicate bad user input.

The write operation will fail and return -1 to userspace with errno = EINVAL, informing the user that their command syntax is incorrect.

Finalizing the Pattern String
error = sbuf_finish(*sb);
if (error != 0 || sbuf_len(*sb) == 0) {
    *sb = NULL;
    return (error);
}
return (0);

Sealing the buffer: sbuf_finish() finalizes the string buffer, null-terminating it and marking it read-only. After this call, the buffer's contents can be extracted with sbuf_data() but no further appends are allowed.

Validation: Two error conditions are checked:

  • error != 0 - sbuf_finish() failed, typically due to memory exhaustion during a buffer resize
  • sbuf_len(*sb) == 0 - the pattern is empty, which shouldn't happen but is checked defensively

If either condition holds, the buffer is unusable. Setting *sb = NULL signals the caller that no pattern was generated, and the error code is returned. The caller must not attempt to use or free the buffer; it was already freed by sbuf_finish() on error.

Success: Returning 0 with *sb pointing to a valid buffer signals successful parsing. The caller now owns the buffer and must eventually free it with sbuf_delete().

Pattern Language Summary

The parser supports several pattern languages, each optimized for different use cases:

CommandPurposeExampleResult
0, 1Static state"1"LED on steady
f[1-9]Symmetric blink"f"Fast blink
d[digits]Count by flashes"d42"4 flashes, 2 flashes
m[morse]Morse code"msos"... --- ...

This variety allows users to express intent naturally without memorizing timing code syntax. The write handler accepts simple commands; the parser expands them to precise timing sequences; the timer executes those sequences.

5.1) The write entry point: echo "cmd" > /dev/led/<name>

User space writes a command string to the device. The driver parses it and updates the LED's state. The shape is exactly what you'll write later: uiomove() the user buffer, parse, then update the softc under a lock.

212: static int
213: led_write(struct cdev *dev, struct uio *uio, int ioflag)
214: {
215: 	struct ledsc	*sc;
216: 	char *s;
217: 	struct sbuf *sb = NULL;
218: 	int error, state = 0;
220: 	if (uio->uio_resid > 512)
221: 		return (EINVAL);
222: 	s = malloc(uio->uio_resid + 1, M_DEVBUF, M_WAITOK);
223: 	s[uio->uio_resid] = '\0';
224: 	error = uiomove(s, uio->uio_resid, uio);
225: 	if (error) { free(s, M_DEVBUF); return (error); }
226: 	/* parse  ->  (sb pattern) or (state only) */
227: 	error = led_parse(s, &sb, &state);
228: 	free(s, M_DEVBUF);
229: 	if (error) return (error);
230: 	mtx_lock(&led_mtx);
231: 	sc = dev->si_drv1;
232: 	if (sc != NULL)
233: 		error = led_state(sc, &sb, state);
234: 	mtx_unlock(&led_mtx);
235: 	if (sb != NULL) sbuf_delete(sb);
236: 	return (error);
237: }
Write Handler: User Command Interface

The led_write function implements the character device write operation for /dev/led/* devices. When a user writes a pattern command like "f" or "m...---..." to an LED device node, this function copies the data from userspace, parses it into an internal format, and installs the new LED pattern.

Size Validation and Buffer Allocation
static int
led_write(struct cdev *dev, struct uio *uio, int ioflag)
{
    struct ledsc    *sc;
    char *s;
    struct sbuf *sb = NULL;
    int error, state = 0;

    if (uio->uio_resid > 512)
        return (EINVAL);
    s = malloc(uio->uio_resid + 1, M_DEVBUF, M_WAITOK);
    s[uio->uio_resid] = '\0';

Size limit enforcement: The check uio->uio_resid > 512 rejects writes larger than 512 bytes. LED patterns are short text commands, even complex morse code messages rarely exceed a few dozen characters. This limit prevents memory exhaustion from malicious or buggy programs attempting multi-megabyte writes.

Returning EINVAL signals invalid argument to userspace. The write fails immediately without allocating memory or touching the LED state.

Temporary buffer allocation: Unlike the null driver's null_write, which never accesses user data, the LED driver must examine the written bytes to parse commands. The allocation reserves uio->uio_resid + 1 bytes, the exact write size plus one byte for null termination.

The M_DEVBUF allocation type is generic for device driver temporary buffers. The M_WAITOK flag allows the allocation to sleep if memory is temporarily unavailable, which is acceptable since this is a blocking write operation with no stringent latency requirements.

Null termination: Setting s[uio->uio_resid] = '\0' ensures the buffer is a proper C string. The uiomove call will fill the first uio->uio_resid bytes with user data, and this assignment adds the terminator immediately after. String functions like those used in parsing require null-terminated strings.

Copying Data from Userspace
error = uiomove(s, uio->uio_resid, uio);
if (error) { free(s, M_DEVBUF); return (error); }

The uiomove function transfers uio->uio_resid bytes from the user's buffer (described by uio) into the kernel buffer s. This is the same function used in the null and zero drivers for data transfer between address spaces.

Error handling: If uiomove fails (typically EFAULT for an invalid user pointer), the allocated buffer is freed immediately with free(s, M_DEVBUF) and the error propagates to userspace. The write fails without modifying LED state, and the temporary buffer doesn't leak.

This cleanup discipline is critical, kernel code must free allocated memory on all error paths, not just success paths.

Parsing the Command
/* parse  ->  (sb pattern) or (state only) */
error = led_parse(s, &sb, &state);
free(s, M_DEVBUF);
if (error) return (error);

Translation to internal format: The led_parse function interprets the user's command string, producing either:

  • A string buffer (sb) containing timing codes for pattern mode
  • A state value (0 or 1) for static on/off mode

The parser determines which mode based on the command's first character. Commands like "f", "d", "m" generate patterns; commands "0" and "1" set static state.

Immediate cleanup: The temporary buffer s is no longer needed after parsing, whether parsing succeeded or failed, the original command string is no longer required. Freeing it immediately rather than waiting until function end reduces memory consumption in the common case where parsing succeeds and additional processing follows.

Error propagation: If parsing fails (unrecognized command, memory exhaustion, empty pattern), the error is returned to userspace. The write operation fails before acquiring locks or modifying LED state. Users see the write fail with errno set to the parser's error code (typically EINVAL for bad syntax or ENOMEM for resource exhaustion).

Installing the New State
mtx_lock(&led_mtx);
sc = dev->si_drv1;
if (sc != NULL)
    error = led_state(sc, &sb, state);
mtx_unlock(&led_mtx);

Acquiring the lock: The led_mtx mutex protects the LED list and per-LED state from concurrent modification. Multiple threads might write to different LEDs simultaneously, or a write might race with timer callbacks updating blink patterns. The mutex serializes these operations.

Retrieving the LED context: dev->si_drv1 provides the ledsc structure for this device, established during led_create(). This pointer links the character device node to its LED state.

Defensive NULL check: The condition if (sc != NULL) guards against a race where the LED is being destroyed while a write is in progress. If led_destroy() has cleared si_drv1 but the write handler is still executing, this check prevents dereferencing NULL. In practice, proper reference counting makes this unlikely, but defensive checks prevent kernel panics.

State installation: led_state(sc, &sb, state) installs the new pattern or static state. This function:

  • Swaps the new pattern buffer with the old one
  • Updates blinkers counter if the LED transitions between active and inactive
  • Calls the hardware driver's callback for static state changes
  • Returns the old pattern buffer via the sb pointer

Lock release: After state installation completes, the mutex is released. Other threads blocked on LED operations can now proceed. Lock hold time is minimal, only the state swap and counter update, not the potentially slow parsing that happened earlier.

Cleanup and Return
if (sb != NULL) sbuf_delete(sb);
return (error);

Freeing the old pattern: After led_state returns, sb points to the old pattern buffer (or NULL if no previous pattern existed). The code must free this buffer to prevent memory leaks. Each pattern installation generates one buffer to free from the previous pattern.

The check if (sb != NULL) handles both initial pattern installation (no previous pattern) and static state commands (parser never allocated a buffer). Only actual pattern buffers need deletion.

Success return: Returning error (typically 0 for success) completes the write operation. The userspace write(2) call returns the number of bytes written (the original uio->uio_resid), indicating success.

Complete Write Flow

The whole sequence from userspace write to LED state change (the flow below uses a theoretical device, to illustrate the flow):

User: echo "f" > /dev/led/disk0
     -> 
led_write() called by kernel
     -> 
Validate size (< 512 bytes)
     -> 
Allocate temporary buffer
     -> 
Copy "f\n" from userspace
     -> 
Parse "f"  ->  timing code "Aa"
     -> 
Free temporary buffer
     -> 
Lock led_mtx
     -> 
Find LED via dev->si_drv1
     -> 
Install new pattern "Aa"
     -> 
Increment blinkers (0 -> 1)
     -> 
Schedule timer if needed
     -> 
Unlock led_mtx
     -> 
Free old pattern (NULL)
     -> 
Return success
     -> 
User: write() returns 2 bytes

On the next timer tick (0.1 seconds later), the LED begins blinking at ~2.5Hz, alternating on and off every 0.1 seconds.

Error Handling Paths

The function has multiple error exits, each with proper cleanup:

Size validation failure:

Check uio_resid > 512  ->  return EINVAL
(nothing allocated yet, no cleanup needed)

Allocation failure:

malloc() returns NULL  ->  kernel panics (M_WAITOK)
(M_WAITOK means "wait for memory, never fail")

Copyin failure:

uiomove() fails  ->  free(s)  ->  return EFAULT
(temporary buffer freed, no other resources allocated)

Parse failure:

led_parse() fails  ->  free(s)  ->  return EINVAL
(temporary buffer freed, no string buffer created)

State installation success:

led_state() succeeds  ->  sbuf_delete(old)  ->  return 0
(old pattern freed, new pattern installed)

Every error path frees all allocated resources, preventing memory leaks regardless of where failure occurs.

Contrast with null.c

The null driver's null_write was trivial: set uio_resid = 0 and return. The LED driver's write handler is substantially more complex because:

User input requires parsing: Commands like "f" and "m..." must be interpreted, not just discarded.

State must be modified: New patterns affect LED behavior, requiring coordination with timer callbacks.

Memory must be managed: Buffers are allocated, swapped, and freed across function boundaries.

Synchronization is required: Multiple writers and timer callbacks must coordinate via mutexes.

This increased complexity reflects the LED driver's role as infrastructure supporting rich user interaction with physical hardware, not just a simple data sink.

5.2) Kernel API: Programmatic LED Control

240: int
241: led_set(char const *name, char const *cmd)
...
247: 	error = led_parse(cmd, &sb, &state);
...
251: 	LIST_FOREACH(sc, &led_list, list) {
252: 		if (strcmp(sc->name, name) == 0) break;
253: 	}
254: 	if (sc != NULL) error = led_state(sc, &sb, state);
255: 	else error = ENOENT;

The led_set function provides a kernel-facing API that allows other kernel code to control LEDs without going through the character device interface. This enables drivers, kernel subsystems, and system event handlers to manipulate LEDs directly using the same pattern language available to userspace.

Function Signature and Purpose
int
led_set(char const *name, char const *cmd)

Parameters: The function receives two strings:

  • name - the LED identifier, matching the name used during led_create() (e.g., "disk0", "power")
  • cmd - the pattern command string, same syntax as userspace writes (e.g., "f", "1", "m...---...")

Return value: Zero for success, or an errno value for failure (EINVAL for parse errors, ENOENT for unknown LED name, ENOMEM for allocation failure).

Use cases: Kernel code can call this function to:

  • Indicate disk activity: led_set("disk0", "f") to blink during I/O
  • Show system status: led_set("power", "1") to turn on power LED after boot completes
  • Signal error conditions: led_set("status", "m...---...") to flash SOS pattern
  • Implement heartbeat: led_set("heartbeat", "Uu") for 1Hz toggle showing system liveness
Parsing the Command
error = led_parse(cmd, &sb, &state);

The function reuses the same parser as the write handler. Pattern strings are interpreted identically whether coming from userspace via write(2) or from kernel code via led_set().

This code reuse ensures consistency; a command that works in one context works in the other. The parser handles all the complexity of expanding "f" to "Aa" or "m..." to "aA", so kernel callers don't need to understand the internal timing code format.

If parsing fails (bad command syntax, memory exhaustion), the error is recorded in the error variable and checked later. The function continues to acquire the lock even on parse failure because the lock must be held to safely return without leaking the buffer.

Finding the Named LED
LIST_FOREACH(sc, &led_list, list) {
    if (strcmp(sc->name, name) == 0) break;
}
if (sc != NULL) error = led_state(sc, &sb, state);
else error = ENOENT;

Linear search: The LIST_FOREACH macro walks the global LED list, comparing each LED's name to the requested name with strcmp(). The loop terminates early with break when a match is found, leaving sc pointing to the matching LED.

Why linear search?: For small lists (typically 5-20 LEDs per system), linear search is faster than hash table overhead. The code simplicity and cache-friendly sequential access outweigh the O(n) complexity. Systems with hundreds of LEDs would benefit from a hash table, but such systems are rare.

Handling not found: If the loop completes without breaking, no LED matched the name and sc remains NULL (from the LIST_FOREACH initialization). Setting error = ENOENT (no such file or directory) signals that the named LED doesn't exist.

Installing state: When a match is found (sc != NULL), led_state() is called to install the new pattern or static state, using the same state installation function as the write handler. The return value overwrites any parse error, if parsing succeeded but state installation failed, the installation error takes precedence.

Critical Code Omitted in Fragment

The provided fragment omits several critical lines visible in the complete function:

Lock acquisition (before line 251):

mtx_lock(&led_mtx);

The LED list must be locked before traversal to prevent concurrent modifications. If one thread is searching the list while another thread destroys an LED, the search might access freed memory. The mutex serializes list access.

Lock release and cleanup (after line 255):

mtx_unlock(&led_mtx);
if (sb != NULL)
    sbuf_delete(sb);
return (error);

After the state installation attempt, the mutex is released and the old pattern buffer (returned via sb by led_state()) is freed. This cleanup mirrors the write handler's buffer management.

Comparison with Write Handler

Both led_write and led_set follow the same pattern:

Parse command  ->  Acquire lock  ->  Find LED  ->  Install state  ->  Release lock  ->  Cleanup

The key differences:

Aspectled_writeled_set
CallerUserspace via write(2)Kernel code
Input sourceuio structureDirect string pointers
LED identificationdev->si_drv1Name lookup
Size validationLimit 512 bytesNo explicit limit (caller responsible)
Error reportingerrno to userspaceReturn value to caller

The write handler uses the device pointer to find the LED directly (single device, single LED). The kernel API uses name lookup to support arbitrary LED selection from any kernel context.

Example Usage Patterns

Disk driver indicating activity:

void
disk_start_io(struct disk_softc *sc)
{
    /* Begin I/O operation */
    led_set(sc->led_name, "f");  // Start blinking
}

void
disk_complete_io(struct disk_softc *sc)
{
    /* I/O completed */
    led_set(sc->led_name, "0");  // Turn off
}

System initialization sequence:

void
system_boot_complete(void)
{
    led_set("power", "1");      // Solid on: system ready
    led_set("status", "0");     // Off: no errors
    led_set("heartbeat", "Uu"); // 1Hz toggle: alive
}

Error indication:

void
critical_error_handler(int error_code)
{
    char pattern[16];
    snprintf(pattern, sizeof(pattern), "d%d", error_code);
    led_set("status", pattern);  // Flash error code
}
Thread Safety

The function is thread-safe through mutex protection. Multiple threads can call led_set() concurrently:

Scenario: Thread A sets "disk0" to "f" while Thread B sets "power" to "1".

Thread A                    Thread B
Parse "f"  ->  "Aa"            Parse "1"  ->  state=1
Lock led_mtx                (blocks on lock)
Find "disk0"                ...
Install pattern             ...
Unlock led_mtx              Acquire lock
Delete old buffer           Find "power"
Return                      Install state
                            Unlock led_mtx
                            Delete old buffer
                            Return

The mutex serializes list traversal and state modification, preventing corruption. Both operations complete successfully without interference.

Error Handling

The function can fail in several ways:

Parse error:

led_set("disk0", "invalid")  // Returns EINVAL

LED not found:

led_set("nonexistent", "f")  // Returns ENOENT

Memory exhaustion:

led_set("disk0", "m..." /* very long morse */)  // Returns ENOMEM

Kernel callers should check the return value and handle errors appropriately, though in practice LED control failures are rarely fatal, the system continues operating, just without visual indicators.

Why Both APIs Exist

The dual interface (character device + kernel API) serves different needs:

Character device (/dev/led/*):

  • User scripts and programs
  • System administrators
  • Testing and debugging
  • Interactive control

Kernel API (led_set()):

  • Automated responses to events
  • Driver-integrated indicators
  • System state visualization
  • Performance-critical paths (no system call overhead)

This pattern, exposing functionality through both userspace devices and kernel APIs, appears throughout FreeBSD. The LED subsystem provides a clean example of how to structure such dual-interface services.

6) Hook into devfs and export the write method

272: static struct cdevsw led_cdevsw = {
273: 	.d_version =	D_VERSION,
274: 	.d_write =	led_write,
275: 	.d_name =	"LED",
276: };
Character Device Switch Table

The led_cdevsw structure defines the character device operations for all LED device nodes. Unlike the null driver which had three separate cdevsw structures for three devices, the LED driver uses a single cdevsw shared by all dynamically created /dev/led/* devices.

Structure Definition
static struct cdevsw led_cdevsw = {
    .d_version =    D_VERSION,
    .d_write =      led_write,
    .d_name =       "LED",
};

d_version = D_VERSION: The mandatory version field ensures binary compatibility between the driver and the kernel's device framework. All cdevsw structures must include this field.

d_write = led_write: The only operation explicitly defined. When userspace calls write(2) on any /dev/led/* device, the kernel invokes this function. The led_write handler parses pattern commands and updates LED state.

d_name = "LED": The device class name appearing in kernel messages and accounting. This string identifies the driver type, though individual devices have their own specific names (like "disk0" or "power").

Minimal Operation Set

Notice what's not defined:

No d_read: LEDs are output-only devices. Reading from /dev/led/disk0 is meaningless, there's no state to query, no data to retrieve. Omitting d_read causes read attempts to fail with ENODEV (operation not supported by device).

No d_open / d_close: LED devices require no per-open initialization or cleanup. Multiple processes can write to the same LED simultaneously (serialized by the mutex), and closing the device requires no state teardown. The kernel's default handlers suffice.

No d_ioctl: Unlike the null driver which supported terminal ioctls, LED devices have no control operations beyond writing patterns. All configuration happens through the write interface.

No d_poll / d_kqfilter: LEDs are write-only, so there's no condition to wait for. Polling for writability would always return "ready" since writes never block (beyond mutex acquisition), making poll support useless.

This minimalism contrasts with the null driver's more complete interface (which included ioctl handlers) and demonstrates that cdevsw structures need only provide operations that make sense for the device type.

Shared Across Devices

A critical distinction from the null driver: this single cdevsw serves all LED devices. When the system has three LEDs registered:

/dev/led/disk0   ->  led_cdevsw
/dev/led/power   ->  led_cdevsw
/dev/led/status  ->  led_cdevsw

All three device nodes share the same function pointer table. The led_write function determines which LED is being written to by examining dev->si_drv1, which points to the specific LED's ledsc structure.

This sharing is possible because:

  • All LEDs support the same operations (write pattern commands)
  • Per-device state is accessed through si_drv1, not through different functions
  • The same parsing and state installation logic applies to every LED
Contrast with null.c

The null driver defined three separate cdevsw structures:

static struct cdevsw full_cdevsw = { ... };
static struct cdevsw null_cdevsw = { ... };
static struct cdevsw zero_cdevsw = { ... };

Each had different function assignments because the devices had different behavior (full_write vs. null_write, nullop vs. zero_read). The devices were fundamentally different types.

The LED driver's devices are all the same type, they're LEDs that accept pattern commands. The only differences are:

  • Device name ("disk0" vs. "power")
  • Hardware control callback (different for each physical LED)
  • Current pattern state (independent per LED)

These differences are stored in per-device ledsc structures, not encoded in separate function tables. This design scales elegantly: registering 100 LEDs doesn't require 100 cdevsw structures, just 100 ledsc instances sharing one cdevsw.

Usage in Device Creation

When a hardware driver calls led_create(), the code creates a device node:

sc->dev = make_dev(&led_cdevsw, sc->unit,
    UID_ROOT, GID_WHEEL, 0600, "led/%s", name);

The &led_cdevsw parameter provides the function dispatch table. All created devices reference the same structure, make_dev() doesn't copy it, just stores the pointer. This means:

  • Zero memory overhead per device for the function table
  • Changes to led_write (during development) automatically affect all devices
  • The cdevsw must remain valid for the system's lifetime (hence static storage)
Device Identification

With all devices sharing one cdevsw, how does led_write distinguish which LED is being written to? The device linkage:

// In led_create():
sc->dev = make_dev(&led_cdevsw, ...);
sc->dev->si_drv1 = sc;  // Link device to its ledsc

// In led_write():
sc = dev->si_drv1;       // Retrieve the ledsc

The si_drv1 field (set during led_create()) creates a per-device pointer to the unique ledsc structure. Though all devices share the same cdevsw and thus the same led_write function, each invocation receives a different dev parameter, which provides access to device-specific state through si_drv1.

This pattern, shared function table, per-device state pointer, is the standard approach for drivers managing multiple similar devices. It combines efficiency (one function table) with flexibility (device-specific behavior through the state pointer).

7) Create per-LED device nodes

278: struct cdev *
279: led_create(led_t *func, void *priv, char const *name)
280: {
282: 	return (led_create_state(func, priv, name, 0));
283: }
285: struct cdev *
286: led_create_state(led_t *func, void *priv, char const *name, int state)
287: {
288: 	struct ledsc	*sc;
290: 	sc = malloc(sizeof *sc, M_LED, M_WAITOK | M_ZERO);
292: 	sx_xlock(&led_sx);
293: 	sc->name = strdup(name, M_LED);
294: 	sc->unit = alloc_unr(led_unit);
295: 	sc->private = priv;
296: 	sc->func = func;
297: 	sc->dev = make_dev(&led_cdevsw, sc->unit,
298: 	    UID_ROOT, GID_WHEEL, 0600, "led/%s", name);
299: 	sx_xunlock(&led_sx);
301: 	mtx_lock(&led_mtx);
302: 	sc->dev->si_drv1 = sc;
303: 	LIST_INSERT_HEAD(&led_list, sc, list);
304: 	if (state != -1)
305: 		sc->func(sc->private, state != 0);
306: 	mtx_unlock(&led_mtx);
308: 	return (sc->dev);
309: }
LED Registration: Creating Dynamic Devices

The led_create and led_create_state functions form the public API that hardware drivers use to register LEDs with the subsystem. These functions allocate resources, create device nodes, and integrate the LED into the global registry, making it accessible to both userspace and kernel code.

Simple Registration Wrapper
struct cdev *
led_create(led_t *func, void *priv, char const *name)
{
    return (led_create_state(func, priv, name, 0));
}

The led_create function provides a simplified interface for the common case where the LED's initial state doesn't matter. It delegates to led_create_state with an initial state of 0 (off), allowing hardware drivers to register LEDs with minimal code:

struct cdev *led;
led = led_create(my_led_callback, my_softc, "disk0");

This convenience wrapper follows the FreeBSD pattern of providing both simple and feature-complete versions of the same API.

Full Registration Function
struct cdev *
led_create_state(led_t *func, void *priv, char const *name, int state)
{
    struct ledsc    *sc;

Parameters: The function receives four values:

  • func - callback function that controls the physical LED hardware
  • priv - opaque pointer passed to the callback, typically the driver's softc
  • name - string identifying the LED, becomes part of /dev/led/name
  • state - initial LED state: 0 (off), 1 (on), or -1 (don't initialize)

Return value: Pointer to the created struct cdev, which the hardware driver should store for later use with led_destroy(). If creation fails, the function panics (due to M_WAITOK allocation) rather than returning NULL.

Allocating LED State
sc = malloc(sizeof *sc, M_LED, M_WAITOK | M_ZERO);

The softc structure is allocated to track this LED's state. The M_ZERO flag zeroes all fields, providing safe defaults:

  • Pointer fields (name, dev, spec, str, ptr) are NULL
  • Numeric fields (unit, count) are zero
  • The list entry is zeroed (will be initialized by LIST_INSERT_HEAD)

The M_WAITOK flag means the allocation can sleep waiting for memory, which is acceptable since LED registration happens during driver attach (a blocking context). If memory is truly exhausted, the kernel panics, LED registration is considered essential enough that failure is not recoverable.

Device Creation Under Exclusive Lock
sx_xlock(&led_sx);
sc->name = strdup(name, M_LED);
sc->unit = alloc_unr(led_unit);
sc->private = priv;
sc->func = func;
sc->dev = make_dev(&led_cdevsw, sc->unit,
    UID_ROOT, GID_WHEEL, 0600, "led/%s", name);
sx_xunlock(&led_sx);

Exclusive lock acquisition: The sx_xlock call acquires the shared/exclusive lock in exclusive (write) mode. This serializes all device creation and destruction operations, preventing races where two threads simultaneously create devices with the same name or allocate the same unit number.

Name duplication: strdup(name, M_LED) allocates a copy of the name string. The caller's string may be temporary (stack buffer or string literal), so a persistent copy is needed for the LED's lifetime. This copy will be freed in led_destroy().

Unit number allocation: alloc_unr(led_unit) obtains a unique unit number from the global pool. This number becomes the device's minor number, ensuring /dev/led/disk0 and /dev/led/power have distinct device identifiers even though they share the same major number.

Callback registration: The private and func fields are copied from parameters, establishing the connection to the hardware driver's control function. When the LED state changes (via pattern execution or static state command), sc->func(sc->private, onoff) will be called to manipulate the physical hardware.

Device node creation: make_dev creates /dev/led/name with the following properties:

  • &led_cdevsw - shared character device operations (write handler)
  • sc->unit - unique minor number for this LED
  • UID_ROOT, GID_WHEEL - owned by root:wheel
  • 0600 - read/write for owner only (root), no access for others
  • "led/%s", name - device path, automatically prepends /dev/

The restrictive permissions (0600) prevent unprivileged users from controlling LEDs, which could be a security concern (information leakage through LED patterns) or nuisance (making the power LED blink rapidly).

Lock release: After device creation completes, the exclusive lock is released. Other threads can now create or destroy LEDs. The lock hold time is minimal, just the core allocation and registration, not including the earlier softc allocation which didn't need protection.

Integration Under Mutex
mtx_lock(&led_mtx);
sc->dev->si_drv1 = sc;
LIST_INSERT_HEAD(&led_list, sc, list);
if (state != -1)
    sc->func(sc->private, state != 0);
mtx_unlock(&led_mtx);

Mutex acquisition: The led_mtx mutex protects the LED list and timer-related state. It's acquired after device creation because multiple locks with different purposes reduces contention, threads creating devices don't block threads modifying LED states.

Bidirectional linkage: Setting sc->dev->si_drv1 = sc creates the critical link from device node to softc. When led_write is called with this device, it can retrieve the softc via dev->si_drv1. This linkage must be established before the device is usable.

List insertion: LIST_INSERT_HEAD(&led_list, sc, list) adds the LED to the global registry at the head of the list. The list field in the softc was zeroed during allocation, and this macro initializes it properly while linking into the existing list.

Using LIST_INSERT_HEAD rather than LIST_INSERT_TAIL is arbitrary; order doesn't matter for LED list iteration. Head insertion is slightly faster (no need to find the tail), but the performance difference is negligible.

Optional initial state: If state != -1, the hardware callback is invoked immediately to set the LED's initial state:

  • state != 0 converts any non-zero value to boolean true (LED on)
  • state == 0 means LED off

The special value -1 means "don't initialize," leaving the LED in whatever state the hardware defaults to. This is useful when the hardware driver has already configured the LED before registration.

Lock release: After list insertion and optional initialization, the mutex is released. The LED is now fully operational; userspace can write to its device node, kernel code can call led_set() with its name, and timer callbacks will process any patterns.

Return Value and Ownership
return (sc->dev);
}

The function returns the cdev pointer, which the hardware driver should store:

struct my_driver_softc {
    struct cdev *led_dev;
    /* other fields */
};

void
my_driver_attach(device_t dev)
{
    struct my_driver_softc *sc = device_get_softc(dev);
    /* other initialization */
    sc->led_dev = led_create(my_led_callback, sc, "disk0");
}

The hardware driver needs this pointer to call led_destroy() during detach. Without storing it, the LED would leak, its device node and resources would persist even after the hardware driver unloads.

Resource Allocation Summary

A successful LED registration allocates:

  • Softc structure (freed in led_destroy)
  • Name string copy (freed in led_destroy)
  • Unit number (returned to pool in led_destroy)
  • Device node (destroyed in led_destroy)

All resources are cleaned up symmetrically during destruction, preventing leaks when hardware is removed.

Thread Safety

The two-lock design enables safe concurrent operations:

Scenario: Thread A creates "disk0" while Thread B creates "power".

Thread A                    Thread B
Allocate sc1                Allocate sc2
Lock led_sx (exclusive)     (blocks on led_sx)
Create /dev/led/disk0       ...
Unlock led_sx               Acquire led_sx
Lock led_mtx                Create /dev/led/power
Insert sc1 to list          Unlock led_sx
Unlock led_mtx              Lock led_mtx
                            Insert sc2 to list
                            Unlock led_mtx

The exclusive lock serializes device creation (preventing name/unit conflicts), while the mutex serializes list modification (preventing list corruption). Both threads complete successfully with two working LEDs.

Contrast with null.c

The null driver's device creation happened in null_modevent during module load:

// null.c: static devices created once
full_dev = make_dev_credf(..., "full");
null_dev = make_dev_credf(..., "null");
zero_dev = make_dev_credf(..., "zero");

The LED driver's device creation happens dynamically on demand:

// led.c: devices created whenever hardware drivers request
led_create(func, priv, "disk0");   // called by disk driver
led_create(func, priv, "power");   // called by power driver
led_create(func, priv, "status");  // called by GPIO driver

This dynamic approach scales naturally: the system can have any number of LEDs (zero to hundreds), with devices appearing and disappearing as hardware is added and removed. The subsystem provides infrastructure, but doesn't dictate what LEDs exist, that's determined by which hardware drivers are loaded and what hardware is present.

8) Destroy per-LED device nodes

306: void
307: led_destroy(struct cdev *dev)
308: {
309: 	struct ledsc *sc;
311: 	mtx_lock(&led_mtx);
312: 	sc = dev->si_drv1;
313: 	dev->si_drv1 = NULL;
314: 	if (sc->ptr != NULL)
315: 		blinkers--;
316: 	LIST_REMOVE(sc, list);
317: 	if (LIST_EMPTY(&led_list))
318: 		callout_stop(&led_ch);
319: 	mtx_unlock(&led_mtx);
321: 	sx_xlock(&led_sx);
322: 	free_unr(led_unit, sc->unit);
323: 	destroy_dev(dev);
324: 	if (sc->spec != NULL)
325: 		sbuf_delete(sc->spec);
326: 	free(sc->name, M_LED);
327: 	free(sc, M_LED);
328: 	sx_xunlock(&led_sx);
329: }
LED Deregistration: Cleanup and Resource Release

The led_destroy function unregisters an LED from the subsystem, reversing all operations performed during led_create. Hardware drivers call this function during detach to cleanly remove LEDs before the underlying hardware disappears, ensuring no dangling references or resource leaks remain.

Function Entry and Softc Retrieval
void
led_destroy(struct cdev *dev)
{
    struct ledsc *sc;

    mtx_lock(&led_mtx);
    sc = dev->si_drv1;
    dev->si_drv1 = NULL;

Parameter: The function receives the cdev pointer returned by led_create. Hardware drivers typically store this pointer in their own softc and pass it during cleanup:

void
my_driver_detach(device_t dev)
{
    struct my_driver_softc *sc = device_get_softc(dev);
    led_destroy(sc->led_dev);
    /* other cleanup */
}

Mutex acquisition: The led_mtx mutex is acquired first to protect the LED list and timer state. This serializes destruction with ongoing timer callbacks and write operations.

Breaking the linkage: Setting dev->si_drv1 = NULL immediately severs the connection between device node and softc. Any write operation that started before this function was called but hasn't yet acquired the mutex will see NULL when it checks dev->si_drv1 and safely fail rather than accessing freed memory. This defensive programming prevents use-after-free bugs during concurrent operations.

Deactivating Pattern Execution
if (sc->ptr != NULL)
    blinkers--;

If this LED has an active blink pattern (ptr != NULL), the global blinkers counter must be decremented. This counter tracks how many LEDs need timer servicing, and removing an active LED reduces that count.

Timer shutdown logic: When the counter reaches zero (this was the last blinking LED), the timer callback will notice and stop rescheduling itself. However, there's no explicit timer stop here; the counter update is sufficient. The timer callback checks blinkers > 0 before each reschedule.

Removing from Global Registry
LIST_REMOVE(sc, list);
if (LIST_EMPTY(&led_list))
    callout_stop(&led_ch);

List removal: LIST_REMOVE(sc, list) unlinks this LED from the global list. The macro updates neighboring list entries to skip this node, and future timer callbacks won't see this LED when iterating.

Explicit timer stop: If the list becomes empty after removal, callout_stop(&led_ch) explicitly stops the timer. This is an optimization, waiting for the timer to notice blinkers == 0 would work, but stopping immediately when all LEDs are gone is more efficient.

The callout_stop function is safe to call on an already-stopped timer (it does nothing), so the check for empty list is just an optimization to avoid the function call when unnecessary.

Lock release: After list modification and timer management, the mutex is released:

mtx_unlock(&led_mtx);

The remaining cleanup doesn't require mutex protection since this LED is now invisible to timer callbacks and write operations.

Resource Deallocation Under Exclusive Lock
sx_xlock(&led_sx);
free_unr(led_unit, sc->unit);
destroy_dev(dev);
if (sc->spec != NULL)
    sbuf_delete(sc->spec);
free(sc->name, M_LED);
free(sc, M_LED);
sx_xunlock(&led_sx);

Exclusive lock acquisition: The led_sx lock serializes device creation and destruction. Acquiring it exclusively prevents new devices from being created while this one is being destroyed, avoiding races where the freed unit number or name might be immediately reused.

Unit number return: free_unr(led_unit, sc->unit) returns the unit number to the pool, making it available for future LED registrations. Without this, unit numbers would leak and eventually exhaust the available range.

Device node destruction: destroy_dev(dev) removes /dev/led/name from the filesystem and deallocates the cdev structure. This function blocks until all open file descriptors to the device are closed, ensuring no write operations are in progress.

After destroy_dev returns, the device no longer exists in /dev, and any future attempts to open it will fail with ENOENT (no such file or directory).

Pattern buffer cleanup: If an active pattern exists (sc->spec != NULL), its string buffer is freed with sbuf_delete. This handles the case where an LED is destroyed while a blink pattern is running.

Name string cleanup: free(sc->name, M_LED) releases the duplicated name string allocated during led_create. The M_LED type tag matches the allocation, maintaining accounting consistency.

Softc deallocation: free(sc, M_LED) releases the LED state structure itself. After this call, the sc pointer is invalid and must not be accessed.

Lock release: The exclusive lock is released, allowing other device operations to proceed. All resources associated with this LED have been freed.

Symmetric Cleanup

The destruction sequence precisely reverses creation:

Creation StepDestruction Step
Allocate softcFree softc
Duplicate nameFree name
Allocate unitFree unit
Create device nodeDestroy device node
Insert into listRemove from list
Increment blinkers (if pattern)Decrement blinkers (if pattern)

This symmetry ensures complete cleanup with no resource leaks. Every allocation has a corresponding deallocation, every list insertion has a removal, every increment has a decrement.

Handling Active LEDs

If an LED is destroyed while actively blinking, the function handles this cleanly:

Before destruction:

LED state: ptr = "AaAa", spec = sbuf, blinkers = 1
Timer: scheduled, will fire in 0.1s

During destruction:

Mutex locked
dev->si_drv1 = NULL (breaks write path)
blinkers--  (now 0)
LIST_REMOVE (invisible to timer)
Mutex unlocked
Timer fires, sees empty list, doesn't reschedule
sbuf_delete (frees pattern)

After destruction:

LED state: freed
Timer: stopped
Device: removed from /dev

The LED's pattern is interrupted mid-execution, but no crashes or leaks occur. The hardware LED is left in whatever state it was in at destruction time, turning it off explicitly is the hardware driver's responsibility if desired.

Thread Safety Considerations

The two-phase locking (mutex then exclusive lock) prevents several race conditions:

Race 1: Write vs. Destroy

Thread A (write)                Thread B (destroy)
Begin led_write()               Begin led_destroy()
                                Lock led_mtx
                                dev->si_drv1 = NULL
                                Remove from list
                                Unlock led_mtx
Lock led_mtx                    Lock led_sx
sc = dev->si_drv1 (NULL)        destroy_dev() blocks
if (sc != NULL) ... (skipped)   ...
Unlock led_mtx                  [write returns]
Return error                    destroy_dev() completes

The write operation safely detects the destroyed LED via the NULL check and returns an error without accessing freed memory.

Race 2: Timer vs. Destroy

Timer callback running          led_destroy() called
Iterating LED list              Lock led_mtx (blocks)
Process this LED                ...
                                Acquire lock
                                Remove from list
                                Unlock
Move to next LED                [timer continues]
                                Free softc

The timer finishes processing the LED before it's removed from the list. The mutex ensures the LED isn't freed while the timer is accessing it.

Contrast with null.c

The null driver's cleanup in MOD_UNLOAD was simple:

destroy_dev(full_dev);
destroy_dev(null_dev);
destroy_dev(zero_dev);

Three fixed devices, three destroy calls, done. The LED driver's cleanup is more complex because:

Dynamic lifecycle: LEDs are created and destroyed individually as hardware appears and disappears, not all at once during module unload.

Active state: LEDs may have running timers and allocated patterns that need cleanup.

Reference counting: The blinkers counter must be maintained correctly for timer management.

List management: Removal from the global registry requires proper list manipulation.

This additional complexity is the cost of supporting dynamic device creation, the subsystem must handle arbitrary sequences of create/destroy operations without leaking resources or corrupting state.

Usage Example

A complete hardware driver lifecycle:

// During attach
sc->led_dev = led_create(my_led_control, sc, "disk0");

// During normal operation
// LED blinks, patterns execute, writes succeed

// During detach
led_destroy(sc->led_dev);
// LED is gone, /dev/led/disk0 removed
// All resources freed

After led_destroy returns, the hardware driver can safely unload without leaving orphaned LED state in the kernel.

9) Driver init: set up bookkeeping and the callout

331: static void
332: led_drvinit(void *unused)
333: {
335: 	led_unit = new_unrhdr(0, INT_MAX, NULL);
336: 	mtx_init(&led_mtx, "LED mtx", NULL, MTX_DEF);
337: 	sx_init(&led_sx, "LED sx");
338: 	callout_init_mtx(&led_ch, &led_mtx, 0);
339: }
341: SYSINIT(leddev, SI_SUB_DRIVERS, SI_ORDER_MIDDLE, led_drvinit, NULL);
Driver Initialization and Registration

The final section of the LED driver handles one-time initialization during system boot. This code sets up the global infrastructure needed before any LEDs can be registered, establishing the foundation that all subsequent operations rely on.

Initialization Function
static void
led_drvinit(void *unused)
{
    led_unit = new_unrhdr(0, INT_MAX, NULL);
    mtx_init(&led_mtx, "LED mtx", NULL, MTX_DEF);
    sx_init(&led_sx, "LED sx");
    callout_init_mtx(&led_ch, &led_mtx, 0);
}

Function signature: Initialization functions registered with SYSINIT receive a single void * argument for optional data. The LED driver doesn't need any initialization parameters, so the argument is unused and named accordingly.

Unit number allocator creation: new_unrhdr(0, INT_MAX, NULL) creates a unit number pool that can allocate integers from 0 to INT_MAX (typically 2,147,483,647). Each LED registered will receive a unique number from this range, used as the device minor number. The NULL parameter indicates no mutex protects this allocator; external locking (via led_sx) will serialize access instead.

Mutex initialization: mtx_init(&led_mtx, "LED mtx", NULL, MTX_DEF) initializes the mutex that protects:

  • The LED list during insertions, removals, and traversals
  • The blinkers counter
  • Per-LED pattern execution state

The parameters specify:

  • &led_mtx - the mutex structure to initialize
  • "LED mtx" - name appearing in lock debugging and analysis tools
  • NULL - no witness data (advanced lock-order checking not needed)
  • MTX_DEF - default mutex type (can sleep while held, standard recursion rules)

Shared/exclusive lock initialization: sx_init(&led_sx, "LED sx") initializes the lock that protects device creation and destruction. The simpler parameter list reflects that sx locks have fewer options than mutexes; they're always sleepable and non-recursive.

Timer initialization: callout_init_mtx(&led_ch, &led_mtx, 0) prepares the timer callback infrastructure. The parameters specify:

  • &led_ch - the callout structure to initialize
  • &led_mtx - the mutex held when timer callbacks execute
  • 0 - flags (none needed)

This initialization associates the timer with the mutex, so timer callbacks automatically hold led_mtx while executing. This simplifies locking in led_timeout, it doesn't need to acquire the mutex explicitly because the callout infrastructure does it automatically.

Boot-Time Registration
SYSINIT(leddev, SI_SUB_DRIVERS, SI_ORDER_MIDDLE, led_drvinit, NULL);

The SYSINIT macro registers the initialization function with the kernel's boot sequence. The kernel calls registered functions in order during startup, ensuring dependencies are satisfied.

Macro parameters:

leddev: A unique identifier for this initialization. Must be unique across the entire kernel to prevent collisions. The name doesn't affect behavior, it's purely for identification in debugging.

SI_SUB_DRIVERS: The subsystem level. The kernel initialization happens in phases (we will see a simplified list, the ... in the list below means that we have skipped some phases):

  • SI_SUB_TUNABLES - system tunables
  • SI_SUB_COPYRIGHT - display copyright
  • SI_SUB_VM - virtual memory
  • SI_SUB_KMEM - kernel memory allocator
  • ...
  • SI_SUB_DRIVERS - device drivers
  • ...
  • SI_SUB_RUN_SCHEDULER - start scheduler

The LED driver initializes during the driver phase, after core kernel services (memory allocation, locking primitives) are available but before devices start attaching.

SI_ORDER_MIDDLE: The order within the subsystem. Multiple initializers in the same subsystem execute in order from SI_ORDER_FIRST through SI_ORDER_ANY to SI_ORDER_LAST. Using MIDDLE places the LED driver in the middle of the driver initialization phase, not critical to go first, but not dependent on everything else either.

led_drvinit: Pointer to the initialization function.

NULL: No argument data to pass to the function.

Initialization Ordering

The SYSINIT mechanism ensures proper initialization order:

Before LED init:

Memory allocator running (malloc works)
Lock primitives available (mtx_init, sx_init work)
Timer subsystem operational (callout_init works)
Device filesystem ready (make_dev will work later)

During LED init:

led_drvinit() called
 -> 
Create unit allocator
Initialize locks
Prepare timer infrastructure

After LED init:

Hardware drivers attach
 -> 
Call led_create()
 -> 
Use the already-initialized infrastructure

Without SYSINIT, hardware drivers that call led_create() during their attach functions would crash attempting to use uninitialized locks or allocate from a NULL unit number pool.

Contrast with null.c Module Load

The null driver used module event handlers:

static int
null_modevent(module_t mod, int type, void *data)
{
    switch(type) {
    case MOD_LOAD:
        /* Create devices */
        break;
    case MOD_UNLOAD:
        /* Destroy devices */
        break;
    }
}

DEV_MODULE(null, null_modevent, NULL);

Module events fire when loadable modules are loaded or unloaded. The LED driver uses SYSINIT instead because:

Always needed: The LED subsystem is infrastructure that other drivers depend on. It should initialize early during boot, not wait for explicit module loading.

No unload: The LED subsystem doesn't provide a module unload handler. Once initialized, it remains available for the system's lifetime. Unloading would be complex, all registered LEDs would need to be destroyed, which requires coordinating with potentially many hardware drivers.

Separate concerns: SYSINIT handles initialization, while individual LEDs are created/destroyed dynamically as hardware appears/disappears. The null driver conflated initialization with device creation (both happened in MOD_LOAD), while the LED driver separates them.

What's Not Initialized

Notice what this function doesn't do:

No LED creation: Unlike the null driver which created its three devices during initialization, the LED driver creates no devices here. Device creation is demand-driven via led_create() calls from hardware drivers.

No list initialization: The global led_list was statically initialized:

static LIST_HEAD(, ledsc) led_list = LIST_HEAD_INITIALIZER(led_list);

Static initialization suffices for list heads, they're just pointer structures that start empty.

No blinkers initialization: The blinkers counter was declared static int, giving it an initial value of 0 automatically. No explicit initialization needed.

No timer scheduling: The timer callback starts inactive. It's only scheduled when the first LED receives a blink pattern, not during driver initialization.

This minimal initialization reflects good design: do the minimum necessary work at boot, defer everything else until actually needed.

Complete Boot Sequence

The full sequence from power-on to working LEDs:

1. Kernel starts
2. Early boot (memory, interrupts, etc.)
3. SYSINIT runs:
   - led_drvinit() initializes LED infrastructure
4. Device enumeration and driver attachment:
   - Disk driver attaches
   - Calls led_create(..., "disk0")
   - /dev/led/disk0 appears
   - GPIO driver attaches
   - Calls led_create(..., "power")
   - /dev/led/power appears
5. System running:
   - User scripts write patterns
   - Drivers call led_set()
   - LEDs blink and indicate status

The LED subsystem is ready before hardware drivers need it, and hardware drivers can register LEDs at any point during or after boot without worrying about initialization order.

Why This Matters

This initialization pattern, early infrastructure setup via SYSINIT, late device creation on demand, is fundamental to FreeBSD's modular architecture. It allows:

Flexibility: Hardware drivers don't need to coordinate initialization order. The LED subsystem is always ready when they need it.

Scalability: The subsystem doesn't pre-allocate resources for devices that might not exist. Memory usage scales with actual hardware.

Modularity: Hardware drivers depend only on the LED API, not on implementation details. The subsystem can change internally without affecting drivers.

Reliability: Initialization failures (like memory exhaustion during new_unrhdr) are fatal panics rather than obscure later crashes, making problems immediately visible during boot.

This design philosophy, initialize infrastructure early, create instances lazily, appears throughout the FreeBSD kernel and is worth understanding for anyone implementing subsystems or drivers.

Interactive Exercises for led(4)

Goal: Understand dynamic device creation, timer-based state machines, and pattern parsing. This driver builds on concepts from the null driver but adds stateful pattern execution and kernel API design.

A) Structure and Global State
  1. Examine the struct ledsc definition (lines 30-42). This structure contains both device identity and pattern execution state. Create a table categorizing the fields:
FieldPurposeCategory
list?Linkage
name?Identity
ptr?Pattern execution
.........
Quote the fields related to pattern execution (`str`, `ptr`, `count`, `last_second`) and explain the role of each in one sentence.

2. Locate the global variables (lines 44-51). For each one, explain its purpose:

  • led_unit - what does this allocate?
  • led_mtx vs. led_sx - why two locks? What does each protect?
  • led_list - who iterates this and when?
  • led_ch - what triggers this?
  • blinkers - what happens when this reaches 0?

Quote the declaration lines.

  1. Examine the led_cdevsw structure (lines 272-276). Which operation is defined? Which operations are notably absent (compare to null.c)? What appears under /dev when LEDs are created?
  1. Trace the data flow in led_write() (lines 213-237):
  • Find the size check - what's the limit and why?
  • Find the buffer allocation - why uio_resid + 1?
  • Find the uiomove() call - what's being copied?
  • Find the parse call - what does it produce?
  • Find the state update - what lock is held?

Quote each step and write one sentence explaining its purpose.

  1. In led_state() (lines 89-112), trace two paths:

Path 1 - Installing a pattern (sb != NULL):

  • Which fields change in the softc?
  • When is blinkers incremented?
  • What does sc->ptr = sc->str accomplish?

Path 2 - Setting static state (sb == NULL):

  • Which fields change?
  • When is blinkers decremented?
  • Why call sc->func() here but not in Path 1?

Quote the key lines for each path.

  1. Explain the timer-to-pattern connection:
  • When blinkers goes from 0 -> 1, what must happen? (Hint: who schedules the timer?)
  • When blinkers goes from 1 -> 0, what must happen? (Hint: line 317-318)
  • Why doesn't led_state() directly schedule the timer?
C) Timer Callback State Machine
  1. In led_timeout() (lines 54-87), explain the pattern interpreter:

Create a table showing what each code does:

CodeLED ActionDuration SetupExample
'A'-'J'?count = ?'C' means?
'a'-'j'?count = ?'c' means?
'U'/'u'?Special timingWhat's checked?
'.'?N/AWhat happens?

Quote the lines implementing each case.

  1. The count field implements delays:
  • When is count set to non-zero? Quote the line.
  • When is count decremented? Quote the line.
  • Why does the pattern advance skip when count > 0?

Trace pattern "Ac" (on 0.1s, off 0.3s) through three timer ticks:

  • Tick 1: What happens?
  • Tick 2: What happens?
  • Tick 3: What happens?
  1. Find the timer rescheduling logic (line 85-86):
  • What condition must be true for rescheduling?
  • What's the delay (hz / 10 means what in seconds)?
  • Why doesn't the timer reschedule when blinkers == 0?
D) Pattern Parsing DSL
  1. For the flash command "f2" (lines 135-138):
  • What does the digit '2' map to (i = ?)?
  • What two-character string is generated?
  • How long is each phase in timer ticks?
  • What frequency does this produce?

Quote the lines and calculate the blink rate.

  1. For the Morse command "m...---..." (lines 187-199):
  • What string is generated for '.' (dot)?
  • What string is generated for '-' (dash)?
  • What string is generated for ' ' (space)?
  • What string is generated for '\n' (newline)?

Quote the sbuf_cat() calls and explain how this implements standard Morse timing (dot = 1 unit, dash = 3 units).

  1. For the digit command "d12" (lines 149-161):
  • How is digit '1' represented in flashes?
  • How is digit '2' represented in flashes?
  • Why is '0' treated as 10 instead of 0?
  • What separates repetitions of the pattern?

Quote the loop and explain the formula for flash count.

E) Dynamic Device Lifecycle
  1. In led_create_state() (lines 286-309), identify the initialization sequence:
  • What's allocated first and with what flags?
  • Which lock is acquired for device creation? Why exclusive?
  • What parameters does make_dev() receive? What path is created?
  • Which lock protects list insertion? Why different from device creation?
  • When is the hardware callback invoked, and what does state != -1 mean?

Quote each phase and explain the lock separation.

  1. In led_destroy() (lines 306-329), trace the cleanup:
  • Why is dev->si_drv1 set to NULL immediately?
  • When is blinkers decremented?
  • Why call callout_stop() only when list becomes empty?
  • Which resources are freed under which lock?

Create a table mapping each led_create() allocation to its corresponding led_destroy() deallocation.

  1. Explain the two-phase locking:
  • Why acquire led_mtx first, then release it before acquiring led_sx?
  • What would happen if we held led_mtx during destroy_dev()?
  • Could we use just one lock for everything? What would be the downsides?
F) Kernel API vs. Device Write
  1. Compare led_write() and led_set():
  • Both call led_parse() and led_state() - what's different about how they find the LED?
  • led_write() has size limits - does led_set() need them? Why or why not?
  • Who typically calls each function? Give examples.

Quote the LED lookup logic in both functions.

  1. Find the led_cdevsw declaration (lines 272-276) and explain why it's shared:
  • How many cdevsw structures exist for N LEDs?
  • How does led_write() know which LED it's writing to?
  • Compare this to null.c which had three separate cdevsw structures.
G) System Integration
  1. Examine the initialization (lines 331-341):
  • What does SYSINIT do and when does it run?
  • What are the four resources initialized in led_drvinit()?
  • Why is the callout associated with led_mtx?
  • What's NOT initialized here (compare to null.c's MOD_LOAD)?
  1. Find where the driver registers with SYSINIT:
  • What's the subsystem level (SI_SUB_DRIVERS)?
  • Why not use DEV_MODULE like null.c did?
  • Can this driver be unloaded? Why or why not?
H) Safe Experiments (optional, only if you have a system with physical LEDs)
  1. If your system has LEDs in /dev/led, try these (as root in a VM):
# List available LEDs
ls -l /dev/led/

# Fast blink
echo "f" > /dev/led/SOME_LED_NAME

# Slow blink
echo "f5" > /dev/led/SOME_LED_NAME

# Morse code SOS
echo "m...---..." > /dev/led/SOME_LED_NAME

# Static on
echo "1" > /dev/led/SOME_LED_NAME

# Static off
echo "0" > /dev/led/SOME_LED_NAME

For each test:

  • Which parse case handles the command?
  • What internal pattern string is generated?
  • Estimate the timing you observe and verify against the code.
  1. Try invalid commands and explain the errors:
# Too long
perl -e 'print "f" x 600' > /dev/led/SOME_LED_NAME
# What error? Which line checks this?

# Invalid syntax
echo "xyz" > /dev/led/SOME_LED_NAME
# What error? Which case handles this?

Stretch (thought experiments)

  1. The timer self-rescheduling logic (line 85-86):

Suppose we removed the if (blinkers > 0) check and always called:

callout_reset(&led_ch, hz / 10, led_timeout, p);

Trace what happens when:

  • User writes "f" to an LED (timer starts)
  • Pattern runs for 5 seconds
  • User writes "0" to stop blinking (blinkers -> 0)

What's the symptom? Where's the wasted resource? Why does the current check prevent this?

  1. The write size limit (lines 220-221):

The code rejects writes over 512 bytes. Consider removing this check:

  • What's the immediate risk with malloc(uio->uio_resid, ...)?
  • The parser then allocates an sbuf - what's the risk there?
  • Could an attacker cause a denial of service? How?
  • Why is 512 bytes plenty for any legitimate LED pattern?

Point to the current guard and explain the defense-in-depth principle.

  1. The two-lock design:

Suppose we replaced both led_mtx and led_sx with a single mutex. What would break?

Scenario 1: led_create() calls make_dev() while holding the lock, and make_dev() blocks. What happens to timer callbacks during this time?

Scenario 2: A write operation holds the lock while parsing a complex pattern. What happens to other LEDs' timer updates?

Explain why separating device structure operations (led_sx) from state operations (led_mtx) improves concurrency.

Note: If your system doesn't have physical LEDs, you can still trace through the code and understand the patterns. The mental model of "timer walks list -> interprets codes -> calls callbacks" is the key lesson, not seeing actual lights blink.

Bridge to the next tour

If you can walk the path from user write() to a timer-driven state machine and back to device teardown, you've internalized the write-centric character-device shape with timers and sbuf-powered parsing. Next we'll look at a slightly different shape: a network interface pseudo-device that binds into the ifnet stack (if_tuntap.c). Keep your eyes on three things: how the driver registers with a larger subsystem, how I/O is routed through that subsystem's callbacks, and how open/close/lifecycle differs from the small /dev patterns you've just mastered.

Tour 3 - A pseudo-NIC that is also a character device: tun(4)/tap(4):

Open the file:

% cd /usr/src/sys/net
% less if_tuntap.c

This driver is a perfect "small but real" example of integrating a simple character device with a larger kernel subsystem (the network stack). It exposes /dev/tunN, /dev/tapN, and /dev/vmnetN character devices, while also registering ifnet interfaces that you can ifconfig.

As you read, keep these "anchors" in mind:

  • Character device surface: cdevsw + open/read/write/ioctl/poll/kqueue;

  • Network surface: ifnet + if_attach + bpfattach;

  • Cloning: on-demand creation of /dev/tunN and the corresponding ifnet;

  • how a cdevsw maps open/read/write/ioctl into driver code for three related device names;

  • how opening /dev/tun0 et al. lines up with creating/configuring an ifnet;

  • how data flows both ways: packets from kernel -> user via read(2), and user -> kernel via write(2).

Note

To keep this manageable, code examples below are excerpts from the 2071-line source file. Lines marked with ... have been omitted.

1) Where the character device surface is declared (the cdevsw)

 270: static struct tuntap_driver {
 271: 	struct cdevsw		 cdevsw;
 272: 	int			 ident_flags;
 273: 	struct unrhdr		*unrhdr;
 274: 	struct clonedevs	*clones;
 275: 	ifc_match_f		*clone_match_fn;
 276: 	ifc_create_f		*clone_create_fn;
 277: 	ifc_destroy_f		*clone_destroy_fn;
 278: } tuntap_drivers[] = {
 279: 	{
 280: 		.ident_flags =	0,
 281: 		.cdevsw =	{
 282: 		    .d_version =	D_VERSION,
 283: 		    .d_flags =		D_NEEDMINOR,
 284: 		    .d_open =		tunopen,
 285: 		    .d_read =		tunread,
 286: 		    .d_write =		tunwrite,
 287: 		    .d_ioctl =		tunioctl,
 288: 		    .d_poll =		tunpoll,
 289: 		    .d_kqfilter =	tunkqfilter,
 290: 		    .d_name =		tunname,
 291: 		},
 292: 		.clone_match_fn =	tun_clone_match,
 293: 		.clone_create_fn =	tun_clone_create,
 294: 		.clone_destroy_fn =	tun_clone_destroy,
 295: 	},
 296: 	{
 297: 		.ident_flags =	TUN_L2,
 298: 		.cdevsw =	{
 299: 		    .d_version =	D_VERSION,
 300: 		    .d_flags =		D_NEEDMINOR,
 301: 		    .d_open =		tunopen,
 302: 		    .d_read =		tunread,
 303: 		    .d_write =		tunwrite,
 304: 		    .d_ioctl =		tunioctl,
 305: 		    .d_poll =		tunpoll,
 306: 		    .d_kqfilter =	tunkqfilter,
 307: 		    .d_name =		tapname,
 308: 		},
 309: 		.clone_match_fn =	tap_clone_match,
 310: 		.clone_create_fn =	tun_clone_create,
 311: 		.clone_destroy_fn =	tun_clone_destroy,
 312: 	},
 313: 	{
 314: 		.ident_flags =	TUN_L2 | TUN_VMNET,
 315: 		.cdevsw =	{
 316: 		    .d_version =	D_VERSION,
 317: 		    .d_flags =		D_NEEDMINOR,
 318: 		    .d_open =		tunopen,
 319: 		    .d_read =		tunread,
 320: 		    .d_write =		tunwrite,
 321: 		    .d_ioctl =		tunioctl,
 322: 		    .d_poll =		tunpoll,
 323: 		    .d_kqfilter =	tunkqfilter,
 324: 		    .d_name =		vmnetname,
 325: 		},
 326: 		.clone_match_fn =	vmnet_clone_match,
 327: 		.clone_create_fn =	tun_clone_create,
 328: 		.clone_destroy_fn =	tun_clone_destroy,
 329: 	},
 330: };

This initial fragment demonstrates a clever design pattern: one driver implementation serving three related but distinct device types (tun, tap, and vmnet).

Let's see how it works:

The tuntap_driver Structure (lines 270-278)
struct tuntap_driver {
    struct cdevsw         cdevsw;           // Character device switch table
    int                   ident_flags;      // Identity flags (TUN_L2, TUN_VMNET)
    struct unrhdr        *unrhdr;           // Unit number allocator
    struct clonedevs     *clones;           // Cloning infrastructure
    ifc_match_f          *clone_match_fn;   // Network interface clone matching
    ifc_create_f         *clone_create_fn;  // Network interface creation
    ifc_destroy_f        *clone_destroy_fn; // Network interface destruction
};

This structure combines two kernel subsystems:

  1. Character device operations (cdevsw) - how userspace interacts with /dev/tunN, /dev/tapN, /dev/vmnetN
  2. Network interface cloning (clone_*_fn) - how the corresponding ifnet structures get created
The Critical cdevsw Structure

The cdevsw (character device switch) is FreeBSD's function dispatch table for character devices. Think of it as a vtable or interface:

.d_version   = D_VERSION      // ABI version check
.d_flags     = D_NEEDMINOR    // Device needs minor number tracking
.d_open      = tunopen        // Called on open(2)
.d_read      = tunread        // Called on read(2)
.d_write     = tunwrite       // Called on write(2)
.d_ioctl     = tunioctl       // Called on ioctl(2)
.d_poll      = tunpoll        // Called on poll(2)/select(2)
.d_kqfilter  = tunkqfilter    // Called for kqueue event registration
.d_name      = tunname        // Device name ("tun", "tap", "vmnet")

Key insight: All three device types share the same function implementations (tunopen, tunread, etc.), but behave differently based on ident_flags.

The Three Driver Instances
1. TUN (lines 279-295) - Layer 3 (IP) tunnel
.ident_flags = 0              // No flags = plain TUN device
.d_name = tunname             // "tun"  ->  /dev/tun0, /dev/tun1, ...
  • Point-to-point IP tunnel
  • Packets are raw IP (no Ethernet headers)
  • Used by VPNs like OpenVPN in TUN mode
2. TAP (lines 296-312) - Layer 2 (Ethernet) tunnel
.ident_flags = TUN_L2         // Layer 2 flag
.d_name = tapname             // "tap"  ->  /dev/tap0, /dev/tap1, ...
  • Ethernet-level tunnel
  • Packets include full Ethernet frames
  • Used by VMs, bridges, OpenVPN in TAP mode
3. VMNET (lines 313-329) - VMware compatibility
.ident_flags = TUN_L2 | TUN_VMNET  // Layer 2 + VMware semantics
.d_name = vmnetname                 // "vmnet"  ->  /dev/vmnet0, ...
  • Like TAP but with VMware-specific behavior
  • Different lifecycle rules (survives interface down)
How This Achieves Code Reuse

Notice that all three entries use identical function pointers:

  • tunopen handles opening all three device types
  • tunread/tunwrite handle I/O for all three
  • The functions check tp->tun_flags (derived from ident_flags) to determine behavior

For example, in tunopen, you'll see:

if ((tp->tun_flags & TUN_L2) != 0) {
    // TAP/VMNET-specific setup
} else {
    // TUN-specific setup
}
The Cloning Functions

Each driver has different clone match functions but shares create/destroy:

  • tun_clone_match - matches "tun" or "tunN"
  • tap_clone_match - matches "tap" or "tapN"
  • vmnet_clone_match - matches "vmnet" or "vmnetN"
  • All use tun_clone_create - shared creation logic
  • All use tun_clone_destroy - shared destruction logic

This lets the kernel automatically create /dev/tun0 when someone opens it, even if it doesn't exist yet.

2) From clone request -> cdev creation -> ifnet attach

2.1 Clone creation (tun_clone_create): pick name/unit, ensure cdev, then hand off to tuncreate

 520: tun_clone_create(struct if_clone *ifc, char *name, size_t len,
 521:     struct ifc_data *ifd, struct ifnet **ifpp)
 522: {
 523: 	struct tuntap_driver *drv;
 524: 	struct cdev *dev;
 525: 	int err, i, tunflags, unit;
 526: 
 527: 	tunflags = 0;
 528: 	/* The name here tells us exactly what we're creating */
 529: 	err = tuntap_name2info(name, &unit, &tunflags);
 530: 	if (err != 0)
 531: 		return (err);
 532: 
 533: 	drv = tuntap_driver_from_flags(tunflags);
 534: 	if (drv == NULL)
 535: 		return (ENXIO);
 536: 
 537: 	if (unit != -1) {
 538: 		/* If this unit number is still available that's okay. */
 539: 		if (alloc_unr_specific(drv->unrhdr, unit) == -1)
 540: 			return (EEXIST);
 541: 	} else {
 542: 		unit = alloc_unr(drv->unrhdr);
 543: 	}
 544: 
 545: 	snprintf(name, IFNAMSIZ, "%s%d", drv->cdevsw.d_name, unit);
 546: 
 547: 	/* find any existing device, or allocate new unit number */
 548: 	dev = NULL;
 549: 	i = clone_create(&drv->clones, &drv->cdevsw, &unit, &dev, 0);
 550: 	/* No preexisting struct cdev *, create one */
 551: 	if (i != 0)
 552: 		i = tun_create_device(drv, unit, NULL, &dev, name);
 553: 	if (i == 0) {
 554: 		dev_ref(dev);
 555: 		tuncreate(dev);
 556: 		struct tuntap_softc *tp = dev->si_drv1;
 557: 		*ifpp = tp->tun_ifp;
 558: 	}
 559: 	return (i);
 560: }

The tun_clone_create function serves as the bridge between FreeBSD's network interface cloning subsystem and character device creation. This function is invoked when a user executes commands like ifconfig tun0 create or ifconfig tap1 create, and its responsibility is to create both a character device (/dev/tun0) and its corresponding network interface.

Function Signature and Purpose
static int
tun_clone_create(struct if_clone *ifc, char *name, size_t len,
    struct ifc_data *ifd, struct ifnet **ifpp)

The function receives an interface name (like "tun0" or "tap3") and must return a pointer to a newly created ifnet structure through the ifpp parameter. Success returns 0; errors return the appropriate errno values, such as EEXIST or ENXIO.

Parsing the Interface Name

The first step extracts meaning from the interface name:

tunflags = 0;
err = tuntap_name2info(name, &unit, &tunflags);
if (err != 0)
    return (err);

The tuntap_name2info helper function parses strings like "tap3" or "vmnet1" to extract:

  • The unit number (3, 1, etc.)
  • The type flags that determine device behavior (0 for tun, TUN_L2 for tap, TUN_L2|TUN_VMNET for vmnet)

If the name contains no unit number (e.g., just "tun"), the function returns -1 for the unit, signaling that any available unit should be allocated.

Locating the Appropriate Driver
drv = tuntap_driver_from_flags(tunflags);
if (drv == NULL)
    return (ENXIO);

The extracted flags determine which entry from the tuntap_drivers[] array will handle this device. This lookup returns the tuntap_driver structure containing the correct cdevsw and device name ("tun", "tap", or "vmnet").

Unit Number Allocation

The driver maintains a unit number allocator (unrhdr) to prevent conflicts:

if (unit != -1) {
    /* User requested specific unit number */
    if (alloc_unr_specific(drv->unrhdr, unit) == -1)
        return (EEXIST);
} else {
    /* Allocate any available unit */
    unit = alloc_unr(drv->unrhdr);
}

The unrhdr (unit number handler) ensures thread-safe allocation of device minor numbers. When a user requests a specific unit (e.g., "tun3"), alloc_unr_specific either reserves that number or returns failure if already allocated. When no specific unit is requested, alloc_unr selects the next available number.

This mechanism prevents race conditions where multiple processes simultaneously attempt to create the same device unit, as the allocation is serialized by the global tunmtx mutex.

Name Normalization

After unit allocation, the function normalizes the interface name:

snprintf(name, IFNAMSIZ, "%s%d", drv->cdevsw.d_name, unit);

If the user specified ifconfig tun create without a unit number, this formats the name with the newly allocated unit, producing strings like "tun0" or "tun1". The name parameter serves as both input and output, the caller's buffer receives the finalized name.

Character Device Creation
dev = NULL;
i = clone_create(&drv->clones, &drv->cdevsw, &unit, &dev, 0);
if (i != 0)
    i = tun_create_device(drv, unit, NULL, &dev, name);

This section handles an important subtlety: the character device may already exist. The clone_create call searches for an existing /dev/tun0 device node, which might have been created earlier through devfs cloning when a process opened the device path.

When clone_create returns non-zero (device not found), the code calls tun_create_device to construct a new struct cdev. This dual-path approach accommodates two creation scenarios:

  1. A process opens /dev/tun0 before any network configuration, triggering devfs cloning
  2. A user runs ifconfig tun0 create, explicitly requesting interface creation
Network Interface Instantiation

The final step connects the character device to the network subsystem:

if (i == 0) {
    dev_ref(dev);
    tuncreate(dev);
    struct tuntap_softc *tp = dev->si_drv1;
    *ifpp = tp->tun_ifp;
}

After successful device creation or lookup:

  • dev_ref(dev) increments the device's reference count, preventing premature destruction during initialization
  • tuncreate(dev) allocates and initializes the ifnet structure, registering it with the network stack
  • dev->si_drv1 provides the critical linkage, this field points to the tuntap_softc structure, which contains both character device state and the ifnet pointer
  • *ifpp = tp->tun_ifp returns the newly created network interface to the if_clone subsystem
Coordination Architecture

The tun_clone_create function exemplifies a coordination pattern common in kernel drivers. It performs no heavy lifting itself, instead orchestrating several subsystems:

  1. Name parsing determines device type and unit
  2. Driver lookup selects the appropriate cdevsw dispatch table
  3. Unit allocation ensures uniqueness
  4. Device lookup or creation establishes the character device presence
  5. Interface creation registers with the network stack

This separation allows two independent creation paths, character device access and network configuration, to converge correctly regardless of invocation order.

The si_drv1 field serves as the architectural keystone, linking the character device world (struct cdev, file operations, /dev namespace) with the network world (struct ifnet, packet processing, ifconfig visibility). Every subsequent operation, whether a read(2) system call or packet transmission, will traverse this link to access the shared tuntap_softc state.

2.2 Create the cdev and wire si_drv1 (tun_create_device)

 807: static int
 808: tun_create_device(struct tuntap_driver *drv, int unit, struct ucred *cr,
 809:     struct cdev **dev, const char *name)
 810: {
 811: 	struct make_dev_args args;
 812: 	struct tuntap_softc *tp;
 813: 	int error;
 814: 
 815: 	tp = malloc(sizeof(*tp), M_TUN, M_WAITOK | M_ZERO);
 816: 	mtx_init(&tp->tun_mtx, "tun_mtx", NULL, MTX_DEF);
 817: 	cv_init(&tp->tun_cv, "tun_condvar");
 818: 	tp->tun_flags = drv->ident_flags;
 819: 	tp->tun_drv = drv;
 820: 
 821: 	make_dev_args_init(&args);
 822: 	if (cr != NULL)
 823: 		args.mda_flags = MAKEDEV_REF | MAKEDEV_CHECKNAME;
 824: 	args.mda_devsw = &drv->cdevsw;
 825: 	args.mda_cr = cr;
 826: 	args.mda_uid = UID_UUCP;
 827: 	args.mda_gid = GID_DIALER;
 828: 	args.mda_mode = 0600;
 829: 	args.mda_unit = unit;
 830: 	args.mda_si_drv1 = tp;
 831: 	error = make_dev_s(&args, dev, "%s", name);
 832: 	if (error != 0) {
 833: 		free(tp, M_TUN);
 834: 		return (error);
 835: 	}
 836: 
 837: 	KASSERT((*dev)->si_drv1 != NULL,
 838: 	    ("Failed to set si_drv1 at %s creation", name));
 839: 	tp->tun_dev = *dev;
 840: 	knlist_init_mtx(&tp->tun_rsel.si_note, &tp->tun_mtx);
 841: 	mtx_lock(&tunmtx);
 842: 	TAILQ_INSERT_TAIL(&tunhead, tp, tun_list);
 843: 	mtx_unlock(&tunmtx);
 844: 	return (0);
 845: }

The tun_create_device function constructs the character device node and its associated driver state. This is the point where /dev/tun0, /dev/tap0, or /dev/vmnet0 actually come into existence in the device filesystem.

Function Parameters
static int
tun_create_device(struct tuntap_driver *drv, int unit, struct ucred *cr,
    struct cdev **dev, const char *name)

The function accepts:

  • drv - pointer to the appropriate entry in tuntap_drivers[]
  • unit - the allocated device unit number (0, 1, 2, etc.)
  • cr - credential context (NULL for kernel-initiated creation, non-NULL for user-initiated)
  • dev - output parameter receiving the created struct cdev pointer
  • name - the complete device name string ("tun0", "tap3", etc.)
Allocating the Softc Structure
tp = malloc(sizeof(*tp), M_TUN, M_WAITOK | M_ZERO);
mtx_init(&tp->tun_mtx, "tun_mtx", NULL, MTX_DEF);
cv_init(&tp->tun_cv, "tun_condvar");
tp->tun_flags = drv->ident_flags;
tp->tun_drv = drv;

Every tun/tap/vmnet device instance requires a tuntap_softc structure to maintain its state. This structure contains everything needed to operate the device: flags, the associated network interface pointer, I/O synchronization primitives, and references back to the driver.

The allocation uses M_WAITOK, allowing the function to sleep if memory is temporarily unavailable. The M_ZERO flag ensures all fields initialize to zero, providing safe defaults for pointers and counters.

Two synchronization primitives are initialized:

  • tun_mtx - a mutex protecting the softc's mutable fields
  • tun_cv - a condition variable used during device destruction to wait for all operations to complete

The tun_flags field receives the driver's identity flags (0, TUN_L2, or TUN_L2|TUN_VMNET), establishing whether this instance behaves as a tun, tap, or vmnet device. The tun_drv backpointer allows the softc to access its parent driver's resources like the unit number allocator.

Preparing Device Creation Arguments

FreeBSD's modern device creation API uses a structure to pass parameters rather than a long argument list:

make_dev_args_init(&args);
if (cr != NULL)
    args.mda_flags = MAKEDEV_REF | MAKEDEV_CHECKNAME;
args.mda_devsw = &drv->cdevsw;
args.mda_cr = cr;
args.mda_uid = UID_UUCP;
args.mda_gid = GID_DIALER;
args.mda_mode = 0600;
args.mda_unit = unit;
args.mda_si_drv1 = tp;

The make_dev_args structure configures every aspect of the device node:

Flags: When cr is non-NULL (user-initiated creation), two flags are set:

  • MAKEDEV_REF - automatically add a reference to prevent immediate destruction
  • MAKEDEV_CHECKNAME - validate the name doesn't conflict with existing devices

Dispatch table: mda_devsw points to the cdevsw containing function pointers for open, read, write, ioctl, etc. This is how the kernel knows which functions to call when userspace performs operations on this device.

Credentials: mda_cr associates the creating user's credentials with the device, used for permission checks.

Ownership and permissions: The device node will be owned by the uucp user and dialer group with mode 0600 (read/write for owner only). These historical Unix conventions reflect the original use of serial devices for dial-up networking. In practice, administrators often adjust these permissions via devfs.rules or by having privileged daemons open the devices.

Unit number: mda_unit embeds the unit number into the device's minor number, allowing the kernel to distinguish /dev/tun0 from /dev/tun1.

Private data: mda_si_drv1 is crucial, this field will become the si_drv1 member of the created struct cdev, establishing the link from character device to driver state. Every subsequent operation on the device will retrieve the softc via this field.

Creating the Device Node
error = make_dev_s(&args, dev, "%s", name);
if (error != 0) {
    free(tp, M_TUN);
    return (error);
}

The make_dev_s call creates the struct cdev and registers it with devfs. If successful, *dev receives a pointer to the new device structure. The "%s" format string and name argument specify the device node path within /dev.

Common failure modes include:

  • Name conflicts (a device with that name already exists)
  • Resource exhaustion (out of kernel memory)
  • Devfs subsystem errors

On failure, the function immediately deallocates the softc and returns the error to the caller. This prevents resource leaks.

Finalizing Device State
KASSERT((*dev)->si_drv1 != NULL,
    ("Failed to set si_drv1 at %s creation", name));
tp->tun_dev = *dev;
knlist_init_mtx(&tp->tun_rsel.si_note, &tp->tun_mtx);

The KASSERT is a development-time sanity check verifying that make_dev_s correctly populated si_drv1 from mda_si_drv1. This assertion would fire during kernel development if the device creation logic broke, but it compiles away in release builds.

The tp->tun_dev assignment creates the reverse link: while si_drv1 points from cdev to softc, tun_dev points from softc to cdev. This bidirectional linkage allows code to traverse in either direction.

The knlist_init_mtx call initializes the kqueue notification list protected by the softc's mutex. This infrastructure supports kqueue(2) event monitoring, allowing userspace applications to efficiently wait for readable/writable conditions on the device.

Global Registration
mtx_lock(&tunmtx); 
TAILQ_INSERT_TAIL(&tunhead, tp, tun_list); 
mtx_unlock(&tunmtx); 
return (0);

Finally, the new device registers itself in the global tunhead list. This list allows the driver to enumerate all active tun/tap/vmnet instances, which is necessary during module unload or system-wide operations.

The tunmtx mutex protects the list from concurrent modification. Multiple threads might simultaneously create devices, so this lock ensures list consistency.

The Created Device State

At function completion, several kernel objects exist and are properly linked:

/dev/tun0 (struct cdev)
     ->  si_drv1
tuntap_softc
     ->  tun_dev
/dev/tun0 (struct cdev)
     ->  tun_drv
tuntap_drivers[0]

The softc is registered in the global device list, ready for both character device operations and network interface attachment. However, the network interface (ifnet) does not yet exist, that will be created by the tuncreate function.

This separation of concerns, character device creation versus network interface creation, allows the two subsystems to initialize independently and in flexible order.

2.3 Build & attach the ifnet (tuncreate): L2 (tap) vs L3 (tun)

 950: static void
 951: tuncreate(struct cdev *dev)
 952: {
 953: 	struct tuntap_driver *drv;
 954: 	struct tuntap_softc *tp;
 955: 	struct ifnet *ifp;
 956: 	struct ether_addr eaddr;
 957: 	int iflags;
 958: 	u_char type;
 959: 
 960: 	tp = dev->si_drv1;
 961: 	KASSERT(tp != NULL,
 962: 	    ("si_drv1 should have been initialized at creation"));
 963: 
 964: 	drv = tp->tun_drv;
 965: 	iflags = IFF_MULTICAST;
 966: 	if ((tp->tun_flags & TUN_L2) != 0) {
 967: 		type = IFT_ETHER;
 968: 		iflags |= IFF_BROADCAST | IFF_SIMPLEX;
 969: 	} else {
 970: 		type = IFT_PPP;
 971: 		iflags |= IFF_POINTOPOINT;
 972: 	}
 973: 	ifp = tp->tun_ifp = if_alloc(type);
 974: 	ifp->if_softc = tp;
 975: 	if_initname(ifp, drv->cdevsw.d_name, dev2unit(dev));
 976: 	ifp->if_ioctl = tunifioctl;
 977: 	ifp->if_flags = iflags;
 978: 	IFQ_SET_MAXLEN(&ifp->if_snd, ifqmaxlen);
 979: 	ifp->if_capabilities |= IFCAP_LINKSTATE | IFCAP_MEXTPG;
 980: 	if ((tp->tun_flags & TUN_L2) != 0)
 981: 		ifp->if_capabilities |=
 982: 		    IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6 | IFCAP_LRO;
 983: 	ifp->if_capenable |= IFCAP_LINKSTATE | IFCAP_MEXTPG;
 984: 
 985: 	if ((tp->tun_flags & TUN_L2) != 0) {
 986: 		ifp->if_init = tunifinit;
 987: 		ifp->if_start = tunstart_l2;
 988: 		ifp->if_transmit = tap_transmit;
 989: 		ifp->if_qflush = if_qflush;
 990: 
 991: 		ether_gen_addr(ifp, &eaddr);
 992: 		ether_ifattach(ifp, eaddr.octet);
 993: 	} else {
 994: 		ifp->if_mtu = TUNMTU;
 995: 		ifp->if_start = tunstart;
 996: 		ifp->if_output = tunoutput;
 997: 
 998: 		ifp->if_snd.ifq_drv_maxlen = 0;
 999: 		IFQ_SET_READY(&ifp->if_snd);
1000: 
1001: 		if_attach(ifp);
1002: 		bpfattach(ifp, DLT_NULL, sizeof(u_int32_t));
1003: 	}
1004: 
1005: 	TUN_LOCK(tp);
1006: 	tp->tun_flags |= TUN_INITED;
1007: 	TUN_UNLOCK(tp);
1008: 
1009: 	TUNDEBUG(ifp, "interface %s is created, minor = %#x\n",
1010: 	    ifp->if_xname, dev2unit(dev));
1011: }

The tuncreate function constructs and registers the network interface (ifnet) corresponding to a character device. After this function completes, the device appears in ifconfig output and can participate in network operations. This is where the character device world and the network stack converge.

Retrieving Driver Context
tp = dev->si_drv1;
KASSERT(tp != NULL,
    ("si_drv1 should have been initialized at creation"));

drv = tp->tun_drv;

The function begins by traversing the link from struct cdev to tuntap_softc established during device creation. The assertion verifies this fundamental invariant, every device must have an associated softc. The tun_drv field provides access to the driver-level resources and configuration.

Determining Interface Type and Flags
iflags = IFF_MULTICAST;
if ((tp->tun_flags & TUN_L2) != 0) {
    type = IFT_ETHER;
    iflags |= IFF_BROADCAST | IFF_SIMPLEX;
} else {
    type = IFT_PPP;
    iflags |= IFF_POINTOPOINT;
}

The interface type and behavior flags depend on whether this is a layer 2 (Ethernet) or layer 3 (IP) tunnel:

Layer 2 devices (tap/vmnet with TUN_L2 set):

  • IFT_ETHER - declares this as an Ethernet interface
  • IFF_BROADCAST - supports broadcast transmission
  • IFF_SIMPLEX - cannot receive its own transmissions (standard for Ethernet)
  • IFF_MULTICAST - supports multicast groups

Layer 3 devices (tun without TUN_L2):

  • IFT_PPP - declares this as a point-to-point protocol interface
  • IFF_POINTOPOINT - has exactly one peer (no broadcast domain)
  • IFF_MULTICAST - supports multicast (though less meaningful for point-to-point)

These flags control how the network stack treats the interface. For example, routing code uses IFF_POINTOPOINT to determine whether a route needs a gateway address or just a destination.

Allocating and Initializing the Interface
ifp = tp->tun_ifp = if_alloc(type);
ifp->if_softc = tp;
if_initname(ifp, drv->cdevsw.d_name, dev2unit(dev));

The if_alloc function allocates a struct ifnet of the specified type. This structure is the network stack's representation of the interface, containing packet queues, statistics counters, capability flags, and function pointers.

Three critical linkages are established:

  1. tp->tun_ifp = if_alloc(type) - softc points to ifnet
  2. ifp->if_softc = tp - ifnet points back to softc
  3. if_initname(ifp, drv->cdevsw.d_name, dev2unit(dev)) - associates the interface name ("tun0") with the ifnet

The bidirectional linkage allows code working with either representation to access the other. Network code receiving a packet can find the character device state; character device operations can access network statistics.

Configuring Interface Operations
ifp->if_ioctl = tunifioctl;
ifp->if_flags = iflags;
IFQ_SET_MAXLEN(&ifp->if_snd, ifqmaxlen);

The if_ioctl function pointer handles interface configuration requests like SIOCSIFADDR (set address), SIOCSIFMTU (set MTU), and SIOCSIFFLAGS (set flags). This is distinct from the character device's ioctl handler, which processes device-specific commands.

The interface flags are copied from the previously determined iflags value. The send queue's maximum length is set to ifqmaxlen (typically 50), limiting how many packets can await transmission to userspace.

Setting Interface Capabilities
ifp->if_capabilities |= IFCAP_LINKSTATE | IFCAP_MEXTPG;
if ((tp->tun_flags & TUN_L2) != 0)
    ifp->if_capabilities |=
        IFCAP_RXCSUM | IFCAP_RXCSUM_IPV6 | IFCAP_LRO;
ifp->if_capenable |= IFCAP_LINKSTATE | IFCAP_MEXTPG;

Interface capabilities declare what hardware offload features the device supports. Two sets of flags exist:

  • if_capabilities - features the interface can support
  • if_capenable - features currently enabled

All interfaces support:

  • IFCAP_LINKSTATE - can report link up/down state changes
  • IFCAP_MEXTPG - supports multi-page external mbufs (zero-copy optimization)

Layer 2 interfaces additionally support:

  • IFCAP_RXCSUM - receive checksum offload for IPv4
  • IFCAP_RXCSUM_IPV6 - receive checksum offload for IPv6
  • IFCAP_LRO - Large Receive Offload (TCP segment coalescing)

These capabilities are initially disabled for tap/vmnet devices. When userspace enables virtio-net header mode via the TAPSVNETHDR ioctl, additional transmit capabilities become available, and the code updates these flags accordingly.

Layer 2 Interface Registration
if ((tp->tun_flags & TUN_L2) != 0) {
    ifp->if_init = tunifinit;
    ifp->if_start = tunstart_l2;
    ifp->if_transmit = tap_transmit;
    ifp->if_qflush = if_qflush;

    ether_gen_addr(ifp, &eaddr);
    ether_ifattach(ifp, eaddr.octet);

For Ethernet interfaces, four function pointers configure packet processing:

  • if_init - called when the interface transitions to the up state
  • if_start - legacy packet transmission (called by the send queue)
  • if_transmit - modern packet transmission (bypasses send queue when possible)
  • if_qflush - discards queued packets

The ether_gen_addr function generates a random MAC address for the local side of the tunnel. The address uses the locally-administered bit pattern, ensuring it doesn't conflict with real hardware addresses.

ether_ifattach performs Ethernet-specific registration:

  • Registers the interface with the network stack
  • Attaches BPF (Berkeley Packet Filter) with DLT_EN10MB (Ethernet) link type
  • Initializes the interface's link-layer address structure
  • Sets up multicast filter management

After ether_ifattach, the interface is fully operational and visible to userspace tools.

Layer 3 Interface Registration
} else {
    ifp->if_mtu = TUNMTU;
    ifp->if_start = tunstart;
    ifp->if_output = tunoutput;

    ifp->if_snd.ifq_drv_maxlen = 0;
    IFQ_SET_READY(&ifp->if_snd);

    if_attach(ifp);
    bpfattach(ifp, DLT_NULL, sizeof(u_int32_t));
}

Point-to-point interfaces follow a simpler path:

The MTU is set to TUNMTU (typically 1500), and two packet transmission functions are installed:

  • if_start - handles packets from the send queue
  • if_output - called directly by the routing code

The if_snd.ifq_drv_maxlen = 0 setting is significant, it prevents the legacy send queue from holding packets, as the modern path uses if_transmit semantics even though the function pointer isn't set. IFQ_SET_READY marks the queue as operational.

if_attach registers the interface with the network stack, making it visible to routing and configuration tools.

bpfattach enables packet capture with DLT_NULL link type. This link type prepends a 4-byte address family field (AF_INET or AF_INET6) to each packet, allowing tools like tcpdump to distinguish IPv4 from IPv6 traffic without examining packet contents.

Marking Initialization Complete
TUN_LOCK(tp);
tp->tun_flags |= TUN_INITED;
TUN_UNLOCK(tp);

The TUN_INITED flag signals that the interface is fully constructed. Other code paths check this flag before performing operations. For example, the device open function verifies that both TUN_INITED and TUN_OPEN are set before allowing I/O.

The mutex protects this flag from races where one thread checks the state while another is still initializing.

The Completed Interface

After tuncreate returns, both the character device and network interface exist and are cross-linked:

/dev/tun0 (struct cdev)
     <->  si_drv1 / tun_dev
tuntap_softc
     <->  if_softc / tun_ifp
tun0 (struct ifnet)

Opening /dev/tun0 with open(2) allows userspace to read and write packets. Transmitting packets to the tun0 interface via sendto(2) or routing queues them for userspace to read. This bidirectional connection enables userspace VPN and virtualization software to implement custom network protocols while integrating seamlessly with the kernel's network stack.

1064: static int
1065: tunopen(struct cdev *dev, int flag, int mode, struct thread *td)
1066: {
1067: 	struct ifnet	*ifp;
1068: 	struct tuntap_softc *tp;
1069: 	int error __diagused, tunflags;
1070: 
1071: 	tunflags = 0;
1072: 	CURVNET_SET(TD_TO_VNET(td));
1073: 	error = tuntap_name2info(dev->si_name, NULL, &tunflags);
1074: 	if (error != 0) {
1075: 		CURVNET_RESTORE();
1076: 		return (error);	/* Shouldn't happen */
1077: 	}
1078: 
1079: 	tp = dev->si_drv1;
1080: 	KASSERT(tp != NULL,
1081: 	    ("si_drv1 should have been initialized at creation"));
1082: 
1083: 	TUN_LOCK(tp);
1084: 	if ((tp->tun_flags & TUN_INITED) == 0) {
1085: 		TUN_UNLOCK(tp);
1086: 		CURVNET_RESTORE();
1087: 		return (ENXIO);
1088: 	}
1089: 	if ((tp->tun_flags & (TUN_OPEN | TUN_DYING)) != 0) {
1090: 		TUN_UNLOCK(tp);
1091: 		CURVNET_RESTORE();
1092: 		return (EBUSY);
1093: 	}
1094: 
1095: 	error = tun_busy_locked(tp);
1096: 	KASSERT(error == 0, ("Must be able to busy an unopen tunnel"));
1097: 	ifp = TUN2IFP(tp);
1098: 
1099: 	if ((tp->tun_flags & TUN_L2) != 0) {
1100: 		bcopy(IF_LLADDR(ifp), tp->tun_ether.octet,
1101: 		    sizeof(tp->tun_ether.octet));
1102: 
1103: 		ifp->if_drv_flags |= IFF_DRV_RUNNING;
1104: 		ifp->if_drv_flags &= ~IFF_DRV_OACTIVE;
1105: 
1106: 		if (tapuponopen)
1107: 			ifp->if_flags |= IFF_UP;
1108: 	}
1109: 
1110: 	tp->tun_pid = td->td_proc->p_pid;
1111: 	tp->tun_flags |= TUN_OPEN;
1112: 
1113: 	if_link_state_change(ifp, LINK_STATE_UP);
1114: 	TUNDEBUG(ifp, "open\n");
1115: 	TUN_UNLOCK(tp);
1116: 	/* ... cdevpriv setup ... */
1117: 	(void)devfs_set_cdevpriv(tp, tundtor);
1118: 	CURVNET_RESTORE();
1119: 	return (0);
1120: }

The tunopen function handles the open(2) system call on tun/tap/vmnet character devices. This is the entry point where userspace applications like VPN daemons or virtual machine monitors gain control over a network interface. Opening the device transitions it from an initialized but inactive state to an operational state ready for packet I/O.

Function Signature and Virtual Network Context
static int
tunopen(struct cdev *dev, int flag, int mode, struct thread *td)
{
    CURVNET_SET(TD_TO_VNET(td));

The function receives the standard character device open parameters: the device being opened, flags from the open(2) call, mode bits, and the thread performing the operation.

The CURVNET_SET macro is critical for FreeBSD's VNET (virtual network stack) support. In systems using jails or virtualization, multiple independent network stacks may exist. This macro switches to the network context associated with the opening thread's jail or vnet, ensuring all subsequent network operations affect the correct stack. Every function that touches network interfaces or routing tables must bracket its work between CURVNET_SET and CURVNET_RESTORE.

Device Type Validation
tunflags = 0;
error = tuntap_name2info(dev->si_name, NULL, &tunflags);
if (error != 0) {
    CURVNET_RESTORE();
    return (error);
}

Although the device should already exist and be properly typed, this code validates that the device name still corresponds to a known tun/tap/vmnet variant. The check should always succeed, as indicated by the comment "Shouldn't happen". The validation guards against corrupted kernel state or race conditions during device destruction.

Retrieving and Validating Device State
tp = dev->si_drv1;
KASSERT(tp != NULL,
    ("si_drv1 should have been initialized at creation"));

TUN_LOCK(tp);
if ((tp->tun_flags & TUN_INITED) == 0) {
    TUN_UNLOCK(tp);
    CURVNET_RESTORE();
    return (ENXIO);
}

The softc is retrieved via the si_drv1 link established during device creation. The assertion verifies this fundamental invariant.

The softc mutex is acquired before checking state flags, preventing race conditions. The TUN_INITED flag check ensures the network interface was successfully created. If initialization failed or hasn't completed yet, the open fails with ENXIO (device not configured).

Enforcing Exclusive Access
if ((tp->tun_flags & (TUN_OPEN | TUN_DYING)) != 0) {
    TUN_UNLOCK(tp);
    CURVNET_RESTORE();
    return (EBUSY);
}

Tun/tap devices enforce exclusive access, only one process may have a device open at a time. This design simplifies packet routing: there's always exactly one userspace consumer for packets arriving at the interface.

The check examines two flags:

  • TUN_OPEN - device is already open by another process
  • TUN_DYING - device is being destroyed

Either condition returns EBUSY, informing userspace that the device is unavailable. This prevents scenarios where multiple VPN daemons fight over the same tunnel or where a process opens a device mid-destruction.

Marking the Device Busy
error = tun_busy_locked(tp);
KASSERT(error == 0, ("Must be able to busy an unopen tunnel"));
ifp = TUN2IFP(tp);

The busy mechanism prevents device destruction while operations are in progress. The tun_busy_locked function increments the tun_busy counter and fails if TUN_DYING is set.

The assertion verifies that marking the device busy must succeed, since we hold the lock and already checked that neither TUN_OPEN nor TUN_DYING is set, no concurrent destruction can be occurring.

The TUN2IFP macro extracts the ifnet pointer from the softc, providing access to the network interface for subsequent configuration.

Layer 2 Interface Activation
if ((tp->tun_flags & TUN_L2) != 0) {
    bcopy(IF_LLADDR(ifp), tp->tun_ether.octet,
        sizeof(tp->tun_ether.octet));

    ifp->if_drv_flags |= IFF_DRV_RUNNING;
    ifp->if_drv_flags &= ~IFF_DRV_OACTIVE;

    if (tapuponopen)
        ifp->if_flags |= IFF_UP;
}

For Ethernet interfaces (tap/vmnet), opening the device activates several features:

The MAC address is copied from the interface to tp->tun_ether. This snapshot preserves the "remote" MAC address that userspace might need. While the interface itself knows its local MAC address, the softc stores this copy for symmetric access patterns.

Two driver flags are updated:

  • IFF_DRV_RUNNING - signals that the driver is ready to transmit and receive
  • IFF_DRV_OACTIVE - cleared to indicate output is not blocked

These "driver flags" (if_drv_flags) are distinct from interface flags (if_flags). Driver flags reflect the device driver's internal state, while interface flags reflect administratively configured properties.

The tapuponopen sysctl controls whether opening the device automatically marks the interface administratively up. When enabled, ifp->if_flags |= IFF_UP brings the interface up without requiring a separate ifconfig tap0 up command. This convenience feature is disabled by default to maintain traditional Unix semantics where device availability and interface state are orthogonal.

Recording Ownership
tp->tun_pid = td->td_proc->p_pid;
tp->tun_flags |= TUN_OPEN;

The controlling process's PID is recorded in tun_pid. This information appears in ifconfig output and helps administrators identify which process owns each tunnel. While not used for access control (the file descriptor provides that), it's valuable for debugging and monitoring.

The TUN_OPEN flag is set, transitioning the device into the open state. Subsequent open attempts will now fail with EBUSY until this process closes the device.

if_link_state_change(ifp, LINK_STATE_UP);
TUNDEBUG(ifp, "open\n");
TUN_UNLOCK(tp);

The if_link_state_change call notifies the network stack that the interface's link is now up. This generates routing socket messages that daemons like devd can monitor, and it updates the interface's link state visible in ifconfig output.

For physical Ethernet interfaces, link state reflects cable connection status. For tun/tap devices, link state reflects whether userspace has the device open. This semantic mapping allows routing protocols and management tools to treat virtual interfaces consistently with physical ones.

The debug message logs the open event, and the mutex is released before the final setup step.

Establishing Close Notification
(void)devfs_set_cdevpriv(tp, tundtor);
CURVNET_RESTORE();
return (0);

The devfs_set_cdevpriv call associates the softc with this file descriptor and registers tundtor (tunnel destructor) as the cleanup function. When the file descriptor is closed, whether explicitly via close(2) or implicitly via process termination, the kernel automatically invokes tundtor to tear down the device state.

This mechanism provides robust cleanup semantics. Even if a process crashes or is killed, the kernel ensures proper device shutdown. The function pointer and data association are per-file-descriptor, allowing the same device to be opened multiple times in succession (though not concurrently) with correct cleanup for each instance.

The return value 0 signals successful open. At this point, userspace can begin reading packets transmitted to the interface and writing packets to inject into the network stack.

State Transitions

The open operation transitions the device through several states:

Device created  ->  TUN_INITED set
     -> 
tunopen() called
     -> 
Check exclusive access
     -> 
Mark busy (prevent destruction)
     -> 
Configure interface (L2: set RUNNING, optionally set UP)
     -> 
Record owner PID
     -> 
Set TUN_OPEN flag
     -> 
Signal link state UP
     -> 
Register close handler
     -> 
Device ready for I/O

After successful open, the device exists in three interlinked representations:

  • Character device node (/dev/tun0) with an open file descriptor
  • Network interface (tun0) with link state UP
  • Softc structure binding them with TUN_OPEN set

Packets can now flow bidirectionally: the network stack queues outbound packets for userspace to read, and userspace writes inbound packets for the network stack to process.

4) read(2): userspace receives a whole packet (or EWOULDBLOCK)

1706: /*
1707:  * The cdevsw read interface - reads a packet at a time, or at
1708:  * least as much of a packet as can be read.
1709:  */
1710: static	int
1711: tunread(struct cdev *dev, struct uio *uio, int flag)
1712: {
1713: 	struct tuntap_softc *tp = dev->si_drv1;
1714: 	struct ifnet	*ifp = TUN2IFP(tp);
1715: 	struct mbuf	*m;
1716: 	size_t		len;
1717: 	int		error = 0;
1718: 
1719: 	TUNDEBUG (ifp, "read\n");
1720: 	TUN_LOCK(tp);
1721: 	if ((tp->tun_flags & TUN_READY) != TUN_READY) {
1722: 		TUN_UNLOCK(tp);
1723: 		TUNDEBUG (ifp, "not ready 0%o\n", tp->tun_flags);
1724: 		return (EHOSTDOWN);
1725: 	}
1726: 
1727: 	tp->tun_flags &= ~TUN_RWAIT;
1728: 
1729: 	for (;;) {
1730: 		IFQ_DEQUEUE(&ifp->if_snd, m);
1731: 		if (m != NULL)
1732: 			break;
1733: 		if (flag & O_NONBLOCK) {
1734: 			TUN_UNLOCK(tp);
1735: 			return (EWOULDBLOCK);
1736: 		}
1737: 		tp->tun_flags |= TUN_RWAIT;
1738: 		error = mtx_sleep(tp, &tp->tun_mtx, PCATCH | (PZERO + 1),
1739: 		    "tunread", 0);
1740: 		if (error != 0) {
1741: 			TUN_UNLOCK(tp);
1742: 			return (error);
1743: 		}
1744: 	}
1745: 	TUN_UNLOCK(tp);
1746: 
1747: 	len = min(tp->tun_vhdrlen, uio->uio_resid);
1748: 	if (len > 0) {
1749: 		struct virtio_net_hdr_mrg_rxbuf vhdr;
1750: 
1751: 		bzero(&vhdr, sizeof(vhdr));
1752: 		if (m->m_pkthdr.csum_flags & TAP_ALL_OFFLOAD) {
1753: 			m = virtio_net_tx_offload(ifp, m, false, &vhdr.hdr);
1754: 		}
1755: 
1756: 		TUNDEBUG(ifp, "txvhdr: f %u, gt %u, hl %u, "
1757: 		    "gs %u, cs %u, co %u\n", vhdr.hdr.flags,
1758: 		    vhdr.hdr.gso_type, vhdr.hdr.hdr_len,
1759: 		    vhdr.hdr.gso_size, vhdr.hdr.csum_start,
1760: 		    vhdr.hdr.csum_offset);
1761: 		error = uiomove(&vhdr, len, uio);
1762: 	}
1763: 	if (error == 0)
1764: 		error = m_mbuftouio(uio, m, 0);
1765: 	m_freem(m);
1766: 	return (error);
1767: }

The tunread function implements the read(2) system call for tun/tap devices, transferring packets from the kernel's network stack to userspace. This is the critical path where packets destined for transmission on the virtual network interface become available to VPN daemons, virtual machine monitors, or other userspace networking applications.

Function Overview and Context Retrieval
static int
tunread(struct cdev *dev, struct uio *uio, int flag)
{
    struct tuntap_softc *tp = dev->si_drv1;
    struct ifnet *ifp = TUN2IFP(tp);
    struct mbuf *m;
    size_t len;
    int error = 0;

The function receives the standard read(2) parameters: the device being read, a uio (user I/O) structure describing the userspace buffer, and flags from the open(2) call (particularly O_NONBLOCK).

The softc and interface pointers are retrieved via the established linkages. The mbuf pointer m will hold the packet being transferred, while len tracks how much data to copy.

Device Readiness Check
TUNDEBUG(ifp, "read\n");
TUN_LOCK(tp);
if ((tp->tun_flags & TUN_READY) != TUN_READY) {
    TUN_UNLOCK(tp);
    TUNDEBUG(ifp, "not ready 0%o\n", tp->tun_flags);
    return (EHOSTDOWN);
}

The TUN_READY macro combines two flags: TUN_OPEN | TUN_INITED. Both must be set for I/O to proceed:

  • TUN_INITED - the network interface was successfully created
  • TUN_OPEN - a process has opened the device

If either condition fails, the read returns EHOSTDOWN, signaling that the network path is unavailable. This error code is semantically appropriate, from the kernel's perspective, packets are being sent to a "host" (userspace), but that host is down.

Preparing for Packet Retrieval
tp->tun_flags &= ~TUN_RWAIT;

The TUN_RWAIT flag tracks whether a reader is blocked waiting for packets. Clearing it before entering the loop ensures correct state regardless of how the previous read completed, whether it retrieved a packet, timed out, or was interrupted.

The Packet Dequeue Loop
for (;;) {
    IFQ_DEQUEUE(&ifp->if_snd, m);
    if (m != NULL)
        break;
    if (flag & O_NONBLOCK) {
        TUN_UNLOCK(tp);
        return (EWOULDBLOCK);
    }
    tp->tun_flags |= TUN_RWAIT;
    error = mtx_sleep(tp, &tp->tun_mtx, PCATCH | (PZERO + 1),
        "tunread", 0);
    if (error != 0) {
        TUN_UNLOCK(tp);
        return (error);
    }
}
TUN_UNLOCK(tp);

This loop implements the standard kernel pattern for blocking I/O with non-blocking mode support.

Packet retrieval: IFQ_DEQUEUE atomically removes the head packet from the interface's send queue. This macro handles queue locking internally and returns NULL if the queue is empty.

Success path: When m != NULL, a packet was successfully dequeued, and the loop exits.

Non-blocking path: If the queue is empty and O_NONBLOCK was specified during open(2), the read immediately returns EWOULDBLOCK (also known as EAGAIN). This allows userspace to use poll(2), select(2), or kqueue(2) to wait efficiently for readable conditions without blocking the thread.

Blocking path: For blocking reads, the code:

  1. Sets TUN_RWAIT to indicate a reader is waiting
  2. Calls mtx_sleep to block the thread atomically

The mtx_sleep function atomically releases tp->tun_mtx and puts the thread to sleep. When woken (by tunstart or tunstart_l2 when packets arrive), it reacquires the mutex before returning.

The sleep parameters specify:

  • tp - the wait channel (arbitrary unique pointer, using the softc)
  • &tp->tun_mtx - mutex to release/reacquire atomically
  • PCATCH | (PZERO + 1) - allow signal interruption, priority just above normal
  • "tunread" - name for debugging (shows in ps or top)
  • 0 - no timeout (sleep indefinitely)

Signal handling: If interrupted by a signal (like SIGINT), mtx_sleep returns an error (typically EINTR or ERESTART), and the function propagates this to userspace. This allows Ctrl+C to interrupt a blocked read.

After successfully dequeuing a packet, the mutex is released. The remainder of the function operates on the mbuf without holding locks, avoiding contention with packet transmission threads.

Virtio-Net Header Processing
len = min(tp->tun_vhdrlen, uio->uio_resid);
if (len > 0) {
    struct virtio_net_hdr_mrg_rxbuf vhdr;

    bzero(&vhdr, sizeof(vhdr));
    if (m->m_pkthdr.csum_flags & TAP_ALL_OFFLOAD) {
        m = virtio_net_tx_offload(ifp, m, false, &vhdr.hdr);
    }
    /* ... debug output ... */
    error = uiomove(&vhdr, len, uio);
}

For tap devices configured with virtio-net header mode (via the TAPSVNETHDR ioctl), packets are prefixed with a metadata header describing offload features. This optimization allows userspace (particularly QEMU/KVM) to leverage hardware offload capabilities:

The tun_vhdrlen field is zero for standard mode and non-zero (typically 10 or 12 bytes) when virtio headers are enabled. The code only processes headers if both the header is enabled (len > 0) and the userspace buffer has room (uio->uio_resid).

The vhdr structure is zero-initialized to provide safe defaults. If the mbuf has offload flags set (TAP_ALL_OFFLOAD includes TCP/UDP checksum offload and TSO), virtio_net_tx_offload populates the header with:

  • Checksum computation parameters (where to start, where to insert)
  • Segmentation parameters (MSS, header length)
  • Generic flags (whether header is valid)

The uiomove(&vhdr, len, uio) call copies the header to userspace. This function handles the kernel-to-user memory transfer, updating uio to reflect consumed buffer space. If this copy fails (typically due to invalid userspace pointer), the error is recorded but processing continues to free the mbuf.

Packet Data Transfer
if (error == 0)
    error = m_mbuftouio(uio, m, 0);
m_freem(m);
return (error);

Assuming the header transfer succeeded (or no header was required), m_mbuftouio copies the packet data from the mbuf chain to the userspace buffer. This function:

  • Walks the mbuf chain (packets may be fragmented across multiple mbufs)
  • Copies each segment to userspace via uiomove
  • Updates uio->uio_resid to reflect remaining buffer space
  • Returns an error if the buffer is too small or pointers are invalid

The m_freem call releases the mbuf back to the kernel's memory pool. This must always execute, even if earlier operations failed, to prevent memory leaks. The mbuf is freed regardless of whether the copy succeeded, once dequeued from the send queue, the packet's fate is sealed.

Data Flow Summary

The complete path from network transmission to userspace read:

Application calls send()/sendto()
     -> 
Kernel routing selects tun0 interface
     -> 
tunoutput() or tap_transmit() enqueues mbuf
     -> 
tunstart()/tunstart_l2() wakes blocked reader
     -> 
tunread() dequeues mbuf from if_snd
     -> 
Optional: Generate virtio-net header
     -> 
Copy header to userspace (if enabled)
     -> 
Copy packet data to userspace
     -> 
Free mbuf
     -> 
Userspace receives packet data
Error Handling Semantics

The function returns several distinct error codes with specific meanings:

  • EHOSTDOWN - device not ready (not open or not initialized)
  • EWOULDBLOCK - non-blocking read, no packets available
  • EINTR/ERESTART - interrupted by signal while waiting
  • EFAULT - userspace buffer pointer invalid
  • 0 - success, packet transferred

These error codes allow userspace to distinguish between transient conditions (like EWOULDBLOCK requiring retry) and permanent failures (like EHOSTDOWN requiring device reopen).

Blocking and Wakeup Coordination

The TUN_RWAIT flag and mtx_sleep coordination ensure efficient resource usage. When no packets are available:

  1. Reader blocks in mtx_sleep, consuming no CPU
  2. When the network stack transmits a packet, tunstart or tunstart_l2 executes
  3. Those functions check TUN_RWAIT and call wakeup(tp) if set
  4. The sleeping thread wakes, loops, and dequeues the packet

This pattern avoids polling loops while ensuring prompt packet delivery. The mutex protects against races where packets arrive between the empty queue check and the sleep call.

5) write(2): userspace injects a packet (L2 vs L3 path)

5.1 Main write dispatcher (tunwrite)

1896: /*
1897:  * the cdevsw write interface - an atomic write is a packet - or else!
1898:  */
1899: static	int
1900: tunwrite(struct cdev *dev, struct uio *uio, int flag)
1901: {
1902: 	struct virtio_net_hdr_mrg_rxbuf vhdr;
1903: 	struct tuntap_softc *tp;
1904: 	struct ifnet	*ifp;
1905: 	struct mbuf	*m;
1906: 	uint32_t	mru;
1907: 	int		align, vhdrlen, error;
1908: 	bool		l2tun;
1909: 
1910: 	tp = dev->si_drv1;
1911: 	ifp = TUN2IFP(tp);
1912: 	TUNDEBUG(ifp, "tunwrite\n");
1913: 	if ((ifp->if_flags & IFF_UP) != IFF_UP)
1914: 		/* ignore silently */
1915: 		return (0);
1916: 
1917: 	if (uio->uio_resid == 0)
1918: 		return (0);
1919: 
1920: 	l2tun = (tp->tun_flags & TUN_L2) != 0;
1921: 	mru = l2tun ? TAPMRU : TUNMRU;
1922: 	vhdrlen = tp->tun_vhdrlen;
1923: 	align = 0;
1924: 	if (l2tun) {
1925: 		align = ETHER_ALIGN;
1926: 		mru += vhdrlen;
1927: 	} else if ((tp->tun_flags & TUN_IFHEAD) != 0)
1928: 		mru += sizeof(uint32_t);	/* family */
1929: 	if (uio->uio_resid < 0 || uio->uio_resid > mru) {
1930: 		TUNDEBUG(ifp, "len=%zd!\n", uio->uio_resid);
1931: 		return (EIO);
1932: 	}
1933: 
1934: 	if (vhdrlen > 0) {
1935: 		error = uiomove(&vhdr, vhdrlen, uio);
1936: 		if (error != 0)
1937: 			return (error);
1938: 		TUNDEBUG(ifp, "txvhdr: f %u, gt %u, hl %u, "
1939: 		    "gs %u, cs %u, co %u\n", vhdr.hdr.flags,
1940: 		    vhdr.hdr.gso_type, vhdr.hdr.hdr_len,
1941: 		    vhdr.hdr.gso_size, vhdr.hdr.csum_start,
1942: 		    vhdr.hdr.csum_offset);
1943: 	}
1944: 
1945: 	if ((m = m_uiotombuf(uio, M_NOWAIT, 0, align, M_PKTHDR)) == NULL) {
1946: 		if_inc_counter(ifp, IFCOUNTER_IERRORS, 1);
1947: 		return (ENOBUFS);
1948: 	}
1949: 
1950: 	m->m_pkthdr.rcvif = ifp;
1951: #ifdef MAC
1952: 	mac_ifnet_create_mbuf(ifp, m);
1953: #endif
1954: 
1955: 	if (l2tun)
1956: 		return (tunwrite_l2(tp, m, vhdrlen > 0 ? &vhdr : NULL));
1957: 
1958: 	return (tunwrite_l3(tp, m));
1959: }

The tunwrite function implements the write(2) system call for tun/tap devices, injecting packets from userspace into the kernel's network stack. This is the complementary operation to tunread, where tunread delivers kernel-originated packets to userspace, tunwrite accepts userspace packets for kernel processing. The comment "an atomic write is a packet - or else!" emphasizes a critical design principle: each write(2) call must contain exactly one complete packet.

Function Initialization and Context
static int
tunwrite(struct cdev *dev, struct uio *uio, int flag)
{
    struct virtio_net_hdr_mrg_rxbuf vhdr;
    struct tuntap_softc *tp;
    struct ifnet *ifp;
    struct mbuf *m;
    uint32_t mru;
    int align, vhdrlen, error;
    bool l2tun;

    tp = dev->si_drv1;
    ifp = TUN2IFP(tp);

The function retrieves the device context through the standard si_drv1 linkage. Local variables track the maximum receive unit, alignment requirements, virtio header length, and whether this is a layer 2 interface.

Interface State Validation
TUNDEBUG(ifp, "tunwrite\n");
if ((ifp->if_flags & IFF_UP) != IFF_UP)
    /* ignore silently */
    return (0);

if (uio->uio_resid == 0)
    return (0);

Two early checks filter out invalid operations:

Interface down check: If the interface is administratively down (not marked IFF_UP), the write succeeds immediately without processing the packet. This silent discard behavior differs from the read path, which returns EHOSTDOWN when not ready. The asymmetry makes sense: applications writing packets shouldn't fail when the interface is temporarily down, packets are simply dropped, mimicking what would happen on a real network interface with no carrier.

Zero-length write: Writing zero bytes is treated as a no-op success. This handles edge cases like write(fd, buf, 0) without error.

Determining Packet Size Limits
l2tun = (tp->tun_flags & TUN_L2) != 0;
mru = l2tun ? TAPMRU : TUNMRU;
vhdrlen = tp->tun_vhdrlen;
align = 0;
if (l2tun) {
    align = ETHER_ALIGN;
    mru += vhdrlen;
} else if ((tp->tun_flags & TUN_IFHEAD) != 0)
    mru += sizeof(uint32_t);

The Maximum Receive Unit (MRU) depends on the interface type:

  • Layer 3 (tun): TUNMRU (typically 1500 bytes, standard IPv4 MTU)
  • Layer 2 (tap/vmnet): TAPMRU (typically 1518 bytes, Ethernet frame size)

Alignment requirements: Layer 2 devices set align = ETHER_ALIGN (usually 2 bytes). This ensures the IP header following the 14-byte Ethernet header lands on a 4-byte boundary, which improves performance on architectures with alignment restrictions or cache line efficiency concerns.

Header adjustments: The MRU increases to accommodate:

  • Virtio-net headers for tap devices (vhdrlen bytes)
  • Address family indicator for tun devices in IFHEAD mode (4 bytes)

These headers precede the actual packet data in the userspace buffer but are not part of the on-wire packet format.

Validating Write Size
if (uio->uio_resid < 0 || uio->uio_resid > mru) {
    TUNDEBUG(ifp, "len=%zd!\n", uio->uio_resid);
    return (EIO);
}

The write size (uio->uio_resid) must fall within valid bounds. Negative sizes are impossible in correct operation but checked for safety. Oversized writes indicate either:

  • Application bugs (trying to write jumbo frames without configuration)
  • Protocol violations (incorrect packet framing)
  • Malicious behavior

The EIO return signals a generic I/O error, appropriate for data that cannot be processed.

Processing Virtio-Net Headers
if (vhdrlen > 0) {
    error = uiomove(&vhdr, vhdrlen, uio);
    if (error != 0)
        return (error);
    TUNDEBUG(ifp, "txvhdr: f %u, gt %u, hl %u, "
        "gs %u, cs %u, co %u\n", vhdr.hdr.flags,
        vhdr.hdr.gso_type, vhdr.hdr.hdr_len,
        vhdr.hdr.gso_size, vhdr.hdr.csum_start,
        vhdr.hdr.csum_offset);
}

When virtio-net header mode is enabled (common for VM networking), userspace prepends a small header to each packet describing offload operations:

  • Checksum offload: Instructs the kernel where to compute and insert checksums
  • Segmentation offload: For large packets (TSO/GSO), describes how to segment into MTU-sized chunks
  • Receive offload hints: Indicates checksums already validated by VM guest

The uiomove call copies the header from userspace, consuming vhdrlen bytes from the user buffer and advancing uio. If the copy fails (invalid pointer), the error propagates immediately, a corrupted header cannot be safely processed.

The debug output logs header fields for troubleshooting offload issues. In production builds with tundebug = 0, these statements compile away.

Constructing the Mbuf
if ((m = m_uiotombuf(uio, M_NOWAIT, 0, align, M_PKTHDR)) == NULL) {
    if_inc_counter(ifp, IFCOUNTER_IERRORS, 1);
    return (ENOBUFS);
}

The m_uiotombuf function is the kernel's utility for converting userspace data into the network stack's native packet format (mbuf chains). Its parameters specify:

  • uio - source data from userspace
  • M_NOWAIT - don't sleep for memory (return NULL immediately if allocation fails)
  • 0 - no maximum length (use all remaining uio_resid bytes)
  • align - start packet data this many bytes into the first mbuf
  • M_PKTHDR - allocate an mbuf with packet header (required for network packets)

Memory allocation failure: If m_uiotombuf returns NULL, the system is out of mbuf memory. The IFCOUNTER_IERRORS counter increments (visible in netstat -i), and ENOBUFS informs userspace of temporary resource exhaustion. Applications should generally retry after a brief delay.

The M_NOWAIT policy: Using M_NOWAIT rather than M_WAITOK prevents userspace writes from blocking indefinitely when memory is low. This is appropriate for the write path, if memory isn't available now, failing quickly allows the application to handle backpressure.

Setting Packet Metadata
m->m_pkthdr.rcvif = ifp;
#ifdef MAC
mac_ifnet_create_mbuf(ifp, m);
#endif

Two pieces of metadata are attached to the packet:

Receive interface: m_pkthdr.rcvif records which interface received the packet. This seems counterintuitive; we're injecting a packet, not receiving one, but from the kernel's perspective, packets written to /dev/tun0 are "received" on the tun0 interface. This field is crucial for:

  • Firewall rules (ipfw, pf) that filter based on incoming interface
  • Routing decisions that consider packet source
  • Accounting that attributes traffic to specific interfaces

MAC Framework labeling: If the Mandatory Access Control framework is enabled, mac_ifnet_create_mbuf applies security labels to the packet based on the interface's policy. This supports systems using TrustedBSD MAC for fine-grained network security.

Dispatching by Layer
if (l2tun)
    return (tunwrite_l2(tp, m, vhdrlen > 0 ? &vhdr : NULL));

return (tunwrite_l3(tp, m));

The final step delegates to layer-specific processing functions:

Layer 2 path (tunwrite_l2): For tap/vmnet devices, the mbuf contains a complete Ethernet frame. The function:

  • Validates the Ethernet header
  • Applies virtio-net offload hints if present
  • Injects the frame into the Ethernet processing path
  • Potentially processes through LRO (Large Receive Offload)

Layer 3 path (tunwrite_l3): For tun devices, the mbuf contains a raw IP packet (possibly preceded by an address family indicator in IFHEAD mode). The function:

  • Extracts the protocol family (IPv4 vs IPv6)
  • Dispatches to the appropriate network layer protocol handler
  • Bypasses link-layer processing entirely

Both functions assume ownership of the mbuf, they will either successfully inject it into the network stack or free it on error. The caller should not access the mbuf after these calls return.

Data Flow Summary

The complete path from userspace write to kernel network processing:

Application calls write(fd, packet, len)
     -> 
tunwrite() validates interface state and size
     -> 
Extract virtio-net header (if enabled)
     -> 
Copy packet data from userspace to mbuf
     -> 
Set mbuf metadata (rcvif, MAC labels)
     -> 
Layer 2: tunwrite_l2()           Layer 3: tunwrite_l3()
     ->                                   -> 
Validate Ethernet header          Extract address family
     ->                                   -> 
Apply offload hints               Dispatch to IP/IPv6
     ->                                   -> 
ether_input() / LRO               netisr_dispatch()
     ->                                   -> 
Network stack processes packet
     -> 
Routing, firewall, socket delivery
Atomic Write Semantics

The opening comment, "an atomic write is a packet - or else!", highlights a critical contract: userspace must write complete packets in single write(2) calls. The driver provides no buffering or packet assembly:

  • Writing 1000 bytes, then 500 bytes creates two packets (1000-byte and 500-byte)
  • Not "one 1500-byte packet assembled from two writes"

This design simplifies the driver and matches the semantics of real network interfaces, which receive complete frames. Applications that need to construct packets piece-by-piece must buffer in userspace before writing.

Error Handling and Resource Management

The function's error handling demonstrates defensive programming patterns:

  • Early validation prevents resource allocation for invalid requests
  • Immediate cleanup on m_uiotombuf failure (increment error counter, return ENOBUFS)
  • Ownership transfer to layer-specific functions eliminates double-free risks

The only resource allocated (the mbuf) has clear ownership transfer semantics. After calling tunwrite_l2 or tunwrite_l3, the write function never touches it again.

5.2 L3 (tun) dispatch to the network stack (netisr)

1845: static int
1846: tunwrite_l3(struct tuntap_softc *tp, struct mbuf *m)
1847: {
1848: 	struct epoch_tracker et;
1849: 	struct ifnet *ifp;
1850: 	int family, isr;
1851: 
1852: 	ifp = TUN2IFP(tp);
1853: 	/* Could be unlocked read? */
1854: 	TUN_LOCK(tp);
1855: 	if (tp->tun_flags & TUN_IFHEAD) {
1856: 		TUN_UNLOCK(tp);
1857: 		if (m->m_len < sizeof(family) &&
1858: 		(m = m_pullup(m, sizeof(family))) == NULL)
1859: 			return (ENOBUFS);
1860: 		family = ntohl(*mtod(m, u_int32_t *));
1861: 		m_adj(m, sizeof(family));
1862: 	} else {
1863: 		TUN_UNLOCK(tp);
1864: 		family = AF_INET;
1865: 	}
1866: 
1867: 	BPF_MTAP2(ifp, &family, sizeof(family), m);
1868: 
1869: 	switch (family) {
1870: #ifdef INET
1871: 	case AF_INET:
1872: 		isr = NETISR_IP;
1873: 		break;
1874: #endif
1875: #ifdef INET6
1876: 	case AF_INET6:
1877: 		isr = NETISR_IPV6;
1878: 		break;
1879: #endif
1880: 	default:
1881: 		m_freem(m);
1882: 		return (EAFNOSUPPORT);
1883: 	}
1884: 	random_harvest_queue(m, sizeof(*m), RANDOM_NET_TUN);
1885: 	if_inc_counter(ifp, IFCOUNTER_IBYTES, m->m_pkthdr.len);
1886: 	if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1);
1887: 	CURVNET_SET(ifp->if_vnet);
1888: 	M_SETFIB(m, ifp->if_fib);
1889: 	NET_EPOCH_ENTER(et);
1890: 	netisr_dispatch(isr, m);
1891: 	NET_EPOCH_EXIT(et);
1892: 	CURVNET_RESTORE();
1893: 	return (0);
1894: }

The tunwrite_l3 function handles packets written to layer 3 (tun) devices, injecting raw IP packets directly into the kernel's network protocol handlers. Unlike layer 2 (tap) devices that process complete Ethernet frames, tun devices work with IP packets that have no link-layer headers, making them ideal for VPN implementations and IP tunneling protocols.

Function Context and Protocol Family Extraction
static int
tunwrite_l3(struct tuntap_softc *tp, struct mbuf *m)
{
    struct epoch_tracker et;
    struct ifnet *ifp;
    int family, isr;

    ifp = TUN2IFP(tp);

The function receives the softc and an mbuf containing the packet. The epoch_tracker will be used later to ensure safe concurrent access to routing structures. The family variable will hold the protocol family (AF_INET or AF_INET6), and isr will identify the appropriate network interrupt service routine.

Determining Protocol Family
TUN_LOCK(tp);
if (tp->tun_flags & TUN_IFHEAD) {
    TUN_UNLOCK(tp);
    if (m->m_len < sizeof(family) &&
    (m = m_pullup(m, sizeof(family))) == NULL)
        return (ENOBUFS);
    family = ntohl(*mtod(m, u_int32_t *));
    m_adj(m, sizeof(family));
} else {
    TUN_UNLOCK(tp);
    family = AF_INET;
}

Tun devices support two modes for indicating packet protocol:

IFHEAD mode (TUN_IFHEAD flag set): Each packet begins with a 4-byte address family indicator in network byte order. This mode, enabled via the TUNSIFHEAD ioctl, allows a single tun device to carry both IPv4 and IPv6 traffic. The code:

  1. Checks if the first mbuf contains at least 4 bytes using m->m_len
  2. If not, calls m_pullup to consolidate the header into the first mbuf
  3. Extracts the family using mtod (mbuf-to-data pointer) and converts from network to host byte order with ntohl
  4. Strips the family indicator with m_adj, which advances the data pointer by 4 bytes

The m_pullup call can fail if memory is exhausted, returning NULL. In this case, the original mbuf has already been freed by m_pullup, so the function simply returns ENOBUFS without calling m_freem.

Non-IFHEAD mode (default): All packets are assumed to be IPv4. This legacy mode simplifies applications that only handle IPv4, but prevents multiplexing protocols over one device.

The mutex is held only while reading tun_flags, minimizing lock contention. The comment "Could be unlocked read?" questions whether the lock is even necessary, since flags rarely change after initialization, an unlocked read would likely be safe. However, the conservative approach prevents theoretical races.

Berkeley Packet Filter Tap
BPF_MTAP2(ifp, &family, sizeof(family), m);

The BPF_MTAP2 macro passes the packet to any attached BPF (Berkeley Packet Filter) listeners, typically packet capture tools like tcpdump. The macro name breaks down as:

  • BPF - Berkeley Packet Filter subsystem
  • MTAP - tap into the packet stream from an mbuf
  • 2 - two-argument variant that prepends metadata

The call prepends the 4-byte family value before the packet data, allowing capture tools to distinguish IPv4 from IPv6 without packet inspection. This matches the link-layer type DLT_NULL configured during interface creation; captured packets have a 4-byte address family header even if the wire format doesn't.

BPF operates efficiently: if no listeners are attached, the macro expands to a simple conditional check that costs only a few instructions. This design allows pervasive instrumentation points throughout the network stack without performance impact when not actively debugging.

Protocol Validation and Dispatch Setup
switch (family) {
#ifdef INET
case AF_INET:
    isr = NETISR_IP;
    break;
#endif
#ifdef INET6
case AF_INET6:
    isr = NETISR_IPV6;
    break;
#endif
default:
    m_freem(m);
    return (EAFNOSUPPORT);
}

The protocol family determines which network layer interrupt service routine (netisr) will process the packet:

  • AF_INET -> NETISR_IP - IPv4 processing
  • AF_INET6 -> NETISR_IPV6 - IPv6 processing

The #ifdef guards are crucial: if the kernel was compiled without IPv4 or IPv6 support, those cases don't exist, and attempting to inject such packets results in EAFNOSUPPORT (address family not supported).

Unsupported protocol families trigger immediate mbuf deallocation via m_freem and return an error. This prevents packets from leaking into the network stack with incorrect metadata that could cause crashes or security issues.

Entropy Collection
random_harvest_queue(m, sizeof(*m), RANDOM_NET_TUN);

This call contributes entropy to the kernel's random number generator. Network packet arrival timing is unpredictable and difficult for attackers to manipulate, making it a valuable entropy source. The function samples metadata about the mbuf structure (not packet contents) to seed the random pool.

The RANDOM_NET_TUN flag tags the entropy source, allowing the random subsystem to track entropy diversity. Systems relying on /dev/random for cryptographic operations benefit from accumulating entropy from multiple independent sources.

Interface Statistics
if_inc_counter(ifp, IFCOUNTER_IBYTES, m->m_pkthdr.len);
if_inc_counter(ifp, IFCOUNTER_IPACKETS, 1);

These calls update interface statistics visible via netstat -i or ifconfig:

  • IFCOUNTER_IBYTES - total bytes received
  • IFCOUNTER_IPACKETS - total packets received

From the kernel's perspective, packets written by userspace are "input" to the interface, hence the use of input counters rather than output counters. This matches the semantic established by setting m_pkthdr.rcvif earlier, the packet is being received from userspace.

The if_inc_counter function handles atomic updates, ensuring accurate counts even with concurrent packet processing on multiprocessor systems.

Network Stack Context Setup
CURVNET_SET(ifp->if_vnet);
M_SETFIB(m, ifp->if_fib);

Two pieces of context are established before injecting the packet:

Virtual network stack: CURVNET_SET switches to the network context (vnet) associated with the interface. In systems using jails or network stack virtualization, multiple independent network stacks coexist. This macro ensures routing tables, firewall rules, and socket lookups operate in the correct namespace.

Forwarding Information Base (FIB): M_SETFIB tags the packet with the interface's FIB number. FreeBSD supports multiple routing tables (FIBs), allowing policy-based routing where different applications or interfaces use distinct routing policies. The packet inherits the interface's FIB, ensuring routes are looked up in the appropriate table.

These settings affect all subsequent packet processing: firewall rules, routing decisions, and socket delivery.

Epoch-Protected Dispatch
NET_EPOCH_ENTER(et);
netisr_dispatch(isr, m);
NET_EPOCH_EXIT(et);
CURVNET_RESTORE();
return (0);

The critical packet injection occurs within an epoch section:

Network epoch: FreeBSD's network stack uses epoch-based reclamation (a form of read-copy-update) to protect data structures from concurrent access without heavy locking. NET_EPOCH_ENTER registers this thread as active in the network epoch, preventing routing entries, interface structures, and other network objects from being deallocated until NET_EPOCH_EXIT.

This mechanism enables lock-free reads of routing tables and interface lists, dramatically improving multicore scalability. The epoch tracker et maintains the context needed to exit cleanly.

Netisr dispatch: netisr_dispatch(isr, m) hands the packet to the network interrupt service routine subsystem. This asynchronous dispatch model decouples packet injection from protocol processing:

  1. The packet is queued to the appropriate netisr thread (typically one per CPU core)
  2. The calling thread (handling the write(2)) returns immediately
  3. The netisr thread dequeues and processes the packet asynchronously

This design prevents userspace writes from blocking in complex protocol processing (IP forwarding, firewall evaluation, TCP reassembly). The netisr thread will:

  • Validate IP headers (checksum, length, version)
  • Process IP options
  • Consult routing tables
  • Apply firewall rules
  • Deliver to local sockets or forward to other interfaces

Context restoration: CURVNET_RESTORE switches back to the calling thread's original network context. This is essential for correctness, without restoration, subsequent operations in the thread would execute in the wrong network namespace.

Ownership and Lifecycle

After netisr_dispatch, the function returns success but no longer owns the mbuf. The netisr subsystem assumes responsibility for either:

  • Delivering the packet to its destination and freeing the mbuf
  • Dropping the packet (for policy, routing, or validation reasons) and freeing the mbuf

The function never needs to call m_freem in the success path, ownership has transferred to the network stack.

Data Flow Through the Network Stack

The complete path after dispatch:

tunwrite_l3() injects packet
     -> 
netisr_dispatch() queues to NETISR_IP/NETISR_IPV6
     -> 
Netisr thread dequeues packet
     -> 
ip_input() / ip6_input() processes
     -> 
Routing table lookup
     -> 
Firewall evaluation (ipfw, pf)
     -> 
    | ->  Local delivery: socket input queue
    | ->  Forward: ip_forward()  ->  output interface
    | ->  Drop: m_freem()
Error Paths and Resource Management

The function has three possible outcomes:

  1. Success (return 0): Packet dispatched to network stack, mbuf ownership transferred
  2. Pullup failure (return ENOBUFS): m_pullup freed the mbuf, no further cleanup needed
  3. Unsupported protocol (return EAFNOSUPPORT): Mbuf explicitly freed with m_freem

All paths correctly manage mbuf ownership, preventing both leaks and double-frees. This careful resource management is characteristic of well-designed kernel code.

6) Readiness: poll(2) and kqueue

1965:  */
1966: static	int
1967: tunpoll(struct cdev *dev, int events, struct thread *td)
1968: {
1969: 	struct tuntap_softc *tp = dev->si_drv1;
1970: 	struct ifnet	*ifp = TUN2IFP(tp);
1971: 	int		revents = 0;
1972: 
1973: 	TUNDEBUG(ifp, "tunpoll\n");
1974: 
1975: 	if (events & (POLLIN | POLLRDNORM)) {
1976: 		IFQ_LOCK(&ifp->if_snd);
1977: 		if (!IFQ_IS_EMPTY(&ifp->if_snd)) {
1978: 			TUNDEBUG(ifp, "tunpoll q=%d\n", ifp->if_snd.ifq_len);
1979: 			revents |= events & (POLLIN | POLLRDNORM);
1980: 		} else {
1981: 			TUNDEBUG(ifp, "tunpoll waiting\n");
1982: 			selrecord(td, &tp->tun_rsel);
1983: 		}
1984: 		IFQ_UNLOCK(&ifp->if_snd);
1985: 	}
1986: 	revents |= events & (POLLOUT | POLLWRNORM);
1987: 
1988: 	return (revents);
1989: }
1990: 
1991: /*
1992:  * tunkqfilter - support for the kevent() system call.
1993:  */
1994: static int
1995: tunkqfilter(struct cdev *dev, struct knote *kn)
1996: {
1997: 	struct tuntap_softc	*tp = dev->si_drv1;
1998: 	struct ifnet	*ifp = TUN2IFP(tp);
1999: 
2000: 	switch(kn->kn_filter) {
2001: 	case EVFILT_READ:
2002: 		TUNDEBUG(ifp, "%s kqfilter: EVFILT_READ, minor = %#x\n",
2003: 		    ifp->if_xname, dev2unit(dev));
2004: 		kn->kn_fop = &tun_read_filterops;
2005: 		break;
2006: 
2007: 	case EVFILT_WRITE:
2008: 		TUNDEBUG(ifp, "%s kqfilter: EVFILT_WRITE, minor = %#x\n",
2009: 		    ifp->if_xname, dev2unit(dev));
2010: 		kn->kn_fop = &tun_write_filterops;
2011: 		break;
2012: 
2013: 	default:
2014: 		return (EINVAL);
2015: 	}
2016: 
2017: 	kn->kn_hook = tp;
2018: 	knlist_add(&tp->tun_rsel.si_note, kn, 0);
2019: 
2020: 	return (0);
2021: }

The tunpoll function implements support for poll(2) and select(2), which allow applications to monitor multiple file descriptors for I/O readiness:

static int
tunpoll(struct cdev *dev, int events, struct thread *td)
{
    struct tuntap_softc *tp = dev->si_drv1;
    struct ifnet *ifp = TUN2IFP(tp);
    int revents = 0;

The function receives:

  • dev - the character device being polled
  • events - bitmask of events the application wants to monitor
  • td - the calling thread context

The return value revents indicates which requested events are currently ready. The function builds this bitmask by checking actual device conditions.

Event Notification Mechanisms: tunpoll and tunkqfilter

Efficient I/O multiplexing is essential for applications managing multiple tun/tap devices or integrating tunnel I/O with other event sources. FreeBSD provides two interfaces for this: the traditional poll(2)/select(2) system calls and the more scalable kqueue(2) mechanism. The tunpoll and tunkqfilter functions implement these interfaces, allowing applications to wait efficiently for readable or writable conditions without busy-polling.

Read Readiness
if (events & (POLLIN | POLLRDNORM)) {
    IFQ_LOCK(&ifp->if_snd);
    if (!IFQ_IS_EMPTY(&ifp->if_snd)) {
        TUNDEBUG(ifp, "tunpoll q=%d\n", ifp->if_snd.ifq_len);
        revents |= events & (POLLIN | POLLRDNORM);
    } else {
        TUNDEBUG(ifp, "tunpoll waiting\n");
        selrecord(td, &tp->tun_rsel);
    }
    IFQ_UNLOCK(&ifp->if_snd);
}

When the application requests read events (POLLIN or POLLRDNORM, which are synonymous for devices):

Queue check: The send queue lock is acquired, and IFQ_IS_EMPTY tests whether packets await reading. If packets are present:

  • The requested read events are added to revents
  • The application will be notified that read(2) can proceed without blocking

Registration for notification: If the queue is empty:

  • selrecord registers this thread's interest in the device becoming readable
  • The thread's context is added to tp->tun_rsel, a per-device selection list
  • When packets arrive later (in tunstart or tunstart_l2), the code calls selwakeup(&tp->tun_rsel) to notify all registered threads

The selrecord mechanism is the key to efficient waiting. Instead of the application repeatedly polling, the kernel maintains a list of interested threads and wakes them when conditions change. This pattern appears throughout the FreeBSD kernel for any device supporting poll(2).

The send queue lock protects against races where packets arrive between checking the queue and registering interest. The lock ensures atomicity: if the queue is empty during the check, registration completes before any packet arrival can call selwakeup.

Write Readiness
revents |= events & (POLLOUT | POLLWRNORM);

Writes are always ready for tun/tap devices. The device has no internal buffering that could fill, write(2) either succeeds immediately (allocating an mbuf and dispatching to the network stack) or fails immediately (if mbuf allocation fails). There's no condition where writing would block waiting for buffer space to become available.

This unconditional write readiness is common for network devices. Unlike pipes or sockets with limited buffer space, tun/tap devices accept writes as fast as the application can generate them, relying on the mbuf allocator's dynamic memory management.

Kqueue Interface: tunkqfilter

The tunkqfilter function implements support for kqueue(2), FreeBSD's scalable event notification mechanism. Kqueue offers several advantages over poll(2):

  • Edge-triggered semantics (notifications only on state changes)
  • Better performance with thousands of file descriptors
  • User data can be attached to events
  • More flexible event types (not just read/write)
static int
tunkqfilter(struct cdev *dev, struct knote *kn)
{
    struct tuntap_softc *tp = dev->si_drv1;
    struct ifnet *ifp = TUN2IFP(tp);

The function receives a knote (kernel note) structure representing the event registration. The knote persists across multiple event deliveries, unlike poll(2) which requires re-registration on every call.

Filter Type Validation
switch(kn->kn_filter) {
case EVFILT_READ:
    TUNDEBUG(ifp, "%s kqfilter: EVFILT_READ, minor = %#x\n",
        ifp->if_xname, dev2unit(dev));
    kn->kn_fop = &tun_read_filterops;
    break;

case EVFILT_WRITE:
    TUNDEBUG(ifp, "%s kqfilter: EVFILT_WRITE, minor = %#x\n",
        ifp->if_xname, dev2unit(dev));
    kn->kn_fop = &tun_write_filterops;
    break;

default:
    return (EINVAL);
}

The application specifies which event type to monitor via kn->kn_filter:

  • EVFILT_READ - monitor for readable condition
  • EVFILT_WRITE - monitor for writable condition

For each filter type, the code assigns a function table (kn_fop) that implements the filter's semantics. These tables were defined earlier in the source:

static const struct filterops tun_read_filterops = {
    .f_isfd = 1,
    .f_attach = NULL,
    .f_detach = tunkqdetach,
    .f_event = tunkqread,
};

static const struct filterops tun_write_filterops = {
    .f_isfd = 1,
    .f_attach = NULL,
    .f_detach = tunkqdetach,
    .f_event = tunkqwrite,
};

The filterops structure defines callbacks:

  • f_isfd - flag indicating this filter operates on file descriptors
  • f_attach - called when the filter is registered (NULL here, no special setup needed)
  • f_detach - called when the filter is removed (tunkqdetach cleanup)
  • f_event - called to test event condition (tunkqread or tunkqwrite)

Unsupported filter types (like EVFILT_SIGNAL or EVFILT_TIMER) return EINVAL, as they don't make sense for tun/tap devices.

Registering the Event
kn->kn_hook = tp;
knlist_add(&tp->tun_rsel.si_note, kn, 0);

return (0);
}

Two steps complete registration:

Attach context: kn->kn_hook stores the softc pointer. This allows the filter operation functions (tunkqread, tunkqwrite) to access device state without global lookups. When the event fires, the callback receives the knote, extracts kn_hook, and casts it back to tuntap_softc *.

Add to notification list: knlist_add inserts the knote into the device's kernel note list (tp->tun_rsel.si_note). This list is shared between poll(2) and kqueue(2) infrastructure, the si_note field within tun_rsel handles kqueue events, while other tun_rsel fields handle poll/select events.

When packets arrive (in tunstart or tunstart_l2), the code calls KNOTE_LOCKED(&tp->tun_rsel.si_note, 0), which iterates the knote list and invokes each filter's f_event callback. If the callback returns true (readable/writable condition met), the kqueue subsystem delivers the event to userspace.

The third argument to knlist_add (0) indicates no special flags, the knote is added unconditionally without requiring specific locking state.

Filter Operation Callbacks

Though not shown in this fragment, the filter operations are worth understanding:

tunkqread: Called to test read readiness

static int
tunkqread(struct knote *kn, long hint)
{
    struct tuntap_softc *tp = kn->kn_hook;
    struct ifnet *ifp = TUN2IFP(tp);

    if ((kn->kn_data = ifp->if_snd.ifq_len) > 0) {
        return (1);  // Readable
    }
    return (0);  // Not readable
}

The callback checks the send queue length and stores it in kn->kn_data, making the count available to userspace via the kevent structure. Returning 1 signals the event should fire; returning 0 means the condition is not yet met.

tunkqwrite: Called to test write readiness

static int
tunkqwrite(struct knote *kn, long hint)
{
    struct tuntap_softc *tp = kn->kn_hook;
    struct ifnet *ifp = TUN2IFP(tp);

    kn->kn_data = ifp->if_mtu;
    return (1);  // Always writable
}

Since writes are always possible, this always returns 1. The kn_data field is set to the interface MTU, giving userspace information about maximum write size.

tunkqdetach: Called when removing the event

static void
tunkqdetach(struct knote *kn)
{
    struct tuntap_softc *tp = kn->kn_hook;

    knlist_remove(&tp->tun_rsel.si_note, kn, 0);
}

This removes the knote from the device's notification list, ensuring no further events are delivered for this registration.

Comparison: Poll vs. Kqueue

The two mechanisms serve similar purposes but with different characteristics:

Poll/Select:

  • Level-triggered: reports readiness state on every call
  • Requires kernel scanning of all file descriptors on each call
  • Simple API, widely portable
  • O(n) complexity in number of file descriptors

Kqueue:

  • Edge-triggered: reports changes in readiness state
  • Kernel maintains active event list, only reports changes
  • More complex API, FreeBSD/macOS specific
  • O(1) complexity for event delivery

For applications monitoring a single tun/tap device, the difference is negligible. For VPN concentrators or network simulators managing hundreds of virtual interfaces, kqueue's scalability advantages become significant.

Notification Flow

When a packet arrives for transmission, the complete notification sequence:

Network stack routes packet to tun0
     -> 
tunoutput() / tap_transmit() enqueues mbuf
     -> 
tunstart() / tunstart_l2() wakes waiters:
    | ->  wakeup(tp) - wakes blocked read()
    | ->  selwakeup(&tp->tun_rsel) - wakes poll()/select()
    | ->  KNOTE_LOCKED(&tp->tun_rsel.si_note, 0) - delivers kqueue events
     -> 
Application receives notification
     -> 
Application calls read() to retrieve packet

This multi-mechanism notification ensures applications using any waiting strategy, blocking reads, poll/select loops, or kqueue event loops, receive prompt packet delivery notification.

Interactive Exercises for tun(4)/tap(4)

Goal: Trace both directions of data flow and map user-space operations to the exact kernel lines.

A) Device personalities and cloning (warm-up)
  1. In the tuntap_drivers[] array, list the three .d_name values and identify which function pointers (.d_open, .d_read, .d_write, etc.) are assigned for each. Note: are they the same or different functions? Quote the initializer lines you used. (Tip: examine lines around 280-291 and the subsequent entries for tap/vmnet.)

  2. In tun_clone_create(), find where the driver:

    • computes the final name with unit,
    • calls clone_create(),
    • falls back to tun_create_device(), and
    • calls tuncreate() to attach the ifnet.

    Quote those lines and explain the sequence.

  3. In tun_create_device(), record the mode used for the cdev and which field points si_drv1 to the softc. Quote the lines. (Hint: look for mda_mode and mda_si_drv1.)

B) Interface bring-up path
  1. In tuncreate(), point to the if_alloc(), if_initname(), and if_attach() calls. Why is bpfattach() called for L3 mode with DLT_NULL instead of DLT_EN10MB? Quote the lines you used.

  2. In tunopen(), identify where link state is marked UP on open. Quote the line(s).

  3. In tunopen(), what prevents two processes from opening the same device simultaneously? Quote the check and explain the flags involved. (Hint: look for TUN_OPEN and EBUSY.)

C) Read a packet from user space (kernel -> user)
  1. In tunread(), explain the blocking and non-blocking behaviors. Which flag forces EWOULDBLOCK? Where is the sleep done? Quote the lines.

  2. Where is the optional virtio header copied to user space, and how is the payload then delivered? Quote those lines.

  3. Where are readers woken when output arrives from the stack? Trace the wakeups in tunstart_l2() (or the L3 start path): wakeup, selwakeuppri, and KNOTE. Quote the lines.

D) Write a packet from user space (user -> kernel)
  1. In tunwrite(), find the guard that silently ignores writes if the interface is down, and the check that bounds the maximum write size (MRU + headers). Quote the lines.

  2. Still in tunwrite(), where is the user buffer turned into an mbuf? Quote the call and explain the align parameter for L2.

  3. Follow the L3 path into tunwrite_l3(): where is the address family read (when TUN_IFHEAD is set), where is BPF tapped, and where is the netisr dispatch called? Quote those lines.

  4. Follow the L2 path into tunwrite_l2(): where does it drop frames whose destination MAC address doesn't match the interface's MAC (unless promiscuous mode is set)? This simulates what real Ethernet hardware wouldn't deliver. Quote those lines.

E) Quick user-space validations (safe experiments)

These checks assume you created a tun0 (L3) or tap0 (L2) and brought it up in a private VM.

# L3: read a packet the kernel queued for us
% ifconfig tun0 10.0.0.1/24 up
% ( ping -c1 10.0.0.2 >/dev/null & ) &
% dd if=/dev/tun0 bs=4096 count=1 2>/dev/null | hexdump -C | head -n2
# Expected: You should see an ICMP echo request (type 8)
# with destination IP 10.0.0.2 starting around offset 0x14

# L3: inject an IPv4 echo request (requires crafting a full frame)
# (later in the book we'll show a tiny C sender using write())

For each command you run, point to the exact lines in tunread() or tunwrite_l3() that explain the behavior you observe.

Stretch (thought experiments)

  1. If tunwrite() returned EIO when the interface is down, instead of ignoring writes, how would tools that rely on blind writes behave? Point to the current "ignore if down" line and explain the design choice.

  2. Suppose tunstart_l2() called wakeup(tp) but not selwakeuppri(&tp->tun_rsel, ...). What would happen to an application using poll(2) to wait for packets? Would blocking read(2) still work? Point to both notification mechanisms and explain why each is necessary.

Bridging to the next tour

The if_tuntap driver demonstrates how character devices and network interfaces integrate, with userspace acting as the "hardware" endpoint. Our next driver explores fundamentally different territory: uart_bus_pci shows how real hardware devices are discovered and bound to kernel drivers through FreeBSD's layered bus architecture.

This shift from character device operations to bus attachment represents a critical architectural pattern: the separation between bus-specific glue code and device-agnostic core functionality. The uart_bus_pci driver is intentionally minimal, with under 300 lines of code, focusing solely on device identification (matching PCI vendor/device IDs), resource negotiation (claiming I/O ports and interrupts), and handoff to the generic UART subsystem via uart_bus_probe() and uart_bus_attach().

Tour 4 - The PCI glue: uart(4)

Open the file:

% cd /usr/src/sys/dev/uart
% less uart_bus_pci.c

This file is the PCI "bus glue" for the generic UART core. It matches hardware via a PCI ID table, picks a UART class, calls the shared uart bus probe/attach, and adds a tiny bit of bus-specific logic (MSI preference, unique console matching). The actual UART register shuffling lives in the common UART code; this file is about matching and wiring.

1) Method table + driver object (what Newbus calls)

 52: static device_method_t uart_pci_methods[] = {
 53: 	/* Device interface */
 54: 	DEVMETHOD(device_probe,		uart_pci_probe),
 55: 	DEVMETHOD(device_attach,	uart_pci_attach),
 56: 	DEVMETHOD(device_detach,	uart_pci_detach),
 57: 	DEVMETHOD(device_resume,	uart_bus_resume),
 58: 	DEVMETHOD_END
 59: };
 61: static driver_t uart_pci_driver = {
 62: 	uart_driver_name,
 63: 	uart_pci_methods,
 64: 	sizeof(struct uart_softc),
 65: };

Map this mentally to the Newbus lifecycle: probe -> attach -> detach (+ resume).

Device Methods and Driver Structure

FreeBSD's device driver framework uses an object-oriented approach where drivers declare which operations they support through method tables. The uart_pci_methods array and uart_pci_driver structure establish this driver's interface to the kernel's device management subsystem.

The Device Method Table
static device_method_t uart_pci_methods[] = {
    /* Device interface */
    DEVMETHOD(device_probe,     uart_pci_probe),
    DEVMETHOD(device_attach,    uart_pci_attach),
    DEVMETHOD(device_detach,    uart_pci_detach),
    DEVMETHOD(device_resume,    uart_bus_resume),
    DEVMETHOD_END
};

The device_method_t array maps generic device operations to driver-specific implementations. Each DEVMETHOD entry binds a method identifier to a function pointer:

device_probe -> uart_pci_probe: Called by the PCI bus driver during device enumeration to ask "can you drive this device?" The function examines the device's PCI vendor and device IDs, returning a priority value indicating how well it matches. Lower values mean better matches; returning ENXIO means "not my device."

device_attach -> uart_pci_attach: Called after a successful probe to initialize the device. This function allocates resources (I/O ports, interrupts), configures the hardware, and makes the device operational. If attachment fails, the driver should release any allocated resources.

device_detach -> uart_pci_detach: Called when the device is being removed from the system (hot-unplug, driver unload, or system shutdown). Must release all resources claimed during attach and ensure the hardware is left in a safe state.

device_resume -> uart_bus_resume: Called when the system resumes from a suspend state. Note this points to uart_bus_resume, not a PCI-specific function, the generic UART layer handles power management uniformly across all bus types.

DEVMETHOD_END: A sentinel marking the array's end. The kernel iterates this table until reaching this terminator.

The Driver Declaration
static driver_t uart_pci_driver = {
    uart_driver_name,
    uart_pci_methods,
    sizeof(struct uart_softc),
};

The driver_t structure packages the method table with metadata:

uart_driver_name: A string identifying this driver, typically "uart". This name appears in kernel messages, device tree output, and administrative tools. The name is defined in the generic uart code and shared across all bus attachments (PCI, ISA, ACPI), ensuring consistent device naming regardless of how the UART was discovered.

uart_pci_methods: Pointer to the method table defined above. When the kernel needs to perform an operation on a uart_pci device, it looks up the appropriate method in this table and calls the corresponding function.

sizeof(struct uart_softc): The size of the driver's per-device state structure. The kernel allocates this much memory when creating a device instance, accessible via device_get_softc(). Importantly, this uses uart_softc from the generic UART layer, not a PCI-specific structure, the core UART state is bus-agnostic.

Architectural Significance

This simple structure embodies FreeBSD's layered driver model. The method table contains four functions:

  • Two are PCI-specific (uart_pci_probe, uart_pci_attach, uart_pci_detach)
  • One is bus-agnostic (uart_bus_resume)

The PCI-specific functions handle only bus-related concerns: matching device IDs, claiming PCI resources, and managing MSI interrupts. All UART-specific logic, baud rate configuration, FIFO management, character I/O, lives in the generic uart_bus.c code that these functions call.

This separation means the same UART hardware logic works whether the device appears on the PCI bus, the ISA bus, or as an ACPI-enumerated device. Only the probe/attach glue changes. This pattern, thin bus-specific wrappers around substantial generic cores, reduces code duplication and simplifies porting to new bus types or architectures.

The method table mechanism also enables runtime polymorphism. If a UART appears on different buses (a 16550 on both PCI and ISA, for example), the kernel loads different driver modules (uart_pci, uart_isa), each with its own method table, but both share the underlying uart_softc structure and call the same generic functions for actual device operation.

2) Local structs + flags we'll use

 67: struct pci_id {
 68: 	uint16_t	vendor;
 69: 	uint16_t	device;
 70: 	uint16_t	subven;
 71: 	uint16_t	subdev;
 72: 	const char	*desc;
 73: 	int		rid;
 74: 	int		rclk;
 75: 	int		regshft;
 76: };
 78: struct pci_unique_id {
 79: 	uint16_t	vendor;
 80: 	uint16_t	device;
 81: };
 83: #define PCI_NO_MSI	0x40000000
 84: #define PCI_RID_MASK	0x0000ffff

What matters later: rid (which BAR/IRQ to use), optional rclk and regshft, and the PCI_NO_MSI hint.

Device Identification Structures

Hardware drivers must identify which specific devices they can manage. For PCI devices, this identification relies on vendor and device ID codes burned into the hardware's configuration space. The pci_id and pci_unique_id structures encode this matching logic along with device-specific configuration parameters.

The Primary Identification Structure
struct pci_id {
    uint16_t    vendor;
    uint16_t    device;
    uint16_t    subven;
    uint16_t    subdev;
    const char  *desc;
    int         rid;
    int         rclk;
    int         regshft;
};

Each pci_id entry describes one UART variant and how to configure it:

vendor and device: The primary identification pair. Every PCI device has a 16-bit vendor ID (assigned by the PCI Special Interest Group) and a 16-bit device ID (assigned by the vendor). For example, Intel is vendor 0x8086, and their AMT Serial-over-LAN controller is device 0x108f. These IDs are read from the device's configuration space at bus enumeration time.

subven and subdev: Secondary identification for OEM customization. Many manufacturers build cards using reference designs from chipset vendors, then assign their own subsystem vendor and device IDs. A value of 0xffff in these fields acts as a wildcard, meaning "match any subsystem IDs." This allows matching either specific OEM variants or entire chipset families.

The four-level matching hierarchy enables precise identification:

  1. Match only specific OEM cards: all four IDs must match exactly
  2. Match all cards using a chipset: vendor/device match, subven/subdev are 0xffff
  3. Match specific OEM customization: vendor/device plus exact subven/subdev

desc: Human-readable device description displayed in boot messages and dmesg output. Examples: "Intel AMT - SOL" or "Oxford Semiconductor OXCB950 Cardbus 16950 UART". This string helps administrators identify which physical device corresponds to which /dev/cuaU* entry.

rid: Resource ID specifying which PCI Base Address Register (BAR) contains the UART's registers. PCI devices can have up to six BARs (numbered 0x10, 0x14, 0x18, 0x1c, 0x20, 0x24). Most UARTs use BAR 0 (0x10), but some multi-function cards place the UART at alternate BARs. This field may also encode flags via the high bits.

rclk: Reference clock frequency in Hz. The UART's baud rate generator divides this clock to produce serial bit timing. Standard PC UARTs use 1843200 Hz (1.8432 MHz), but embedded UARTs and specialized cards often use different frequencies. Some Intel devices use 24x the standard clock for high-speed operation. An incorrect rclk causes garbled serial communication due to baud rate mismatch.

regshft: Register address shift value. Most UARTs place consecutive registers at consecutive byte addresses (shift = 0), but some embed the UART in larger register spaces with registers at every 4th byte (shift = 2) or other intervals. The driver shifts register offsets by this amount when accessing hardware. This accommodates SoC designs where the UART shares address space with other peripherals.

The Simplified Identification Structure
struct pci_unique_id {
    uint16_t    vendor;
    uint16_t    device;
};

This smaller structure identifies devices guaranteed to exist only once per system. Certain hardware, particularly server management controllers and embedded SoC UARTs, is designed as singleton devices. For these, vendor and device IDs alone suffice for matching against system consoles, without needing subsystem IDs or configuration parameters.

The distinction matters for console matching: if a UART serves as the system console (configured in firmware or boot loader), the kernel must identify which enumerated device corresponds to the pre-configured console. For unique devices, a simple vendor/device match provides certainty.

Resource ID Encoding
#define PCI_NO_MSI      0x40000000
#define PCI_RID_MASK    0x0000ffff

The rid field serves double duty through bit packing:

PCI_RID_MASK (0x0000ffff): The lower 16 bits contain the actual BAR number (0x10, 0x14, etc.). Masking with this value extracts the resource ID for bus allocation functions.

PCI_NO_MSI (0x40000000): The high bit flags devices with broken or unreliable Message Signaled Interrupt (MSI) support. Some UART implementations don't correctly implement MSI, causing interrupt delivery failures or system hangs. This flag tells the attach function to use traditional line-based interrupts instead of attempting MSI allocation.

This encoding scheme avoids enlarging the pci_id structure with an additional boolean field. Since BAR numbers only use the low byte, the high bits are available for flags. The driver extracts the actual RID with id->rid & PCI_RID_MASK and checks MSI capability with (id->rid & PCI_NO_MSI) == 0.

Purpose in Device Matching

These structures populate a large static array (examined in the next fragment) that the probe function searches during device enumeration. When the PCI bus driver discovers a device with class "Simple Communications" (modems and UARTs), it calls this driver's probe function. The probe function walks the array comparing the device's IDs against each entry, looking for a match. Upon finding one, it uses the associated desc, rid, rclk, and regshft values to configure the device correctly.

This table-driven approach simplifies adding new hardware support: most new UART variants require only adding a table entry with the correct IDs and clock frequency, without modifying code.

3) The PCI ID table (ns8250-ish parts)

Below is the contiguous table used to match vendor/device(/subvendor/subdevice), plus per-device hints (RID, reference clock, register shift). The 0xffff row terminates the list.

 86: static const struct pci_id pci_ns8250_ids[] = {
 87: { 0x1028, 0x0008, 0xffff, 0, "Dell Remote Access Card III", 0x14,
 88: 	128 * DEFAULT_RCLK },
 89: { 0x1028, 0x0012, 0xffff, 0, "Dell RAC 4 Daughter Card Virtual UART", 0x14,
 90: 	128 * DEFAULT_RCLK },
 91: { 0x1033, 0x0074, 0x1033, 0x8014, "NEC RCV56ACF 56k Voice Modem", 0x10 },
 92: { 0x1033, 0x007d, 0x1033, 0x8012, "NEC RS232C", 0x10 },
 93: { 0x103c, 0x1048, 0x103c, 0x1227, "HP Diva Serial [GSP] UART - Powerbar SP2",
 94: 	0x10 },
 95: { 0x103c, 0x1048, 0x103c, 0x1301, "HP Diva RMP3", 0x14 },
 96: { 0x103c, 0x1290, 0xffff, 0, "HP Auxiliary Diva Serial Port", 0x18 },
 97: { 0x103c, 0x3301, 0xffff, 0, "HP iLO serial port", 0x10 },
 98: { 0x11c1, 0x0480, 0xffff, 0, "Agere Systems Venus Modem (V90, 56KFlex)", 0x14 },
 99: { 0x115d, 0x0103, 0xffff, 0, "Xircom Cardbus Ethernet + 56k Modem", 0x10 },
100: { 0x125b, 0x9100, 0xa000, 0x1000,
101: 	"ASIX AX99100 PCIe 1/2/3/4-port RS-232/422/485", 0x10 },
102: { 0x1282, 0x6585, 0xffff, 0, "Davicom 56PDV PCI Modem", 0x10 },
103: { 0x12b9, 0x1008, 0xffff, 0, "3Com 56K FaxModem Model 5610", 0x10 },
104: { 0x131f, 0x1000, 0xffff, 0, "Siig CyberSerial (1-port) 16550", 0x18 },
105: { 0x131f, 0x1001, 0xffff, 0, "Siig CyberSerial (1-port) 16650", 0x18 },
106: { 0x131f, 0x1002, 0xffff, 0, "Siig CyberSerial (1-port) 16850", 0x18 },
107: { 0x131f, 0x2000, 0xffff, 0, "Siig CyberSerial (1-port) 16550", 0x10 },
108: { 0x131f, 0x2001, 0xffff, 0, "Siig CyberSerial (1-port) 16650", 0x10 },
109: { 0x131f, 0x2002, 0xffff, 0, "Siig CyberSerial (1-port) 16850", 0x10 },
110: { 0x135a, 0x0a61, 0xffff, 0, "Brainboxes UC-324", 0x18 },
111: { 0x135a, 0x0aa1, 0xffff, 0, "Brainboxes UC-246", 0x18 },
112: { 0x135a, 0x0aa2, 0xffff, 0, "Brainboxes UC-246", 0x18 },
113: { 0x135a, 0x0d60, 0xffff, 0, "Intashield IS-100", 0x18 },
114: { 0x135a, 0x0da0, 0xffff, 0, "Intashield IS-300", 0x18 },
115: { 0x135a, 0x4000, 0xffff, 0, "Brainboxes PX-420", 0x10 },
116: { 0x135a, 0x4001, 0xffff, 0, "Brainboxes PX-431", 0x10 },
117: { 0x135a, 0x4002, 0xffff, 0, "Brainboxes PX-820", 0x10 },
118: { 0x135a, 0x4003, 0xffff, 0, "Brainboxes PX-831", 0x10 },
119: { 0x135a, 0x4004, 0xffff, 0, "Brainboxes PX-246", 0x10 },
120: { 0x135a, 0x4005, 0xffff, 0, "Brainboxes PX-101", 0x10 },
121: { 0x135a, 0x4006, 0xffff, 0, "Brainboxes PX-257", 0x10 },
122: { 0x135a, 0x4008, 0xffff, 0, "Brainboxes PX-846", 0x10 },
123: { 0x135a, 0x4009, 0xffff, 0, "Brainboxes PX-857", 0x10 },
124: { 0x135c, 0x0190, 0xffff, 0, "Quatech SSCLP-100", 0x18 },
125: { 0x135c, 0x01c0, 0xffff, 0, "Quatech SSCLP-200/300", 0x18 },
126: { 0x135e, 0x7101, 0xffff, 0, "Sealevel Systems Single Port RS-232/422/485/530",
127: 	0x18 },
128: { 0x1407, 0x0110, 0xffff, 0, "Lava Computer mfg DSerial-PCI Port A", 0x10 },
129: { 0x1407, 0x0111, 0xffff, 0, "Lava Computer mfg DSerial-PCI Port B", 0x10 },
130: { 0x1407, 0x0510, 0xffff, 0, "Lava SP Serial 550 PCI", 0x10 },
131: { 0x1409, 0x7168, 0x1409, 0x4025, "Timedia Technology Serial Port", 0x10,
132: 	8 * DEFAULT_RCLK },
133: { 0x1409, 0x7168, 0x1409, 0x4027, "Timedia Technology Serial Port", 0x10,
134: 	8 * DEFAULT_RCLK },
135: { 0x1409, 0x7168, 0x1409, 0x4028, "Timedia Technology Serial Port", 0x10,
136: 	8 * DEFAULT_RCLK },
137: { 0x1409, 0x7168, 0x1409, 0x5025, "Timedia Technology Serial Port", 0x10,
138: 	8 * DEFAULT_RCLK },
139: { 0x1409, 0x7168, 0x1409, 0x5027, "Timedia Technology Serial Port", 0x10,
140: 	8 * DEFAULT_RCLK },
141: { 0x1415, 0x950b, 0xffff, 0, "Oxford Semiconductor OXCB950 Cardbus 16950 UART",
142: 	0x10, 16384000 },
143: { 0x1415, 0xc120, 0xffff, 0, "Oxford Semiconductor OXPCIe952 PCIe 16950 UART",
144: 	0x10 },
145: { 0x14e4, 0x160a, 0xffff, 0, "Broadcom TruManage UART", 0x10,
146: 	128 * DEFAULT_RCLK, 2},
147: { 0x14e4, 0x4344, 0xffff, 0, "Sony Ericsson GC89 PC Card", 0x10},
148: { 0x151f, 0x0000, 0xffff, 0, "TOPIC Semiconductor TP560 56k modem", 0x10 },
149: { 0x1d0f, 0x8250, 0x0000, 0, "Amazon PCI serial device", 0x10 },
150: { 0x1d0f, 0x8250, 0x1d0f, 0, "Amazon PCI serial device", 0x10 },
151: { 0x1fd4, 0x1999, 0x1fd4, 0x0001, "Sunix SER5xxxx Serial Port", 0x10,
152: 	8 * DEFAULT_RCLK },
153: { 0x8086, 0x0c5f, 0xffff, 0, "Atom Processor S1200 UART",
154: 	0x10 | PCI_NO_MSI },
155: { 0x8086, 0x0f0a, 0xffff, 0, "Intel ValleyView LPIO1 HSUART#1", 0x10,
156: 	24 * DEFAULT_RCLK, 2 },
157: { 0x8086, 0x0f0c, 0xffff, 0, "Intel ValleyView LPIO1 HSUART#2", 0x10,
158: 	24 * DEFAULT_RCLK, 2 },
159: { 0x8086, 0x108f, 0xffff, 0, "Intel AMT - SOL", 0x10 },
160: { 0x8086, 0x19d8, 0xffff, 0, "Intel Denverton UART", 0x10 },
161: { 0x8086, 0x1c3d, 0xffff, 0, "Intel AMT - KT Controller", 0x10 },
162: { 0x8086, 0x1d3d, 0xffff, 0, "Intel C600/X79 Series Chipset KT Controller",
163: 	0x10 },
164: { 0x8086, 0x1e3d, 0xffff, 0, "Intel Panther Point KT Controller", 0x10 },
165: { 0x8086, 0x228a, 0xffff, 0, "Intel Cherryview SIO HSUART#1", 0x10,
166: 	24 * DEFAULT_RCLK, 2 },
167: { 0x8086, 0x228c, 0xffff, 0, "Intel Cherryview SIO HSUART#2", 0x10,
168: 	24 * DEFAULT_RCLK, 2 },
169: { 0x8086, 0x2a07, 0xffff, 0, "Intel AMT - PM965/GM965 KT Controller", 0x10 },
170: { 0x8086, 0x2a47, 0xffff, 0, "Mobile 4 Series Chipset KT Controller", 0x10 },
171: { 0x8086, 0x2e17, 0xffff, 0, "4 Series Chipset Serial KT Controller", 0x10 },
172: { 0x8086, 0x31bc, 0xffff, 0, "Intel Gemini Lake SIO/LPSS UART 0", 0x10,
173: 	24 * DEFAULT_RCLK, 2 },
174: { 0x8086, 0x31be, 0xffff, 0, "Intel Gemini Lake SIO/LPSS UART 1", 0x10,
175: 	24 * DEFAULT_RCLK, 2 },
176: { 0x8086, 0x31c0, 0xffff, 0, "Intel Gemini Lake SIO/LPSS UART 2", 0x10,
177: 	24 * DEFAULT_RCLK, 2 },
178: { 0x8086, 0x31ee, 0xffff, 0, "Intel Gemini Lake SIO/LPSS UART 3", 0x10,
179: 	24 * DEFAULT_RCLK, 2 },
180: { 0x8086, 0x3b67, 0xffff, 0, "5 Series/3400 Series Chipset KT Controller",
181: 	0x10 },
182: { 0x8086, 0x5abc, 0xffff, 0, "Intel Apollo Lake SIO/LPSS UART 0", 0x10,
183: 	24 * DEFAULT_RCLK, 2 },
184: { 0x8086, 0x5abe, 0xffff, 0, "Intel Apollo Lake SIO/LPSS UART 1", 0x10,
185: 	24 * DEFAULT_RCLK, 2 },
186: { 0x8086, 0x5ac0, 0xffff, 0, "Intel Apollo Lake SIO/LPSS UART 2", 0x10,
187: 	24 * DEFAULT_RCLK, 2 },
188: { 0x8086, 0x5aee, 0xffff, 0, "Intel Apollo Lake SIO/LPSS UART 3", 0x10,
189: 	24 * DEFAULT_RCLK, 2 },
190: { 0x8086, 0x8811, 0xffff, 0, "Intel EG20T Serial Port 0", 0x10 },
191: { 0x8086, 0x8812, 0xffff, 0, "Intel EG20T Serial Port 1", 0x10 },
192: { 0x8086, 0x8813, 0xffff, 0, "Intel EG20T Serial Port 2", 0x10 },
193: { 0x8086, 0x8814, 0xffff, 0, "Intel EG20T Serial Port 3", 0x10 },
194: { 0x8086, 0x8c3d, 0xffff, 0, "Intel Lynx Point KT Controller", 0x10 },
195: { 0x8086, 0x8cbd, 0xffff, 0, "Intel Wildcat Point KT Controller", 0x10 },
196: { 0x8086, 0x8d3d, 0xffff, 0,
197: 	"Intel Corporation C610/X99 series chipset KT Controller", 0x10 },
198: { 0x8086, 0x9c3d, 0xffff, 0, "Intel Lynx Point-LP HECI KT", 0x10 },
199: { 0x8086, 0xa13d, 0xffff, 0,
200: 	"100 Series/C230 Series Chipset Family KT Redirection",
201: 	0x10 | PCI_NO_MSI },
202: { 0x9710, 0x9820, 0x1000, 1, "NetMos NM9820 Serial Port", 0x10 },
203: { 0x9710, 0x9835, 0x1000, 1, "NetMos NM9835 Serial Port", 0x10 },
204: { 0x9710, 0x9865, 0xa000, 0x1000, "NetMos NM9865 Serial Port", 0x10 },
205: { 0x9710, 0x9900, 0xa000, 0x1000,
206: 	"MosChip MCS9900 PCIe to Peripheral Controller", 0x10 },
207: { 0x9710, 0x9901, 0xa000, 0x1000,
208: 	"MosChip MCS9901 PCIe to Peripheral Controller", 0x10 },
209: { 0x9710, 0x9904, 0xa000, 0x1000,
210: 	"MosChip MCS9904 PCIe to Peripheral Controller", 0x10 },
211: { 0x9710, 0x9922, 0xa000, 0x1000,
212: 	"MosChip MCS9922 PCIe to Peripheral Controller", 0x10 },
213: { 0xdeaf, 0x9051, 0xffff, 0, "Middle Digital PC Weasel Serial Port", 0x10 },
214: { 0xffff, 0, 0xffff, 0, NULL, 0, 0}
215: };

Notice per-device RID (which BAR/IRQ), frequency hints (rclk like 24 \* DEFAULT_RCLK), and optional regshft.

The Device Identification Table

The pci_ns8250_ids array is the heart of the driver's device recognition logic. This table lists every known PCI UART variant compatible with the NS8250/16550 register interface, along with the configuration parameters needed to operate each correctly. During system boot, the PCI bus driver walks all discovered devices and calls this driver's probe function for potential matches; the probe function searches this table to determine compatibility.

Table Structure and Purpose
static const struct pci_id pci_ns8250_ids[] = {

The array name, pci_ns8250_ids, reflects that all listed devices implement the National Semiconductor 8250 (or compatible 16450/16550/16650/16750/16850/16950) register interface. Despite coming from dozens of manufacturers, these UARTs share a common programming model dating back to the original IBM PC's serial port design. This compatibility allows a single driver to support disparate hardware through a unified register abstraction.

The static const qualifiers indicate this data is read-only and internal to this compilation unit. The table resides in read-only memory, preventing accidental modification and allowing the kernel to share one copy across all CPU cores.

Entry Analysis: Understanding the Patterns

Examining representative entries reveals the matching hierarchy and configuration diversity:

Simple wildcard match (line 159):

{ 0x8086, 0x108f, 0xffff, 0, "Intel AMT - SOL", 0x10 },
  • Vendor 0x8086 (Intel), device 0x108f (AMT Serial-over-LAN)
  • Subsystem IDs 0xffff (wildcard) match all OEM variants
  • Description for boot messages and device listings
  • RID 0x10 (BAR0), standard clock rate (implied DEFAULT_RCLK), no register shift

This pattern matches Intel's AMT SOL controller regardless of which motherboard manufacturer integrated it.

OEM-specific match (lines 93-95):

{ 0x103c, 0x1048, 0x103c, 0x1227, "HP Diva Serial [GSP] UART - Powerbar SP2", 0x10 },
{ 0x103c, 0x1048, 0x103c, 0x1301, "HP Diva RMP3", 0x14 },
  • Same chipset (HP vendor 0x103c, device 0x1048) used in multiple products
  • Different subsystem device IDs (0x1227, 0x1301) distinguish variants
  • Different BARs (0x10 vs 0x14) indicate the UART appears at different addresses in each card's configuration space

This illustrates how one chipset spawns multiple table entries when OEMs configure it differently across product lines.

Non-standard clock frequency (lines 87-88):

{ 0x1028, 0x0008, 0xffff, 0, "Dell Remote Access Card III", 0x14,
    128 * DEFAULT_RCLK },
  • Dell (0x1028) RAC III uses 128x the standard 1.8432 MHz clock = 235.9296 MHz
  • This extremely high frequency supports baud rates far beyond standard serial ports
  • Without the correct rclk value, all baud rate calculations would be wrong by 128x, producing gibberish

Server management cards often use high clocks to support fast console redirection over network links.

Register address shifting (lines 155-156):

{ 0x8086, 0x0f0a, 0xffff, 0, "Intel ValleyView LPIO1 HSUART#1", 0x10,
    24 * DEFAULT_RCLK, 2 },
  • Intel SoC UART with 24x standard clock for high-speed operation
  • regshft = 2 means registers appear at 4-byte intervals (addresses 0, 4, 8, 12, ...)
  • The generic UART code shifts all register offsets left by 2 bits: address << 2

This accommodates SoC designs where the UART shares a large memory-mapped region with other peripherals, often with registers aligned to 32-bit boundaries for bus efficiency.

MSI incompatibility (lines 153-154, 199-201):

{ 0x8086, 0x0c5f, 0xffff, 0, "Atom Processor S1200 UART",
    0x10 | PCI_NO_MSI },
  • The PCI_NO_MSI flag in the RID field indicates broken MSI support
  • The attach function will detect this flag and use legacy line-based interrupts instead
  • These devices claim MSI capability in their PCI configuration space but don't deliver interrupts correctly

Such quirks typically arise from silicon errata or incomplete MSI implementation in integrated peripherals.

Multiple subsystem variants (lines 131-140):

{ 0x1409, 0x7168, 0x1409, 0x4025, "Timedia Technology Serial Port", 0x10,
    8 * DEFAULT_RCLK },
{ 0x1409, 0x7168, 0x1409, 0x4027, "Timedia Technology Serial Port", 0x10,
    8 * DEFAULT_RCLK },
  • Same base chipset (vendor 0x1409, device 0x7168) used across a product family
  • Each subsystem device ID represents a different card model or port count variant
  • All share the same clock (8x standard) and BAR configuration
  • The probe function matches the first entry with compatible subsystem IDs

This repetition is unavoidable when a manufacturer uses one chipset across many SKUs, each with unique subsystem identification.

The Sentinel Entry
{ 0xffff, 0, 0xffff, 0, NULL, 0, 0}

The final entry marks the table's end. The matching function walks entries until finding vendor == 0xffff, indicating no more devices to check. Using 0xffff (an invalid vendor ID; no such vendor exists) ensures the sentinel can't accidentally match real hardware.

Table Maintenance and Evolution

This table grows continuously as new UART hardware appears. Adding support for a new device typically requires:

  1. Determining the vendor/device/subsystem IDs (via pciconf -lv on FreeBSD)
  2. Finding the correct BAR where the UART registers reside (often documented, sometimes discovered via trial)
  3. Identifying the clock frequency (from datasheets or experimentation)
  4. Testing that standard NS8250 register access works

Most entries use default values (standard clock, no shift, BAR0), requiring only IDs and a description. Complex entries like those with unusual clocks or MSI quirks often emerge from bug reports or hardware donations to developers.

The table-driven approach keeps the code maintainable: adding a new UART rarely requires code changes, just a new table entry. This is critical for a subsystem supporting dozens of manufacturers and hundreds of product variants accumulated over decades of PC hardware evolution.

Architectural Note

This table documents only NS8250-compatible UARTs. Non-compatible serial controllers (like USB serial adapters, IEEE 1394 serial, or proprietary designs) use different drivers. The probe function verifies NS8250 compatibility before accepting a device, ensuring this table's assumptions hold for all matched hardware.

4) Matching function: from PCI IDs to a hit

218: const static struct pci_id *
219: uart_pci_match(device_t dev, const struct pci_id *id)
220: {
221: 	uint16_t device, subdev, subven, vendor;
222: 
223: 	vendor = pci_get_vendor(dev);
224: 	device = pci_get_device(dev);
225: 	while (id->vendor != 0xffff &&
226: 	    (id->vendor != vendor || id->device != device))
227: 		id++;
228: 	if (id->vendor == 0xffff)
229: 		return (NULL);
230: 	if (id->subven == 0xffff)
231: 		return (id);
232: 	subven = pci_get_subvendor(dev);
233: 	subdev = pci_get_subdevice(dev);
234: 	while (id->vendor == vendor && id->device == device &&
235: 	    (id->subven != subven || id->subdev != subdev))
236: 		id++;
237: 	return ((id->vendor == vendor && id->device == device) ? id : NULL);

First match vendor/device; if the entry has specific sub-IDs, check those too; otherwise accept the wildcard.

Device Matching Logic: uart_pci_match

The uart_pci_match function implements a two-phase search algorithm that efficiently matches PCI devices against the identification table while respecting the vendor/device/subsystem hierarchy. This function is the core of device recognition, called during probe to determine if a discovered PCI device is a supported UART.

Function Signature and Context
const static struct pci_id *
uart_pci_match(device_t dev, const struct pci_id *id)
{
    uint16_t device, subdev, subven, vendor;

The function accepts a device_t representing the PCI device being probed and a pointer to the start of the identification table. It returns either a pointer to the matching pci_id entry (containing configuration parameters) or NULL if no match exists.

The return type is const struct pci_id * because the function returns a pointer into the read-only table, the caller must not modify the returned entry.

Phase One: Primary ID Matching
vendor = pci_get_vendor(dev);
device = pci_get_device(dev);
while (id->vendor != 0xffff &&
    (id->vendor != vendor || id->device != device))
    id++;
if (id->vendor == 0xffff)
    return (NULL);

The function begins by reading the device's primary identification from PCI configuration space. The pci_get_vendor() and pci_get_device() functions access configuration space registers 0x00 and 0x02, which every PCI device must implement.

The search loop: The while condition has two termination criteria:

  1. id->vendor != 0xffff - haven't reached the sentinel entry
  2. (id->vendor != vendor || id->device != device) - current entry doesn't match

The loop advances through the table until finding either a matching vendor/device pair or the sentinel. This linear search is acceptable because:

  • The table has fewer than 100 entries (fast even with linear search)
  • Probe happens once per device at boot (not performance-critical)
  • The table is in cache-friendly sequential memory

Sentinel detection: If the loop exits with id->vendor == 0xffff, no entry matched the device's primary IDs. Returning NULL signals "not my device" to the probe function, which will return ENXIO to allow other drivers a chance.

Wildcard Subsystem Handling
if (id->subven == 0xffff)
    return (id);

This is the fast-path exit for entries with wildcard subsystem IDs. When subven == 0xffff, the entry matches all variants of this chipset regardless of OEM customization. The function returns immediately without reading subsystem IDs from configuration space.

This optimization avoids unnecessary PCI configuration reads for the common case where the driver accepts all OEM variants of a chipset (e.g., "Intel AMT - SOL" matches Intel's chipset in any motherboard).

Phase Two: Subsystem ID Matching
subven = pci_get_subvendor(dev);
subdev = pci_get_subdevice(dev);
while (id->vendor == vendor && id->device == device &&
    (id->subven != subven || id->subdev != subdev))
    id++;

For entries requiring specific subsystem matches, the function reads the subsystem vendor and device IDs from PCI configuration space registers 0x2C and 0x2E.

The refinement loop: This second search advances through consecutive table entries with the same primary IDs, looking for a subsystem match. The loop continues while:

  1. id->vendor == vendor && id->device == device - still examining entries for this chipset
  2. (id->subven != subven || id->subdev != subdev) - subsystem IDs don't match

This handles tables with multiple entries for one chipset, each specifying different OEM variants:

c

{ 0x103c, 0x1048, 0x103c, 0x1227, "HP Diva Serial - Powerbar SP2", 0x10 },
{ 0x103c, 0x1048, 0x103c, 0x1301, "HP Diva RMP3", 0x14 },

Both entries have vendor 0x103c and device 0x1048, but different subsystem device IDs. The loop examines each until finding the correct variant.

Final Validation
return ((id->vendor == vendor && id->device == device) ? id : NULL);

After the refinement loop exits, one of two conditions holds:

  1. The loop found a matching entry (all four IDs match) -> return it
  2. The loop exhausted entries for this chipset without matching subsystems -> return NULL

The ternary expression performs a final sanity check: even though the loop condition guarantees id points to an entry with matching primary IDs (or past the last such entry), explicitly verifying ensures correct behavior if the loop walked past all entries for this device without finding a subsystem match.

This covers the case where:

  • Primary IDs match (phase one succeeded)
  • Table has entries with specific subsystem requirements
  • None of those subsystem entries match the device
  • The loop advanced until finding a different primary ID or the sentinel
Matching Examples

Example 1: Simple wildcard match

  • Device: Intel AMT SOL (vendor 0x8086, device 0x108f)
  • Phase one: finds { 0x8086, 0x108f, 0xffff, 0, ... }
  • Wildcard check: subven == 0xffff, return immediately
  • Result: match without reading subsystem IDs

Example 2: OEM-specific match

  • Device: HP Diva RMP3 (vendor 0x103c, device 0x1048, subven 0x103c, subdev 0x1301)
  • Phase one: finds first entry with vendor 0x103c, device 0x1048
  • Wildcard check: subven != 0xffff, read subsystem IDs
  • Phase two: first entry has subdev 0x1227 (no match), advance
  • Phase two: second entry has subdev 0x1301 (match!), return
  • Result: returns second entry with BAR 0x14 and correct description

Example 3: No match

  • Device: Unknown UART (vendor 0x1234, device 0x5678)
  • Phase one: walks entire table without finding matching primary IDs
  • Sentinel detection: returns NULL
  • Result: probe function returns ENXIO
Efficiency Considerations

The two-phase approach optimizes the common case:

  • Most table entries use wildcard subsystems (require only primary ID match)
  • Reading PCI configuration space is slower than memory access
  • Deferring subsystem ID reads until necessary reduces probe latency

For devices with wildcard entries, the function performs two configuration space reads (vendor, device) and returns. Only devices requiring subsystem matching incur four reads.

The linear search is justified because:

  • Table size is bounded and small (< 100 entries)
  • Modern CPUs prefetch sequential memory efficiently
  • Probe happens once per device lifetime, not in I/O paths
  • Code simplicity outweighs marginal speedup from binary search or hash tables
Integration with Probe Function

The probe function calls uart_pci_match with the table base pointer:

id = uart_pci_match(dev, pci_ns8250_ids);
if (id != NULL) {
    sc->sc_class = &uart_ns8250_class;
    goto match;
}

A non-NULL return provides both confirmation that the device is supported and access to its configuration parameters (id->rid, id->rclk, id->regshft). The probe function uses these values to initialize the generic UART layer correctly for this hardware variant.

5) Console uniqueness helper (rare but educational)

239: extern SLIST_HEAD(uart_devinfo_list, uart_devinfo) uart_sysdevs;
242: static const struct pci_unique_id pci_unique_devices[] = {
243: { 0x1d0f, 0x8250 }	/* Amazon PCI serial device */
244: };
248: static void
249: uart_pci_unique_console_match(device_t dev)
250: {
251: 	struct uart_softc *sc;
252: 	struct uart_devinfo * sysdev;
253: 	const struct pci_unique_id * id;
254: 	uint16_t vendor, device;
255: 
256: 	sc = device_get_softc(dev);
257: 	vendor = pci_get_vendor(dev);
258: 	device = pci_get_device(dev);
259: 
260: 	/* Is this a device known to exist only once in a system? */
261: 	for (id = pci_unique_devices; ; id++) {
262: 		if (id == &pci_unique_devices[nitems(pci_unique_devices)])
263: 			return;
264: 		if (id->vendor == vendor && id->device == device)
265: 			break;
266: 	}
267: 
268: 	/* If it matches a console, it must be the same device. */
269: 	SLIST_FOREACH(sysdev, &uart_sysdevs, next) {
270: 		if (sysdev->pci_info.vendor == vendor &&
271: 		    sysdev->pci_info.device == device) {
272: 			sc->sc_sysdev = sysdev;
273: 			sysdev->bas.rclk = sc->sc_bas.rclk;
274: 		}
275: 	}

If a PCI UART is known to be unique in a system, tie it to the console instance automatically.

Console Device Matching: uart_pci_unique_console_match

FreeBSD must identify which UART serves as the system console, the device where boot messages appear and where single-user mode login occurs. For most systems, firmware or the boot loader configures the console before the kernel starts, but the kernel must later match this pre-configured console to the correct driver instance during PCI enumeration. The uart_pci_unique_console_match function solves this matching problem for devices guaranteed to exist only once per system.

The Console Matching Problem

When the kernel boots, early console output may use a UART initialized by firmware (BIOS/UEFI) or the boot loader. This "system device" (sysdev) has register addresses and basic configuration but no association with a PCI device tree entry. Later, during normal device enumeration, the PCI bus driver discovers UARTs and attaches driver instances. The kernel must determine which enumerated device corresponds to the pre-configured console.

The challenge: PCI enumeration order is not guaranteed. The device at PCI address 0:1f:3 (bus 0, device 31, function 3) might enumerate as uart0 on one boot and uart1 after adding a card. Matching by device tree position would be unreliable.

The Unique Device Approach
extern SLIST_HEAD(uart_devinfo_list, uart_devinfo) uart_sysdevs;

static const struct pci_unique_id pci_unique_devices[] = {
{ 0x1d0f, 0x8250 }  /* Amazon PCI serial device */
};

The solution for certain hardware: some devices are architecturally guaranteed to exist only once. Server management controllers, SoC-integrated UARTs, and cloud instance serial ports fall into this category. For these devices, vendor and device IDs alone suffice for matching.

The uart_sysdevs list contains pre-configured console devices recorded during early boot. Each uart_devinfo structure captures the console's register base address, baud rate, and (if known) PCI identification.

The pci_unique_devices array lists devices meeting the uniqueness criterion. Currently it contains only Amazon's EC2 serial device (vendor 0x1d0f, device 0x8250), which exists exactly once in EC2 instances and serves as the console for serial console access.

Function Entry and Device Identification
static void
uart_pci_unique_console_match(device_t dev)
{
    struct uart_softc *sc;
    struct uart_devinfo * sysdev;
    const struct pci_unique_id * id;
    uint16_t vendor, device;

    sc = device_get_softc(dev);
    vendor = pci_get_vendor(dev);
    device = pci_get_device(dev);

The function is called from uart_pci_probe after successful device identification but before final probe completion. It receives the device being probed and retrieves:

  • The softc (driver instance state) via device_get_softc()
  • The device's vendor and device IDs from PCI configuration space

The softc at this point has been partially initialized by uart_bus_probe() with register access methods and clock rates, but sc->sc_sysdev is NULL unless console matching succeeds.

Uniqueness Verification
/* Is this a device known to exist only once in a system? */
for (id = pci_unique_devices; ; id++) {
    if (id == &pci_unique_devices[nitems(pci_unique_devices)])
        return;
    if (id->vendor == vendor && id->device == device)
        break;
}

The loop searches the unique device table for a match. Two exit conditions:

Not unique: If the loop walks past the last entry without matching, this device isn't guaranteed unique. The function returns immediately; console matching requires stricter identification (likely including subsystem IDs or base address comparison), which this function doesn't attempt.

Is unique: If vendor and device IDs match an entry, the device is guaranteed unique in the system. The loop breaks, and matching proceeds.

The array bounds check uses nitems(pci_unique_devices), a macro computing array element count. This pointer comparison detects when id has advanced past the array's end:

if (id == &pci_unique_devices[nitems(pci_unique_devices)])

This is equivalent to id == pci_unique_devices + array_length, checking if the pointer equals the address just beyond the last valid element.

Console Device Matching
/* If it matches a console, it must be the same device. */
SLIST_FOREACH(sysdev, &uart_sysdevs, next) {
    if (sysdev->pci_info.vendor == vendor &&
        sysdev->pci_info.device == device) {
        sc->sc_sysdev = sysdev;
        sysdev->bas.rclk = sc->sc_bas.rclk;
    }
}

The SLIST_FOREACH macro iterates the system device list, checking each pre-configured console for matching PCI IDs. The list typically contains zero or one entry (systems without serial consoles or with one console), but the code correctly handles multiple consoles.

Match confirmation: When sysdev->pci_info matches the device's vendor and device IDs, the uniqueness guarantee ensures this enumerated device is the same physical hardware the firmware configured as a console. No ambiguity exists; there's only one device with these IDs in the system.

Linking the instances: sc->sc_sysdev = sysdev creates a bidirectional association:

  • The driver instance (sc) now knows it's managing a console device
  • Console-specific behaviors activate: special character handling, kernel message output, debugger entry

Clock synchronization: sysdev->bas.rclk = sc->sc_bas.rclk updates the system device's clock rate to match the value from the identification table. Early boot initialization might not know the precise clock frequency, using a default or probe-detected value. The PCI driver, having matched the device against the table, knows the correct frequency and updates the system device record.

This clock update is critical: if early boot used an incorrect clock, baud rate calculations would be wrong. The console might have worked by luck (if firmware configured the UART's divisor latch directly) but would fail when the driver reconfigures it. Synchronizing rclk ensures subsequent operations use correct values.

Why This Function Exists

Traditional console matching compares base addresses: the system device's physical register address matches the PCI BAR of one enumerated device. This works reliably but requires reading BARs for all UARTs and handling complications like I/O port vs. memory-mapped registers.

For unique devices, vendor/device ID matching is simpler and equally reliable. The uniqueness guarantee eliminates ambiguity: if a unique device exists as a console and that device is enumerated, they must be the same.

Limitations and Scope

This function only handles devices in pci_unique_devices. Most UARTs don't qualify:

  • Multi-port cards have identical vendor/device IDs for all ports
  • Generic chipsets appear in multiple products
  • Motherboard UARTs from one vendor may use the same chipset across product lines

For non-unique devices, the probe function falls back to other matching methods (typically base address comparison in uart_bus_probe), or the console association might be established through hints or device tree properties.

The function is called opportunistically: it attempts to match for all probed devices but only succeeds for unique devices that also happen to be consoles. Failure is not an error; it simply means this device is either not unique or not a console.

Integration Context

The probe function calls this after initial device identification:

result = uart_bus_probe(dev, ...);
if (sc->sc_sysdev == NULL)
    uart_pci_unique_console_match(dev);

The check sc->sc_sysdev == NULL ensures this function runs only if uart_bus_probe didn't already establish a console association through other means. This ordering provides a fallback: try precise matching first (base address comparison), then try unique device matching.

If matching succeeds, subsequent driver operations recognize the console status and enable special handling: synchronous output for panic messages, debugger break character detection, and kernel message routing.

6) probe: choose the class and call the shared bus probe

277: static int
278: uart_pci_probe(device_t dev)
279: {
280: 	struct uart_softc *sc;
281: 	const struct pci_id *id;
282: 	struct pci_id cid = {
283: 		.regshft = 0,
284: 		.rclk = 0,
285: 		.rid = 0x10 | PCI_NO_MSI,
286: 		.desc = "Generic SimpleComm PCI device",
287: 	};
288: 	int result;
289: 
290: 	sc = device_get_softc(dev);
291: 
292: 	id = uart_pci_match(dev, pci_ns8250_ids);
293: 	if (id != NULL) {
294: 		sc->sc_class = &uart_ns8250_class;
295: 		goto match;
296: 	}
297: 	if (pci_get_class(dev) == PCIC_SIMPLECOMM &&
298: 	    pci_get_subclass(dev) == PCIS_SIMPLECOMM_UART &&
299: 	    pci_get_progif(dev) < PCIP_SIMPLECOMM_UART_16550A) {
300: 		/* XXX rclk what to do */
301: 		id = &cid;
302: 		sc->sc_class = &uart_ns8250_class;
303: 		goto match;
304: 	}
305: 	/* Add checks for non-ns8250 IDs here. */
306: 	return (ENXIO);
307: 
308:  match:
309: 	result = uart_bus_probe(dev, id->regshft, 0, id->rclk,
310: 	    id->rid & PCI_RID_MASK, 0, 0);
311: 	/* Bail out on error. */
312: 	if (result > 0)
313: 		return (result);
314: 	/*
315: 	 * If we haven't already matched this to a console, check if it's a
316: 	 * PCI device which is known to only exist once in any given system
317: 	 * and we can match it that way.
318: 	 */
319: 	if (sc->sc_sysdev == NULL)
320: 		uart_pci_unique_console_match(dev);
321: 	/* Set/override the device description. */
322: 	if (id->desc)
323: 		device_set_desc(dev, id->desc);
324: 	return (result);
325: }

Two routes to a match: explicit table hit or class/subclass fallback. Then call the UART bus probe with regshft, rclk, and rid.

Device Probe Function: uart_pci_probe

The probe function is the kernel's first interaction with a potential device during enumeration. When the PCI bus driver discovers a device, it calls the probe function of every registered driver, asking "can you manage this device?" The probe function examines the hardware's identification and configuration, returning a priority value indicating match quality or an error signaling "not my device."

Function Purpose and Contract
static int
uart_pci_probe(device_t dev)
{
    struct uart_softc *sc;
    const struct pci_id *id;
    int result;

    sc = device_get_softc(dev);

The probe function receives a device_t representing the hardware being examined. It must determine compatibility without modifying device state or allocating resources; those operations belong in the attach function.

The return value encodes probe results:

  • Negative values or zero indicate success, with lower values representing better matches
  • Positive values (particularly ENXIO) indicate "this driver cannot manage this device"
  • The kernel selects the driver returning the lowest (best) value

The softc is retrieved via device_get_softc(), which returns a zeroed structure of the size specified in the driver declaration (sizeof(struct uart_softc)). The probe function initializes critical fields like sc_class before delegating to generic code.

Explicit Device Table Matching
id = uart_pci_match(dev, pci_ns8250_ids);
if (id != NULL) {
    sc->sc_class = &uart_ns8250_class;
    goto match;
}

The primary matching path searches the explicit device table. If uart_pci_match returns non-NULL, the device is explicitly supported with known configuration parameters.

Setting the UART class: sc->sc_class = &uart_ns8250_class assigns the function table for NS8250-compatible register access. The uart_class structure (defined in the generic UART layer) contains function pointers for operations like:

  • Reading/writing registers
  • Configuring baud rates
  • Managing FIFOs and flow control
  • Handling interrupts

Different UART families (NS8250/16550, SAB82532, Z8530) would assign different class pointers. This driver only handles NS8250 variants, so the class assignment is unconditional.

The goto match bypasses subsequent checks, once explicitly identified, no further heuristics are needed.

Generic SimpleComm Device Fallback
if (pci_get_class(dev) == PCIC_SIMPLECOMM &&
    pci_get_subclass(dev) == PCIS_SIMPLECOMM_UART &&
    pci_get_progif(dev) < PCIP_SIMPLECOMM_UART_16550A) {
    /* XXX rclk what to do */
    id = &cid;
    sc->sc_class = &uart_ns8250_class;
    goto match;
}

This fallback handles devices not in the explicit table but advertising themselves as generic UARTs through PCI class codes. The PCI specification defines a class/subclass/programming interface hierarchy for device categorization:

Class check: PCIC_SIMPLECOMM (0x07) identifies "Simple Communication Controllers," which includes serial ports, parallel ports, and modems.

Subclass check: PCIS_SIMPLECOMM_UART (0x00) narrows this to serial controllers specifically.

Programming interface check: pci_get_progif(dev) < PCIP_SIMPLECOMM_UART_16550A accepts devices claiming 8250-compatible (ProgIF 0x00) or 16450-compatible (ProgIF 0x01) programming interfaces, but rejects devices claiming 16550A compatibility (ProgIF 0x02) or higher.

This seemingly backwards logic exists because early 16550A implementations had broken FIFOs. The PCI specification allowed devices to claim "16550-compatible" without specifying whether FIFOs worked. Rejecting 16550A+ ProgIF values forces these devices through explicit table matching, where quirks can be documented. Only conservative 8250/16450 claims are trusted.

Fallback configuration: The cid structure (declared at function entry) provides default parameters:

struct pci_id cid = {
    .regshft = 0,        /* Standard register spacing */
    .rclk = 0,           /* Use default clock */
    .rid = 0x10 | PCI_NO_MSI,  /* BAR0, no MSI */
    .desc = "Generic SimpleComm PCI device",
};

The comment /* XXX rclk what to do */ highlights uncertainty: without explicit table entry, the correct clock frequency is unknown. The generic code defaults to 1.8432 MHz (standard PC UART clock), which works for most hardware but fails for devices with non-standard clocks.

The PCI_NO_MSI flag in the default RID disables MSI for generic devices. Since quirks aren't known, conservative interrupt handling prevents potential MSI-related hangs or interrupt storms.

Setting id = &cid makes this local structure visible to the match path below, treating the generic configuration as if it came from the table.

Non-Match Exit
/* Add checks for non-ns8250 IDs here. */
return (ENXIO);

If neither explicit matching nor generic class matching succeeds, the device isn't a supported UART. Returning ENXIO ("Device not configured") tells the kernel to try other drivers.

The comment indicates an extension point: drivers for other UART families (Exar, Oxford, Sunix with proprietary registers) would add their checks here before the final ENXIO.

Delegating to Generic Probe Logic
match:
result = uart_bus_probe(dev, id->regshft, 0, id->rclk,
    id->rid & PCI_RID_MASK, 0, 0);
/* Bail out on error. */
if (result > 0)
    return (result);

The match label unifies both identification paths (explicit table and generic class). All subsequent code operates on id, which points either to a table entry or the cid structure.

Calling the generic layer: uart_bus_probe() lives in uart_bus.c and handles bus-agnostic initialization:

  • Allocates and maps the I/O resource (BAR indicated by id->rid)
  • Configures register access using id->regshft
  • Sets the reference clock to id->rclk (or default if zero)
  • Probes the hardware to verify UART presence and identify the FIFO depth
  • Establishes register base address

The additional parameters (three zeros) specify:

  • Flags controlling probe behavior
  • Device unit number hint (0 = auto-assign)
  • Reserved for future use

Error handling: If uart_bus_probe returns a positive value (error), that value propagates to the caller. Typical errors include:

  • ENOMEM - couldn't allocate resources
  • ENXIO - registers don't respond correctly (not a UART or disabled)
  • EIO - hardware access failures

Successful probe returns zero or a negative priority value.

Console Device Association
if (sc->sc_sysdev == NULL)
    uart_pci_unique_console_match(dev);

After successful generic probe, the driver attempts console matching. The check sc->sc_sysdev == NULL ensures this runs only if uart_bus_probe didn't already identify the device as a console (which it might have done via base address comparison).

Console association is opportunistic; failure doesn't prevent device attachment, it just means this UART won't receive kernel messages or serve as a login prompt.

Setting Device Description
/* Set/override the device description. */
if (id->desc)
    device_set_desc(dev, id->desc);
return (result);

The device description appears in boot messages, dmesg, and pciconf -lv output. It helps administrators identify hardware: "Intel AMT - SOL" is more meaningful than "PCI device 8086:108f."

For explicitly matched devices, id->desc contains the table-specified string. For generic devices, it's "Generic SimpleComm PCI device." The description is set unconditionally if present; even if a generic probe set one, the PCI-specific driver overrides it with more accurate information.

Finally, the function returns the result from uart_bus_probe, which the kernel uses to select among competing drivers. For UARTs, this is typically zero (success, neutral priority) since NS8250 drivers are the only ones claiming these devices.

Probe Priority and Driver Selection

The probe priority mechanism handles hardware claimed by multiple drivers. Consider a multi-function card with serial ports and network interfaces:

  • uart_pci might probe it (matches PCI class)
  • A vendor-specific driver might also probe it (matches vendor/device ID)

The vendor driver should return a better (lower) priority like -1 or -10, winning the selection. The uart_pci driver returns 0, meaning "I can drive this, but I'm generic."

For most serial hardware, only uart_pci probes successfully, making priority moot. But the mechanism allows graceful coexistence with specialized drivers.

The Complete Probe Flow
PCI bus discovers device
     -> 
Calls uart_pci_probe(dev)
     -> 
Check explicit table  ->  uart_pci_match()
     ->  (if matched)
Set NS8250 class, jump to match label
     -> 
Check PCI class codes
     ->  (if generic UART)
Use default config, jump to match label
     ->  (if neither)
Return ENXIO (not my device)

match:
     -> 
Call uart_bus_probe() for generic init
     ->  (on error)
Return error code
     ->  (on success)
Attempt console matching (if needed)
     -> 
Set device description
     -> 
Return success (0 or priority)

After successful probe, the kernel records this driver as the handler for this device and will later call uart_pci_attach to complete initialization.

7) attach: prefer single-vector MSI, then defer to the core

327: static int
328: uart_pci_attach(device_t dev)
329: {
330: 	struct uart_softc *sc;
331: 	const struct pci_id *id;
332: 	int count;
333: 
334: 	sc = device_get_softc(dev);
335: 
336: 	/*
337: 	 * Use MSI in preference to legacy IRQ if available. However, experience
338: 	 * suggests this is only reliable when one MSI vector is advertised.
339: 	 */
340: 	id = uart_pci_match(dev, pci_ns8250_ids);
341: 	if ((id == NULL || (id->rid & PCI_NO_MSI) == 0) &&
342: 	    pci_msi_count(dev) == 1) {
343: 		count = 1;
344: 		if (pci_alloc_msi(dev, &count) == 0) {
345: 			sc->sc_irid = 1;
346: 			device_printf(dev, "Using %d MSI message\n", count);
347: 		}
348: 	}
349: 
350: 	return (uart_bus_attach(dev));
351: }

Small bus-specific policy (prefer 1-vector MSI) and then delegate to uart_bus_attach().

Device Attach Function: uart_pci_attach

The attach function is called after successful probe to make the device operational. While probe merely identifies the device and verifies compatibility, attach allocates resources, configures hardware, and integrates the device into the system. For uart_pci, attach focuses on one PCI-specific concern, interrupt configuration, before delegating to the generic UART initialization code.

Function Entry and Context
static int
uart_pci_attach(device_t dev)
{
    struct uart_softc *sc;
    const struct pci_id *id;
    int count;

    sc = device_get_softc(dev);

The attach function receives the same device_t passed to probe. The softc retrieved here contains initialization performed during probe: the UART class assignment, base address configuration, and any console association.

Unlike probe (which must be idempotent and non-destructive), attach may modify device state, allocate resources, and fail destructively. If attach fails, the device becomes unavailable and typically requires reboot or manual intervention to recover.

Message Signaled Interrupts: Background

Traditional PCI interrupts use dedicated physical signal lines (INTx: INTA#, INTB#, INTC#, INTD#) shared among multiple devices. This sharing causes several problems:

  • Interrupt storms when devices don't properly acknowledge interrupts
  • Latency from iterating handlers until finding the interrupting device
  • Limited routing flexibility in complex systems

Message Signaled Interrupts (MSI) replace physical signals with memory writes to special addresses. When a device needs service, it writes to a CPU-specific address, triggering an interrupt on that CPU. MSI advantages:

  • No sharing, each device gets dedicated interrupt vectors
  • Lower latency, direct CPU targeting
  • Better scalability, thousands of vectors available vs. four INTx lines

However, MSI implementation quality varies, particularly in UARTs (simple devices often getting minimal validation). Some UART MSI implementations suffer from lost interrupts, spurious interrupts, or system hangs.

MSI Eligibility Check
/*
 * Use MSI in preference to legacy IRQ if available. However, experience
 * suggests this is only reliable when one MSI vector is advertised.
 */
id = uart_pci_match(dev, pci_ns8250_ids);
if ((id == NULL || (id->rid & PCI_NO_MSI) == 0) &&
    pci_msi_count(dev) == 1) {

The driver attempts MSI allocation only when three conditions hold:

Device not in table OR MSI not explicitly disabled: The condition (id == NULL || (id->rid & PCI_NO_MSI) == 0) evaluates true in two cases:

  1. id == NULL - device matched via generic class codes, not explicit table entry (no known quirks)
  2. (id->rid & PCI_NO_MSI) == 0 - device in table, but MSI flag is clear (MSI known working)

If the device has PCI_NO_MSI set in its table entry, this condition fails and MSI allocation is skipped entirely. Legacy line-based interrupts will be used instead.

Single MSI vector advertised: pci_msi_count(dev) == 1 queries the device's MSI capability structure to determine how many interrupt vectors it supports. UARTs only need one interrupt (serial events: received character, transmit buffer empty, modem status change), so multi-vector support is unnecessary.

The comment captures hard-won experience: devices advertising multiple MSI vectors (even though they only use one) often have buggy implementations. Restricting allocation to single-vector devices avoids these problems. A device advertising eight vectors for a simple UART likely received minimal MSI testing.

MSI Allocation
count = 1;
if (pci_alloc_msi(dev, &count) == 0) {
    sc->sc_irid = 1;
    device_printf(dev, "Using %d MSI message\n", count);
}

Requesting allocation: pci_alloc_msi(dev, &count) asks the PCI subsystem to allocate MSI vectors for this device. The count parameter is both input and output:

  • Input: requested number of vectors (1)
  • Output: actual allocated count (might be less if resources exhausted)

The function returns zero on success, non-zero on failure. Failure reasons include:

  • System doesn't support MSI (old chipsets, disabled in BIOS)
  • MSI resources exhausted (too many devices already using MSI)
  • Device MSI capability structure is malformed

Recording interrupt resource ID: On successful allocation, sc->sc_irid = 1 records that interrupt resource ID 1 will be used. The significance:

  • RID 0 typically represents the legacy INTx interrupt
  • RID 1+ represent MSI vectors
  • The generic UART attach code will allocate the interrupt resource using this RID

Without this assignment, the default RID (0) would be used, causing the driver to allocate the legacy interrupt instead of the newly-allocated MSI vector.

User notification: device_printf logs the MSI allocation to the console and system message buffer. This information helps administrators debug interrupt-related issues. Output appears as:

uart0: <Intel AMT - SOL> port 0xf0e0-0xf0e7 mem 0xfebff000-0xfebff0ff irq 16 at device 22.0 on pci0
uart0: Using 1 MSI message

Silent fallback: If pci_alloc_msi fails, the conditional body doesn't execute. The sc->sc_irid field remains at its default value (0), and no message is printed. The attach function proceeds to generic initialization, which will allocate the legacy interrupt. This silent fallback ensures device functionality even when MSI is unavailable, legacy interrupts work universally.

Delegating to Generic Attach
return (uart_bus_attach(dev));

After PCI-specific interrupt configuration, the function calls uart_bus_attach() to complete initialization. This generic function (shared across all bus types: PCI, ISA, ACPI, USB) performs:

Resource allocation:

  • I/O ports or memory-mapped registers (already mapped during probe)
  • Interrupt resource (using sc->sc_irid to select MSI or legacy)
  • Possibly DMA resources (not used by most UARTs)

Hardware initialization:

  • Reset the UART
  • Configure default parameters (8 data bits, no parity, 1 stop bit)
  • Enable and size the FIFO
  • Set up modem control signals

Character device creation:

  • Allocate TTY structures
  • Create device nodes (/dev/cuaU0, /dev/ttyU0)
  • Register with the TTY layer for line discipline support

Console integration:

  • If sc->sc_sysdev is set, configure as system console
  • Enable console output through this UART
  • Handle kernel debugger entry via break signals

Return value propagation: The return value from uart_bus_attach() passes directly to the kernel. Success (0) indicates the device is operational; errors (positive errno values) indicate failure.

Attach Failure Handling

If uart_bus_attach() fails, the device remains unusable. The PCI subsystem notes the failure and won't call device methods (read, write, ioctl) on this instance. However, resources already allocated by attach (like MSI vectors) may leak unless the driver's detach function is called.

Proper error handling in the generic attach code ensures:

  • Failed interrupt allocation triggers resource cleanup
  • Partial initialization is rolled back
  • The device remains in a safe state for retry or removal
The Complete Attach Flow
Kernel calls uart_pci_attach(dev)
     -> 
Check MSI eligibility
    | ->  Device has PCI_NO_MSI flag  ->  skip MSI
    | ->  Device advertises multiple vectors  ->  skip MSI
    | ->  Device advertises one vector  ->  attempt MSI
         -> 
    Allocate MSI vector via pci_alloc_msi()
        | ->  Success: set sc->sc_irid = 1, log message
        | ->  Failure: silent, sc->sc_irid remains 0
     -> 
Call uart_bus_attach(dev)
     -> 
Generic code allocates interrupt using sc->sc_irid
    | ->  RID 1: MSI vector
    | ->  RID 0: legacy INTx
     -> 
Complete UART initialization
     -> 
Create device nodes (/dev/cuaU*, /dev/ttyU*)
     -> 
Return success/failure

After successful attach, the UART is fully operational. Applications can open /dev/cuaU0 for serial communication, kernel messages flow to the console (if configured), and interrupt-driven I/O handles character transmission and reception.

Architectural Simplicity

The attached function's brevity, twenty-three lines including comments, demonstrates the layered architecture's power. PCI-specific concerns (MSI allocation) are handled here in minimal code, while complex UART initialization lives in the generic layer where it's shared across all bus types.

This separation means:

  • ISA-attached UARTs skip MSI logic but reuse all UART initialization
  • ACPI-attached UARTs might handle power management differently but share character device creation
  • USB serial adapters use completely different interrupt delivery but share TTY integration

The uart_pci driver is thin glue connecting PCI resource management to generic UART functionality, exactly as intended.

8) detach and module registration

353: static int
354: uart_pci_detach(device_t dev)
355: {
356: 	struct uart_softc *sc;
357: 
358: 	sc = device_get_softc(dev);
359: 
360: 	if (sc->sc_irid != 0)
361: 		pci_release_msi(dev);
362: 
363: 	return (uart_bus_detach(dev));
364: }
366: DRIVER_MODULE(uart, pci, uart_pci_driver, NULL, NULL);

Release MSI if we took it, then let the UART core unwind. Finally, register this driver on the pci bus.

Device Detach Function and Driver Registration

The detach function is called when a device must be removed from the system, either due to hot-unplug, driver unload, or system shutdown. It must reverse all operations performed during attach, releasing resources and ensuring the hardware is left in a safe state. The final DRIVER_MODULE macro registers the driver with the kernel's device framework.

Device Detach Function: uart_pci_detach
static int
uart_pci_detach(device_t dev)
{
    struct uart_softc *sc;

    sc = device_get_softc(dev);

Detach receives the device being removed and retrieves its softc containing the current configuration. The function must be prepared to handle partial initialization states, if attach failed midway, detach might be called to clean up whatever succeeded.

MSI Resource Release
if (sc->sc_irid != 0)
    pci_release_msi(dev);

The conditional checks whether MSI was allocated during attach. Recall that sc->sc_irid = 1 signals successful MSI allocation; the default value (0) indicates legacy interrupts were used.

Releasing MSI vectors: pci_release_msi(dev) returns the MSI interrupt vector to the system pool, making it available for other devices. This call must be made before the generic detach, which will deallocate the interrupt resource itself. The sequence matters:

  1. Release MSI allocation (returns vector to system)
  2. Generic detach deallocates the interrupt resource (frees kernel structures)

Reversing this order would leak MSI vectors, the kernel would consider them allocated even after the device is gone.

Why check sc_irid?: Calling pci_release_msi when MSI wasn't allocated is harmless but wastes cycles. More importantly, it documents the code's intent: "if we allocated MSI during attach, release it during detach." This symmetry aids understanding.

The lack of error handling is intentional, pci_release_msi cannot meaningfully fail during detach. The device is being removed regardless; if MSI release fails (due to corrupted kernel state), proceeding with detach is still correct.

Delegating to Generic Detach
return (uart_bus_detach(dev));

After PCI-specific resource cleanup, the function calls uart_bus_attach() to handle generic UART teardown. This mirrors the attach sequence: PCI-specific code wraps generic code.

Generic detach operations:

Character device removal: Close any open file descriptors, destroy /dev/cuaU* and /dev/ttyU* nodes, and deregister from the TTY layer.

Hardware shutdown: Disable interrupts at the UART, flush FIFOs, and deassert modem control signals. This prevents the hardware from generating spurious interrupts or asserting control lines after the driver is gone.

Resource deallocation: Free the interrupt resource (the kernel structure, not the MSI vector, that was already released above), unmap I/O ports or memory regions, and release any allocated kernel memory.

Console disconnection: If this device was the system console, redirect console output to an alternative device or disable console output entirely. The system must remain bootable even if the console UART is removed.

Return value: uart_bus_detach() returns zero on success or an error code on failure. In practice, detach rarely fails, the device is being removed whether or not software cleanup succeeds gracefully.

Detach Failure Consequences

If detach returns an error, the kernel's response depends on context:

Driver unload: If attempting to unload the driver module (kldunload uart_pci), the operation fails and the module remains loaded. The device stays attached, preventing resource leaks.

Device hot-removal: If physical removal triggered detach (PCIe hot-unplug), the hardware is already gone. Detach failure is logged but the device tree entry is removed anyway. Resource leaks may occur, but system stability is preserved.

System shutdown: During shutdown, detach failures are ignored. The system is halting regardless, so resource leaks are irrelevant.

Well-designed detach functions should never fail. The uart_pci implementation achieves this by:

  • Performing only infallible operations (resource release)
  • Delegating complex logic to generic code that handles edge cases
  • Not requiring hardware responses (hardware might already be disconnected)
Driver Registration: DRIVER_MODULE
DRIVER_MODULE(uart, pci, uart_pci_driver, NULL, NULL);

This macro registers the driver with FreeBSD's device framework, making it available for device matching during boot and module load. The macro expands to considerable infrastructure code, but its parameters are straightforward:

uart: The driver name, matching the string in uart_driver_name. This name appears in kernel messages, device tree paths, and administrative commands. Multiple drivers can share the same name if they attach to different buses, uart_pci, uart_isa, and uart_acpi all use "uart", distinguishing themselves by the bus they attach to.

pci: The parent bus name. This driver attaches to the PCI bus, so it specifies "pci". The kernel's bus framework uses this to determine when to call the driver's probe function, only PCI devices are offered to uart_pci.

uart_pci_driver: Pointer to the driver_t structure defined earlier, containing the method table and softc size. The kernel uses this to invoke driver methods and allocate per-device state.

NULL, NULL: Two reserved parameters for module initialization hooks. Most drivers don't need these, passing NULL for both. The hooks allow running code when the module loads (before any device attach) or unloads (after all devices detach). Uses include:

  • Allocating global resources (memory pools, worker threads)
  • Registering with subsystems (like the network stack)
  • Performing one-time hardware initialization

For uart_pci, no module-level initialization is needed, all work happens in probe/attach on a per-device basis.

The Module Lifecycle

The DRIVER_MODULE macro makes the driver participates in FreeBSD's modular kernel architecture:

Static compilation: If compiled into the kernel (options UART in kernel config), the driver is available at boot. The linker includes uart_pci_driver in the kernel's driver table, and PCI enumeration during boot calls its probe function.

Dynamic loading: If compiled as a module (kldload uart_pci.ko), the module loader processes the DRIVER_MODULE registration, adding the driver to the active table. Existing devices are reprobed; new matches trigger attach.

Dynamic unloading: kldunload uart_pci attempts to detach all devices managed by this driver. If any detach fails or devices are in use (open file descriptors), unload fails and the module remains. Successful unload removes the driver from the active table.

Relationship to Other UART Drivers

The FreeBSD UART subsystem includes multiple bus-specific drivers all sharing generic code:

  • uart_pci.c - PCI-attached UARTs (this driver)
  • uart_isa.c - ISA bus UARTs (legacy COM ports)
  • uart_acpi.c - ACPI-enumerated UARTs (modern laptops/servers)
  • uart_fdt.c - Flattened Device Tree UARTs (embedded systems, ARM)

Each uses DRIVER_MODULE to register with its respective bus:

DRIVER_MODULE(uart, pci, uart_pci_driver, NULL, NULL);   // PCI bus
DRIVER_MODULE(uart, isa, uart_isa_driver, NULL, NULL);   // ISA bus
DRIVER_MODULE(uart, acpi, uart_acpi_driver, NULL, NULL); // ACPI bus

All share the name "uart" but attach to different buses. A system might load all four modules simultaneously, with each handling UARTs discovered on its bus. A desktop might have:

  • Two ISA COM ports (COM1/COM2 via uart_isa)
  • One PCI management controller (IPMI via uart_pci)
  • Zero ACPI UARTs (not present)

Each device gets an independent driver instance, all sharing the generic UART code in uart_bus.c and uart_core.c.

Complete Driver Structure

With all pieces explained, the complete driver structure is:

uart_pci_methods[] ->  Method table (probe/attach/detach/resume)
      
uart_pci_driver ->  Driver declaration (name, methods, softc size)
      
DRIVER_MODULE() ->  Registration (uart, pci, uart_pci_driver)

At runtime, the PCI bus driver discovers devices and consults the registered driver table. For each device, it calls probe functions of matching drivers. The uart_pci probe function examines device IDs against its table, returning success for matches. The kernel then calls attach to initialize the device. Later, detach cleans up when the device is removed.

This architecture, method tables, layered initialization, bus-independent core logic, repeats throughout FreeBSD's device driver framework. Understanding it in the uart_pci context prepares you for more complex drivers: network cards, storage controllers, and graphics adapters all follow similar patterns at larger scale.

Interactive Exercises for uart(4)

Goal: Cement the PCI driver pattern: device identification tables -> probe -> attach -> generic core, with MSI as a bus-specific variation.

A) Driver Skeleton & Registration
  1. Point to the device_method_t array and the driver_t structure. For each, identify what it declares and how they connect to each other. Quote the relevant lines. Which field in driver_t points to the method table? Hint: Lines 52-65.

  2. Where is the DRIVER_MODULE macro and which bus does it target? What are the five parameters it receives? Quote it and explain each parameter. Hint: Line 366.

B) Device Identification and Matching
  1. In the pci_ns8250_ids[] table, find at least two Intel entries (vendor 0x8086) that demonstrate special handling: one with the PCI_NO_MSI flag and one with a non-standard clock frequency (rclk). Quote both complete entries and explain what each special parameter means for the hardware. Hint: Search for vendor 0x8086 entries around lines 153-158.

  2. In uart_pci_match(), trace the two-phase matching logic. Where does the first loop match primary IDs (vendor/device)? Where does the second loop match subsystem IDs? What happens if an entry has subven == 0xffff? Quote the relevant lines (3-5 lines total). Hint: Lines 225-237, look for the wildcard check at line 230.

  3. Find an example in pci_ns8250_ids[] where the same vendor/device pair appears multiple times with different subsystem IDs. Quote 2-3 consecutive entries and explain why this duplication exists. Hint: HP Diva entries around lines 93-95, or Timedia around lines 131-140.

C) Probe Flow
  1. In uart_pci_probe(), show where the code sets sc->sc_class to &uart_ns8250_class after successful table matching, and where it then calls uart_bus_probe(). Quote both spots (2-3 lines each). Hint: Lines 292-295 for class assignment, 309-310 for bus probe call.

  2. What does uart_pci_unique_console_match() do when it finds a unique device that matches a console? Quote the assignment to sc->sc_sysdev and the rclk synchronization line. Why is clock synchronization necessary? Hint: Lines 272-273.

  3. In uart_pci_probe(), explain the fallback path for "Generic SimpleComm" devices. What PCI class, subclass, and progif values trigger this path? Why does the comment say "XXX rclk what to do"? Quote the conditional check and note what configuration is used. Hint: Lines 297-303, look at the cid structure at lines 282-286.

D) Attach and Detach
  1. In uart_pci_attach(), why does the function re-match the device against the ID table when probe already did matching? Quote the line. Hint: Line 340.

  2. Quote the exact conditional that checks MSI eligibility (must prefer single-vector MSI) and the call that allocates it. What happens if MSI allocation fails? Quote 5-7 lines. Hint: Lines 341-347.

  3. In uart_pci_detach(), quote the two critical operations: MSI release and delegation to generic detach. Why must MSI be released before calling uart_bus_detach()? Explain the order dependency. Hint: Lines 360-363.

E) Integration: Tracing Complete Flow
  1. Starting from boot, trace how a Dell RAC 4 (vendor 0x1028, device 0x0012) becomes /dev/cuaU0. For each step, quote the relevant line:
  • Which table entry matches?
  • What clock frequency does it specify?
  • What happens in probe? (which class is set? which function is called?)
  • What happens in attach? (will it use MSI?)
  • Which generic function creates the device node?
  1. A device has vendor 0x8086, device 0xa13d (100 Series Chipset KT). Will it use MSI? Trace through the logic:
  • Find and quote the table entry
  • Check the rid field, what flag is present?
  • Quote the conditional in uart_pci_attach() that checks this flag
  • What interrupt mechanism will be used instead?
F) Architecture and Design Patterns
  1. Compare if_tuntap.c (from the previous section) with uart_bus_pci.c:
  • if_tuntap had ~2200 lines; uart_bus_pci has ~370. Why such a difference in size?
  • if_tuntap contained complete device logic; uart_bus_pci is mostly glue code. Where does the actual UART register access, baud rate configuration, and TTY integration happen? (Hint: what function does attach call?)
  • Which design approach, monolithic like if_tuntap or layered like uart_bus_pci, makes it easier to support the same hardware on multiple buses (PCI, ISA, USB)?
  1. Imagine you need to add support for:
  • A new PCI UART: vendor 0xABCD, device 0x1234, standard clock, BAR 0x10

  • An ISA-attached version of the same UART chipset

    For the PCI variant, what would you modify in uart_bus_pci.c? (Quote the structure and location) For the ISA variant, would you modify uart_bus_pci.c at all, or work in a different file? How many lines of UART register access code would you need to write/duplicate?

Stretch (thought experiments)

Examine the MSI allocation logic in uart_pci_attach().

The comment says "experience suggests this is only reliable when one MSI vector is advertised."

  1. Why would a simple UART (which only needs one interrupt) ever advertise multiple MSI vectors?
  2. What problems might occur with multi-vector MSI that the driver avoids by checking pci_msi_count(dev) == 1?
  3. If MSI allocation fails silently (the if condition is false), the driver continues. Where in the generic attach code will the interrupt resource be allocated instead? What type of interrupt will be used

Why this matters in your "anatomy" chapter

You've just walked a tiny PCI glue driver end-to-end. It matches devices, chooses a UART class, calls a shared probe/attach in the subsystem core, and sprinkles light PCI policy (MSI/console). This is the same shape you'll reuse for other buses: match -> probe -> attach -> core, plus resources/IRQs and clean detach. Keep this pattern in mind when you move from pseudo-devices to real hardware in later chapters.

From Four Drivers to One Mental Model

You've now walked through four complete drivers, each demonstrating different aspects of FreeBSD's device driver architecture. These weren't arbitrary examples; they form a deliberate progression that reveals the patterns underlying all kernel drivers.

The Journey You've Completed

Tour 1: /dev/null, /dev/zero, /dev/full (null.c)

  • Simplest possible character devices
  • Static device creation during module load
  • Trivial operations: discard writes, return zeros, simulate errors
  • No per-device state, no timers, no complexity
  • Key lesson: The cdevsw function dispatch table and basic I/O with uiomove()

Tour 2: LED Subsystem (led.c)

  • Dynamic device creation on demand
  • Subsystem providing both userspace interface and kernel API
  • Timer-driven state machine for pattern execution
  • Pattern parsing DSL converting user commands to internal codes
  • Key lesson: Stateful devices, infrastructure drivers, lock separation (mtx vs. sx)

Tour 3: TUN/TAP Network Tunnels (if_tuntap.c)

  • Dual character device + network interface
  • Bidirectional data flow: kernel <-> userspace packet exchange
  • Network stack integration (ifnet, BPF, routing)
  • Blocking I/O with proper wakeups (poll/select/kqueue support)
  • Key lesson: Complex integration bridging two kernel subsystems

Tour 4: PCI UART Driver (uart_bus_pci.c)

  • Hardware bus attachment (PCI enumeration)
  • Layered architecture: thin bus glue + thick generic core
  • Device identification via vendor/device ID tables
  • Resource management (BARs, interrupts, MSI)
  • Key lesson: The probe-attach-detach lifecycle, code reuse through layering

Patterns That Emerged

As you progressed through these drivers, certain patterns appeared repeatedly:

1. The Character Device Pattern

Every character device follows the same structure, whether it's /dev/null or /dev/tun0:

  • A cdevsw structure mapping system calls to functions
  • make_dev() creating the /dev entry
  • si_drv1 linking the device node to per-device state
  • destroy_dev() cleaning up on removal

The complexity varies, null.c has no state, led.c tracks patterns, tuntap tracks network interface, but the skeleton is identical.

2. The Dynamic vs. Static Device Pattern

null.c creates three fixed devices at module load. led.c and tuntap create devices on demand as hardware registers or users open device nodes. This flexibility comes with complexity:

  • Unit number allocation (unrhdr)
  • Global registries (linked lists)
  • More sophisticated locking

3. The Subsystem API Pattern

led.c demonstrates infrastructure design: it's both a device driver (exposing /dev/led/*) and a service provider (exporting led_create() for other drivers). This dual role appears throughout FreeBSD drivers that are libraries for other drivers.

4. The Layered Architecture Pattern

uart_bus_pci.c is minimal because most logic lives in uart_bus.c. The pattern:

  • Bus-specific code handles: device identification, resource claiming, interrupt setup
  • Generic code handles: device initialization, protocol implementation, user interface

This separation means the same UART logic works on PCI, ISA, USB, and device-tree platforms.

5. The Data Movement Patterns

You've seen three approaches to transferring data:

  • Simple: null_write sets uio_resid = 0 and returns (discard data)
  • Buffered: zero_read loops calling uiomove() from a kernel buffer
  • Zero-copy: tuntap uses mbufs for efficient packet handling

6. The Synchronization Patterns

Each driver's locking reflects its needs:

  • null.c: none (stateless devices)
  • led.c: two locks (mtx for fast state, sx for slow structure changes)
  • tuntap: per-device mutex protecting queues and ifnet state
  • uart_pci: minimal (most locking in generic uart_bus layer)

7. The Lifecycle Patterns

All drivers follow create-operate-destroy, but with variations:

  • Module lifecycle: null.c's MOD_LOAD/MOD_UNLOAD events
  • Dynamic lifecycle: led.c's led_create()/led_destroy() API
  • Clone lifecycle: tuntap's on-demand device creation
  • Hardware lifecycle: uart_pci's probe-attach-detach sequence

What You Can Now Recognize

After these four tours, when you encounter any FreeBSD driver, you should immediately identify:

What kind of driver is this?

  • Character device only? (like null.c)
  • Infrastructure/subsystem? (like led.c)
  • Dual device/network? (like tuntap)
  • Hardware bus attachment? (like uart_pci)

Where's the state?

  • Global only? (led.c's global list and timer)
  • Per-device? (tuntap's softc with queues and ifnet)
  • Split? (uart_pci's minimal state + uart_bus's rich state)

How's it locked?

  • One mutex for everything?
  • Multiple locks for different data/access patterns?
  • Handed off to generic code?

What's the data path?

  • Copying with uiomove()?
  • Using mbufs?
  • Zero-copy techniques?

What's the lifecycle?

  • Fixed (created once at load)?
  • Dynamic (created on demand)?
  • Hardware-driven (appears/disappears with physical devices)?

The Blueprint Ahead

The document that follows distills these patterns into a quick-reference guide, a checklist and template collection you can use when writing or analyzing drivers. It's organized by integration point (character device, network interface, bus attachment) and captures the critical decisions and invariants you must maintain.

Think of the four drivers you've studied as worked examples, and the blueprint as the extracted principles. Together, they form your foundation for understanding FreeBSD's driver architecture. The drivers showed you how things work in context; the blueprint reminds you what you must do to make your own drivers work correctly.

When you're ready to write your own driver or modify an existing one, start with the blueprint's self-check questions. Then refer back to the appropriate tour (null.c for basic devices, led.c for timers and APIs, tuntap for networking, uart_pci for hardware) to see those patterns in complete implementations.

You're now equipped to navigate the kernel's device drivers, not as intimidating black boxes, but as variations on patterns you've internalized through hands-on study.

Driver Anatomy Blueprint (FreeBSD 14.3)

This is your quick-reference map for FreeBSD drivers. It captures the shape (the moving parts and where they live), the contract (what the kernel expects from you), and the pitfalls (what breaks under load). Use it as the checklist before and after you code.

Core Skeleton: What Every Driver Needs

Identify your integration point:

Character device (devfs) -> struct cdevsw + make_dev*()/destroy_dev()

  • Entry points: open/read/write/ioctl/poll/kqfilter
  • Example: null.c, led.c

Network interface (ifnet) -> if_alloc()/if_attach()/if_free() + optional cdev

  • Callbacks: if_transmit or if_start, input via netisr_dispatch()
  • Example: if_tuntap.c

Bus-attached (e.g., PCI) -> device_method_t[] + driver_t + DRIVER_MODULE()

  • Lifecycle: probe/attach/detach (+ suspend/resume if needed)
  • Example: uart_bus_pci.c

Minimal invariants (commit these to memory):

  • Every object you create (cdev, ifnet, callout, taskqueue, resource) has a symmetric destroy/free on error paths and during detach/unload
  • Concurrency is explicit: if you touch state from multiple contexts (syscall path, timeout, rx/tx, interrupt), you hold the right lock or design for lock-free with strict rules
  • Resource cleanup must happen in reverse order of allocation

Character Device Blueprint

Shape:

  • static struct cdevsw with only what you implement; leave others nullop or omit
  • Module or init hook creates nodes: make_dev_credf()/make_dev_s()
  • Keep a struct cdev * to tear down later

Entry points:

read: Loop while uio->uio_resid > 0; move bytes with uiomove(); return early on error

  • Example: zero_read loops copying from pre-zeroed kernel buffer

write: Either consume (uio_resid = 0; return 0;) or fail (return ENOSPC/EIO/...)

  • No partial writes unless you mean it
  • Example: null_write consumes all; full_write always fails

ioctl: Small switch(cmd); return 0, specific errno, or ENOIOCTL

  • Handle standard terminal ioctls (FIONBIO, FIOASYNC) even if they're no-ops
  • Example: null_ioctl handles kernel dump configuration

poll/kqueue (optional): Wire readiness + notifications if userspace blocks

  • Example: tuntap's poll checks queue and registers via selrecord()

Concurrency & timers:

  • If you have periodic work (e.g., LED blink), use a callout bound to the right mutex
  • Arm/re-arm responsibly; stop it in teardown when the last user goes away
  • Example: led.c's callout_init_mtx(&led_ch, &led_mtx, 0)

Teardown:

  • destroy_dev(), stop callouts/taskqueues, free buffers
  • Clear pointers (e.g., si_drv1 = NULL) under lock before freeing
  • Example: led_destroy's two-phase cleanup (mtx then sx)

Check before lab:

  • Can you match each user-visible behavior to the exact entry point?
  • Are all allocations paired with frees on every error path?

Network Pseudo-Interface Blueprint

Two faces:

  • Character device side (/dev/tunN, /dev/tapN) with open/read/write/ioctl/poll
  • ifnet side (ifconfig tun0 ...) with attach, flags, link state, and BPF hooks

Data flow:

Kernel -> user (read):

  • Dequeue packet (mbuf) from your queue
  • Block until available unless O_NONBLOCK (then EWOULDBLOCK)
  • Copy optional headers first (virtio/ifhead), then payload via m_mbuftouio()
  • Free mbuf with m_freem()
  • Example: tunread's loop with mtx_sleep() for blocking

User -> kernel (write):

  • Build mbuf with m_uiotombuf()
  • Decide L2 vs L3 path
  • For L3: pick AF and netisr_dispatch()
  • For L2: validate destination (drop frames real NIC wouldn't receive unless promisc)
  • Example: tunwrite_l3 dispatches via NETISR_IP/NETISR_IPV6

Lifecycle:

  • Clone or first open creates cdev and softc
  • Then if_alloc()/if_attach() and bpfattach()
  • Open can raise link up; close can drop it
  • Example: tuncreate builds ifnet, tunopen marks link UP

Notify readers:

  • wakeup(), selwakeuppri(), KNOTE() when packets arrive
  • Example: tunstart's triple notification when packet enqueued

Check before lab:

  • Do you know which paths block and which return immediately?
  • Is your maximum I/O size bounded (MRU + headers)?
  • Are wakeups fired on every packet enqueue?

PCI Glue Blueprint

Match -> Probe -> Attach -> Detach:

Match: Vendor/device(/subvendor/subdevice) table; fall back to class/subclass when needed

  • Example: uart_pci_match's two-phase search (primary IDs then subsystem)

Probe: Choose driver class, compute parameters (reg shift, rclk, BAR RID), then call shared bus probe

  • Example: uart_pci_probe sets sc->sc_class = &uart_ns8250_class

Attach: Allocate interrupts (prefer single-vector MSI if supported), then delegate to subsystem

  • Example: uart_pci_attach's conditional MSI allocation

Detach: Release MSI/IRQ, then delegate to subsystem detach

  • Example: uart_pci_detach checks sc_irid and releases MSI if allocated

Resources:

  • Map BARs, allocate IRQs, hand resources to core
  • Track IDs so you can release them symmetrically
  • Example: id->rid & PCI_RID_MASK extracts BAR number

Check before lab:

  • Do you handle the "no match" path cleanly (ENXIO)?
  • Are you leak-free across any mid-attach failure?
  • Do you check for quirks (like PCI_NO_MSI flag)?

Locking & Concurrency Cheatsheet

Fast path data movement (read/write, rx/tx):

  • Protect queues and state with a mutex
  • Minimize hold time; never sleep while holding if avoidable
  • Example: tuntap's tun_mtx protecting send queue

Configuration / topology (create/destroy, link up/down):

  • Typically an sx lock or higher-level serialization
  • Example: led.c's led_sx for device creation/destruction

Timer/callout:

  • Use callout_init_mtx(&callout, &mtx, flags) so timeout runs with your mutex held
  • Example: led.c's timer automatically holds led_mtx

User-space notifications:

  • After enqueuing: wakeup(tp), selwakeuppri(&sel, PRIO), KNOTE(&klist, NOTE_*)
  • Example: tunstart's triple notification pattern

Lock ordering rules:

  • Never acquire locks in inconsistent order
  • Document your lock hierarchy
  • Example: led.c acquires led_mtx then releases before taking led_sx

Data Movement Patterns

uiomove() loop for cdev read/write:

  • Cap chunk size to a safe buffer (avoid giant copies)
  • Check and handle errors on each iteration
  • Example: zero_read limits to ZERO_REGION_SIZE per iteration

mbuf path for networking:

User -> kernel:

m = m_uiotombuf(uio, M_NOWAIT, 0, align, M_PKTHDR);
// set metadata (AF/virtio)
netisr_dispatch(isr, m);

Kernel -> user:

// optional header to user (uiomove())
m_mbuftouio(uio, m, 0);
m_freem(m);

Example: tunwrite builds mbuf; tunread extracts to userspace

Common Patterns From the Tours

Pattern: Shared cdevsw, per-device state via si_drv1

  • One function table, many device instances
  • Example: led.c shares led_cdevsw across all LEDs
  • State accessed via sc = dev->si_drv1

Pattern: Subsystem providing both APIs

  • Userspace interface (character device)
  • Kernel API (function calls)
  • Example: led.c's led_write() vs. led_set()

Pattern: Timer-driven state machine

  • Reference counter tracks active items
  • Timer reschedules only when work remains
  • Example: led.c's blinkers counter gates timer

Pattern: Two-phase cleanup

  • Phase 1: Make invisible (clear pointers, remove from lists)
  • Phase 2: Free resources
  • Example: led_destroy clears si_drv1 before destroying device

Pattern: Unit number allocation

  • Use unrhdr for dynamic assignment
  • Prevents conflicts in multi-instance devices
  • Example: led.c's led_unit pool

Errors, Edge Cases, and User Experience

Error handling:

  • Prefer clear errno over silent behavior unless silence is part of the contract
  • Example: tunwrite silently ignores writes when interface down (expected behavior)
  • Example: led_write returns EINVAL for bad commands (error condition)

Bound inputs:

  • Always validate sizes, counts, indices
  • Example: led_write rejects commands over 512 bytes
  • Example: tuntap checks against MRU + headers

Default to fail fast:

  • Unsupported ioctl -> ENOIOCTL
  • Invalid flags -> EINVAL
  • Malformed frames -> drop and increment error counter

Module unload:

  • Think about impact on active users
  • Don't yank foundational devices from busy systems
  • Example: null.c can be unloaded; led.c cannot (no unload handler)

Minimal Templates

Character Device (Read/Write/Ioctl Only)

static d_read_t  foo_read;
static d_write_t foo_write;
static d_ioctl_t foo_ioctl;

static struct cdevsw foo_cdevsw = {
    .d_version = D_VERSION,
    .d_read    = foo_read,
    .d_write   = foo_write,
    .d_ioctl   = foo_ioctl,
    .d_name    = "foo",
};

static struct cdev *foo_dev;

static int
foo_read(struct cdev *dev, struct uio *uio, int flags)
{
    while (uio->uio_resid > 0) {
        size_t n = MIN(uio->uio_resid, CHUNK);
        int err = uiomove(srcbuf, n, uio);
        if (err) return err;
    }
    return 0;
}

static int
foo_write(struct cdev *dev, struct uio *uio, int flags)
{
    /* Consume all (bit bucket pattern) */
    uio->uio_resid = 0;
    return 0;
}

static int
foo_ioctl(struct cdev *dev, u_long cmd, caddr_t data, 
          int fflag, struct thread *td)
{
    switch (cmd) {
    case FIONBIO:
        return 0;  /* Non-blocking always OK */
    default:
        return ENOIOCTL;
    }
}

Dynamic Device Registration

static struct unrhdr *foo_units;
static struct mtx foo_mtx;
static LIST_HEAD(, foo_softc) foo_list;

struct cdev *
foo_create(void *priv, const char *name)
{
    struct foo_softc *sc;
    
    sc = malloc(sizeof(*sc), M_FOO, M_WAITOK | M_ZERO);
    sc->unit = alloc_unr(foo_units);
    sc->private = priv;
    
    sc->dev = make_dev(&foo_cdevsw, sc->unit,
        UID_ROOT, GID_WHEEL, 0600, "foo/%s", name);
    sc->dev->si_drv1 = sc;
    
    mtx_lock(&foo_mtx);
    LIST_INSERT_HEAD(&foo_list, sc, list);
    mtx_unlock(&foo_mtx);
    
    return sc->dev;
}

void
foo_destroy(struct cdev *dev)
{
    struct foo_softc *sc;
    
    mtx_lock(&foo_mtx);
    sc = dev->si_drv1;
    dev->si_drv1 = NULL;
    LIST_REMOVE(sc, list);
    mtx_unlock(&foo_mtx);
    
    free_unr(foo_units, sc->unit);
    destroy_dev(dev);
    free(sc, M_FOO);
}

PCI Glue (Probe/Attach/Detach)

static int foo_probe(device_t dev)
{
    /* Table match  ->  pick class */
    id = foo_pci_match(dev, foo_ids);
    if (id == NULL)
        return ENXIO;
    
    sc->sc_class = &foo_device_class;
    return foo_bus_probe(dev, id->regshft, id->rclk, 
                         id->rid & RID_MASK);
}

static int foo_attach(device_t dev)
{
    /* Maybe allocate single-vector MSI */
    if (pci_msi_count(dev) == 1) {
        count = 1;
        if (pci_alloc_msi(dev, &count) == 0)
            sc->sc_irid = 1;
    }
    return foo_bus_attach(dev);
}

static int foo_detach(device_t dev)
{
    /* Release MSI if used */
    if (sc->sc_irid != 0)
        pci_release_msi(dev);
    
    return foo_bus_detach(dev);
}

static device_method_t foo_methods[] = {
    DEVMETHOD(device_probe,  foo_probe),
    DEVMETHOD(device_attach, foo_attach),
    DEVMETHOD(device_detach, foo_detach),
    DEVMETHOD_END
};

static driver_t foo_driver = {
    "foo",
    foo_methods,
    sizeof(struct foo_softc)
};

DRIVER_MODULE(foo, pci, foo_driver, NULL, NULL);

Pre-Lab Self-Check (2 Minutes)

Ask yourself these questions before writing code:

  1. Which integration point am I targeting (devfs, ifnet, PCI)?

  2. Do I know my entry points and what each must return on success/failure?

  3. What are my locks and which contexts touch each field?

  4. Can I list every resource I allocate and where I free it on:

    • Success path
    • Mid-attach failure
    • Detach/unload
  5. Have I studied a similar driver from the tours?

    • null.c for simple character devices
    • led.c for dynamic devices and timers
    • tuntap for network integration
    • uart_pci for hardware attachment

After-Lab Reflection (5 minutes)

After writing or modifying code, verify:

  1. Did I leak anything on an early return?
  2. Did I block in a context that shouldn't sleep?
  3. Did I notify userspace/kernel peers after enqueuing work?
  4. Can I point from a user-visible behavior back to the specific source lines?
  5. Does my locking follow a consistent hierarchy?
  6. Are my error messages helpful for debugging?

Common Pitfalls and How to Avoid Them

This section catalogs the mistakes that cause the most pain in driver development, silent corruption, deadlocks, panics, and resource leaks. Each pitfall includes the symptom, the root cause, and the correct pattern to follow.

Data Movement Errors

Pitfall: Forgetting to update uio_resid

Symptom: Infinite loops in read/write handlers, or userspace receiving wrong byte counts.

Root cause: The kernel uses uio_resid to track remaining bytes. If you don't decrement it, the kernel thinks no progress was made.

Wrong:

static int
bad_write(struct cdev *dev, struct uio *uio, int flags)
{
    /* Data is discarded but uio_resid never changes! */
    return 0;  /* Kernel sees 0 bytes written, retries infinitely */
}

Correct:

static int
good_write(struct cdev *dev, struct uio *uio, int flags)
{
    uio->uio_resid = 0;  /* Mark all bytes consumed */
    return 0;
}

How to avoid: Always ask "how many bytes did I actually process?" and update uio_resid accordingly. Even if you discard data (like /dev/null), you must mark it consumed.

Related: Partial transfers are dangerous. If you process some bytes but then fail, you must update uio_resid to reflect what was actually transferred before returning the error, or userspace will retry with the wrong offset.

Pitfall: Not bounding chunk sizes in uiomove() loops

Symptom: Stack overflow if copying to stack buffer, kernel panic on huge allocations.

Root cause: User requests can be arbitrarily large. Copying multi-megabyte transfers in one shot exhausts resources.

Wrong:

static int
bad_read(struct cdev *dev, struct uio *uio, int flags)
{
    char buf[uio->uio_resid];  /* Stack overflow if user requests 1MB! */
    memset(buf, 0, sizeof(buf));
    return uiomove(buf, uio->uio_resid, uio);
}

Correct:

#define CHUNK_SIZE 4096

static int
good_read(struct cdev *dev, struct uio *uio, int flags)
{
    char buf[CHUNK_SIZE];
    int error;
    
    memset(buf, 0, sizeof(buf));
    
    while (uio->uio_resid > 0) {
        size_t len = MIN(uio->uio_resid, CHUNK_SIZE);
        error = uiomove(buf, len, uio);
        if (error)
            return error;
    }
    return 0;
}

How to avoid: Always loop with a reasonable chunk size (typically 4KB-64KB). Study zero_read in null.c, it limits transfers to ZERO_REGION_SIZE per iteration.

Pitfall: Accessing user memory directly from kernel

Symptom: Security vulnerabilities, kernel crashes on invalid pointers.

Root cause: Kernel and user memory spaces are separate. Dereferencing user pointers directly bypasses protection.

Wrong:

static int
bad_ioctl(struct cdev *dev, u_long cmd, caddr_t data, int flag, struct thread *td)
{
    char *user_ptr = *(char **)data;
    strcpy(kernel_buf, user_ptr);  /* DANGER: user_ptr not validated! */
}

Correct:

static int
good_ioctl(struct cdev *dev, u_long cmd, caddr_t data, int flag, struct thread *td)
{
    char *user_ptr = *(char **)data;
    char kernel_buf[256];
    int error;
    
    error = copyinstr(user_ptr, kernel_buf, sizeof(kernel_buf), NULL);
    if (error)
        return error;
    /* Now safe to use kernel_buf */
}

How to avoid: Never dereference pointers received from userspace. Use copyin(), copyout(), copyinstr(), or uiomove() for all user <-> kernel transfers. These functions validate addresses and handle page faults safely.

Locking Disasters

Pitfall: Holding locks across uiomove()

Symptom: System deadlock when user memory is paged out.

Root cause: uiomove() can page fault, which may need to acquire VM locks. If you hold another lock during the fault, and that lock is needed by the paging path, deadlock results.

Wrong:

static int
bad_read(struct cdev *dev, struct uio *uio, int flags)
{
    mtx_lock(&my_mtx);
    /* Build response in kernel buffer */
    uiomove(kernel_buf, len, uio);  /* DEADLOCK RISK: uiomove while locked */
    mtx_unlock(&my_mtx);
    return 0;
}

Correct:

static int
good_read(struct cdev *dev, struct uio *uio, int flags)
{
    char *local_buf;
    size_t len;
    
    mtx_lock(&my_mtx);
    /* Copy data to private buffer while locked */
    len = MIN(uio->uio_resid, bufsize);
    local_buf = malloc(len, M_TEMP, M_WAITOK);
    memcpy(local_buf, sc->data, len);
    mtx_unlock(&my_mtx);
    
    /* Transfer to user without holding lock */
    error = uiomove(local_buf, len, uio);
    free(local_buf, M_TEMP);
    return error;
}

How to avoid: Always release locks before uiomove(), copyin(), copyout(). Snapshot the data you need while locked, then transfer it to userspace unlocked.

Exception: Some sleep-capable locks (sx locks with SX_DUPOK) can be held across user memory access if carefully designed, but mutexes never can.

Pitfall: Inconsistent lock ordering

Symptom: Deadlock when two threads acquire the same locks in opposite orders.

Root cause: Lock ordering violations create circular wait conditions.

Wrong:

/* Thread A */
mtx_lock(&lock_a);
mtx_lock(&lock_b);  /* Order: A then B */

/* Thread B */
mtx_lock(&lock_b);
mtx_lock(&lock_a);  /* Order: B then A - DEADLOCK! */

Correct:

/* Establish hierarchy: always lock_a before lock_b */

/* Thread A */
mtx_lock(&lock_a);
mtx_lock(&lock_b);

/* Thread B */
mtx_lock(&lock_a);  /* Same order everywhere */
mtx_lock(&lock_b);

How to avoid:

  1. Document your lock hierarchy in comments at the top of the file
  2. Always acquire locks in the same order throughout the driver
  3. Use WITNESS kernel option during development to detect violations
  4. Study led.c: it acquires led_mtx first, releases it, then acquires led_sx, never holds both simultaneously
Pitfall: Forgetting to initialize locks

Symptom: Kernel panic with "lock not initialized" or immediate hang on first lock acquisition.

Root cause: Lock structures must be explicitly initialized before use.

Wrong:

static struct mtx my_lock;  /* Declared but not initialized */

static int
foo_attach(device_t dev)
{
    mtx_lock(&my_lock);  /* PANIC: uninitialized lock */
}

Correct:

static struct mtx my_lock;

static void
foo_init(void)
{
    mtx_init(&my_lock, "my lock", NULL, MTX_DEF);
}

SYSINIT(foo, SI_SUB_DRIVERS, SI_ORDER_FIRST, foo_init, NULL);

How to avoid:

  • Initialize locks in module load handler, SYSINIT, or attach function
  • Use mtx_init(), sx_init(), rw_init() as appropriate
  • For callouts: callout_init_mtx() associates timer with lock
  • Study led.c's led_drvinit(): initializes all locks before any devices are created
Pitfall: Destroying locks while threads still hold them

Symptom: Kernel panic during module unload or device detach.

Root cause: Lock structures must remain valid until all users are done.

Wrong:

static int
bad_detach(device_t dev)
{
    mtx_destroy(&sc->mtx);     /* Destroy lock */
    destroy_dev(sc->dev);       /* But device write handler may still run! */
    return 0;
}

Correct:

static int
good_detach(device_t dev)
{
    destroy_dev(sc->dev);       /* Wait for all users to finish */
    /* Now safe - no threads can be in device operations */
    mtx_destroy(&sc->mtx);
    return 0;
}

How to avoid:

  • destroy_dev() blocks until all open file descriptors close and in-progress operations complete
  • Destroy locks only after devices/resources are gone
  • For global locks: destroy in module unload or never (if module can't unload)

Resource Management Failures

Pitfall: Leaking resources on error paths

Symptom: Memory leaks, device node leaks, eventual resource exhaustion.

Root cause: Early returns skip cleanup code.

Wrong:

static int
bad_attach(device_t dev)
{
    sc = malloc(sizeof(*sc), M_DEV, M_WAITOK);
    
    sc->res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid, RF_ACTIVE);
    if (sc->res == NULL)
        return ENXIO;  /* LEAK: sc not freed! */
    
    error = setup_irq(dev);
    if (error)
        return error;  /* LEAK: sc and sc->res not freed! */
    
    return 0;
}

Correct:

static int
good_attach(device_t dev)
{
    sc = malloc(sizeof(*sc), M_DEV, M_WAITOK | M_ZERO);
    
    sc->res = bus_alloc_resource_any(dev, SYS_RES_MEMORY, &rid, RF_ACTIVE);
    if (sc->res == NULL) {
        error = ENXIO;
        goto fail;
    }
    
    error = setup_irq(dev);
    if (error)
        goto fail;
    
    return 0;

fail:
    if (sc->res != NULL)
        bus_release_resource(dev, SYS_RES_MEMORY, rid, sc->res);
    free(sc, M_DEV);
    return error;
}

How to avoid:

  • Use a single fail: label at the end of the function
  • Check which resources were allocated and free only those
  • Initialize pointers to NULL so you can check them
  • Consider: every malloc() needs a free(), every make_dev() needs a destroy_dev()
Pitfall: Use-after-free in concurrent cleanup

Symptom: Kernel panic with "page fault in kernel mode", often intermittent.

Root cause: One thread frees memory while another thread still accesses it.

Wrong:

void
bad_destroy(struct cdev *dev)
{
    struct foo_softc *sc = dev->si_drv1;
    
    free(sc, M_FOO);            /* Free immediately */
    /* Another thread's foo_write may still be using sc! */
}

Correct:

void
good_destroy(struct cdev *dev)
{
    struct foo_softc *sc;
    
    mtx_lock(&foo_mtx);
    sc = dev->si_drv1;
    dev->si_drv1 = NULL;        /* Break link first */
    LIST_REMOVE(sc, list);      /* Remove from searchable lists */
    mtx_unlock(&foo_mtx);
    
    destroy_dev(dev);           /* Wait for operations to drain */
    
    /* Now safe - no threads can find or access sc */
    free(sc, M_FOO);
}

How to avoid:

  • Make objects invisible before freeing (clear pointers, remove from lists)
  • Use destroy_dev() which waits for in-progress operations
  • Study led_destroy: clears si_drv1 first, removes from list, then frees
Pitfall: Not checking for allocation failures with M_NOWAIT

Symptom: Kernel panic dereferencing NULL pointer.

Root cause: M_NOWAIT allocations can fail, but code assumes success.

Wrong:

static int
bad_write(struct cdev *dev, struct uio *uio, int flags)
{
    char *buf = malloc(uio->uio_resid, M_TEMP, M_NOWAIT);
    /* PANIC if malloc returns NULL and we dereference buf! */
    uiomove(buf, uio->uio_resid, uio);
    free(buf, M_TEMP);
}

Correct:

static int
good_write(struct cdev *dev, struct uio *uio, int flags)
{
    char *buf = malloc(uio->uio_resid, M_TEMP, M_NOWAIT);
    if (buf == NULL)
        return ENOMEM;
    
    error = uiomove(buf, uio->uio_resid, uio);
    free(buf, M_TEMP);
    return error;
}

Better: Use M_WAITOK when safe:

static int
better_write(struct cdev *dev, struct uio *uio, int flags)
{
    /* M_WAITOK can sleep but never returns NULL */
    char *buf = malloc(uio->uio_resid, M_TEMP, M_WAITOK);
    error = uiomove(buf, uio->uio_resid, uio);
    free(buf, M_TEMP);
    return error;
}

How to avoid:

  • Use M_WAITOK unless in interrupt context or holding spin locks
  • Always check M_NOWAIT allocations for NULL
  • Study led_write: uses M_WAITOK since write operations can sleep

Timer and Asynchronous Operation Errors

Pitfall: Timer callback accessing freed memory

Symptom: Panic in timer callback, memory corruption.

Root cause: Device destroyed but timer still scheduled.

Wrong:

void
bad_destroy(struct cdev *dev)
{
    struct foo_softc *sc = dev->si_drv1;
    
    destroy_dev(dev);
    free(sc, M_FOO);            /* Free softc */
    /* Timer may fire and access sc! */
}

Correct:

void
good_destroy(struct cdev *dev)
{
    struct foo_softc *sc = dev->si_drv1;
    
    callout_drain(&sc->callout);  /* Wait for callback to complete */
    destroy_dev(dev);
    free(sc, M_FOO);              /* Now safe */
}

How to avoid:

  • Use callout_drain() before freeing structures accessed by callback
  • Or use callout_stop() and ensure no callback is running
  • Initialize callouts with callout_init_mtx() to automatically hold your lock
  • Study led_destroy: stops timer when list becomes empty
Pitfall: Timer rescheduling unconditionally

Symptom: CPU waste, system slowdown, unnecessary wakeups.

Root cause: Timer fires even when there's no work to do.

Wrong:

static void
bad_timeout(void *arg)
{
    /* Process items */
    LIST_FOREACH(item, &list, entries) {
        if (item->active)
            process_item(item);
    }
    
    /* Always reschedule - wastes CPU even when list empty! */
    callout_reset(&timer, hz / 10, bad_timeout, arg);
}

Correct:

static void
good_timeout(void *arg)
{
    int active_count = 0;
    
    LIST_FOREACH(item, &list, entries) {
        if (item->active) {
            process_item(item);
            active_count++;
        }
    }
    
    /* Only reschedule if there's work */
    if (active_count > 0)
        callout_reset(&timer, hz / 10, good_timeout, arg);
}

How to avoid:

  • Maintain a counter of items needing service
  • Only schedule timer when counter > 0
  • Study led.c: blinkers counter gates timer rescheduling

Network Driver Specific Issues

Pitfall: Not freeing mbufs on error paths

Symptom: mbuf exhaustion, "network buffers exhausted" messages.

Root cause: Mbufs are a limited resource that must be explicitly freed.

Wrong:

static int
bad_transmit(struct ifnet *ifp, struct mbuf *m)
{
    if (validate_packet(m) < 0)
        return EINVAL;  /* LEAK: m not freed! */
    
    if (queue_full())
        return ENOBUFS; /* LEAK: m not freed! */
    
    enqueue_packet(m);
    return 0;
}

Correct:

static int
good_transmit(struct ifnet *ifp, struct mbuf *m)
{
    if (validate_packet(m) < 0) {
        m_freem(m);
        return EINVAL;
    }
    
    if (queue_full()) {
        m_freem(m);
        return ENOBUFS;
    }
    
    enqueue_packet(m);  /* Queue now owns mbuf */
    return 0;
}

How to avoid:

  • Whoever has the mbuf pointer is responsible for freeing it
  • On error: m_freem(m) before returning
  • On success: ensure someone else took ownership (queued, transmitted, etc.)
Pitfall: Forgetting to notify blocked readers/writers

Symptom: Processes hang in read/write/poll even though data is available.

Root cause: Data arrives but waiters aren't woken.

Wrong:

static void
bad_rx_handler(struct foo_softc *sc, struct mbuf *m)
{
    TAILQ_INSERT_TAIL(&sc->rxq, m, list);
    /* Reader blocked in read() never wakes up! */
}

Correct:

static void
good_rx_handler(struct foo_softc *sc, struct mbuf *m)
{
    TAILQ_INSERT_TAIL(&sc->rxq, m, list);
    
    /* Triple notification pattern */
    wakeup(sc);                              /* Wake sleeping threads */
    selwakeuppri(&sc->rsel, PZERO + 1);      /* Wake poll/select */
    KNOTE_LOCKED(&sc->rsel.si_note, 0);      /* Wake kqueue */
}

How to avoid:

  • After enqueueing data: call wakeup(), selwakeuppri(), KNOTE()
  • Study tunstart in if_tuntap.c: triple notification pattern
  • For write: notify after dequeueing (when space becomes available)

Input Validation Failures

Pitfall: Not bounding input sizes

Symptom: Denial of service, kernel memory exhaustion.

Root cause: Attacker can request huge allocations or cause huge copies.

Wrong:

static int
bad_write(struct cdev *dev, struct uio *uio, int flags)
{
    char *buf = malloc(uio->uio_resid, M_TEMP, M_WAITOK);
    /* Attacker writes 1GB, kernel allocates 1GB! */
    uiomove(buf, uio->uio_resid, uio);
    process(buf);
    free(buf, M_TEMP);
}

Correct:

#define MAX_CMD_SIZE 4096

static int
good_write(struct cdev *dev, struct uio *uio, int flags)
{
    char *buf;
    
    if (uio->uio_resid > MAX_CMD_SIZE)
        return EINVAL;  /* Reject excessive requests */
    
    buf = malloc(uio->uio_resid, M_TEMP, M_WAITOK);
    uiomove(buf, uio->uio_resid, uio);
    process(buf);
    free(buf, M_TEMP);
}

How to avoid:

  • Define maximum sizes for all inputs (commands, packets, buffers)
  • Check limits before allocation
  • Study led_write: rejects commands over 512 bytes
Pitfall: Trusting user-provided lengths and offsets

Symptom: Buffer overruns, reading uninitialized memory, information leaks.

Root cause: User controls length fields in ioctl structures.

Wrong:

struct user_request {
    void *buf;
    size_t len;
};

static int
bad_ioctl(struct cdev *dev, u_long cmd, caddr_t data, int flag, struct thread *td)
{
    struct user_request *req = (struct user_request *)data;
    char kernel_buf[256];
    
    /* User can set len > 256! */
    copyin(req->buf, kernel_buf, req->len);  /* Buffer overrun! */
}

Correct:

static int
good_ioctl(struct cdev *dev, u_long cmd, caddr_t data, int flag, struct thread *td)
{
    struct user_request *req = (struct user_request *)data;
    char kernel_buf[256];
    
    if (req->len > sizeof(kernel_buf))
        return EINVAL;
    
    return copyin(req->buf, kernel_buf, req->len);
}

How to avoid:

  • Validate all length fields against buffer sizes
  • Validate offsets are within valid ranges
  • Use MIN() to cap lengths: len = MIN(user_len, MAX_LEN)

Race Conditions and Timing Issues

Pitfall: Check-then-use races

Symptom: Intermittent crashes, security vulnerabilities (TOCTOU bugs).

Root cause: State changes between check and use.

Wrong:

static int
bad_write(struct cdev *dev, struct uio *uio, int flags)
{
    struct foo_softc *sc = dev->si_drv1;
    
    if (sc == NULL)          /* Check */
        return ENXIO;
    
    /* Another thread destroys device here! */
    
    process_data(sc->buf);   /* Use - sc may be freed! */
}

Correct:

static int
good_write(struct cdev *dev, struct uio *uio, int flags)
{
    struct foo_softc *sc;
    int error;
    
    mtx_lock(&foo_mtx);
    sc = dev->si_drv1;
    if (sc == NULL) {
        mtx_unlock(&foo_mtx);
        return ENXIO;
    }
    
    /* Process while holding lock */
    error = process_data_locked(sc->buf);
    mtx_unlock(&foo_mtx);
    return error;
}

How to avoid:

  • Hold appropriate lock from check through use
  • Make checks and uses atomic with respect to each other
  • Or use reference counting to keep objects alive
Pitfall: Missing memory barriers on lock-free code

Symptom: Rare corruption on multi-core systems, works fine in single-core.

Root cause: CPU reordering of memory operations.

Wrong:

/* Producer */
sc->data = new_value;    /* Write data */
sc->ready = 1;           /* Set flag - may be reordered before data write! */

/* Consumer */
if (sc->ready)           /* Check flag */
    use(sc->data);       /* May see old data! */

Correct with explicit barriers:

/* Producer */
sc->data = new_value;
atomic_store_rel_int(&sc->ready, 1);  /* Release barrier */

/* Consumer */
if (atomic_load_acq_int(&sc->ready))  /* Acquire barrier */
    use(sc->data);

Better: Just use locks:

/* Much simpler and correct */
mtx_lock(&sc->mtx);
sc->data = new_value;
sc->ready = 1;
mtx_unlock(&sc->mtx);

How to avoid:

  • Avoid lock-free programming unless you're an expert
  • Use locks for correctness, optimize only if profiling shows need
  • If you must go lock-free: use atomic operations with explicit barriers

Module Lifecycle Issues

Pitfall: Device operations racing with module unload

Symptom: Crash during kldunload, jumps to invalid memory.

Root cause: Functions unloaded while still in use.

Wrong:

static int
bad_unload(module_t mod, int type, void *data)
{
    switch(type) {
    case MOD_UNLOAD:
        destroy_dev(my_dev);
        return 0;  /* Module text may be unloaded while write() in progress! */
    }
}

Correct:

static int
good_unload(module_t mod, int type, void *data)
{
    switch(type) {
    case MOD_UNLOAD:
        /* destroy_dev() waits for all operations to complete */
        destroy_dev(my_dev);
        /* Now safe - no code paths reference module functions */
        return 0;
    }
}

How to avoid:

  • destroy_dev() automatically prevents this by waiting
  • For infrastructure modules (like led.c): don't provide unload handler
  • Test unload under load: while true; do cat /dev/foo; done & sleep 1; kldunload foo
Pitfall: Unload leaving dangling references

Symptom: Crashes in seemingly unrelated code after module unload.

Root cause: Other code holds pointers to unloaded module's data/functions.

Wrong:

/* Your module */
void my_callback(void *arg) { /* ... */ }

static int
bad_load(module_t mod, int type, void *data)
{
    register_callback(my_callback);  /* Register with another subsystem */
    return 0;
}

static int
bad_unload(module_t mod, int type, void *data)
{
    return 0;  /* Forgot to unregister - subsystem will call invalid function! */
}

Correct:

static int
good_unload(module_t mod, int type, void *data)
{
    unregister_callback(my_callback);  /* Clean up registrations */
    /* Wait for any in-progress callbacks to complete */
    return 0;
}

How to avoid:

  • Every registration needs corresponding deregistration
  • Every callback installation needs removal
  • Every "register with subsystem" needs "unregister from subsystem"

Debugging Pitfall Patterns

How to detect these bugs:

For locking issues:

# In kernel config or loader.conf
options WITNESS
options WITNESS_SKIPSPIN
options INVARIANTS
options INVARIANT_SUPPORT

WITNESS detects lock order violations and reports them in dmesg.

For memory issues:

# Track allocations
vmstat -m | grep M_YOURTYPE

# Enable kernel malloc debugging
options MALLOC_DEBUG_MAXZONES=8

For race conditions:

  • Run stress tests on multi-core systems
  • Use stress2 test suite
  • Concurrent operations: multiple threads opening/closing/reading/writing

For leak detection:

  • Before load: note resource counts (vmstat -m, devfs, ifconfig -a)
  • Load module, exercise it heavily
  • Unload module
  • Check resource counts - should return to baseline

Prevention Checklist

Before committing code, verify:

Data movement

  • All uiomove() calls properly update uio_resid
  • Chunk sizes bounded to reasonable limits
  • No direct dereferencing of user pointers

Locking

  • No locks held across uiomove()/copyin()/copyout()
  • Consistent lock ordering documented and followed
  • All locks initialized before use
  • Locks destroyed only after last user done

Resources

  • Every allocation has matching free on all paths
  • Error paths tested and leak-free
  • Objects made invisible before freeing
  • NULL checks after M_NOWAIT allocations

Timers

  • callout_drain() before freeing structures
  • Timer rescheduling gated by work counter
  • Callout initialized with associated mutex

Network (if applicable)

  • Mbufs freed on all error paths
  • Triple notification after enqueue
  • Input sizes validated against MRU

Input validation

  • Maximum sizes defined and enforced
  • User-provided lengths checked
  • Offsets validated before use

Races

  • No check-then-use patterns without locks
  • Critical sections properly protected
  • Lock-free code avoided unless necessary

Lifecycle

  • destroy_dev() before freeing softc
  • All registrations have deregistrations
  • Unload tested under concurrent use

When Things Go Wrong

If you see "sleeping with lock held":

  • Likely holding mutex across uiomove() or allocation with M_WAITOK
  • Solution: Release lock before blocking operation

If you see "lock order reversal":

  • Two locks acquired in different orders in different code paths
  • Solution: Establish and document hierarchy, fix violating code

If you see "page fault in kernel mode":

  • Usually use-after-free or NULL dereference
  • Check: Are you accessing memory after freeing? Is si_drv1 cleared first?

If processes hang forever:

  • Missing wakeup() or notification
  • Check: Does every enqueue call wakeup/selwakeup/KNOTE?

If resources leak:

  • Error path missing cleanup
  • Check: Does every early return free what was allocated?

You're Ready: From Patterns to Practice

By studying these pitfalls and their solutions in the context of the four driver tours, you develop the instincts to avoid them. The patterns repeat: check before use, lock appropriately, free what you allocate, notify when you enqueue, validate user input. Master these, and your drivers will be robust.

You now have a compact mental model: the same few patterns repeat with different applications. Keep this blueprint open while you tackle hands-on labs, it's the shortest path from "I think I get it" to "I can ship a driver that behaves correctly."

When in doubt, return to the four driver tours. They're your worked examples showing these patterns in complete, working code.

Next, it's time to get hands-on with four practical labs.

Hands-On Labs: From Reading to Building (Beginner-Safe)

You've read about driver structure; now experience it. These four carefully designed labs take you from reading code to building working kernel modules, each one validating your understanding before moving forward.

Lab Design Philosophy

These labs are:

  • Safe: Run in your lab VM, isolated from your main system
  • Incremental: Each builds on the last with clear checkpoints
  • Self-validating: You'll know immediately if you've succeeded
  • Explanatory: Code includes comments explaining the "why" behind the "what"
  • Complete: All code is tested on FreeBSD 14.3 and ready to use

Prerequisites for All Labs

Before starting, ensure you have:

  1. FreeBSD 14.3 running (VM or physical machine)

  2. Source code installed: /usr/src must exist

    # If /usr/src is missing, install it:
    % sudo pkg install git
    % sudo git clone --branch releng/14.3 --depth 1 https://git.FreeBSD.org/src.git src /usr/src
    
  3. Build tools installed:

    % sudo pkg install llvm
    
  4. Root access via sudo or su

  5. Your lab logbook for notes and observations

Time Commitment

  • Lab 1 (Scavenger Hunt): 30-40 minutes
  • Lab 2 (Hello Module): 40-50 minutes
  • Lab 3 (Device Node): 60-75 minutes
  • Lab 4 (Error Handling): 30-40 minutes

Total: 2.5 - 3.5 hours for all labs

Recommendation: Complete Lab 1 and Lab 2 in one session, take a break, then tackle Lab 3 and 4 in a second session.

Lab 1: Explore the Driver Map (Read-Only Scavenger Hunt)

Goal

Locate and identify key driver structures in real FreeBSD source code. Build navigation confidence and pattern recognition skills.

What You'll Learn

  • How to find and read FreeBSD driver source files
  • How to recognize common patterns (cdevsw, probe/attach, DRIVER_MODULE)
  • Where different types of drivers live in the source tree
  • How to use less and grep effectively for driver exploration

Prerequisites

  • FreeBSD 14.3 with /usr/src installed
  • Text editor or less for viewing files
  • Terminal with your favorite shell

Time Estimate

30-40 minutes (questions only)
+10 minutes if you want to explore beyond the questions

Instructions

Part 1: Character Device Driver - The Null Driver

Step 1: Navigate to the null driver

% cd /usr/src/sys/dev/null
% ls -l
total 8
-rw-r--r--  1 root  wheel  4127 Oct 14 10:15 null.c

Step 2: Open the file with less

% less null.c

Navigation tips for less:

  • Press / to search (example: /cdevsw to find cdevsw structure)
  • Press n to find next occurrence
  • Press q to quit
  • Press g to go to top, G to go to bottom

Step 3: Answer these questions (write in your lab logbook):

Q1: What line number defines the null_cdevsw structure?
Hint: Search for /cdevsw in less

Q2: Which function handles writes to /dev/null?
Hint: Look at the .d_write = line in the cdevsw structure

Q3: What does the write function return?
Hint: Look at the function implementation

Q4: Where is the module event handler? What's its name?
Hint: Search for modevent

Q5: What macro registers the module with the kernel?
Hint: Look near the end of the file, search for DECLARE_MODULE

Q6: How many device nodes does this module create in /dev?
Hint: Count the make_dev_credf calls in the load handler

Q7: What are the device node names?
Hint: Look at the last parameter in each make_dev_credf call

Part 2: Infrastructure Driver - The LED Driver

Step 4: Navigate to the LED driver

% cd /usr/src/sys/dev/led
% less led.c

Step 5: Answer these questions:

Q8: Find the softc structure. What's it called?
Hint: Search for _softc { to find structure definitions

Q9: Where is led_create() defined?
Hint: Search for ^led_create (^ means start of line)

Q10: What subdirectory in /dev do LED device nodes appear under?
Hint: Look at the make_dev call in led_create(), check the path

Q11: Find the led_write function. What does it do with user input?
Hint: Look for the function definition, read the code

Q12: Is there a probe/attach pair, or does this use a module event handler?
Hint: Search for probe and attach vs modevent

Q13: Can you find where the driver allocates memory for the softc?
Hint: Look in led_create() for malloc calls

Part 3: Network Driver - The Tun/Tap Driver

Step 6: Navigate to the tun/tap driver

% cd /usr/src/sys/net
% less if_tuntap.c

Note: This is a larger, more complex driver. Don't try to understand everything, just find the specific patterns.

Step 7: Answer these questions:

Q14: Find the softc structure for tun. What's it called?
Hint: Search for tun_softc {

Q15: Does the softc contain both a struct cdev * and network interface pointer?
Hint: Look at the members of the softc structure

Q16: Where is the tun_cdevsw structure defined?
Hint: Search for tun_cdevsw =

Q17: What function is called when you open /dev/tun?
Hint: Look at the .d_open = line in the cdevsw

Q18: Where does the driver create the network interface?
Hint: Search for if_alloc in the source

Part 4: Bus-Attached Driver - A PCI UART

Step 8: Navigate to a PCI driver

% cd /usr/src/sys/dev/uart
% less uart_bus_pci.c

Step 9: Answer these questions:

Q19: Find the probe function. What's it called?
Hint: Look for a function ending in _probe

Q20: What does the probe function check to identify compatible hardware?
Hint: Look inside the probe function for ID comparisons

Q21: Where is DRIVER_MODULE declared?
Hint: Search for DRIVER_MODULE - should be near end of file

Q22: What bus does this driver attach to?
Hint: Look at the second parameter to DRIVER_MODULE macro

Q23: Find the device method table. What's it called?
Hint: Search for device_method_t - should be an array

Q24: How many methods are defined in the method table?
Hint: Count entries between the declaration and DEVMETHOD_END

Check Your Answers

After completing all questions, compare with the answer key below. Don't peek before attempting!

Part 1: Null Driver

A1: Around line 85-95 (varies with FreeBSD version)

A2: null_write function

A3: Returns 0 (success) and the value of uio->uio_resid (consumes all input but discards it)

A4: null_modevent() around line 120-140

A5: DECLARE_MODULE macro (for non-newbus drivers)

A6: Three device nodes: /dev/null, /dev/zero and /dev/full

A7: "null", "zero", "full"

Part 2: LED Driver

A8: struct led_softc

A9: Near the middle of the file (around line 200-250)

A10: /dev/led/ (LEDs appear as /dev/led/name)

A11: It converts the user's input string to LED control commands (like "on", "off", or patterns)

A12: Module event handler (no probe/attach - it's infrastructure, not hardware driver)

A13: Yes, in led_create(): malloc(sizeof(struct led_softc), M_LED, M_WAITOK | M_ZERO)

Part 3: Tun/Tap Driver

A14: struct tuntap_softc

A15: Yes - both dev (struct cdev *) and network interface members

A16: Around line 400-450 (look for the cdevsw initialization)

A17: tuntap_open() or similar (name may vary slightly)

A18: In the open or create function, using if_alloc(IFT_PPP) or if_alloc(IFT_ETHER)

Part 4: PCI UART Driver

A19: uart_pci_probe()

A20: PCI vendor ID and device ID (checks against a table of known UART chips)

A21: At the end of the file

A22: pci bus (the parent bus)

A23: uart_pci_methods[]

A24: Typically 3-5 methods (probe, attach, detach, and possibly others)

If your answers differ significantly:

  1. Don't worry! FreeBSD code evolves between versions
  2. The important part is finding the structures, not exact line numbers
  3. If you found similar patterns in different locations, that's success

Success Criteria

  • Found all major structures in each drive
  • Understand the pattern: entry points (cdevsw/ifnet), lifecycle (probe/attach/detach), registration (DRIVER_MODULE/DECLARE_MODULE)
  • Can navigate driver source confidently
  • Recognize differences between driver types (character vs network vs bus-attached)

What You Learned

  • Character devices use cdevsw structures with entry point functions
  • Network devices combine character devices (cdev) with network interfaces (ifnet)
  • Bus-attached drivers use newbus (probe/attach/detach) and method tables
  • Infrastructure modules may skip probe/attach if they're not hardware drivers
  • softc structures hold per-device state
  • Module registration varies (DECLARE_MODULE vs DRIVER_MODULE) depending on driver type

Lab Logbook Entry Template

Lab 1 Complete: [Date]

Time taken: ___ minutes
Questions answered: 24/24

Most interesting discovery: 
[What surprised you most about real driver code?]

Challenging aspects:
[What was hard to find? Any patterns you didn't expect?]

Key insight:
[What "clicked" for you during this exploration?]

Next steps:
[Ready for Lab 2 where you'll build your first module]

Lab 2: Minimal Module with Logs Only

Goal

Build, load, and unload your first kernel module. Confirm your toolchain works and understand the module lifecycle through direct observation.

What You'll Learn

  • How to write a minimal kernel module
  • How to create a Makefile for kernel module builds
  • How to load and unload modules safely
  • How to observe kernel messages in dmesg
  • The module event handler lifecycle (load/unload)
  • How to troubleshoot common build errors

Prerequisites

  • FreeBSD 14.3 with /usr/src installed
  • Build tools installed (clang, make)
  • sudo/root access
  • Completed Lab 1 (recommended but not required)

Time Estimate

40-50 minutes (including build, test, and documentation)

Instructions

Step 1: Create Working Directory

% mkdir -p ~/drivers/hello
% cd ~/drivers/hello

Why this location?: Your home directory keeps driver experiments separate from system files and survives reboots.

Step 2: Create the Minimal Driver

Create a file named hello.c:

% vi hello.c   # or nano, emacs, your choice

Enter the following code (explanation follows):

/*
 * hello.c - Minimal FreeBSD kernel module for testing
 * 
 * This is the simplest possible kernel module: it does nothing except
 * print messages when loaded and unloaded. Perfect for verifying that
 * your build environment works correctly.
 *
 * FreeBSD 14.3 compatible
 */

#include <sys/param.h>      /* System parameter definitions */
#include <sys/module.h>     /* Kernel module definitions */
#include <sys/kernel.h>     /* Kernel types and macros */
#include <sys/systm.h>      /* System functions (printf) */

/*
 * Module event handler
 * 
 * This function is called whenever something happens to the module:
 * - MOD_LOAD: Module is being loaded into the kernel
 * - MOD_UNLOAD: Module is being removed from the kernel
 * - MOD_SHUTDOWN: System is shutting down (rare, usually not implemented)
 * - MOD_QUIESCE: Module should prepare for unload (advanced, not shown here)
 *
 * Parameters:
 *   mod: Module identifier (handle to this module)
 *   event: What's happening (MOD_LOAD, MOD_UNLOAD, etc.)
 *   arg: Extra data (usually NULL, not used here)
 *
 * Returns:
 *   0 on success
 *   Error code (like EOPNOTSUPP) on failure
 */
static int
hello_modevent(module_t mod __unused, int event, void *arg __unused)
{
    int error = 0;
    
    /*
     * The __unused attribute tells the compiler "I know these parameters
     * aren't used, don't warn me about it." It's good practice to mark
     * intentionally unused parameters.
     */
    
    switch (event) {
    case MOD_LOAD:
        /*
         * This runs when someone does 'kldload hello.ko'
         * 
         * printf() in kernel code goes to the kernel message buffer,
         * which you can see with 'dmesg' or in /var/log/messages.
         * 
         * Notice we say "Hello:" at the start - this helps identify
         * which module printed the message when reading logs.
         */
        printf("Hello: Module loaded successfully!\n");
        printf("Hello: This message appears in dmesg\n");
        printf("Hello: Module address: %p\n", (void *)&hello_modevent);
        break;
        
    case MOD_UNLOAD:
        /*
         * This runs when someone does 'kldunload hello'
         * 
         * This is where you'd clean up resources if this module
         * had allocated anything. Our minimal module has nothing
         * to clean up.
         */
        printf("Hello: Module unloaded. Goodbye!\n");
        break;
        
    default:
        /*
         * We don't handle other events (like MOD_SHUTDOWN).
         * Return EOPNOTSUPP ("operation not supported").
         */
        error = EOPNOTSUPP;
        break;
    }
    
    return (error);
}

/*
 * Module declaration structure
 * 
 * This tells the kernel about our module:
 * - name: "hello" (how it appears in kldstat)
 * - evhand: pointer to our event handler
 * - priv: private data (NULL for us, we have none)
 */
static moduledata_t hello_mod = {
    "hello",            /* module name */
    hello_modevent,     /* event handler function */
    NULL                /* extra data (not used) */
};

/*
 * DECLARE_MODULE macro
 * 
 * This is the magic that registers our module with FreeBSD.
 * 
 * Parameters:
 *   1. hello: Unique module identifier (matches name in moduledata_t)
 *   2. hello_mod: Our moduledata_t structure
 *   3. SI_SUB_DRIVERS: Subsystem order (we're a "driver" subsystem)
 *   4. SI_ORDER_MIDDLE: Load order within subsystem (middle of the pack)
 *
 * Load order matters when modules depend on each other. SI_SUB_DRIVERS
 * and SI_ORDER_MIDDLE are safe defaults for simple modules.
 */
DECLARE_MODULE(hello, hello_mod, SI_SUB_DRIVERS, SI_ORDER_MIDDLE);

/*
 * MODULE_VERSION macro
 * 
 * Declares the version of this module. Version numbers help the kernel
 * manage module dependencies and compatibility.
 * 
 * Format: MODULE_VERSION(name, version_number)
 * Version 1 is fine for new modules.
 */
MODULE_VERSION(hello, 1);

Code explanation summary:

  • Includes: Bring in kernel headers (unlike userspace, we can't use <stdio.h>)
  • Event handler: Function called when module loads/unloads
  • moduledata_t: Connects the module name to its event handler
  • DECLARE_MODULE: Registers everything with the kernel
  • MODULE_VERSION: Declares version for dependency tracking

Step 3: Create the Makefile

Create a file named Makefile (exact name, capital M):

% vi Makefile

Enter this content:

# Makefile for hello kernel module
#
# This Makefile uses FreeBSD's kernel module build infrastructure.
# The .include at the end does all the heavy lifting.

# KMOD: Kernel module name (will produce hello.ko)
KMOD=    hello

# SRCS: Source files to compile (just hello.c)
SRCS=    hello.c

# Include FreeBSD's kernel module build rules
# This single line gives you:
#   - 'make' or 'make all': Build the module
#   - 'make clean': Remove build artifacts
#   - 'make install': Install to /boot/modules (don't use in lab!)
#   - 'make load': Load the module (requires root)
#   - 'make unload': Unload the module (requires root)
.include <bsd.kmod.mk>

Makefile notes:

  • Must be named "Makefile" (or "makefile", but "Makefile" is convention)
  • Tabs matter: If you get errors, check that indentation uses TABS not spaces
  • KMOD determines the output filename (hello.ko)
  • bsd.kmod.mk is FreeBSD's kernel module build infrastructure (does the complex stuff)

Step 4: Build the Module

% make clean
rm -f hello.ko hello.o ... [various cleanup]

% make
cc -O2 -pipe -fno-strict-aliasing  -Werror -D_KERNEL -DKLD_MODULE ... -c hello.c
ld -d -warn-common -r -d -o hello.ko hello.o

What's happening:

  1. make clean: Removes old build artifacts (always safe to run)
  2. make: Compiles hello.c to hello.o, then links to create hello.ko
  3. The compiler flags (-D_KERNEL -DKLD_MODULE) tell the code it's in kernel mode

Expected output: You should see compilation commands but no errors.

Common error messages:

Error: "implicit declaration of function 'printf'"
Fix: Check your includes - you need <sys/systm.h>

Error: "expected ';' before '}'"
Fix: Check for missing semicolons in your code

Error: "undefined reference to __something"
Fix: Usually means wrong includes or typo in function name

Step 5: Verify Build Success

% ls -lh hello.ko
-rwxr-xr-x  1 youruser  youruser   14K Nov 14 15:30 hello.ko

What to look for:

  • File exists: hello.ko is present
  • Size is reasonable: 10-20 KB is typical for minimal modules
  • Executable bit set: -rwxr-xr-x (the 'x' means executable)

Step 6: Load the Module

% sudo kldload ./hello.ko

Important notes:

  • Must use sudo (or be root): Only root can load kernel modules
  • Use ./hello.ko: The ./ tells kldload to use the local file, not search system paths
  • No output is normal: If it loads successfully, kldload prints nothing

If you get an error:

kldload: can't load ./hello.ko: module already loaded or in kernel
Solution: The module is already loaded. Unload it first: sudo kldunload hello

kldload: can't load ./hello.ko: Exec format error
Solution: Module was built for different FreeBSD version. Rebuild on target system.

kldload: an error occurred. Please check dmesg(8) for more details.
Solution: Run 'dmesg | tail' to see what went wrong

Step 7: Verify Module Is Loaded

% kldstat | grep hello
 5    1 0xffffffff82500000     3000 hello.ko

Column meanings:

  • 5: Module ID (your number may differ)
  • 1: Reference count (how many things depend on it)
  • 0xffffffff82500000: Kernel memory address where module is loaded
  • 3000: Size in hex (0x3000 = 12288 bytes = 12 KB)
  • hello.ko: Module filename

Step 8: View Kernel Messages

% dmesg | tail -5
Hello: Module loaded successfully!
Hello: This message appears in dmesg
Hello: Module address: 0xffffffff82500000

What's dmesg?: The kernel message buffer. Everything printed with printf() in kernel code goes here.

Alternative ways to view:

% dmesg | grep Hello
% tail -f /var/log/messages   # Watch in real-time (Ctrl+C to stop)

Step 9: Unload the Module

% sudo kldunload hello

What happens:

  1. Kernel calls your hello_modevent() with MOD_UNLOAD
  2. Your handler prints "Goodbye!" and returns 0 (success)
  3. Kernel removes the module from memory

Step 10: Verify Unload Messages

% dmesg | tail -3
Hello: This message appears in dmesg
Hello: Module address: 0xffffffff82500000
Hello: Module unloaded. Goodbye!

Step 11: Confirm Module Is Gone

% kldstat | grep hello
[no output - module is unloaded]

% ls -l /dev/ | grep hello
[no output - this module doesn't create devices]

Behind the Scenes: What Just Happened?

Let's trace the complete journey of your module:

When you ran kldload ./hello.ko:

  1. Kernel loads file: Read hello.ko from disk into kernel memory
  2. Relocation: Adjust memory addresses in the code to work at the loaded address
  3. Symbol resolution: Connect function calls to their implementations
  4. Initialization: Call your hello_modevent() with MOD_LOAD
  5. Registration: Add "hello" to the kernel's module list
  6. Complete: kldload returns success (exit code 0)

Your printf() calls in MOD_LOAD happened during step 4.

When you ran kldunload hello:

  1. Lookup: Find the "hello" module in kernel's module list
  2. Reference check: Ensure nothing is using the module (ref count = 1)
  3. Shutdown: Call your hello_modevent() with MOD_UNLOAD
  4. Cleanup: Remove from module list
  5. Unmap: Free the kernel memory that held the module code
  6. Complete: kldunload returns success

Your printf() in MOD_UNLOAD happened during step 3.

Why DECLARE_MODULE and MODULE_VERSION matter:

DECLARE_MODULE(hello, hello_mod, SI_SUB_DRIVERS, SI_ORDER_MIDDLE);

This macro expands to code that creates a special data structure in a special ELF section (.set section) of the hello.ko file. When the kernel loads the module, it scans for these structures and knows:

  • Name: "hello"
  • Handler: hello_modevent
  • When to initialize: SI_SUB_DRIVERS phase, SI_ORDER_MIDDLE position

Without this macro, the kernel wouldn't know your module exists!

Troubleshooting Guide

Problem: Module won't compile

Symptom: make shows errors

Common causes:

  1. Typo in code: Carefully compare with example above
  2. Wrong includes: Check that all four #include lines are present
  3. Tabs vs spaces in Makefile: Makefiles require TABS for indentation
  4. Missing /usr/src: Build needs kernel headers from /usr/src

Debug steps:

# Check if /usr/src exists
% ls /usr/src/sys/sys/param.h
[should exist]

# Try compiling manually to see better errors
% cc -c -D_KERNEL -I/usr/src/sys hello.c

Problem: "Operation not permitted" when loading

Symptom: kldload: can't load ./hello.ko: Operation not permitted

Cause: Not running as root

Fix:

% sudo kldload ./hello.ko
# OR
% su
# kldload ./hello.ko

Problem: "module already loaded"

Symptom: kldload: can't load ./hello.ko: module already loaded

Cause: Module is already in the kernel

Fix:

% sudo kldunload hello
% sudo kldload ./hello.ko

Problem: No messages in dmesg

Symptom: kldload succeeds but dmesg shows nothing

Possible causes:

  1. Messages scrolled away: Use dmesg | tail -20 to see recent messages
  2. Wrong module loaded: Check kldstat to verify your module is there
  3. Event handler not called: Check that DECLARE_MODULE matches moduledata_t name

Problem: Kernel panic

Symptom: System crashes, shows panic message

Unlikely with this minimal module, but if it happens:

  1. Don't panic (no pun intended): Your VM can be rebooted
  2. Check the code: Probably a typo in the DECLARE_MODULE macro
  3. Start fresh: Reboot VM, compare your code character-by-character with example

Success Criteria

  • Module compiles without errors or warnings
  • hello.ko file is created (10-20 KB)
  • Module loads without errors
  • Messages appear in dmesg showing load
  • Module appears in kldstat output
  • Module unloads successfully
  • Unload message appears in dmesg
  • No kernel panics or crashes

What You Learned

Technical skills:

  • Writing a minimal kernel module structure
  • Using FreeBSD's kernel module build system
  • Loading and unloading kernel modules safely
  • Observing kernel messages with dmesg

Concepts:

  • Module event handlers (MOD_LOAD/MOD_UNLOAD lifecycle)
  • DECLARE_MODULE and MODULE_VERSION macros
  • Kernel printf vs userspace printf
  • Why root access is required for module operations

Confidence:

  • Your build environment works correctly
  • You can compile and load kernel code
  • You understand the basic module lifecycle
  • You're ready to add actual functionality (Lab 3)

Lab Logbook Entry Template

Lab 2 Complete: [Date]

Time taken: ___ minutes

Build results:
- First attempt: [ ] Success  [ ] Errors (describe: ___)
- After fixes: [ ] Success

Module operations:
- Load: [ ] Success  [ ] Errors
- Visible in kldstat: [ ] Yes  [ ] No
- Messages in dmesg: [ ] Yes  [ ] No
- Unload: [ ] Success  [ ] Errors

Key insight:
[What did you learn about the kernel module lifecycle?]

Challenges faced:
[What went wrong? How did you fix it?]

Next steps:
[Ready for Lab 3: adding real functionality with device nodes]

Optional Experiment: Module Load Order

Want to see why SI_SUB and SI_ORDER matter?

  1. Check current boot order:
% kldstat -v | less
  1. Try different subsystem orders: Edit hello.c and change:
DECLARE_MODULE(hello, hello_mod, SI_SUB_DRIVERS, SI_ORDER_MIDDLE);

to:

DECLARE_MODULE(hello, hello_mod, SI_SUB_PSEUDO, SI_ORDER_FIRST);

Rebuild and reload. Module still works! The order only matters when modules depend on each other.

Lab 3: Create and Remove a Device Node

Goal

Extend the minimal module to create a /dev entry that users can interact with. Implement basic read and write operations.

What You'll Learn

  • How to create a character device node in /dev
  • How to implement cdevsw (character device switch) entry points
  • How to safely copy data between user and kernel space with uiomove()
  • How open/close/read/write syscalls connect to your driver functions
  • The relationship between struct cdev, cdevsw, and device operations
  • Proper resource cleanup and NULL pointer safety

Prerequisites

  • Completed Lab 2 (Hello Module)
  • Understanding of file operations (open, read, write, close)
  • Basic C string handling knowledge

Time Estimate

60-75 minutes (including code understanding, building, and thorough testing)

Instructions

Step 1: Create New Working Directory

% mkdir -p ~/drivers/demo
% cd ~/drivers/demo

Why a new directory?: Keep each lab self-contained for easy reference later.

Step 2: Create the Driver Source

Create demo.c with the following complete code:

/*
 * demo.c - Simple character device with /dev node
 * 
 * This driver demonstrates:
 * - Creating a device node in /dev
 * - Implementing open/close/read/write operations
 * - Safe data transfer between kernel and user space
 * - Proper resource management and cleanup
 *
 * Compatible with FreeBSD 14.3
 */

#include <sys/param.h>      /* System parameters and limits */
#include <sys/module.h>     /* Kernel module support */
#include <sys/kernel.h>     /* Kernel types */
#include <sys/systm.h>      /* System functions like printf */
#include <sys/conf.h>       /* Character device configuration */
#include <sys/uio.h>        /* User I/O structures and uiomove() */
#include <sys/malloc.h>     /* Kernel memory allocation */

/*
 * Global device node pointer
 * 
 * This holds the handle to our /dev/demo entry. We need to keep this
 * so we can destroy the device when the module unloads.
 * 
 * NULL when module is not loaded.
 */
static struct cdev *demo_dev = NULL;

/*
 * Open handler - called when someone opens /dev/demo
 * 
 * This is called every time a process opens the device file:
 *   open("/dev/demo", O_RDWR);
 *   cat /dev/demo
 *   echo "hello" > /dev/demo
 * 
 * Parameters:
 *   dev: Device being opened (our cdev structure)
 *   oflags: Open flags (O_RDONLY, O_WRONLY, O_RDWR, O_NONBLOCK, etc.)
 *   devtype: Device type (usually S_IFCHR for character devices)
 *   td: Thread opening the device (process context)
 * 
 * Returns:
 *   0 on success
 *   Error code (like EBUSY, ENOMEM) on failure
 * 
 * Note: The __unused attribute marks parameters we don't use, avoiding
 *       compiler warnings.
 */
static int
demo_open(struct cdev *dev __unused, int oflags __unused,
          int devtype __unused, struct thread *td __unused)
{
    /*
     * In a real driver, you might:
     * - Check if exclusive access is required
     * - Allocate per-open state
     * - Initialize hardware
     * - Check device readiness
     * 
     * Our simple demo just logs that open happened.
     */
    printf("demo: Device opened (pid=%d, comm=%s)\n", 
           td->td_proc->p_pid, td->td_proc->p_comm);
    
    return (0);  /* Success */
}

/*
 * Close handler - called when last reference is closed
 * 
 * Important: This is called when the LAST file descriptor referring to
 * this device is closed. If a process opens /dev/demo twice, close is
 * called only after both fds are closed.
 * 
 * Parameters:
 *   dev: Device being closed
 *   fflag: File flags from the open call
 *   devtype: Device type
 *   td: Thread closing the device
 * 
 * Returns:
 *   0 on success
 *   Error code on failure
 */
static int
demo_close(struct cdev *dev __unused, int fflag __unused,
           int devtype __unused, struct thread *td __unused)
{
    /*
     * In a real driver, you might:
     * - Free per-open state
     * - Flush buffers
     * - Update hardware state
     * - Cancel pending operations
     */
    printf("demo: Device closed (pid=%d)\n", td->td_proc->p_pid);
    
    return (0);  /* Success */
}

/*
 * Read handler - transfer data from kernel to user space
 * 
 * This is called when someone reads from the device:
 *   cat /dev/demo
 *   dd if=/dev/demo of=output.txt bs=1024 count=1
 *   read(fd, buffer, size);
 * 
 * Parameters:
 *   dev: Device being read from
 *   uio: User I/O structure describing the read request
 *   ioflag: I/O flags (IO_NDELAY for non-blocking, etc.)
 * 
 * The 'uio' structure contains:
 *   uio_resid: Bytes remaining to transfer (initially = read size)
 *   uio_offset: Current position in the "file" (we ignore this)
 *   uio_rw: Direction (UIO_READ for read operations)
 *   uio_td: Thread performing the I/O
 *   [internal]: Scatter-gather list describing user buffer(s)
 * 
 * Returns:
 *   0 on success
 *   Error code (like EFAULT if user buffer is invalid)
 */
static int
demo_read(struct cdev *dev __unused, struct uio *uio, int ioflag __unused)
{
    /*
     * Our message to return to user space.
     * Could be device data, sensor readings, status info, etc.
     */
    char message[] = "Hello from demo driver!\n";
    size_t len;
    int error;
    
    /*
     * Log the read request details.
     * uio_resid tells us how many bytes the user wants to read.
     */
    printf("demo: Read called, uio_resid=%zd bytes requested\n", 
           uio->uio_resid);
    
    /*
     * Calculate how many bytes to actually transfer.
     * 
     * We use MIN() to transfer the smaller of:
     * 1. What the user requested (uio_resid)
     * 2. What we have available (sizeof(message)-1, excluding null terminator)
     * 
     * Why -1? The null terminator '\0' is for C string handling in the
     * kernel, but we don't send it to user space. Text files don't have
     * null terminators between lines.
     */
    len = MIN(uio->uio_resid, sizeof(message) - 1);
    
    /*
     * uiomove() - The safe way to copy data to user space
     * 
     * This function:
     * 1. Verifies the user's buffer is valid and writable
     * 2. Copies 'len' bytes from 'message' to the user's buffer
     * 3. Automatically updates uio->uio_resid (subtracts len)
     * 4. Handles scatter-gather buffers (if user buffer is non-contiguous)
     * 5. Returns error if user buffer is invalid (EFAULT)
     * 
     * CRITICAL SAFETY RULE:
     * Never use memcpy(), bcopy(), or direct pointer access for user data!
     * User pointers are in user space, not accessible in kernel space.
     * uiomove() safely bridges this gap.
     * 
     * Parameters:
     *   message: Source data (kernel space)
     *   len: Bytes to copy
     *   uio: Destination description (user space)
     * 
     * After uiomove() succeeds:
     *   uio->uio_resid is decreased by len
     *   uio->uio_offset is increased by len (for seekable devices)
     */
    error = uiomove(message, len, uio);
    
    if (error != 0) {
        printf("demo: Read failed, error=%d\n", error);
        return (error);
    }
    
    printf("demo: Read completed, transferred %zu bytes\n", len);
    
    /*
     * Return 0 for success.
     * The caller knows how much we transferred by checking how much
     * uio_resid decreased.
     */
    return (0);
}

/*
 * Write handler - receive data from user space
 * 
 * This is called when someone writes to the device:
 *   echo "hello" > /dev/demo
 *   dd if=input.txt of=/dev/demo bs=1024
 *   write(fd, buffer, size);
 * 
 * Parameters:
 *   dev: Device being written to
 *   uio: User I/O structure describing the write request
 *   ioflag: I/O flags
 * 
 * Returns:
 *   0 on success (usually - see note below)
 *   Error code on failure
 * 
 * IMPORTANT WRITE SEMANTICS:
 * Unlike read(), write() is expected to consume ALL the data.
 * If you don't consume everything (uio_resid > 0 after return),
 * the kernel will call write() again with the remaining data.
 * This can cause infinite loops if you always return 0 with resid > 0!
 */
static int
demo_write(struct cdev *dev __unused, struct uio *uio, int ioflag __unused)
{
    char buffer[128];  /* Temporary buffer for incoming data */
    size_t len;
    int error;
    
    /*
     * Limit transfer size to our buffer size.
     * 
     * We use sizeof(buffer)-1 to reserve space for null terminator
     * (so we can safely print the string).
     * 
     * Note: Real drivers might:
     * - Accept unlimited data (loop calling uiomove)
     * - Have larger buffers
     * - Queue data for processing
     * - Return EFBIG if data exceeds device capacity
     */
    len = MIN(uio->uio_resid, sizeof(buffer) - 1);
    
    /*
     * uiomove() for write: Copy FROM user space TO kernel buffer
     * 
     * Same function, but now we're the destination.
     * The direction is determined by uio->uio_rw internally.
     */
    error = uiomove(buffer, len, uio);
    if (error != 0) {
        printf("demo: Write failed during uiomove, error=%d\n", error);
        return (error);
    }
    
    /*
     * Add null terminator so we can safely use printf.
     * 
     * SECURITY NOTE: In a real driver, you must validate data!
     * - Check for null bytes if expecting text
     * - Validate ranges for numeric data
     * - Sanitize before using in format strings
     * - Never trust user input
     */
    buffer[len] = '\0';
    
    /*
     * Do something with the data.
     * 
     * Real drivers might:
     * - Send to hardware (network packet, disk write, etc.)
     * - Process commands (like LED control strings)
     * - Update device state
     * - Queue for async processing
     * 
     * We just log it.
     */
    printf("demo: User wrote %zu bytes: \"%s\"\n", len, buffer);
    
    /*
     * Return success.
     * 
     * At this point, uio->uio_resid should be 0 (we consumed everything).
     * If not, the kernel will call us again with the remainder.
     */
    return (0);
}

/*
 * Character device switch (cdevsw) structure
 * 
 * This is the "method table" that connects system calls to your functions.
 * When a user process calls open(), read(), write(), etc. on /dev/demo,
 * the kernel looks up this table to find which function to call.
 * 
 * Think of it as a virtual function table (vtable) in OOP terms.
 */
static struct cdevsw demo_cdevsw = {
    .d_version =    D_VERSION,      /* ABI version - always required */
    .d_open =       demo_open,      /* open() syscall handler */
    .d_close =      demo_close,     /* close() syscall handler */
    .d_read =       demo_read,      /* read() syscall handler */
    .d_write =      demo_write,     /* write() syscall handler */
    .d_name =       "demo",         /* Device name for identification */
    
    /*
     * Other possible entries (not used here):
     * 
     * .d_ioctl =   demo_ioctl,   // ioctl() for configuration/control
     * .d_poll =    demo_poll,    // poll()/select() for readiness
     * .d_mmap =    demo_mmap,    // mmap() for direct memory access
     * .d_strategy= demo_strategy,// For block devices (legacy)
     * .d_kqfilter= demo_kqfilter,// kqueue event notification
     * 
     * Unimplemented entries default to NULL and return ENODEV.
     */
};

/*
 * Module event handler
 * 
 * This is called on module load and unload.
 * We create our device node on load, destroy it on unload.
 */
static int
demo_modevent(module_t mod __unused, int event, void *arg __unused)
{
    int error = 0;
    
    switch (event) {
    case MOD_LOAD:
        /*
         * make_dev() - Create a device node in /dev
         * 
         * This is the key function that makes your driver visible
         * to user space. It creates an entry in the devfs filesystem.
         * 
         * Parameters:
         *   &demo_cdevsw: Pointer to our method table
         *   0: Unit number (minor number) - use 0 for single-instance devices
         *   UID_ROOT: Owner user ID (0 = root)
         *   GID_WHEEL: Owner group ID (0 = wheel group)
         *   0666: Permissions (rw-rw-rw- = world read/write)
         *   "demo": Device name (appears as /dev/demo)
         * 
         * Returns:
         *   Pointer to cdev structure on success
         *   NULL on failure (rare - usually only if name collision)
         * 
         * The returned cdev is an opaque handle representing the device.
         */
        demo_dev = make_dev(&demo_cdevsw, 
                           0,              /* unit number */
                           UID_ROOT,       /* owner UID */
                           GID_WHEEL,      /* owner GID */
                           0666,           /* permissions: rw-rw-rw- */
                           "demo");        /* device name */
        
        /*
         * Always check if make_dev() succeeded.
         * Failure is rare but possible.
         */
        if (demo_dev == NULL) {
            printf("demo: Failed to create device node\n");
            return (ENXIO);  /* "Device not configured" */
        }
        
        printf("demo: Device /dev/demo created successfully\n");
        printf("demo: Permissions: 0666 (world readable/writable)\n");
        printf("demo: Try: cat /dev/demo\n");
        printf("demo: Try: echo \"test\" > /dev/demo\n");
        break;
        
    case MOD_UNLOAD:
        /*
         * Cleanup on module unload.
         * 
         * CRITICAL ORDERING:
         * 1. Make device invisible (destroy_dev)
         * 2. Wait for all operations to complete
         * 3. Free resources
         * 
         * destroy_dev() does steps 1 and 2 automatically!
         */
        
        /*
         * Always check for NULL before destroying.
         * This protects against:
         * - MOD_LOAD failure (demo_dev never created)
         * - Double-unload attempts
         * - Corrupted state
         */
        if (demo_dev != NULL) {
            /*
             * destroy_dev() - Remove device node and clean up
             * 
             * This function:
             * 1. Removes /dev/demo from the filesystem
             * 2. Marks device as "going away"
             * 3. WAITS for all in-progress operations to complete
             * 4. Ensures no new operations can start
             * 5. Frees associated kernel resources
             * 
             * SYNCHRONIZATION GUARANTEE:
             * After destroy_dev() returns, no threads are executing
             * your open/close/read/write functions. This makes cleanup
             * safe - no race conditions with active I/O.
             * 
             * This is why you can safely unload modules while they're
             * in use (e.g., someone has the device open). The unload
             * will wait until they close it.
             */
            destroy_dev(demo_dev);
            
            /*
             * Set pointer to NULL for safety.
             * 
             * Defense in depth: If something accidentally tries to
             * use demo_dev after unload, NULL pointer dereference
             * is much easier to debug than use-after-free.
             */
            demo_dev = NULL;
            
            printf("demo: Device /dev/demo destroyed\n");
        }
        break;
        
    default:
        /*
         * We don't handle MOD_SHUTDOWN or other events.
         */
        error = EOPNOTSUPP;
        break;
    }
    
    return (error);
}

/*
 * Module declaration - connects everything together
 */
static moduledata_t demo_mod = {
    "demo",           /* Module name */
    demo_modevent,    /* Event handler */
    NULL              /* Extra data */
};

/*
 * Register module with kernel
 */
DECLARE_MODULE(demo, demo_mod, SI_SUB_DRIVERS, SI_ORDER_MIDDLE);

/*
 * Declare module version
 */
MODULE_VERSION(demo, 1);

Key concepts in this code:

  1. cdevsw structure: The dispatch table connecting syscalls to your functions
  2. uiomove(): Safe kernel <-> user data transfer (never use memcpy!)
  3. make_dev(): Creates visible /dev entry
  4. destroy_dev(): Removes device and waits for operations to complete
  5. NULL safety: Always check pointers before use, set to NULL after free

Step 3: Create the Makefile

Create Makefile:

# Makefile for demo character device driver

KMOD=    demo
SRCS=    demo.c

.include <bsd.kmod.mk>

Step 4: Build the Driver

% make clean
rm -f demo.ko demo.o ...

% make
cc -O2 -pipe -fno-strict-aliasing -Werror -D_KERNEL ... -c demo.c
ld -d -warn-common -r -d -o demo.ko demo.o

Expected: Clean build with no errors.

If you see warnings about unused parameters: This is fine - we marked them __unused but some compiler versions still warn.

Step 5: Load the Driver

% sudo kldload ./demo.ko

% dmesg | tail -5
demo: Device /dev/demo created successfully
demo: Permissions: 0666 (world readable/writable)
demo: Try: cat /dev/demo
demo: Try: echo "test" > /dev/demo

Step 6: Verify Device Node Creation

% ls -l /dev/demo
crw-rw-rw-  1 root  wheel  0x5e Nov 14 16:00 /dev/demo

What you're seeing:

  • c: Character device (not block device or regular file)
  • rw-rw-rw-: Permissions 0666 (anyone can read/write)
  • root wheel: Owned by root, group wheel
  • 0x5e: Device number (major/minor combined - your value may differ)
  • /dev/demo: The device path

Step 7: Test Reading

% cat /dev/demo
Hello from demo driver!

What happened:

  1. cat opened /dev/demo -> demo_open() called
  2. cat called read() -> demo_read() called
  3. Driver copied "Hello from demo driver!\n" to cat's buffer via uiomove()
  4. cat printed the received data to stdout
  5. cat closed the file -> demo_close() called

Check kernel log:

% dmesg | tail -5
demo: Device opened (pid=1234, comm=cat)
demo: Read called, uio_resid=65536 bytes requested
demo: Read completed, transferred 25 bytes
demo: Device closed (pid=1234)

Note: uio_resid=65536 means cat requested 64 KB (its default buffer). We only sent 25 bytes, which is fine - read() returns how much was actually transferred.

Step 8: Test Writing

% echo "Test message" > /dev/demo

% dmesg | tail -4
demo: Device opened (pid=1235, comm=sh)
demo: User wrote 13 bytes: "Test message
"
demo: Device closed (pid=1235)

What happened:

  1. Shell opened /dev/demo for writing
  2. echo wrote "Test message\n" (13 bytes including newline)
  3. Driver received it via uiomove() and logged it
  4. Shell closed the device

Step 9: Test Multiple Operations

% (cat /dev/demo; echo "Another test" > /dev/demo; cat /dev/demo)
Hello from demo driver!
Hello from demo driver!

Watch dmesg in another terminal:

% dmesg -w    # Watch mode - updates in real-time
...
demo: Device opened (pid=1236, comm=sh)
demo: Read called, uio_resid=65536 bytes requested
demo: Read completed, transferred 25 bytes
demo: Device closed (pid=1236)
demo: Device opened (pid=1237, comm=sh)
demo: User wrote 13 bytes: "Another test
"
demo: Device closed (pid=1237)
demo: Device opened (pid=1238, comm=sh)
demo: Read called, uio_resid=65536 bytes requested
demo: Read completed, transferred 25 bytes
demo: Device closed (pid=1238)

Step 10: Test With dd (Controlled I/O)

% dd if=/dev/demo bs=10 count=1 2>/dev/null
Hello from

% dd if=/dev/demo bs=100 count=1 2>/dev/null
Hello from demo driver!

What this shows:

  • First dd: Requested 10 bytes, got 10 bytes ("Hello from")
  • Second dd: Requested 100 bytes, got 25 bytes (our full message)
  • The driver respects the requested size via uio_resid

Step 11: Verify Unload Protection

Open the device and keep it open:

% (sleep 30; echo "Done") > /dev/demo &
[1] 1240

Now try to unload (in the same 30-second window):

% sudo kldunload demo
[hangs... waiting...]

After 30 seconds:

Done
demo: Device closed (pid=1240)
demo: Device /dev/demo destroyed
[kldunload completes]

What happened: destroy_dev() waited for the write operation to complete before allowing the unload. This is a CRITICAL safety feature - it prevents crashes from unloading code that's still executing.

Step 12: Final Cleanup

% sudo kldunload demo    # If still loaded
% ls -l /dev/demo
ls: /dev/demo: No such file or directory  # Good - it's gone

Behind the Scenes: The Complete Path

Let's trace cat /dev/demo from shell to driver and back:

1. Shell executes cat

User space:
  Shell forks, execs /bin/cat with argument "/dev/demo"

2. cat opens the file

User space:
  cat: fd = open("/dev/demo", O_RDONLY);

Kernel:
   ->  VFS layer: Lookup "/dev/demo" in devfs
   ->  devfs: Find cdev structure (created by make_dev)
   ->  devfs: Allocate file descriptor, file structure
   ->  devfs: Call cdev->si_devsw->d_open (demo_open)
  
Kernel (in demo_open):
   ->  printf("Device opened...")
   ->  return 0 (success)
  
Kernel:
   ->  Return file descriptor to cat
  
User space:
  cat: fd = 3 (success)

3. cat reads data

User space:
  cat: n = read(fd, buffer, 65536);

Kernel:
   ->  VFS: Lookup file descriptor 3
   ->  VFS: Find associated cdev
   ->  VFS: Allocate and initialize uio structure:
      uio_rw = UIO_READ
      uio_resid = 65536 (requested size)
      uio_offset = 0
      [iovec array pointing to cat's buffer]
   ->  VFS: Call cdev->si_devsw->d_read (demo_read)
  
Kernel (in demo_read):
   ->  printf("Read called, uio_resid=65536...")
   ->  len = MIN(65536, 24)  # We have 25 bytes (24 + null)
   ->  uiomove("Hello from demo driver!\n", 24, uio)
       ->  Copy 24 bytes from kernel message[] to cat's buffer
       ->  Update uio_resid: 65536 - 24 = 65512
   ->  printf("Read completed, transferred 24 bytes")
   ->  return 0
  
Kernel:
   ->  Calculate transferred = (original resid - final resid) = 24
   ->  Return 24 to cat
  
User space:
  cat: n = 24 (got 24 bytes)

4. cat processes data

User space:
  cat: write(STDOUT_FILENO, buffer, 24);
  [Your terminal shows: Hello from demo driver!]

5. cat tries to read more

User space:
  cat: n = read(fd, buffer, 65536);  # Try to read more
  
Kernel:
   ->  Call demo_read again
   ->  uiomove returns 24 bytes again (we always return same message)
  
User space:
  cat: n = 24
  cat: write(STDOUT_FILENO, buffer, 24);
  [Would print again, but cat knows this is a device not a file]

Actually, cat will keep reading until it gets 0 bytes (EOF). Our driver never returns 0, so cat would hang! But typically cat times out or you hit Ctrl+C.

Better read() implementation for file-like behavior:

static size_t bytes_sent = 0;  /* Track position */

static int
demo_read(struct cdev *dev __unused, struct uio *uio, int ioflag __unused)
{
    char message[] = "Hello from demo driver!\n";
    size_t len;
    
    /* If we already sent the message, return 0 (EOF) */
    if (bytes_sent >= sizeof(message) - 1) {
        bytes_sent = 0;  /* Reset for next open */
        return (0);  /* EOF */
    }
    
    len = MIN(uio->uio_resid, sizeof(message) - 1 - bytes_sent);
    uiomove(message + bytes_sent, len, uio);
    bytes_sent += len;
    
    return (0);
}

But for our demo, the simple version is fine.

6. cat closes file

User space:
  cat: close(fd);
  
Kernel:
   ->  VFS: Decrement file reference count
   ->  VFS: If last reference, call cdev->si_devsw->d_close (demo_close)
  
Kernel (in demo_close):
   ->  printf("Device closed...")
   ->  return 0
  
Kernel:
   ->  Free file descriptor
   ->  Return to cat
  
User space:
  cat: exit(0)

Concept Deep-Dive: Why uiomove()?

Question: Why can't we just use memcpy() or direct pointer access?

Answer: User space and kernel space have separate address spaces.

Address space separation:

User space (cat process):
  Address 0x1000: cat's buffer[0]
  Address 0x1001: cat's buffer[1]
  ...
  
Kernel space:
  Address 0x1000: DIFFERENT memory (maybe page tables)
  Address 0x1001: DIFFERENT memory

A pointer that's valid in user space (like cat's buffer at 0x1000) is meaningless in kernel space. If you try:

/* WRONG - WILL CRASH */
char *user_buf = (char *)0x1000;  /* User's buffer address */
strcpy(user_buf, "data");  /* KERNEL PANIC! */

The kernel will try to write to address 0x1000 in kernel address space, which is completely different memory. At best, you corrupt kernel data. At worst, immediate panic.

What uiomove() does:

  1. Validates: Checks that user addresses are actually in user space
  2. Maps: Temporarily maps user pages into kernel address space
  3. Copies: Performs the copy using valid kernel addresses
  4. Unmaps: Cleans up the temporary mapping
  5. Handles faults: If user buffer is invalid, returns EFAULT

This is why every driver must use uiomove(), copyin(), or copyout() for user data transfer. Direct access is always wrong and dangerous.

Success Criteria

  • Driver compiles without errors
  • Module loads successfully
  • Device node /dev/demo appears with correct permissions
  • Can read from device (get message)
  • Can write to device (message logged in dmesg)
  • Operations appear in dmesg with correct PIDs
  • Module can be unloaded cleanly
  • evice node disappears after unload
  • Unload waits for operations to complete (tested with sleep experiment)
  • No kernel panics or crashes

What You Learned

Technical skills:

  • Creating character device nodes with make_dev()
  • Implementing cdevsw method table
  • Safe user-kernel data transfer with uiomove()
  • Proper resource cleanup with destroy_dev()
  • Debugging with printf() and dmesg

Concepts:

  • How syscalls (open/read/write/close) map to driver functions
  • The role of cdevsw as a dispatch table
  • Why uiomove() is necessary (address space separation)
  • How destroy_dev() provides synchronization
  • The relationship between cdev, devfs, and /dev entries

Best practices:

  • Always check make_dev() return value
  • Always check for NULL before destroy_dev()
  • Set pointers to NULL after freeing
  • Use MIN() to prevent buffer overruns
  • Log operations for debugging

Common Mistakes and How to Avoid Them

Mistake 1: Using memcpy() instead of uiomove()

Wrong:

memcpy(user_buffer, kernel_data, size);  /* CRASH! */

Right:

uiomove(kernel_data, size, uio);  /* Safe */

Mistake 2: Not consuming all write data

Wrong:

demo_write(...) {
    /* Only process part of the data */
    uiomove(buffer, 10, uio);
    return (0);  /* BUG: uio_resid is not 0! */
}

Result: Kernel calls demo_write() again with remaining data -> infinite loop

Right:

demo_write(...) {
    /* Process ALL data */
    len = MIN(uio->uio_resid, buffer_size);
    uiomove(buffer, len, uio);
    /* Now uio_resid = 0, or we return EFBIG if too much */
    return (0);
}

Mistake 3: Forgetting NULL check before destroy_dev()

Wrong:

MOD_UNLOAD:
    destroy_dev(demo_dev);  /* What if make_dev failed? */

Right:

MOD_UNLOAD:
    if (demo_dev != NULL) {
        destroy_dev(demo_dev);
        demo_dev = NULL;
    }

Mistake 4: Wrong permissions on device node

If you use 0600 permissions:

make_dev(&demo_cdevsw, 0, UID_ROOT, GID_WHEEL, 0600, "demo");

Regular users can't access it:

% cat /dev/demo
cat: /dev/demo: Permission denied

Use 0666 for world-accessible devices (appropriate for learning/testing).

Lab Logbook Entry Template

Lab 3 Complete: [Date]

Time taken: ___ minutes

Build results:
- Compilation: [ ] Success  [ ] Errors
- Module size: ___ KB

Testing results:
- Device node created: [ ] Yes  [ ] No
- Permissions correct: [ ] Yes  [ ] No (expected: crw-rw-rw-)
- Read test: [ ] Success  [ ] Failed
- Write test: [ ] Success  [ ] Failed
- Multiple operations: [ ] Success  [ ] Failed
- Unload protection: [ ] Tested  [ ] Not tested

Key insight:
[What did you learn about user-kernel data transfer?]

Most interesting discovery:
[What surprised you? Maybe how destroy_dev waits?]

Challenges faced:
[Any build errors? Runtime issues? How did you resolve them?]

Code understanding:
- uiomove() purpose: [Explain in your own words]
- cdevsw role: [Explain in your own words]
- Why NULL checks matter: [Explain in your own words]

Next steps:
[Ready for Lab 4: deliberate bugs and error handling]

Lab 4: Error Handling and Defensive Programming

Goal

Learn error handling by deliberately introducing bugs, observing symptoms, and fixing them properly. Develop defensive programming instincts for driver development.

What You'll Learn

  • What happens when cleanup is incomplete
  • How to detect resource leaks
  • The importance of cleanup order
  • How to handle allocation failures
  • Defensive programming techniques (NULL checks, pointer clearing)
  • How to debug driver issues using kernel logs and system tools

Prerequisites

  • Completed Lab 3 (Demo Device)
  • Understanding of demo.c code structure
  • Ability to edit C code and rebuild

Time Estimate

30-40 minutes (deliberate breaking, observing, and fixing)

Important Safety Note

These experiments involve deliberately crashing your driver (not the kernel, just the driver). This is safe in your lab VM but demonstrates real bugs you must avoid in production code.

Always:

  • Use your lab VM, never your host system
  • Take a VM snapshot before starting
  • Be prepared to reboot if something hangs

Part 1: The Resource Leak Bug

Experiment 1A: Forget to destroy_dev()

Goal: See what happens when you forget to clean up device nodes.

Step 1: Edit demo.c, comment out destroy_dev():

case MOD_UNLOAD:
    if (demo_dev != NULL) {
        /* destroy_dev(demo_dev);  */  /* COMMENTED OUT - BUG! */
        demo_dev = NULL;
        printf("demo: Device /dev/demo destroyed\n");  /* LIE! */
    }
    break;

Step 2: Rebuild and load:

% make clean && make
% sudo kldload ./demo.ko
% ls -l /dev/demo
crw-rw-rw-  1 root  wheel  0x5e Nov 14 17:00 /dev/demo

Step 3: Unload the module:

% sudo kldunload demo
% dmesg | tail -1
demo: Device /dev/demo destroyed  # Lied!

Step 4: Check if device still exists:

% ls -l /dev/demo
crw-rw-rw-  1 root  wheel  0x5e Nov 14 17:00 /dev/demo  # STILL THERE!

Step 5: Try to use the orphaned device:

% cat /dev/demo

Symptoms you might see:

  • Hang (cat blocks forever)
  • Kernel panic (jumps to unmapped memory)
  • Error message about invalid device

Step 6: Check for leaks:

% vmstat -m | grep cdev
    cdev     10    15K     -    1442     16,32,64

The count may be higher than before you started.

Step 7: Reboot to clean up:

% sudo reboot

What you learned:

  • Orphaned device nodes persist in /dev even when driver unloads
  • Trying to use orphaned devices causes undefined behavior (crash, hang, or errors)
  • This is a resource leak - the cdev structure and device node are never freed
  • Always call destroy_dev() in cleanup path

Experiment 1B: Fix it properly

Step 1: Restore the destroy_dev() call:

case MOD_UNLOAD:
    if (demo_dev != NULL) {
        destroy_dev(demo_dev);  /* RESTORED */
        demo_dev = NULL;
        printf("demo: Device /dev/demo destroyed\n");
    }
    break;

Step 2: Rebuild, load, test, unload:

% make clean && make
% sudo kldload ./demo.ko
% ls -l /dev/demo        # Exists
% cat /dev/demo          # Works
% sudo kldunload demo
% ls -l /dev/demo        # GONE - correct!

Success: Device node properly cleaned up.

Part 2: The Wrong Order Bug

Experiment 2A: Free before destroying

Goal: See why cleanup order matters.

Step 1: Add a malloc'd buffer to demo.c:

After static struct cdev *demo_dev = NULL;, add:

static char *demo_buffer = NULL;

Step 2: Allocate in MOD_LOAD:

case MOD_LOAD:
    /* Allocate a buffer */
    demo_buffer = malloc(128, M_TEMP, M_WAITOK | M_ZERO);
    printf("demo: Allocated buffer at %p\n", demo_buffer);
    
    demo_dev = make_dev(...);
    /* ... rest of load code ... */
    break;

Step 3: WRONG CLEANUP - free before destroy:

case MOD_UNLOAD:
    /* BUG: Free while device is still accessible! */
    if (demo_buffer != NULL) {
        free(demo_buffer, M_TEMP);
        demo_buffer = NULL;
        printf("demo: Freed buffer\n");
    }
    
    /* Device is still alive and can be opened! */
    if (demo_dev != NULL) {
        destroy_dev(demo_dev);
        demo_dev = NULL;
    }
    break;

Step 4: Rebuild and test:

% make clean && make
% sudo kldload ./demo.ko

Step 5: While module is loaded, in another terminal:

% ( sleep 2; cat /dev/demo ) &  # Start delayed cat
% sudo kldunload demo           # Try to unload

Race condition:

  1. kldunload starts
  2. Your code frees demo_buffer
  3. destroy_dev() called
  4. Meanwhile, cat opened /dev/demo (device still existed!)
  5. demo_read() tries to use freed demo_buffer
  6. Use-after-free crash or corrupted data

Symptoms:

  • Kernel panic: "page fault in kernel mode"
  • Corrupted output
  • Hang

Step 6: Reboot to recover.

Experiment 2B: Fix the ordering

Correct order: Make device invisible FIRST, then free resources.

case MOD_UNLOAD:
    /* CORRECT: Destroy device first */
    if (demo_dev != NULL) {
        destroy_dev(demo_dev);  /* Waits for all operations */
        demo_dev = NULL;
    }
    
    /* Now safe - no one can call our functions */
    if (demo_buffer != NULL) {
        free(demo_buffer, M_TEMP);
        demo_buffer = NULL;
        printf("demo: Freed buffer\n");
    }
    break;

Why this works:

  1. destroy_dev() removes /dev/demo from filesystem
  2. destroy_dev() waits for any in-progress operations (like active reads)
  3. After destroy_dev() returns, no new operations can start
  4. Now it's safe to free demo_buffer - nothing can access it

Step 7: Rebuild and test:

% make clean && make
% sudo kldload ./demo.ko
% ( sleep 2; cat /dev/demo ) &
% sudo kldunload demo
# Works safely - no crash

What you learned:

  • Cleanup order is critical: Device invisible -> wait for operations -> free resources
  • destroy_dev() provides synchronization (waits for operations)
  • Reverse order of initialization: Last allocated, first freed

Part 3: The NULL Pointer Bug

Experiment 3A: Missing NULL check

Goal: See why NULL checks matter.

Step 1: Make make_dev() fail by using an existing name:

Load demo module, then try to load again in MOD_LOAD:

case MOD_LOAD:
    demo_dev = make_dev(&demo_cdevsw, 0, UID_ROOT, GID_WHEEL, 0666, "demo");
    
    /* BUG: Don't check for NULL! */
    printf("demo: Device created at %p\n", demo_dev);  /* Might print NULL! */
    /* Continuing even though make_dev failed... */
    break;

Or simulate failure:

case MOD_LOAD:
    demo_dev = NULL;  /* Simulate make_dev failure */
    /* BUG: No check! */
    printf("demo: Device created at %p\n", demo_dev);
    break;

Step 2: Try to unload without NULL check:

case MOD_UNLOAD:
    /* BUG: No NULL check! */
    destroy_dev(demo_dev);  /* Passing NULL to destroy_dev! */
    break;

Step 3: Test:

% make clean && make
% sudo kldload ./demo.ko
# Module "loads" but device wasn't created
% sudo kldunload demo
# Might panic or crash

Symptoms:

  • Kernel panic in destroy_dev
  • "panic: bad address"
  • System hang

Experiment 3B: Proper NULL checking

case MOD_LOAD:
    demo_dev = make_dev(&demo_cdevsw, 0, UID_ROOT, GID_WHEEL, 0666, "demo");
    
    /* ALWAYS check return value! */
    if (demo_dev == NULL) {
        printf("demo: Failed to create device node\n");
        return (ENXIO);  /* Abort load */
    }
    
    printf("demo: Device /dev/demo created successfully\n");
    break;

case MOD_UNLOAD:
    /* ALWAYS check for NULL before using pointer! */
    if (demo_dev != NULL) {
        destroy_dev(demo_dev);
        demo_dev = NULL;  /* Clear pointer for safety */
    }
    break;

Defensive programming rules:

  1. Check every allocation: if (ptr == NULL) handle_error();
  2. Check before freeing: if (ptr != NULL) free(ptr);
  3. Clear after freeing: ptr = NULL; (defense against use-after-free)

Part 4: The Allocation Failure Bug

Experiment 4: Handling malloc failures

Goal: Learn to handle M_NOWAIT allocation failures.

Step 1: Add allocation to attach:

case MOD_LOAD:
    /* Allocate with M_NOWAIT - can fail! */
    demo_buffer = malloc(128, M_TEMP, M_NOWAIT | M_ZERO);
    
    /* BUG: Don't check for NULL */
    strcpy(demo_buffer, "Hello");  /* CRASH if malloc failed! */
    
    demo_dev = make_dev(&demo_cdevsw, 0, UID_ROOT, GID_WHEEL, 0666, "demo");
    /* ... */
    break;

If malloc fails (rare but possible):

panic: page fault while in kernel mode
fault virtual address = 0x0
fault code = supervisor write data
instruction pointer = 0x8:0xffffffff12345678
current process = 1234 (kldload)

Step 2: Fix with proper error handling:

case MOD_LOAD:
    /* Allocate with M_NOWAIT */
    demo_buffer = malloc(128, M_TEMP, M_NOWAIT | M_ZERO);
    if (demo_buffer == NULL) {
        printf("demo: Failed to allocate buffer\n");
        return (ENOMEM);  /* Out of memory */
    }
    
    /* Now safe to use */
    strcpy(demo_buffer, "Hello");
    
    /* Create device */
    demo_dev = make_dev(&demo_cdevsw, 0, UID_ROOT, GID_WHEEL, 0666, "demo");
    if (demo_dev == NULL) {
        printf("demo: Failed to create device node\n");
        /* BUG: Forgot to free demo_buffer! */
        return (ENXIO);
    }
    
    printf("demo: Device created successfully\n");
    break;

Wait, there's still a bug! If make_dev() fails, we return without freeing demo_buffer.

Step 3: Fix with complete error unwinding:

case MOD_LOAD:
    int error = 0;
    
    /* Allocate buffer */
    demo_buffer = malloc(128, M_TEMP, M_NOWAIT | M_ZERO);
    if (demo_buffer == NULL) {
        printf("demo: Failed to allocate buffer\n");
        return (ENOMEM);
    }
    
    strcpy(demo_buffer, "Hello");
    
    /* Create device */
    demo_dev = make_dev(&demo_cdevsw, 0, UID_ROOT, GID_WHEEL, 0666, "demo");
    if (demo_dev == NULL) {
        printf("demo: Failed to create device node\n");
        error = ENXIO;
        goto fail;
    }
    
    printf("demo: Device created successfully\n");
    return (0);  /* Success */
    
fail:
    /* Error cleanup - undo everything we did */
    if (demo_buffer != NULL) {
        free(demo_buffer, M_TEMP);
        demo_buffer = NULL;
    }
    return (error);

Error unwinding pattern:

  1. Each allocation step can fail
  2. On failure, undo everything done before
  3. Common pattern: use goto fail to centralize cleanup
  4. Free in reverse order of allocation

Part 5: Complete Example with Full Error Handling

Here's a template showing all best practices:

case MOD_LOAD:
    int error = 0;
    
    /* Step 1: Allocate buffer */
    demo_buffer = malloc(128, M_TEMP, M_WAITOK | M_ZERO);
    if (demo_buffer == NULL) {  /* Paranoid - M_WAITOK shouldn't fail */
        error = ENOMEM;
        goto fail_0;  /* Nothing to clean up yet */
    }
    
    strcpy(demo_buffer, "Initialized");
    
    /* Step 2: Create device node */
    demo_dev = make_dev(&demo_cdevsw, 0, UID_ROOT, GID_WHEEL, 0666, "demo");
    if (demo_dev == NULL) {
        printf("demo: Failed to create device node\n");
        error = ENXIO;
        goto fail_1;  /* Need to free buffer */
    }
    
    /* Success! */
    printf("demo: Module loaded successfully\n");
    return (0);

/* Error unwinding - labels in reverse order of operations */
fail_1:
    /* Failed after allocating buffer */
    free(demo_buffer, M_TEMP);
    demo_buffer = NULL;
fail_0:
    /* Failed before allocating anything */
    return (error);

Why this pattern works:

  • Each fail_N label knows exactly what was allocated up to that point
  • Cleanup happens in reverse order (last allocated, first freed)
  • Single return point for errors makes debugging easier
  • All error paths properly clean up

Debugging Checklist: Finding Driver Bugs

When your driver misbehaves, check these systematically:

1. Check dmesg for kernel messages

% dmesg | tail -20
% dmesg | grep -i panic
% dmesg | grep -i "page fault"

Look for:

  • Panic messages
  • "sleeping with lock held"
  • "lock order reversal"
  • Your driver's printf messages

2. Check for resource leaks

Before loading module:

% vmstat -m | grep cdev > before.txt

After load + unload:

% vmstat -m | grep cdev > after.txt
% diff before.txt after.txt

If counts increased, you have a leak.

3. Check for orphaned devices

% ls -l /dev/ | grep demo

If /dev/demo exists after unload, you forgot destroy_dev().

4. Test unload under load

% ( sleep 10; cat /dev/demo ) &
% sudo kldunload demo

Should wait for cat to finish. If it crashes, you have a race condition.

5. Check module state

% kldstat -v | grep demo

Shows dependencies and references.

Success Criteria

  • Observed orphaned device node (Experiment 1A)
  • Fixed with proper destroy_dev() (Experiment 1B)
  • Observed use-after-free crash (Experiment 2A)
  • Fixed with correct cleanup order (Experiment 2B)
  • Understood NULL pointer dangers (Experiment 3)
  • Implemented proper NULL checking (Experiment 3B)
  • Learned error unwinding pattern (Experiment 4)
  • Can identify resource leaks with vmstat
  • Can debug with dmesg

What You Learned

Bug types:

  • Resource leaks (forgotten destroy_dev)
  • Use-after-free (wrong cleanup order)
  • NULL pointer dereference (missing checks)
  • Memory leaks (failed error unwinding)

Defensive programming:

  • Always check return values
  • Always NULL-check before using pointers
  • Clean up in reverse order of initialization
  • Clear pointers after freeing (ptr = NULL)
  • Use goto for error unwinding

Debugging techniques:

  • Using dmesg to track operations
  • Using vmstat to detect leaks
  • Testing unload under load
  • Deliberately introducing bugs to understand symptoms

Patterns to follow:

/* Allocation */
ptr = malloc(size, type, M_WAITOK);
if (ptr == NULL) {
    error = ENOMEM;
    goto fail;
}

/* Device creation */
dev = make_dev(...);
if (dev == NULL) {
    error = ENXIO;
    goto fail_after_malloc;
}

/* Success */
return (0);

/* Error cleanup */
fail_after_malloc:
    free(ptr, type);
    ptr = NULL;
fail:
    return (error);

Lab Logbook Entry Template

Lab 4 Complete: [Date]

Time taken: ___ minutes

Experiments conducted:
- Orphaned device: [ ] Observed  [ ] Fixed
- Wrong cleanup order: [ ] Observed crash  [ ] Fixed
- NULL pointer bug: [ ] Observed  [ ] Fixed
- Error unwinding: [ ] Implemented  [ ] Tested

Most valuable insight:
[What "clicked" about error handling?]

Bugs I've seen before:
[Have you made similar mistakes in userspace code?]

Defensive programming rules I'll remember:
1. [e.g., "Always check malloc return"]
2. [e.g., "Cleanup in reverse order"]
3. [e.g., "Set pointers to NULL after free"]

Debugging techniques learned:
[Which debugging method was most useful?]

Ready for Chapter 7:
[ ] Yes - I understand error handling
[ ] Need more practice - I'll review the error patterns again

Labs Summary and Next Steps

Congratulations! You've completed all four labs. Here's what you've accomplished:

Lab Progression Summary

LabWhat You BuiltKey Skill
Lab 1Navigation skillsRead and understand driver code
Lab 2Minimal moduleBuild and load kernel modules
Lab 3Character deviceCreate /dev nodes, implement I/O
Lab 4Error handlingDefensive programming, debugging

Key Concepts Mastered

Module lifecycle:

  • MOD_LOAD -> initialize
  • MOD_UNLOAD -> cleanup
  • DECLARE_MODULE registration

Device framework:

  • cdevsw as method dispatch table
  • make_dev() to create /dev entries
  • destroy_dev() for cleanup + synchronization

Data transfer:

  • uiomove() for safe user-kernel copying
  • uio structure for I/O requests
  • uio_resid tracking

Error handling:

  • NULL checking all allocations
  • Reverse-order cleanup
  • Error unwinding with goto
  • Resource leak prevention

Debugging:

  • Using dmesg for kernel logs
  • vmstat for resource tracking
  • Testing under load

Your Driver Development Toolkit

You now have a solid foundation of:

  1. Pattern recognition: You can look at any FreeBSD driver and identify its structure
  2. Practical skills: You can build, load, test, and debug kernel modules
  3. Safety knowledge: You understand common bugs and how to avoid them
  4. Debugging ability: You can diagnose problems using system tools

Celebrate Your Achievement!

You've completed hands-on labs that many developers skip. You didn't just read about drivers, you built them, broke them, and fixed them. This experiential learning is invaluable.

Wrapping Up

Congratulations! You've completed a comprehensive tour of FreeBSD driver anatomy. Let's recap what you've learned and where we're heading next.

What You Now Know

Vocabulary - You can speak the language of FreeBSD drivers:

  • newbus: The device framework (probe/attach/detach)
  • devclass: Grouping of related devices
  • softc: Per-device private data structure
  • cdevsw: Character device switch (entry point table)
  • ifnet: Network interface structure
  • GEOM: Storage layer architecture
  • devfs: Dynamic device filesystem

Structure - You recognize driver patterns instantly:

  • Probe functions check device IDs and return priority
  • Attach functions initialize hardware and create device nodes
  • Detach functions clean up in reverse order
  • Method tables map kernel calls to your functions
  • Module declarations register with the kernel

Lifecycle - You understand the flow:

  1. Bus enumeration discovers hardware
  2. Probe functions compete for devices
  3. Attach functions initialize winners
  4. Devices operate (read/write, transmit/receive)
  5. Detach functions clean up on unload

Entry points - You know how user programs reach your driver:

  • Character devices: open/close/read/write/ioctl via /dev
  • Network interfaces: transmit/receive via network stack
  • Storage devices: bio requests via GEOM/CAM

What You Can Now Do

  • Navigate the FreeBSD kernel source tree confidently
  • Recognize common driver patterns (probe/attach/detach, cdevsw)
  • Understand probe/attach/detach lifecycle
  • Build kernel modules with proper Makefiles
  • Load and unload modules safely
  • Create character device nodes with appropriate permissions
  • Implement basic I/O operations (open/close/read/write)
  • Use uiomove() correctly for user-kernel data transfer
  • Handle errors and clean up resources properly
  • Debug with dmesg and system tools
  • Avoid common pitfalls (resource leaks, wrong cleanup order, NULL pointers)

Mindset Shift

Notice the shift in this chapter:

  • Chapter 1-5: Foundations (UNIX, C, kernel C)
  • Chapter 6 (this one): Structure and patterns (recognition)
  • Chapter 7+: Implementation (building)

You've crossed a threshold. You're no longer just learning concepts, you're ready to write real kernel code. This is exciting and a little intimidating, and that's exactly right.

Final Thoughts

Driver development is like learning a musical instrument. At first, the patterns feel foreign and complex. But with practice, they become second nature. You'll start to see probe/attach/detach everywhere you look. You'll recognize cdevsw instantly. You'll know what "allocate resources, check for errors, clean up on failure" means without thinking.

Trust the process. The labs were just the beginning. In Chapter 7, you'll write more code, make mistakes, debug them, and build confidence. By Chapter 8, driver structure will feel natural.

Before You Move On

Take a moment to:

  • Review your lab logbook - What surprised you? What clicked?
  • Revisit any confusing sections - Now that you've done the labs, re-reading makes more sense
  • Browse one more driver - Pick any from /usr/src/sys/dev and see how much you recognize

Challenge Exercises (Optional)

These optional exercises deepen your understanding and build confidence. They're more open-ended than labs but still safe for beginners. Complete as many as you like before moving to Chapter 7.

Challenge 1: Trace a Lifecycle in dmesg

Goal: Capture and annotate real driver lifecycle messages.

Instructions:

  1. Choose a driver that's loadable as a module (e.g., if_em, snd_hda, usb)
  2. Set up logging:
    % tail -f /var/log/messages > ~/driver_lifecycle.log &
    
  3. Load the driver:
    % sudo kldload if_em
    
  4. Watch the attach sequence in real time
  5. Unload the driver:
    % sudo kldunload if_em
    
  6. Stop logging (kill the tail process)
  7. Annotate the log file:
    • Mark where probe was called
    • Mark where attach happened
    • Mark resource allocations
    • Mark where detach cleaned up
  8. Write a one-page summary explaining the lifecycle you observed

Success criteria: Your annotated log shows clear understanding of when each lifecycle phase occurred.

Challenge 2: Map the Entry Points

Goal: Completely document a driver's cdevsw structure.

Instructions:

  1. Open /usr/src/sys/dev/null/null.c
  2. Create a table:
Entry PointFunction NamePresent?What It Does
d_open???
d_close???
d_read???
d_write???
d_ioctl???
d_poll???
d_mmap???
  1. Fill in the table
  2. For missing entry points, explain why they're not needed
  3. For present entry points, describe what they do in 1-2 sentences
  4. Repeat for /usr/src/sys/dev/led/led.c
  5. Compare the two tables: What's similar? What's different? Why?

Success criteria: Your tables are accurate and your explanations demonstrate understanding.

Challenge 3: Classification Drill

Goal: Practice identifying driver families by examining source code.

Instructions:

  1. Choose five random drivers from /usr/src/sys/dev/

    % ls /usr/src/sys/dev | shuf | head -5
    
  2. For each driver, create an entry in your logbook:

    • Driver name
    • Primary source file
    • Classification (character, network, storage, bus, or mixed)
    • Evidence (how did you determine the classification?)
    • Purpose (what does this driver do?)
  3. Verification: Use man 4 <drivername> to confirm your classification

Example entry:

Driver: led
File: sys/dev/led/led.c
Classification: Character device
Evidence: Has cdevsw structure, creates /dev/led/* nodes, no ifnet or GEOM
Purpose: Control system LEDs (keyboard lights, chassis indicators)
Man page: man 4 led (confirmed)

Success criteria: Correctly classified all five, with clear evidence for each.

Challenge 4: Error Code Audit

Goal: Understand error handling patterns in real drivers.

Instructions:

  1. Open /usr/src/sys/dev/uart/uart_core.c

  2. Find the uart_bus_attach() function

  3. List every error code returned (ENOMEM, ENXIO, EIO, etc.)

  4. For each, note:

    • What condition triggered it
    • What resources were freed before returning
    • Whether cleanup was complete
  5. Repeat for /usr/src/sys/dev/ahci/ahci.c (ahci_attach function)

  6. Write a short essay (1-2 pages):

    • Common error handling patterns you observed
    • How drivers ensure no resource leaks
    • Best practices you can apply to your own code

Success criteria: Your essay demonstrates understanding of proper error unwinding.

Challenge 5: Dependency Detective

Goal: Understand module dependencies and load order.

Instructions:

  1. Find a driver that declares MODULE_DEPEND

    % grep -r "MODULE_DEPEND" /usr/src/sys/dev/usb | head -5
    
  2. Pick one example (e.g., a USB driver)

  3. Open the source file and find all MODULE_DEPEND declarations

  4. For each dependency:

    • What module does it depend on?
    • Why is this dependency needed? (What functions/types from that module are used?)
    • What would happen if you tried to load without the dependency?
  5. Test it:

    % sudo kldload <dependency_module>
    % sudo kldload <your_driver>
    % kldstat
    
  6. Try to unload the dependency while your driver is loaded:

    % sudo kldunload <dependency_module>
    

    What happens? Why?

  7. Document your findings: Draw a dependency graph showing the relationships.

Success criteria: You can explain why each dependency exists and predict load order.

Summary

These challenges develop:

  • Challenge 1: Real-world lifecycle observation
  • Challenge 2: Entry point mastery
  • Challenge 3: Pattern recognition across drivers
  • Challenge 4: Error handling discipline
  • Challenge 5: Dependency understanding

Optional: Share your challenge results in the FreeBSD forums or mailing lists. The community loves seeing newcomers dive deep!

Summary Reference Table - Driver Building Blocks at a Glance

This one-screen cheat sheet maps concepts to implementations. Bookmark this page for quick reference while working on Chapter 7 and beyond.

ConceptWhat It IsTypical API/StructureWhere in TreeWhen You'll Use It
device_tOpaque device handledevice_t dev<sys/bus.h>Every driver function (probe/attach/detach)
softcPer-device private datastruct mydriver_softcYou define itStore state, resources, locks
devclassDevice class groupingdevclass_t<sys/bus.h>Auto-managed by DRIVER_MODULE
cdevswCharacter device switchstruct cdevsw<sys/conf.h>Character device entry points
d_openOpen handlerd_open_tIn your cdevswInitialize per-session state
d_closeClose handlerd_close_tIn your cdevswClean up per-session state
d_readRead handlerd_read_tIn your cdevswTransfer data to user
d_writeWrite handlerd_write_tIn your cdevswAccept data from user
d_ioctlIoctl handlerd_ioctl_tIn your cdevswConfiguration and control
uiomoveCopy to/from userint uiomove(...)<sys/uio.h>In read/write handlers
make_devCreate device nodestruct cdev *make_dev(...)<sys/conf.h>In attach (character devices)
destroy_devRemove device nodevoid destroy_dev(...)<sys/conf.h>In detach
ifnet (if_t)Network interfaceif_t<net/if_var.h>Network drivers
ether_ifattachRegister Ethernet ifvoid ether_ifattach(...)<net/ethernet.h>Network driver attach
ether_ifdetachUnregister Ethernet ifvoid ether_ifdetach(...)<net/ethernet.h>Network driver detach
GEOM providerStorage providerstruct g_provider<geom/geom.h>Storage drivers
bioBlock I/O requeststruct bio<sys/bio.h>Storage I/O handling
bus_alloc_resourceAllocate resourcestruct resource *<sys/bus.h>Attach (memory, IRQ, etc.)
bus_release_resourceRelease resourcevoid<sys/bus.h>Detach cleanup
bus_space_read_NRead registeruint32_t bus_space_read_4(...)<machine/bus.h>Hardware register access
bus_space_write_NWrite registervoid bus_space_write_4(...)<machine/bus.h>Hardware register access
bus_setup_intrRegister interruptint bus_setup_intr(...)<sys/bus.h>Attach (interrupt setup)
bus_teardown_intrUnregister interruptint bus_teardown_intr(...)<sys/bus.h>Detach cleanup
device_printfDevice-specific logvoid device_printf(...)<sys/bus.h>All driver functions
device_get_softcRetrieve softcvoid *device_get_softc(device_t)<sys/bus.h>First line of most functions
device_set_descSet device descriptionvoid device_set_desc(...)<sys/bus.h>In probe function
DRIVER_MODULERegister driverMacro<sys/module.h>Once per driver (end of file)
MODULE_VERSIONDeclare versionMacro<sys/module.h>Once per driver
MODULE_DEPENDDeclare dependencyMacro<sys/module.h>If you depend on other modules
DEVMETHODMap method to functionMacro<sys/bus.h>In method table
DEVMETHOD_ENDEnd method tableMacro<sys/bus.h>Last entry in method table
mtxMutex lockstruct mtx<sys/mutex.h>Protect shared state
mtx_initInitialize mutexvoid mtx_init(...)<sys/mutex.h>In attach
mtx_destroyDestroy mutexvoid mtx_destroy(...)<sys/mutex.h>In detach
mtx_lockAcquire lockvoid mtx_lock(...)<sys/mutex.h>Before accessing shared data
mtx_unlockRelease lockvoid mtx_unlock(...)<sys/mutex.h>After accessing shared data
mallocAllocate memoryvoid *malloc(...)<sys/malloc.h>Dynamic allocation
freeFree memoryvoid free(...)<sys/malloc.h>Cleanup
M_WAITOKWait for memoryFlag<sys/malloc.h>malloc flag (can sleep)
M_NOWAITDon't waitFlag<sys/malloc.h>malloc flag (returns NULL if unavailable)

Quick Lookup by Task

Need to...Use ThisMan Page
Create a character devicemake_dev()make_dev(9)
Read/write hardware registersbus_space_read/write_N()bus_space(9)
Allocate hardware resourcesbus_alloc_resource()bus_alloc_resource(9)
Set up interruptsbus_setup_intr()bus_setup_intr(9)
Copy data to/from useruiomove()uio(9)
Log a messagedevice_printf()device(9)
Protect shared datamtx_lock() / mtx_unlock()mutex(9)
Register a driverDRIVER_MODULE()DRIVER_MODULE(9)

Probe/Attach/Detach Quick Reference

/* Probe - Check if we can handle this device */
static int mydrv_probe(device_t dev) {
    /* Check IDs, return BUS_PROBE_DEFAULT or ENXIO */
}

/* Attach - Initialize device */
static int mydrv_attach(device_t dev) {
    sc = device_get_softc(dev);
    /* Allocate resources */
    /* Initialize hardware */
    /* Create device node or register interface */
    return (0);  /* or error code */
}

/* Detach - Clean up */
static int mydrv_detach(device_t dev) {
    sc = device_get_softc(dev);
    /* Reverse order of attach */
    /* Check pointers before freeing */
    /* Set pointers to NULL after freeing */
    return (0);  /* or EBUSY if can't detach */
}
Chapter Info
Status
Complete
Estimated Read Time
1080 minutes
Last Updated
11/1/2025
Author
Edson Brandi
Edit on GitHub