A while ago I was working on wireless access points (APs) based on OpenWrt. One day I discovered that remote logging wasn't working, and the debugging that followed had some surprises.
On OpenWrt, there's a process called logread
responsible for
shipping the logs to another device via the syslog protocol. These
APs don't persist their logs between boots, so sending logs to a
system that can store them was essential for diagnosing problems. I
noticed logread
wasn't running, though it starts on boot, so I added
something to the init script to restart logread
if it crashed, and
was going to call it a day. But I went to test it, and the logs
weren't showing up; sometimes, the logs would show up right after the
AP booted, but then at some point, it would stop working.
I had already spent a lot of time on the other side of this, the
syslog that receives the logs, and was pretty sure the setup was
correct there. So I ran logread
by hand, and it failed with
failed to connect: Permission denied
What? Permission denied? I read the code to find out where this was
happening, and it was in usock()
, which is some socket code that's
used all over OpenWrt, and there were no obvious calls that could fail
with EACCES
in it.
After checking some ACLs, making sure this couldn't possibly be a
permission problem (it's running as root), I decided to strace
logread
(this required rebuilding the flash image for the AP, which
is why I didn't do it earlier), and saw:
socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_UDP) = 8 connect(8, {sa_family=AF_INET, sin_port=htons(65535), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 close(8) = 0 socket(AF_INET6, SOCK_DGRAM|SOCK_CLOEXEC, IPPROTO_UDP) = 8 connect(8, {sa_family=AF_INET6, sin6_port=htons(65535), sin6_flowinfo=htonl(0), inet_pton(AF_INET6, "::1", &sin6_addr), sin6_scope_id=0}, 28) = -1 EACCES (Perm ission denied) close(8) = 0
What the heck? First off, the connection logread
is trying to make
in this case is a TCP connection, and we're giving it an IP address;
why is it making UDP connections to localhost
? And why are those
connections failing?
I had a guess on why this started happening – a little while before,
IPv6 had been disabled on these devices. Maybe it hadn't been done
thoroughly enough? I checked ip addr
, and lo
definitely did not
have ::1
as an address, and IPv6 was disabled through the
disable_ipv6
sysctl
.
I decided that it was probably a call to getaddrinfo()
making UDP
connections – maybe it's trying to resolve DNS – but why port 65535?
Is that just an ephemeral port it's choosing every single time?
I tested getaddrinfo
from Lua (the only interpreted language on the
device), but it worked fine, so there had to be something about how
usock
was calling it; did it want IPv6 addresses specifically or
something?
musl is the libc of choice on these devices. Checking its
implementation of getaddrinfo
, we see this block of code near the
top:
if (flags & AI_ADDRCONFIG) { /* Define the "an address is configured" condition for address * families via ability to create a socket for the family plus * routability of the loopback address for the family. */ // … static const struct sockaddr_in6 lo6 = { .sin6_family = AF_INET6, .sin6_port = 65535, .sin6_addr = IN6ADDR_LOOPBACK_INIT }; const void *ta[2] = { &lo4, &lo6 }; // … for (i=0; i<2; i++) { // … int s = socket(tf[i], SOCK_CLOEXEC|SOCK_DGRAM, IPPROTO_UDP); if (s>=0) { int cs; pthread_setcancelstate( PTHREAD_CANCEL_DISABLE, &cs); int r = connect(s, ta[i], tl[i]); pthread_setcancelstate(cs, 0); close(s); if (!r) continue; } switch (errno) { case EADDRNOTAVAIL: case EAFNOSUPPORT: case EHOSTUNREACH: case ENETDOWN: case ENETUNREACH: break; default: return EAI_SYSTEM; } // …
And sure enough, usock
always sets AI_ADDRCONFIG
on flags. So
this is a kind of probing connect musl is using to check the validity
of IPv4 or IPv6 on the system. The connect
is returning EACCES
,
but musl isn't handling it as part of the errors it considers
"normal". It bails out early, and leaves errno
set to EACCES
where logread
prints it out to mystify us.
But why would connect
fail with EACCES
? The man page doesn't list
anything that makes sense for this.1 Weirder still, I
decide to check if there are any IPv6 addresses at all – and there is
one, but for eth0
, not lo
. I delete it, and suddenly logread
works.
At this point I start looking for information about musl's
getaddrinfo
and this issue, and find a patch posted to the mailing
list with no replies, never applied.
Sweet! I head over to #musl
on IRC and ask them if it wasn't
applied for a reason, and they say it must have been overlooked. But
then someone tries to reproduce with the instructions in the patch,
and can't.
I dive into the kernel source trying to figure out what actually
returns EACCES
here. There is a lot of code under
ip6_datagram_connect
so I tried to grep and pray, but there were
still too many possibilites to know for sure. This is an opportunity
to use ftrace
! I had to rebuild the kernel, since these are
stripped down images for embedded devices, and I was worried
trace-cmd
might actually crash the device, but I got a capture fine.
I could see clearly that the last useful function called under
ip6_datagram_connect
was fib6_rule_action
, which can return
EACCES
, but why? What even are these rules?
I spent a while even trying to figure out what these rules are and how to manipulate them. It turns out they're for "policy-based routing" (PBR), which I hadn't really explored before. I didn't even realize some of these firewall-like policies could be handled at this level.
I was running the ip rule
command and not seeing anything
interesting, until I finally read the source for ip
and noticed it
defaults to IPv4 – I needed to run ip -6 rule
, but that flag is
in the ip(8)
manpage, not the ip-rule(8)
manpage for the
subcommand. But running it on the AP, I saw:
# ip -6 rule 0: from all lookup local 32766: from all lookup main 4200000001: from all iif lo lookup unspec 12 4200000002: from all iif eth0 lookup unspec 12 4200000003: from all iif eth1 lookup unspec 12
I'm not sure I fully understand these rules now, and it took a bit of
looking (and strace
to confirm the netlink message being sent) to
see that this is action "12", which isn't one of the actions in the
(mainline) kernel. But it was enough info to demonstrate that the
issue could be reproduced on any Linux system with
ip -6 rule add from all iif lo lookup unspec prohibit
Some discussion on the musl mailing list revealed that action 12 is a
special rule OpenWrt adds in their kernel. I discovered that
netifd
, which manages interfaces and rules on OpenWrt, was setting
IPv6 policies like this, even when IPv6 was disabled, so I patched
that out. And finally remote logging worked again.
This was a surprising set of interactions. Figuring it out was tractable thanks to having all the source for everything, and reasonable tools for introspection. Is there a moral to this story? Perhaps a few tidbits: strace and ftrace are good; getaddrinfo is bad; maybe don't disable IPv6; and blessed are those who update manpages.
At the time this happened, the man pages on my system didn't list this, but looking now, I see Stefan Puiu added a note about this, debugging much the same situation as mine. What a time saver this would have been.