A slight diversion from my serialization series. Last week, strcat submitted
#18885 pull request, which adds
support for using a Vec
as a Writer
. Over the weekend I submitted
#18980, which allows a &[u8]
to be used as a Reader
. Overall a pretty simple change. However, when I was
running the test suite, I found that the std::io::net::tcp::write_close_ip4()
test was occasionally failing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
|
The write
would succeed a few times, then occasionally error with the
unknown error: Protocol wrong type for socket
, or the EPROTOTYPE
errno.
This is a really surprising error. As far as I know, the only functions that
return that errno are socket
, socketpair
, and connect
. I searched
everywhere, but I couldn’t find any documentation suggesting that write
would
ever produce that error.
I wasn’t the only one who ran into it. bjz opened #18900 describing the same problem. One interesting thing to note is they were also on OSX Yosemite. So I took a little bit of time to extract out that test from the Rust test suite into this gist and got someone on #rust-internals to run it on linux for me with this little driver script:
1 2 3 4 5 6 7 8 9 |
|
and it didn’t error out. So it seems to be system dependent. Further
experimentation showed that if we introduced sleeps or a mutex synchronization
appeared to fix the problem as well. At this point I wasn’t sure if this was a
non-issue, a bug in our code, or a bug in the OS. One things for sure though,
if there is a bug, it could be scattered somewhere across the Rust codebase,
which just std::io
alone is 20 files at around 12522 lines. It’d be a pain to
cut that down to a self contained test case.
Fortunately we’ve got C-Reduce to help us out. Back in May Professor John Regehr from the University of Utah came to our Bay Area Rust meetup to talk about compiler testing and fuzzing. We recorded the talk, so if you want to watch it, you can find it here. One of the things he talked about was the tool C-Reduce his research group developed to automatically cut out unnecessary lines of code that can still reproduce a bug you’re looking for. While it’s written to target C files, it turns out Rust is syntatically close enough to C that it works out pretty well for it too. All you need is a single source file and driver script that’ll report if the compiled source file reproduced the bug.
Aside 1: By the way, one of the other things I did this weekend was I put together a Homebrew pull request for C-Reduce. It hasn’t landed yet, but you want to use it, you can do:
1 2 |
|
Hopefully it’ll land soon so you’ll be able to do:
1
|
|
Anyway, back to the story. So we’ve got a rather large code base to cover, and
while C-reduce does a pretty good job of trimming away lines, just pointing it
at the entire std
module is a bit too much for it to handle in a reasonable
amount of time. It probably needs some more semantic information about Rust to
do a better job of cutting out broad swaths of code.
So I needed to do at least a rough pass to slim down the files. I figured the
problem was probably contained in std::io
or std::sys
, so I wrote a simple
test.rs
that explicitly included those modules as part of the crate (see
this gist if you are
interested), and used the pretty printer to merge it into one file:
1
|
|
Aside 2: Our pretty printer still has some bugs in it, which I filed:
19075 and
19077. Fortunately it was
pretty simple to fix those cases by hand in the generated std.rs
, and odds
are good they’ll be fixed by the time you might use this process.
Next up, we need a driver script. We can adapt our one from before. The only
special consideration is that we need to make sure to only exit with a return
code of 0 if the version of std.rs
we’re compiling errors with EPROTOTYPE
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
|
I used the helper script timeout.sh to time out tests in case C-Reduce accidentally made us an infinite loop.
Finally we’re ready to start running C-Reduce:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
I let it run in the background for sometime in the background while I did some other things. When I got back, C-Reduce automatically reduced the file from 153KB to a slim 22KB. I then reran rust with the lints enabled to manually cut out the dead code C-Reduce failed to remove, and flattened away some unnecessary structs and methods. I was finally left with:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
|
This snippet reproduces the same EPROTOTYPE
that we started with at the top
of the post. Pretty cool that we got here without much effort?
Now At this point you might say to yourself that couldn’t I have extracted this
out myself? And yeah, you would be right. This is a pretty much a c-in-rust
implementation of this bug. But what’s great about using C-Reduce here is that
I only had to make some very rough guesses about what files were and were not
important to include in my test.rs
. Eventually when we get some rust plugins
written for C-Reduce I probably could just point it at the whole libstd
let
C-Reduce do it’s thing. Doing this by hand can be a pretty long and painful
manual process, especially if we’re trying to debug a codegen or runtime bug.
In the past I’ve spent hours reducing some codegen bugs down into a small
snippet that C-Reduce was also able to produce in a couple minutes.
The last step with this code was to eliminate Rust from the picture, and translate this code into C:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 |
|
This also produces EPROTOTYPE
, so we can eliminate Rust altogther. But lets
keep digging. What exactly is producing this error? If I was on Linux, I’d use
strace
, but that program isn’t on Macs. There’s a similar tool called
dtruss
, but that seemed to slow things down enough that the EPROTOTYPE
never happened. Fortunately though there is another program called errinfo
,
that just prints the errno
along with every syscall. In one terminal I ran
while ./test; do sleep 0.1; done
. In the other:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Right there we see our sendto
syscall is actually returning the EPROTOTYPE
.
This errno
then is definitely being created inside the OSX kernel, not in any
userspace code. Fortunately, most of the Apple kernel, XNU, is open sourced, so
we can dig down to what’ll be the my last layer. You can find the tarballs at
http://www.opensource.apple.com/. But I’d rather use the
unoffical GitHub repository. Using GitHub’s
search tools, We can find all 17 instances of
EPROTOTYPE
in the codebase. Now I don’t know the XNU codebase, but there are still some
really interesting things we can find. The first is in
bsd/kern/uipc_usrreq.c:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
|
Hey look at that! There’s handler for the send
syscall (although for IPC, not
TCP) that actually documents that it can return EPROTOTYPE
! While it doesn’t
explain exactly how this can happen, the fact it mentions unp_connect
hints
that uipc_send
may trigger a connect, and that’s exactly what we find a
couple lines into the function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
The fact that the comment says the socket might not be connected yet when we’re
doing a send
hints that Apple may have introduced some level of asynchrony
and preemption to sockets. So if we trigger the actual connect here, it could
then return EPROTOTYPE
, which makes sense. Unfortunately that’s still not
quite the behavior we’re seeing. We’re not getting EPROTOTYPE
on our first
write, but after we’ve done a couple.
I believe we find that behavior in the actual TCP syscall file, bsd/netinet/tcp_usrreq.c:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
|
I believe that comment explains everything we’re seeing. If we trigger a send
while the kernel is in the middle of tearing down the socket, it returns
EPROTOTYPE
. This then looks to be an error we could retry. Once the socket is
fully torn down, it should eventually return the proper EPIPE
. This is also
pretty easy to test. So I modified the inner loop of our C test:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
And yep, it exits cleanly. After all of this, I think it’s pretty clear at this point that there’s no weird kernel corruption bug going on, just a poorly documented edge case. But it sure was fun chasing this condition through the system.
To prevent anyone else from tripping over this edge case, I filed a Apple Radar
ticket (number #19012087 for any Apple employees reading this). Hopefully if
anyone runs into this mysterious EPROTOTYPE
it’ll be documented for them, or
at least there’s a chance they’ll stumble over this blog post and save
themselves a weekend diving through the OS.