Overall serde’s approach for serialization works out pretty well. One thing I
forgot to include in the last post was that I also have two benchmarks that are
not using serde, but are just safely reading and writing values. Assuming I
haven’t missed anything, they should be the upper limit in performance we can
get out of any serialization framework: Here’s
serialization:
language
library
serialization (MB/s)
rust
max without string escapes
353
c++
rapidjson
304
rust
max with string escape
234
rust
serde::json
201
rust
serialize::json
147
go
ffjson
147
So beyond optimizing string escaping, serde::json is only 14% slower than the
zero-cost version and 34% slower than rapidjson.
First, serde::json is built upon consuming from an Iterator<u8>, so we’re
48% slower than our theoretical max, and 58% slower than rapidjson. It looks
like tagged tokens, while faster than the closures in libserialize, are still
pretty expensive.
Second, ffjson is beating us and they compile dramatically faster too. The
goser test suite takes about 0.54
seconds to compile, whereas mine takes about 30 seconds at --opt-level=3
(!!). Rust itself is only taking 1.5 seconds, the rest is spent in LLVM. With
no optimization, it compiles “only” in 5.6 seconds, and is 96% slower.
Third, Reader is a surprisingly expensive trait when dealing with a format
like JSON that need to read a byte at a time. It turns out we’re not
generating great code for
types with padding. aatch has been working on fixing this though.
Since I wrote that last section, I did a little more experimentation to try to
figure out why our serialization upper bound is 23% slower than rapidjson. And,
well, maybe I found it?
serialization (MB/s)
serde::json with a MyMemWriter
346
serde::json with a Vec
247
All I did with MyMemWriter is copy the Vec::<u8> implementation of Writer
into the local codebase:
Somehow it’s not enough to just mark Vec::write as
#[inline], having it in the same file gave LLVM enough information to
optimize it’s overhead away. Even using #[inline(always)] on Vec::write and
Vec::push_all isn’t able to get the same increase, so I’m not sure how to
replicate this in the general case.
Also interesting is bench_serializer_slice, which uses BufWriter.
Another digression. Since I wrote the above, aatch has put out some PRs that
should help speed up enums.
19898 and
#20060 and was able to optimize
the padding out of enums and fix an issue with returns generating bad code. In
my bug from earlier
his patches were able to speed up my benchmark returning an
Result<(), IoError> from running at 40MB/s to 88MB/s. However, if we’re able
to reduce IoError down to a word, we get the performance up to 730MB/s! We
also might get enum compression, so a type like Result<(), IoError> then
would speed up to 1200MB/s! I think going in this direction is going to really
help speed things up.
So libserialize has some pretty serious downsides. It’s slow, it’s got this
weird recursive closure thing going on, and it can’t even represent enum types
like a serialize::json::Json. We need a new solution, and while I was at it,
we ended up with two: serde and
serde2. Both are
different approaches to trying to address these problems. The biggest one being
the type representation problem.
Serde Version 1
Deserialization
I want to start with deserialization first, as that’s really the interesting
bit. To repeat myself a little bit from
part 1,
here is a generic json Value enum:
To deserialize a string like [1, true] into
Array(vec![I64(1), Boolean(true)]), we need to peek at one character ahead
(ignoring whitespace) in order to discover what is the type of the next value.
We then can use that knowledge to pick the right variant, and parse the next
value correctly. While I haven’t formally studied this stuff, I believe this
can be more formally stated as Value requires at least a LL(1) grammar,
but since libserialize supports no lookahead, so at most it can handle LL(0)
grammars.
Since I was thinking of this problem in terms of grammars, I wanted to take a
page out of their book and implement generic deserialization in this style.
serde::de::Deserializers are then an Iterator<serde::de::Token> lexer that
produces a token stream, and serde::de::Deserializes are a parser that
consumes this stream to produce a value. Here’s serde::de::Token, which can
represent nearly all the rust types:
pubenumToken{Null,Bool(bool),Int(int),I8(i8),I16(i16),I32(i32),I64(i64),Uint(uint),U8(u8),U16(u16),U32(u32),U64(u64),F32(f32),F64(f64),Char(char),Str(&'staticstr),String(String),Option(bool),// true if the option has a valueTupleStart(uint),// estimate of the number of valuesStructStart(&'staticstr,// the struct nameuint,// estimate of the number of (string, value) pairs),EnumStart(&'staticstr,// the enum name&'staticstr,// the variant nameuint// estimate of the number of values),SeqStart(uint),// number of valuesMapStart(uint),// number of (value, value) pairsEnd,}
The serde::de::Deserialize stream must generate tokens that follow this
grammar:
pubtraitDeserialize<D:Deserializer<E>,E>{fndeserialize(d:&mutD)->Result<Self,E>{lettoken=try!(d.expect_token());Deserialize::deserialize_token(d,token)}fndeserialize_token(d:&mutD,token:Token)->Result<Self,E>;}pubtraitDeserializer<E>:Iterator<Result<Token,E>>{/// Called when a `Deserialize` expected more tokens, but the/// `Deserializer` was empty.fnend_of_stream_error(&mutself)->E;/// Called when a `Deserializer` was unable to properly parse the stream.fnsyntax_error(&mutself,token:Token,expected:&'static[TokenKind])->E;/// Called when a named structure or enum got a name that it didn't expect.fnunexpected_name_error(&mutself,token:Token)->E;/// Called when a value was unable to be coerced into another value.fnconversion_error(&mutself,token:Token)->E;/// Called when a `Deserialize` structure did not deserialize a field/// named `field`.fnmissing_field<T:Deserialize<Self,E>>(&mutself,field:&'staticstr)->Result<T,E>;/// Called when a `Deserialize` has decided to not consume this token.fnignore_field(&mutself,_token:Token)->Result<(),E>{let_:IgnoreTokens=try!(Deserialize::deserialize(self));Ok(())}#[inline]fnexpect_token(&mutself)->Result<Token,E>{self.next().unwrap_or_else(||Err(self.end_of_stream_error()))}...}
The Deserialize trait is kept pretty slim, and is how lookahead is
implemented. Deserializer is an enhanced Iterator<Result<Token, E>>, with
many helpful default methods. Here are them in action. First we’ll start with
what’s probably the simplest Deserializer, which just wraps a Vec<Token>:
Overall it should be pretty straight forward. As usual, error handling makes
things a bit noisier, but hopefully it’s not too onerous. Next is a
Deserialize for bool:
Simple! Sequences are a bit more tricky. Here’s Deserialize a Vec<T>. We
use a helper adaptor SeqDeserializer to deserialize from all types that
implement FromIterator:
It’s more complicated than libserialize’s struct parsing, but it performs
much better because it can handle out of order maps without buffering tokens.
Serialization
Serialization’s story is a much simpler one. Conceptually
serde::ser::Serializer/serde::ser::Serialize are inspired by the
deserialization story, but we don’t need the tagged tokens because we already
know the types. Here are the traits:
There are many default methods, so only a handful of implementations need to be
specified. Now lets look at how they are used. Here’s a simple
AssertSerializer that I use in my test suite to make sure I’m serializing
properly:
So how does it perform? Here’s the serialization benchmarks, with yet another
ordering. This time sorted by the performance:
language
library
format
serialization (MB/s)
Rust
capnproto-rust
Cap’n Proto (unpacked)
4349
Go
go-capnproto
Cap’n Proto
3824.20
Rust
bincode
Binary
1020
Go
gogoprotobuf
Protocol Buffers
596.78
Rust
capnproto-rust
Cap’n Proto (packed)
583
Rust
rust-msgpack
MessagePack
397
Rust
rust-protobuf
Protocol Buffers
357
C++
rapidjson
JSON
304
Rust
serde::json
JSON
222
Go
goprotobuf
Protocol Buffers
214.68
Go
ffjson
JSON
147.37
Rust
serialize::json
JSON
147
Go
encoding/json
JSON
80.49
serde::json is doing pretty good! It still has got a ways to go to catch up
to rapidjson, but it’s pretty cool it’s
beating goprotobuf out of the box :)
Here are the deserialization numbers:
language
library
format
deserialization (MB/s)
Rust
capnproto-rust
Cap’n Proto (unpacked)
2185
Go
go-capnproto
Cap’n Proto (zero copy)
1407.95
Go
go-capnproto
Cap’n Proto
711.77
Rust
capnproto-rust
Cap’n Proto (packed)
351
Go
gogoprotobuf
Protocol Buffers
272.68
C++
rapidjson
JSON (sax)
189
C++
rapidjson
JSON (dom)
162
Rust
rust-msgpack
MessagePack
138
Rust
rust-protobuf
Protocol Buffers
129
Go
ffjson
JSON
95.06
Rust
bincode
Binary
80
Go
goprotobuf
Protocol Buffers
79.78
Rust
serde::json
JSON
67
Rust
serialize::json
JSON
24
Go
encoding/json
JSON
22.79
Well on the plus side, serde::json nearly 3 times faster than
libserialize::json. On the downside rapidjson is nearly 3 times faster than
us in it’s SAX style parsing. Even the newly added deserialization support in
ffjson is 1.4 times faster than us. So we
got more work cut out for us!
Next time, serde2!
PS: I’m definitely getting close to the end of my story, and while I have some
better numbers with serde2, nothing is quite putting me in the rapidjson
range. Anyone want to help optimize
serde? I would greatly appreciate the help!
PPS: I’ve gotten a number of requests for my
serialization benchmarks
to be ported over to other languages and libraries. Especially a C++ version
of Cap’n Proto. Unfortunately I don’t really have the time to do it myself.
Would anyone be up for helping to implement it?
I’m really supposed to be working on my
serialization series,
but I just saw a neat new library that was open sourced by Yahoo a couple days
ago called MDBM on
Hacker News. I know nothing
about the library, but there are some really neat claims:
It’s supposed to be fast, and that’s always nice.
It’s supposed to have a really slim interface.
It’s so fast because it’s passing around naked pointers to mmapped files,
which is terribly unsafe. Unless you got rust which can prove that those
pointers won’t escape :)
So I wanted to see how easy it’d be to make a Rust binding for the project.
If you want to follow along, first make sure you have
rust installed. Unfortunately it
looks like MDBM only supports Linux and FreeBSD, so I had to build out a Fedora
VM to test this out on. I think this is all you need to build it:
12345
% git clone https://github.com/yahoo/mdbm
% cd mdbm/redhat
% make
% rpm -Uvh ~/rpmbuild/RPMS/x86_64/mdbm-4.11.1-1.fc21.x86_64.rpm
% rpm -Uvh ~/rpmbuild/RPMS/x86_64/mdbm-devel-4.11.1-1.fc21.x86_64.rpm
Unfortunately it’s only for linux, and I got a mac, but it turns out there’s
plenty I can do to prep while VirtualBox and Fedora 21 download. Lets start out
by creating our project with cargo:
12
% cargo new mdbm
% cd rust-mdbm
(Right now there’s no way to have the name be different than the path, so edit
Cargo.toml to rename the project to mdbm. I filed
#1030 to get that
implemented).
By convention, we put bindgen packages into $name-sys, so make that crate as
well:
12
% cargo new --no-git mdbm-sys
% cd mdbm-sys
We’ve got a really cool tool called
bindgen, which uses clang to parse
header files and convert them into an unsafe rust interface. So lets check out
MDBM, and generate a crate to wrap it up in.
123456789101112
% cd ../..
% git clone git@github.com:crabtw/rust-bindgen.git
% cd rust-bindgen
% cargo build
% cd ..
% git clone git@github.com:yahoo/mdbm.git
% cd rust-mdbm/mdbm-sys
% DYLD_LIBRARY_PATH=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib \
~/rust/rust-bindgen/target/bindgen \
-lmdbm \
-o src/lib.rs \
mdbm/include/mdbm.h
Pretty magical. Make sure it builds:
1234567891011
% cargo build
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:3:21: 3:25 error: failed to resolve. Maybe a missing `extern crate libc`?
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:3 pub type __int8_t = ::libc::c_char;
^~~~
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:3:21: 3:35 error: use of undeclared type name `libc::c_char`
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:3 pub type __int8_t = ::libc::c_char;
^~~~~~~~~~~~~~
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:4:22: 4:26 error: failed to resolve. Maybe a missing `extern crate libc`?
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:4 pub type __uint8_t = ::libc::c_uchar;
^~~~
...
Nope! The problem is that we don’t have the libc crate imported. We don’t
have a convention yet for this, but I like to do is:
This lets me run bindgen later on without mucking up the library. This now
compiles. Next up is our high level interface. Add mdbm-sys to our high level
interface by adding this to the rust-mdbm/Cargo.toml file:
12
[dependencies.mdbm-sys]path="mdbm-sys"
By now I got my VirtualBox setup working, so now to the actual code! Lets start
with a barebones wrapper around the database:
123
pubstructMDBM{db:*mutmdbm_sys::MDBM,}
Next is the constructor and destructor. I’m hardcoding things for now and using
IoError, since MDBM appears to log everything to the ERRNO:
Pretty straightforward translation of the examples with some hardcoded values
to start out. Next up is a wrapper around MDBM’s datum type, which is the
type used for both keys and values. datum is just a simple struct containing
a pointer and length, pretty much analogous to our &[u8] slices. However our
slices are much more powerful because our type system can guarantee that in
safe Rust, these slices can never outlive where they are derived from:
And finally, we got setting and getting a key-value. Setting is pretty
straightforward. The only fun thing is using the AsDatum constraints so we
can do db.set(&"foo", &"bar", 0) instead of
db.set(Datum::new(&"foo".as_slice()), Datum::new("bar".as_slice()), 0).
we’re copying into the database, we don’t have to worry about lifetimes yet:
123456789101112131415161718192021222324
implMDBM{.../// Set a key.pubfnset<K,V>(&self,key:&K,value:&V,flags:int)->Result<(),IoError>whereK:AsDatum,V:AsDatum,{unsafe{letrc=mdbm_sys::mdbm_store(self.db,to_raw_datum(&key.as_datum()),to_raw_datum(&value.as_datum()),flagsaslibc::c_int);ifrc==-1{Err(IoError::last_error())}else{Ok(())}}}...
MDBM requires the database to be locked in order to get the keys. This os
where things get fun in order to prevent those interior pointers from escaping.
We’ll create another wrapper type that manages the lock, and uses RAII to
unlock when we’re done. We tie the lifetime of the Lock to the lifetime of
the database and key, which prevents it from outliving either object:
implMDBM{.../// Lock a key.pubfnlock<'a,K>(&'aself,key:&'aK,flags:int)->Result<Lock<'a>,IoError>whereK:AsDatum,{letrc=unsafe{mdbm_sys::mdbm_lock_smart(self.db,&to_raw_datum(&key.as_datum()),flagsaslibc::c_int)};ifrc==1{Ok(Lock{db:self,key:key.as_datum()})}else{Err(IoError::last_error())}}...}pubstructLock<'a>{db:&'aMDBM,key:Datum<'a>,}#[unsafe_destructor]impl<'a>DropforLock<'a>{fndrop(&mutself){unsafe{letrc=mdbm_sys::mdbm_unlock_smart(self.db.db,&to_raw_datum(&self.key),0);assert_eq!(rc,1);}}}
(Note that I’ve heard #[unsafe_destrutor] as used here may become unnecessary
in 1.0).
Finally, let’s get our value! Assuming the value exists, we tie the lifetime of
the Lock to the lifetime of the returned &[u8]:
123456789101112131415161718
impl<'a>Lock<'a>{/// Fetch a key.pubfnget<'a>(&'aself)->Option<&'a[u8]>{unsafe{letvalue=mdbm_sys::mdbm_fetch(self.db.db,to_raw_datum(&self.key));ifvalue.dptr.is_null(){None}else{// we want to constrain the ptr to our lifetime.letptr:&*constu8=mem::transmute(&value.dptr);Some(slice::from_raw_buf(ptr,value.dsizeasuint))}}}}
Now to verify it works:
123456789101112131415161718192021222324252627
#[test]fntest(){letdb=MDBM::new(&Path::new("test.db"),super::MDBM_O_RDWR|super::MDBM_O_CREAT,0o644,0,0).unwrap();db.set(&"hello",&"world",0).unwrap();{// key needs to be an lvalue so the lock can hold a reference to// it.letkey="hello";// Lock the key. RAII will unlock it when we exit this scope.letvalue=db.lock(&key,0).unwrap();// Convert the value into a string. The lock is still live at this// point.letvalue=str::from_utf8(value.get().unwrap()).unwrap();assert_eq!(value,"world");println!("hello: {}",value);}}
Success! Not too bad for 2 hours of work. Baring bugs, this mdbm
library should perform at roughly the same speed as the C library, but
eliminate many very painful bug opportunities that require tools like Valgrind
to debug.
Edit: you can find all the code in this post
here,
and I filed 19281 for the
regression I mentioned at the end of the post.
Low level benchmarking is confusing and non-intuitive.
The end.
Or not. Whatever. So I’m trying to get my
implement-Reader-and-Writer-for-&[u8] type PR
#18980 landed. But
Steven Fackler
obnixously and correctly pointed out that this won’t play that nicely with the
new Reader and Writer implementation for Vec<u8>. Grumble grumble. And then
Alex Crichton
had the gall to mention that a Writer for mut &[u8] also probably won’t be
that common either. Sure, he’s write and all, but but I got it working without
needing an index! That means that the &mut [u8]Writer only needs 2
pointers instead of BufWriter’s three, so it just has to be faster! Well,
doesn’t it?
Stupid benchmarks.
I got to say it’s pretty addicting writing micro-benchmarks. It’s a lot of fun
seeing how sensitive low-level code can be to just the smallest of tweaks. It’s
also really annoying when you write something you think is pretty neat, then
you find it’s chock-full of false dependencies between cache lines, or other
mean things CPUs like to impose on poor programmers.
Anyway, to start lets look at what should be the fastest way to write to a
buffer. Completely unsafely with no checks.
With SRC_LEN=4 and BATCHES=128, we get this. For fun I added the new
libtest from #19233 that will
hopefully land soon. I also added also ran variations that explicitly inlined
and not inlined the inner function:
Crud. So not only did I add an implementation that’s probably going to not work
with write! and now it turns out the performance is pretty terrible. Inlining
isn’t helping like it did in the unsafe case. So how’s
std::io::BufWriter
compare?
1234567891011121314151617181920212223
pubstructBufWriter<'a>{buf:&'amut[u8],pos:uint}impl<'a>WriterforBufWriter<'a>{#[inline]fnwrite(&mutself,buf:&[u8])->IoResult<()>{// return an error if the entire write does not fit in the bufferletcap=ifself.pos>=self.buf.len(){0}else{self.buf.len()-self.pos};ifbuf.len()>cap{returnErr(IoError{kind:io::OtherIoError,desc:"Trying to write past end of buffer",detail:None})}slice::bytes::copy_memory(self.buf[mutself.pos..],buf);self.pos+=buf.len();Ok(())}}
That’s just cruel. The optimization gods obviously hate me. So I started
playing with a lot of
variations
(it’s my yeah yeah it’s my serialization benchmark suite, I’m planning on
making it more general purpose. Besides it’s my suite and I can do whatever I
want with it, so there):
(BufWriter0): Turning this Writer into a struct wrapper shouldn’t do anything, and it
didn’t.
(BufWriter1): There’s error handling, does removing it help? Nope!
(BufWriter5): There’s an implied branch in let write_len = min(dst_len, src_len). We can
turn that into the branch-predictor-friendly:
(BufWriter2): Fine then, optimization gods! Lets remove the branch altogether and just
always advance the slice src.len() bytes! Damn the safety! That, of course,
works. I can hear them giggle.
(BufWriter3): Maybe, just maybe there’s something weird going on with
inlining across crates? Lets copy std::io::BufWriter and make sure that
it’s still nearly optimal. It still is.
(BufWriter6): Technically the min(dst_len, src_len) is a bounds check, so
we could switch from the bounds checked std.slice::bytes::copy_memory to
the unsafe std::ptr::copy_nonoverlapping_memory, but that also doesn’t
help.
(BufWriter7): Might as well and apply the last trick to std::io::BufWriter,
and it does shave a couple nanoseconds off. It might be worth pushing it
upstream:
(BufWriter4): While I’m using one less uint than std::io::BufWriter, I’m
doing two writes to advance my slice, one to advance the pointer, and one to
shrink the length. std::io::BufWriter only has to advance it’s pos index.
But in this case if instead of treating the slice as a (ptr, length), we
can convert it into a (start_ptr, end_ptr), where start_ptr=ptr, and
end_ptr=ptr+length. This works! Ish:
But still no where near where we need to be. That suggests though that always
cutting down the src, which triggers another bounds check has some measurable
impact. So maybe I should only shrink the src slice when we know it needs to
be shrunk?
At this point, both solutions are approximately just as fast as the unsafe
ptr::copy_nonoverlapping_memory! So that’s awesome. Now would anyone really
care enough about the extra uint? There may be a few very specialized cases
where that extra uint could cause a problem, but I’m not sure if it’s worth
it. What do you all think?
I thought that was good, but since I’m already here, how’s the new Vec<u8>
writer doing? Here’s the driver:
Wow. That’s pretty terrible. Something weird must be going on with
Vec::push_all. (Maybe that’s what caused my serialization benchmarks to slow
1/3?). Lets skip it:
1234567891011121314151617181920
impl<'a>MyWriterforVecWriter1<'a>{#[inline]fnmy_write(&mutself,src:&[u8])->IoResult<()>{letsrc_len=src.len();self.dst.reserve(src_len);letdst=self.dst.as_mut_slice();unsafe{// we reserved enough room in `dst` to store `src`.ptr::copy_nonoverlapping_memory(dst.as_mut_ptr(),src.as_ptr(),src_len);}Ok(())}}
There’s even less going on here than before. The only difference is that
reserve call. Commenting it out gets us back to copy_nonoverlapping_memory
territory:
Unfortunately it’s getting pretty late, so rather than wait until the next time
to dive into this, I’ll leave it up to you all. Does anyone know why reserve
is causing so much trouble here?
PS: While I was working on this, I saw
stevencheg
submitted a patch to speed up the protocol buffer support. But when I ran the
tests, everything was about 40% slower than the last benchmark
post! Something
happened with Rust’s performance over these past couple weeks!
A slight diversion from my serialization series. Last week, strcat submitted
#18885 pull request, which adds
support for using a Vec as a Writer. Over the weekend I submitted
#18980, which allows a &[u8]
to be used as a Reader. Overall a pretty simple change. However, when I was
running the test suite, I found that the std::io::net::tcp::write_close_ip4()
test was occasionally failing:
The write would succeed a few times, then occasionally error with the
unknown error: Protocol wrong type for socket, or the EPROTOTYPE errno.
This is a really surprising error. As far as I know, the only functions that
return that errno are socket, socketpair, and connect. I searched
everywhere, but I couldn’t find any documentation suggesting that write would
ever produce that error.
I wasn’t the only one who ran into it. bjz opened
#18900 describing the same
problem. One interesting thing to note is they were also on OSX Yosemite. So I
took a little bit of time to extract out that test from the Rust test suite
into this gist and got
someone on #rust-internals to run it on linux for me with this little driver
script:
123456789
#!/bin/shset -e
rustc test.rs
for i in $(seq 1 1000);do ./test tcp::test::write_close_ip4
done
and it didn’t error out. So it seems to be system dependent. Further
experimentation showed that if we introduced sleeps or a mutex synchronization
appeared to fix the problem as well. At this point I wasn’t sure if this was a
non-issue, a bug in our code, or a bug in the OS. One things for sure though,
if there is a bug, it could be scattered somewhere across the Rust codebase,
which just std::io alone is 20 files at around 12522 lines. It’d be a pain to
cut that down to a self contained test case.
Fortunately we’ve got C-Reduce to help us
out. Back in May Professor John Regehr from
the University of Utah came to our
Bay Area Rust meetup
to talk about compiler testing and fuzzing. We recorded the talk, so if you
want to watch it, you can find it
here. One of the things he
talked about was the tool C-Reduce his research group developed to
automatically cut
out unnecessary lines of code that can still reproduce a bug you’re looking
for. While it’s written to target C files, it turns out Rust is syntatically
close enough to C that it works out pretty well for it too. All you need is a
single source file and driver script that’ll report if the compiled source file
reproduced the bug.
Aside 1: By the way, one of the other things I did this weekend was I put
together a Homebrew
pull request for C-Reduce.
It hasn’t landed yet, but you want to use it, you can do:
Hopefully it’ll land soon so you’ll be able to do:
1
% brew install creduce
Anyway, back to the story. So we’ve got a rather large code base to cover, and
while C-reduce does a pretty good job of trimming away lines, just pointing it
at the entire std module is a bit too much for it to handle in a reasonable
amount of time. It probably needs some more semantic information about Rust to
do a better job of cutting out broad swaths of code.
So I needed to do at least a rough pass to slim down the files. I figured the
problem was probably contained in std::io or std::sys, so I wrote a simple
test.rs that explicitly included those modules as part of the crate (see
this gist if you are
interested), and used the pretty printer to merge it into one file:
1
% rustc --pretty normal test.rs > test.rs
Aside 2: Our pretty printer still has some bugs in it, which I filed:
19075 and
19077. Fortunately it was
pretty simple to fix those cases by hand in the generated std.rs, and odds
are good they’ll be fixed by the time you might use this process.
Next up, we need a driver script. We can adapt our one from before. The only
special consideration is that we need to make sure to only exit with a return
code of 0 if the version of std.rs we’re compiling errors with EPROTOTYPE:
1234567891011121314151617181920212223242526272829
#!/bin/shrustc \ -A unused_imports \ -A unused_unsafe \ -A non_camel_case_types \ -A unused_variables \ --test \ -o test\ test.rs >/dev/null 2>&1
if["$?" -ne "0"];thenexit 1
firoot=`dirname $0`for i in $(seq 1 1000);do$root/timeout.sh 5 ./test --nocapture tcp::test::write_close_ip4 2>&1| tee log
grep "Protocol wrong type for socket" log
RET="$?"if["$RET" -eq "0"];thenexit 0
elif["$RET" -eq "143"];thenexit 1
fiechodoneexit 1
I used the helper script
timeout.sh to
time out tests in case C-Reduce accidentally made us an infinite loop.
I let it run in the background for sometime in the background while I did some
other things. When I got back, C-Reduce automatically reduced the file from
153KB to a slim 22KB. I then reran rust with the lints enabled to manually cut
out the dead code C-Reduce failed to remove, and flattened away some
unnecessary structs and methods. I was finally left with:
This snippet reproduces the same EPROTOTYPE that we started with at the top
of the post. Pretty cool that we got here without much effort?
Now At this point you might say to yourself that couldn’t I have extracted this
out myself? And yeah, you would be right. This is a pretty much a c-in-rust
implementation of this bug. But what’s great about using C-Reduce here is that
I only had to make some very rough guesses about what files were and were not
important to include in my test.rs. Eventually when we get some rust plugins
written for C-Reduce I probably could just point it at the whole libstd let
C-Reduce do it’s thing. Doing this by hand can be a pretty long and painful
manual process, especially if we’re trying to debug a codegen or runtime bug.
In the past I’ve spent hours reducing some codegen bugs down into a small
snippet that C-Reduce was also able to produce in a couple minutes.
The last step with this code was to eliminate Rust from the picture, and
translate this code into C:
#include <stdio.h>#include <unistd.h>#include <stdlib.h>#include <strings.h>#include <sys/socket.h>#include <netinet/in.h>#include <pthread.h>#include <errno.h>intdo_server(){intfd;structsockaddr_inserver_addr;fd=socket(AF_INET,SOCK_STREAM,0);if(fd==-1){perror("error socket server");exit(1);}bzero((char*)&server_addr,sizeof(server_addr));server_addr.sin_family=AF_INET;server_addr.sin_addr.s_addr=INADDR_ANY;server_addr.sin_port=htons(9600);if(bind(fd,(structsockaddr*)&server_addr,sizeof(server_addr))<0){perror("error binding");exit(1);}returnfd;}void*do_child_thread(void*unused){structsockaddr_inclient_addr;intfd;fd=socket(AF_INET,SOCK_STREAM,0);if(fd==-1){perror("error socket client");exit(1);}bzero((char*)&client_addr,sizeof(client_addr));client_addr.sin_family=AF_INET;client_addr.sin_addr.s_addr=INADDR_ANY;client_addr.sin_port=htons(9600);if(connect(fd,(structsockaddr*)&client_addr,sizeof(client_addr))<0){perror("error connect");exit(1);}fprintf(stderr,"closing client socket\n");if(close(fd)<0){perror("error close client socket");exit(1);}fprintf(stderr,"closed client socket\n");returnNULL;}intmain(intargc,char**argv){intserver_fd,client_fd;socklen_tclient_len;structsockaddr_inclient_addr;charbuf[]={'a','\n'};pthread_tchild_thread;intrc;signal(SIGPIPE,SIG_IGN);server_fd=do_server();rc=listen(server_fd,5);if(rc<0){perror("error listen");return1;}rc=pthread_create(&child_thread,NULL,do_child_thread,NULL);if(rc!=0){perror("error pthread_create");return1;}client_len=sizeof(client_addr);client_fd=accept(server_fd,(structsockaddr*)&client_addr,&client_len);if(client_fd<0){perror("error accept");return1;}while(1){fprintf(stderr,"before send\n");rc=send(client_fd,buf,sizeof(buf),0);fprintf(stderr,"after send: %d\n",rc);if(rc<0){if(errno==EPIPE){break;}else{intso_type;socklen_tso_len=sizeof(so_type);getsockopt(client_fd,SOL_SOCKET,SO_TYPE,&so_type,&so_len);fprintf(stderr,"type: %d %d\n",so_type,SOCK_STREAM);perror("error send");return1;}}}fprintf(stderr,"before server closing client fd\n");if(close(client_fd)<0){perror("error close client");return1;}fprintf(stderr,"after server closing client fd\n");fprintf(stderr,"before server closing fd\n");if(close(server_fd)<0){perror("error close server");return1;}fprintf(stderr,"after server closing fd\n");rc=pthread_join(child_thread,NULL);if(rc!=0&&rc!=ESRCH){fprintf(stderr,"error pthread_join: %d\n",rc);return1;}return0;}
This also produces EPROTOTYPE, so we can eliminate Rust altogther. But lets
keep digging. What exactly is producing this error? If I was on Linux, I’d use
strace, but that program isn’t on Macs. There’s a similar tool called
dtruss, but that seemed to slow things down enough that the EPROTOTYPE
never happened. Fortunately though there is another program called errinfo,
that just prints the errno along with every syscall. In one terminal I ran
while ./test; do sleep 0.1; done. In the other:
Right there we see our sendto syscall is actually returning the EPROTOTYPE.
This errno then is definitely being created inside the OSX kernel, not in any
userspace code. Fortunately, most of the Apple kernel, XNU, is open sourced, so
we can dig down to what’ll be the my last layer. You can find the tarballs at
http://www.opensource.apple.com/. But I’d rather use the
unoffical GitHub repository. Using GitHub’s
search tools, We can find all 17 instances of
EPROTOTYPE
in the codebase. Now I don’t know the XNU codebase, but there are still some
really interesting things we can find. The first is in
bsd/kern/uipc_usrreq.c:
123456789101112131415161718192021222324
/* * Returns: 0 Success * EINVAL * EOPNOTSUPP * EPIPE * ENOTCONN * EISCONN * unp_internalize:EINVAL * unp_internalize:EBADF * unp_connect:EAFNOSUPPORT Address family not supported * unp_connect:EINVAL Invalid argument * unp_connect:ENOTSOCK Not a socket * unp_connect:ECONNREFUSED Connection refused * unp_connect:EISCONN Socket is connected * unp_connect:EPROTOTYPE Protocol wrong type for socket * unp_connect:??? * sbappendaddr:ENOBUFS [5th argument, contents modified] * sbappendaddr:??? [whatever a filter author chooses] */staticintuipc_send(structsocket*so,intflags,structmbuf*m,structsockaddr*nam,structmbuf*control,proc_tp){...
Hey look at that! There’s handler for the send syscall (although for IPC, not
TCP) that actually documents that it can return EPROTOTYPE! While it doesn’t
explain exactly how this can happen, the fact it mentions unp_connect hints
that uipc_send may trigger a connect, and that’s exactly what we find a
couple lines into the function:
1234567891011121314151617
.../* Connect if not connected yet. *//* * Note: A better implementation would complain * if not equal to the peer's address. */if((so->so_state&SS_ISCONNECTED)==0){if(nam){error=unp_connect(so,nam,p);if(error)break;/* XXX */}else{error=ENOTCONN;break;}}...
The fact that the comment says the socket might not be connected yet when we’re
doing a send hints that Apple may have introduced some level of asynchrony
and preemption to sockets. So if we trigger the actual connect here, it could
then return EPROTOTYPE, which makes sense. Unfortunately that’s still not
quite the behavior we’re seeing. We’re not getting EPROTOTYPE on our first
write, but after we’ve done a couple.
staticinttcp_usr_send(structsocket*so,intflags,structmbuf*m,structsockaddr*nam,structmbuf*control,structproc*p){interror=0;structinpcb*inp=sotoinpcb(so);structtcpcb*tp;uint32_tmsgpri=MSG_PRI_DEFAULT;#if INET6intisipv6;#endifTCPDEBUG0;if(inp==NULL||inp->inp_state==INPCB_STATE_DEAD#if NECP||(necp_socket_should_use_flow_divert(inp))#endif /* NECP */){/* * OOPS! we lost a race, the TCP session got reset after * we checked SS_CANTSENDMORE, eg: while doing uiomove or a * network interrupt in the non-splnet() section of sosend(). */if(m!=NULL)m_freem(m);if(control!=NULL){m_freem(control);control=NULL;}if(inp==NULL)error=ECONNRESET;/* XXX EPIPE? */elseerror=EPROTOTYPE;tp=NULL;TCPDEBUG1();gotoout;}...
I believe that comment explains everything we’re seeing. If we trigger a send
while the kernel is in the middle of tearing down the socket, it returns
EPROTOTYPE. This then looks to be an error we could retry. Once the socket is
fully torn down, it should eventually return the proper EPIPE. This is also
pretty easy to test. So I modified the inner loop of our C test:
And yep, it exits cleanly. After all of this, I think it’s pretty clear at this
point that there’s no weird kernel corruption bug going on, just a poorly
documented edge case. But it sure was fun chasing this condition through the
system.
To prevent anyone else from tripping over this edge case, I filed a Apple Radar
ticket (number #19012087 for any Apple employees reading this). Hopefully if
anyone runs into this mysterious EPROTOTYPE it’ll be documented for them, or
at least there’s a chance they’ll stumble over this blog post and save
themselves a weekend diving through the OS.
Back to the benchmarks! I got some great comments on
reddit,
So I wanted to do another post to update my numbers. Here’s what I changed:
I wasn’t consistent on whether or not the serialization benchmarks included
Some tests are including the allocation of a buffer to write into. I’ve
changed it so most are reusing one, which speeds everything up (especially
capnproto-rust!). This does depend on
#18885 landing though.
I’ve added bincode, which serializes
values as raw bytes. Quite speedy too! Not nearly as fast as Cap’n Proto though.
I’ve changed C++ and Rust JSON tests to serialize enums as uints.
I added the time it takes to create the populate the structures. I’m betting
the reason the Rust numbers are so high is that we’re allocating strings. Not
sure if the other languages are able to avoid that allocation.
JSON:
language
library
population (ns)
serialization (MB/s)
deserialization (MB/s)
Rust
serialize::json
1127
117
26
C++
rapidjson (dom)
546
281
144
C++
rapidjson (dom)
546
281
181
Go
encoding/json
343
63.99
22.46
Go
ffjson
343
144.60
(not supported)
Cap’n Proto:
language
library
population (ns)
serialization (MB/s)
deserialization (MB/s)
Rust
capnproto-rust (unpacked)
325
4977
2251
Rust
capnproto-rust (packed)
325
398
246
Go
go-capnproto
2368
2226.71
450
Go
go-capnproto (zero copy)
2368
2226.71
1393.3
Protocol Buffers:
language
library
population (ns)
serialization (MB/s)
deserialization (MB/s)
Rust
rust-protobuf
1041
370
118
Go
goprotobuf
1133
138.27
91.18
Go
gogoprotobuf
343
472.69
295.33
Misc:
language
library
population (ns)
serialization (MB/s)
deserialization (MB/s)
Rust
rust-msgpack
1143
454
144
Rust
bincode
1143
1149
82
Anyone want to add more C/Go/Rust/Java/etc benchmarks?
After part 2 I received
a couple requests to add in a couple other rust serialization libraries. So one
thing led to another, and now I’ve got a benchmark suite I’m calling
rust-serialization-benchmarks.
Really creative name, eh? This includes all the other benchmarks I referred to
previously, as well as capnproto,
msgpack, and
protobuf.
language
library
format
serialization (MB/s)
deserialization (MB/s)
C++
rapidjson
JSON (dom)
233
102
C++
rapidjson
JSON (sax)
233
124
Go
encoding/json
JSON
54.93
16.72
Go
ffjson
JSON
126.40
(not supported)
Go
goprotobuf
Protocol Buffers
138.27
91.18
Go
gogoprotobuf
Protocol Buffers
472.69
295.33
Go
go-capnproto
Cap’n Proto
2226.71
450
Go
go-capnproto
Cap’n Proto (zero copy)
2226.71
1393.3
Rust
serialize::json
JSON
89
18
Rust
rust-msgpack
MessagePack
160
52
Rust
rust-protobuf
Protocol Buffers
177
70
Rust
capnproto-rust
Cap’n Proto (unpacked)
1729
1276
Rust
capnproto-rust
Cap’n Proto (packed)
398
246
I upgraded to OS X Yosemite, so I think that brought these numbers down overall
from the last post.
As I said in the last post,
Rust’s serialize library, specifically serialize::json is pretty slow.
Back when I started this project a number of months ago, I wanted to benchmark
to see how we compared to some other languages. There are a bunch of JSON
benchmarks, but the one I chose was Cloudflare’s Go language.
Goser, mainly because it was using a
complex real world log structure, and they did the hard work of implementing
benchmarks for encoding/json,
goprotobuf,
gogoprotobuf, and
go-capnproto. I also included the
Go ffjson and C++
rapidjson, which
both claim to be the fastest JSON libraries for those languages. Here are the
results I got:
language
library
format
serialization (MB/s)
deserialization (MB/s)
C++
rapidjson
JSON
294
164 (DOM) / 192 (SAX)
Go
encoding/json
JSON
71.47
25.09
Go
ffjson
JSON
156.67
(not supported)
Go
goprotobuf
Protocol Buffers
148.78
99.57
Go
gogoprotobuf
Protocol Buffers
519.48
319.40
Go
go-capnproto
Cap’n Proto
3419.54
665.35
Rust
serialize::json
JSON
40-ish
10-ish
Notes:
rapidjson supports both DOM-style and SAX-style deserializing. DOM-style
means deserializing into a generic object, then from there into the final
object, SAX-style means a callback approach where a callback handler is
called for each JSON token.
Go’s encoding/json uses reflection to serialize arbitrary values. ffjson
uses code generation to get it’s serialization sped, but it doesn’t implement
deserialization.
both goprotobuf and gogoprotobuf use code generation, but gogoprotobuf
uses Protocol Buffer’s extension support to do cheaper serialization.
Cap’n Proto doesn’t really do serialization, but lays the serialized data out
just like it is in memory so it has nearly zero serialization speed.
The Rust numbers are from a couple months ago and I couldn’t track down
the exact numbers.
So. Yikes. Not only are we no where near rapidjson, we were being soundly
beaten by Go’s reflection-based framework encoding/json. Even worse, our
compile time was at least 10 times theirs. So, not pretty at all.
But that was a couple months ago. Between then and now, Patrick Walton, Luqman
Aden, myself, and probably lots others found and fixed a number of bugs across
serialize::json, std::io, generic function calls, and more. All this work
got us to more than double our performance:
language
library
format
serialization (MB/s)
deserialization (MB/s)
Rust
serialize::json
JSON
117
25
We’re (kind of) beating Go! At least the builtin reflection-based solution.
Better, but not great. I think our challenge is those dang closures. While LLVM
can optimize simple closures, it seems to have a lot of trouble with all these
recursive closure calls. While having finished unboxed closures might finally
let us break through this performance bottleneck, it’s not guaranteed.
All in all, this, and the representational problems from
post 1 make it pretty
obvious we got some fundamental issues here and we need to use an alternative
solution. Next post I’ll start getting into the details of the design of
serde.
Hello everybody! It’s been, what, two years since I last blogged? Not my best
performance, I’m sorry to say. So for all of my 3 pageviews that are probably
bots, I appologize for such a long delay on updating my blog. I got to say I’ve
been pretty inspired by the great Julia Evans (who I hope we
can someday get back to working on rust stuff). She’s an epic blogger, and I
hope I can get somewhere near that speed.
Anyway, on to the post. My main on-again-off-again project this past year has
been working Rust’s generic serialize
library. If you haven’t played with it yet, it’s really nifty. It’s a generic
framework that allows a generic Encoder serialize a generic Encodable, and
the inverse with Decoder and Decodable. This allows you to write just one
Encodable impl that can transparently work with our
json library,
msgpack,
toml, and etc. It’s simple to use
too in most cases as you can use #[deriving(Encodable, Decodable)] to
automatically create a implementation for your type. Here’s an example:
As you can see, parsing compound structures requires these recursive closure
calls in order to perform the handshake between the Encoder and the
Encodable. A couple people have run into bugs in the past where they didn’t
implement this pattern, which results in some confusing bugs. Furthermore, LLVM
isn’t great at inlining these recursive calls, so serialize impls tend to not
perform well.
That’s not the worst of it though. The real problem is that there are types
that can implement Encodable, there’s no way to write a Decodable
implementation. They’re pretty common too. For example, the
serialize::json::Json type:
The Json value can represent any value that’s in a JSON string. Implied in
this is the notion that the Decodable has to look ahead to see what the next
value is so it can decide which Json variant to construct. Unfortunately our
current Decoder infrastructure doesn’t support lookahead. The way the
Decoder/Decodable handshake works is essentially:
Decodable asks for a struct named "Employee".
Decodable asks for a field named "name".
Decodable asks for a value of type String.
Decodable asks for a field named "age".
Decodable asks for a value of type uint.
…
Any deviation from this pattern results in an error. There isn’t a way for the
Decodable to ask what is the type of the next value, so this is why we
serialize generic enums by explicitly tagging the variant, as in:
123456789101112131415161718
externcrateserialize;useserialize::json;#[deriving(Encodable, Decodable, Show)]enumAnimal{Dog(uint),Frog(String,uint),}fnmain(){letanimal=Frog("Frank".to_string(),349);lets=json::encode(&animal);println!("{}",s);// prints {"variant":"Frog","fields":["Frank",349]}}
That’s probably good enough for now. In my next post I’ll go into in my
approach to fix this in serde.
I’ve written a bunch of simple parsers for Rust, and it’s starting to get a
little obnoxious. So I added a Rust backend to the
Ragel State Machine Compiler.
You can find my fork here. I’m waiting for
Rust to stablize before I try to push it upstream.
Ragel is a rather neat way of writing simple parsers. In some ways it’s pretty
similar to Lex, but Ragel also allows you execute arbitrary code at any point
in the state machine. Furthermore, this arbitrary code can manipulate the state
machine itself, so it can be used in many places you’d traditionally need a
full parser, such as properly handling parentheses.
%%{
machine atoi;
action see_neg { neg = true; }
action add_digit { res = res * 10 + (fc as int - '0' as int); }
main :=
( '-' @see_neg | '+' )? ( digit @add_digit )+
'\n'?
;
write data;
}%%
fn atoi(data: ~str) -> option<int> {
let mut neg = false;
let mut res = 0;
// Ragel assumes that it will be iterating over a value called data, but we
// need to tell ragel where to start (p) and end (pe) parsing.
let mut p = 0;
let mut pe = data.len();
// This is the current state in the state machine.
let mut cs: int;
write init;
write exec;
if neg { res = -1 * res; }
// If we stopped before we hit one of the exit states, then there must have
// been an error of some sort.
if cs < atoi_first_final {
none
} else {
some(res)
}
}
While this is probably a bit more verbose than writing atoi by hand, it does
make the grammar pretty explicit, which can help keep it accurate.
Unfortunately there are some pretty severe performance issues at the moment.
Ragel supports two state machine styles, table-driven and goto-driven. My
backend uses tables, but since Rust doesn’t yet support global constant
vectors, I need to malloc the state machine table on every function call. This
results in the ragel-based url
parser being about
10 times slower than the equivalent table-based parser in OCaml. You can see
the generated code here.
The goto route could be promising to explore. We could simulate it using
mutually recursive function calls. OCaml does this. But again, since Rust
doesn’t support tailcalls (and may
ever), we could run into a stack
explosion. It may work well for small grammars though, and maybe LLVM could
optimize calls into tailcalls.
Unless I’m doing something glaringly wrong, it seems likely that we are going
to need some compiler help before these performance issues get solved.