Edit: you can find all the code in this post
here,
and I filed 19281 for the
regression I mentioned at the end of the post.
Low level benchmarking is confusing and non-intuitive.
The end.
Or not. Whatever. So I’m trying to get my
implement-Reader-and-Writer-for-&[u8] type PR
#18980 landed. But
Steven Fackler
obnixously and correctly pointed out that this won’t play that nicely with the
new Reader and Writer implementation for Vec<u8>. Grumble grumble. And then
Alex Crichton
had the gall to mention that a Writer for mut &[u8] also probably won’t be
that common either. Sure, he’s write and all, but but I got it working without
needing an index! That means that the &mut [u8]Writer only needs 2
pointers instead of BufWriter’s three, so it just has to be faster! Well,
doesn’t it?
Stupid benchmarks.
I got to say it’s pretty addicting writing micro-benchmarks. It’s a lot of fun
seeing how sensitive low-level code can be to just the smallest of tweaks. It’s
also really annoying when you write something you think is pretty neat, then
you find it’s chock-full of false dependencies between cache lines, or other
mean things CPUs like to impose on poor programmers.
Anyway, to start lets look at what should be the fastest way to write to a
buffer. Completely unsafely with no checks.
With SRC_LEN=4 and BATCHES=128, we get this. For fun I added the new
libtest from #19233 that will
hopefully land soon. I also added also ran variations that explicitly inlined
and not inlined the inner function:
Crud. So not only did I add an implementation that’s probably going to not work
with write! and now it turns out the performance is pretty terrible. Inlining
isn’t helping like it did in the unsafe case. So how’s
std::io::BufWriter
compare?
1234567891011121314151617181920212223
pubstructBufWriter<'a>{buf:&'amut[u8],pos:uint}impl<'a>WriterforBufWriter<'a>{#[inline]fnwrite(&mutself,buf:&[u8])->IoResult<()>{// return an error if the entire write does not fit in the bufferletcap=ifself.pos>=self.buf.len(){0}else{self.buf.len()-self.pos};ifbuf.len()>cap{returnErr(IoError{kind:io::OtherIoError,desc:"Trying to write past end of buffer",detail:None})}slice::bytes::copy_memory(self.buf[mutself.pos..],buf);self.pos+=buf.len();Ok(())}}
That’s just cruel. The optimization gods obviously hate me. So I started
playing with a lot of
variations
(it’s my yeah yeah it’s my serialization benchmark suite, I’m planning on
making it more general purpose. Besides it’s my suite and I can do whatever I
want with it, so there):
(BufWriter0): Turning this Writer into a struct wrapper shouldn’t do anything, and it
didn’t.
(BufWriter1): There’s error handling, does removing it help? Nope!
(BufWriter5): There’s an implied branch in let write_len = min(dst_len, src_len). We can
turn that into the branch-predictor-friendly:
(BufWriter2): Fine then, optimization gods! Lets remove the branch altogether and just
always advance the slice src.len() bytes! Damn the safety! That, of course,
works. I can hear them giggle.
(BufWriter3): Maybe, just maybe there’s something weird going on with
inlining across crates? Lets copy std::io::BufWriter and make sure that
it’s still nearly optimal. It still is.
(BufWriter6): Technically the min(dst_len, src_len) is a bounds check, so
we could switch from the bounds checked std.slice::bytes::copy_memory to
the unsafe std::ptr::copy_nonoverlapping_memory, but that also doesn’t
help.
(BufWriter7): Might as well and apply the last trick to std::io::BufWriter,
and it does shave a couple nanoseconds off. It might be worth pushing it
upstream:
(BufWriter4): While I’m using one less uint than std::io::BufWriter, I’m
doing two writes to advance my slice, one to advance the pointer, and one to
shrink the length. std::io::BufWriter only has to advance it’s pos index.
But in this case if instead of treating the slice as a (ptr, length), we
can convert it into a (start_ptr, end_ptr), where start_ptr=ptr, and
end_ptr=ptr+length. This works! Ish:
But still no where near where we need to be. That suggests though that always
cutting down the src, which triggers another bounds check has some measurable
impact. So maybe I should only shrink the src slice when we know it needs to
be shrunk?
At this point, both solutions are approximately just as fast as the unsafe
ptr::copy_nonoverlapping_memory! So that’s awesome. Now would anyone really
care enough about the extra uint? There may be a few very specialized cases
where that extra uint could cause a problem, but I’m not sure if it’s worth
it. What do you all think?
I thought that was good, but since I’m already here, how’s the new Vec<u8>
writer doing? Here’s the driver:
Wow. That’s pretty terrible. Something weird must be going on with
Vec::push_all. (Maybe that’s what caused my serialization benchmarks to slow
1/3?). Lets skip it:
1234567891011121314151617181920
impl<'a>MyWriterforVecWriter1<'a>{#[inline]fnmy_write(&mutself,src:&[u8])->IoResult<()>{letsrc_len=src.len();self.dst.reserve(src_len);letdst=self.dst.as_mut_slice();unsafe{// we reserved enough room in `dst` to store `src`.ptr::copy_nonoverlapping_memory(dst.as_mut_ptr(),src.as_ptr(),src_len);}Ok(())}}
There’s even less going on here than before. The only difference is that
reserve call. Commenting it out gets us back to copy_nonoverlapping_memory
territory:
Unfortunately it’s getting pretty late, so rather than wait until the next time
to dive into this, I’ll leave it up to you all. Does anyone know why reserve
is causing so much trouble here?
PS: While I was working on this, I saw
stevencheg
submitted a patch to speed up the protocol buffer support. But when I ran the
tests, everything was about 40% slower than the last benchmark
post! Something
happened with Rust’s performance over these past couple weeks!