Chasing Rabbits

A poorly updated blog about what I’m working on

Rewriting Rust Serialization, Part 3.1: Another Performance Digression

Wow, home stretch! Here’s the rest of the series if you want to catch up: part 1, part 2, part 2.1, part 2.2, and part 3.

Overall serde’s approach for serialization works out pretty well. One thing I forgot to include in the last post was that I also have two benchmarks that are not using serde, but are just safely reading and writing values. Assuming I haven’t missed anything, they should be the upper limit in performance we can get out of any serialization framework: Here’s serialization:

language library serialization (MB/s)
rust max without string escapes 353
c++ rapidjson 304
rust max with string escape 234
rust serde::json 201
rust serialize::json 147
go ffjson 147

So beyond optimizing string escaping, serde::json is only 14% slower than the zero-cost version and 34% slower than rapidjson.

Deserialization, on the other hand, still has a ways to go:

language library deserialization (MB/s)
rust rapidjson (SAX) 189
c++ rapidjson (DOM) 162
rust max with Iterator<u8> 152
go ffjson 95
rust max with Reader 78
rust serde::json 73
rust serialize::json 24

There are a couple interesting things here:

First, serde::json is built upon consuming from an Iterator<u8>, so we’re 48% slower than our theoretical max, and 58% slower than rapidjson. It looks like tagged tokens, while faster than the closures in libserialize, are still pretty expensive.

Second, ffjson is beating us and they compile dramatically faster too. The goser test suite takes about 0.54 seconds to compile, whereas mine takes about 30 seconds at --opt-level=3 (!!). Rust itself is only taking 1.5 seconds, the rest is spent in LLVM. With no optimization, it compiles “only” in 5.6 seconds, and is 96% slower.

Third, Reader is a surprisingly expensive trait when dealing with a format like JSON that need to read a byte at a time. It turns out we’re not generating great code for types with padding. aatch has been working on fixing this though.


Since I wrote that last section, I did a little more experimentation to try to figure out why our serialization upper bound is 23% slower than rapidjson. And, well, maybe I found it?

serialization (MB/s)
serde::json with a MyMemWriter 346
serde::json with a Vec 247

All I did with MyMemWriter is copy the Vec::<u8> implementation of Writer into the local codebase:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
struct MyMemWriter0 {
    buf: Vec<u8>,
}

impl MyMemWriter0 {
    pub fn with_capacity(cap: uint) -> MyMemWriter0 {
        MyMemWriter0 {
            buf: Vec::with_capacity(cap)
        }
    }
}

impl Writer for MyMemWriter0 {
    #[inline]
    fn write(&mut self, buf: &[u8]) -> io::IoResult<()> {
        self.buf.push_all(buf);
        Ok(())
    }
}

#[bench]
fn bench_serializer_my_mem_writer0(b: &mut Bencher) {
    let log = Log::new();
    let json = json::to_vec(&log);
    b.bytes = json.len() as u64;

    let mut wr = MyMemWriter0::with_capacity(1024);

    b.iter(|| {
        wr.buf.clear();

        let mut serializer = json::Serializer::new(wr.by_ref());
        log.serialize(&mut serializer).unwrap();
        let _json = serializer.unwrap();
    });
}

#[bench]
fn bench_serializer_vec(b: &mut Bencher) {
    let log = Log::new();
    let json = json::to_vec(&log);
    b.bytes = json.len() as u64;

    let mut wr = Vec::with_capacity(1024);

    b.iter(|| {
        wr.clear();

        let mut serializer = json::Serializer::new(wr.by_ref());
        log.serialize(&mut serializer).unwrap();
        let _json = serializer.unwrap();
    });
}

Somehow it’s not enough to just mark Vec::write as #[inline], having it in the same file gave LLVM enough information to optimize it’s overhead away. Even using #[inline(always)] on Vec::write and Vec::push_all isn’t able to get the same increase, so I’m not sure how to replicate this in the general case.

Also interesting is bench_serializer_slice, which uses BufWriter.

serialization (MB/s)
serde::json with a BufWriter 342
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#[bench]
fn bench_serializer_slice(b: &mut Bencher) {
    let log = Log::new();
    let json = json::to_vec(&log);
    b.bytes = json.len() as u64;

    let mut buf = [0, .. 1024];

    b.iter(|| {
        for item in buf.iter_mut(){ *item = 0; }
        let mut wr = std::io::BufWriter::new(&mut buf);

        let mut serializer = json::Serializer::new(wr.by_ref());
        log.serialize(&mut serializer).unwrap();
        let _json = serializer.unwrap();
    });
}

Another digression. Since I wrote the above, aatch has put out some PRs that should help speed up enums. 19898 and #20060 and was able to optimize the padding out of enums and fix an issue with returns generating bad code. In my bug from earlier his patches were able to speed up my benchmark returning an Result<(), IoError> from running at 40MB/s to 88MB/s. However, if we’re able to reduce IoError down to a word, we get the performance up to 730MB/s! We also might get enum compression, so a type like Result<(), IoError> then would speed up to 1200MB/s! I think going in this direction is going to really help speed things up.


That was taking a while, so until next time!

Rewriting Rust Serialization, Part 3: Introducing Serde

There’s been a long digression over the past month (possible kernel bugs, benchmarking Writers, and don’t believe in magic, folks), but I’m back into serialization. Woo! Here’s part 1 part 2, part 2.1, and part 2.2 if you need to catch up.

So libserialize has some pretty serious downsides. It’s slow, it’s got this weird recursive closure thing going on, and it can’t even represent enum types like a serialize::json::Json. We need a new solution, and while I was at it, we ended up with two: serde and serde2. Both are different approaches to trying to address these problems. The biggest one being the type representation problem.

Serde Version 1

Deserialization

I want to start with deserialization first, as that’s really the interesting bit. To repeat myself a little bit from part 1, here is a generic json Value enum:

1
2
3
4
5
6
7
8
9
10
pub enum Value {
    I64(i64),
    U64(u64),
    F64(f64),
    String(String),
    Boolean(bool),
    Array(Vec<Value>),
    Object(TreeMap<String, Value>),
    Null,
}

To deserialize a string like [1, true] into Array(vec![I64(1), Boolean(true)]), we need to peek at one character ahead (ignoring whitespace) in order to discover what is the type of the next value. We then can use that knowledge to pick the right variant, and parse the next value correctly. While I haven’t formally studied this stuff, I believe this can be more formally stated as Value requires at least a LL(1) grammar, but since libserialize supports no lookahead, so at most it can handle LL(0) grammars.

Since I was thinking of this problem in terms of grammars, I wanted to take a page out of their book and implement generic deserialization in this style. serde::de::Deserializers are then an Iterator<serde::de::Token> lexer that produces a token stream, and serde::de::Deserializes are a parser that consumes this stream to produce a value. Here’s serde::de::Token, which can represent nearly all the rust types:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
pub enum Token {
    Null,
    Bool(bool),
    Int(int),
    I8(i8),
    I16(i16),
    I32(i32),
    I64(i64),
    Uint(uint),
    U8(u8),
    U16(u16),
    U32(u32),
    U64(u64),
    F32(f32),
    F64(f64),
    Char(char),
    Str(&'static str),
    String(String),

    Option(bool),     // true if the option has a value

    TupleStart(uint), // estimate of the number of values

    StructStart(
        &'static str, // the struct name
        uint,         // estimate of the number of (string, value) pairs
    ),

    EnumStart(
        &'static str, // the enum name
        &'static str, // the variant name
        uint          // estimate of the number of values
    ),

    SeqStart(uint), // number of values

    MapStart(uint), // number of (value, value) pairs

    End,
}

The serde::de::Deserialize stream must generate tokens that follow this grammar:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
value ::= Null
        | Bool
        | Int
        | ...
        | option
        | tuple
        | struct
        | enum
        | sequence
        | map
        ;

option ::= Option value
         | Option
         ;

tuple := TupleStart value* End;

struct := StructStart (Str value)* End;

enum := EnumStart value* End;

sequence := SeqStart value* End;

map := MapStart (value value)* End;

For performance reasons, there is no separator in the compound grammar.

Finishing up this section are the actual traits, Deserialize and Deserializer:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
pub trait Deserialize<D: Deserializer<E>, E> {
    fn deserialize(d: &mut D) -> Result<Self, E> {
        let token = try!(d.expect_token());
        Deserialize::deserialize_token(d, token)
    }

    fn deserialize_token(d: &mut D, token: Token) -> Result<Self, E>;
}

pub trait Deserializer<E>: Iterator<Result<Token, E>> {
    /// Called when a `Deserialize` expected more tokens, but the
    /// `Deserializer` was empty.
    fn end_of_stream_error(&mut self) -> E;

    /// Called when a `Deserializer` was unable to properly parse the stream.
    fn syntax_error(&mut self, token: Token, expected: &'static [TokenKind]) -> E;

    /// Called when a named structure or enum got a name that it didn't expect.
    fn unexpected_name_error(&mut self, token: Token) -> E;

    /// Called when a value was unable to be coerced into another value.
    fn conversion_error(&mut self, token: Token) -> E;

    /// Called when a `Deserialize` structure did not deserialize a field
    /// named `field`.
    fn missing_field<
        T: Deserialize<Self, E>
    >(&mut self, field: &'static str) -> Result<T, E>;

    /// Called when a `Deserialize` has decided to not consume this token.
    fn ignore_field(&mut self, _token: Token) -> Result<(), E> {
        let _: IgnoreTokens = try!(Deserialize::deserialize(self));
        Ok(())
    }

    #[inline]
    fn expect_token(&mut self) -> Result<Token, E> {
        self.next().unwrap_or_else(|| Err(self.end_of_stream_error()))
    }

    ...
}

The Deserialize trait is kept pretty slim, and is how lookahead is implemented. Deserializer is an enhanced Iterator<Result<Token, E>>, with many helpful default methods. Here are them in action. First we’ll start with what’s probably the simplest Deserializer, which just wraps a Vec<Token>:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
enum Error {
    EndOfStream,
    SyntaxError(Vec<TokenKind>),
    UnexpectedName,
    ConversionError,
    MissingField(&'static str),
}

struct TokenDeserializer<Iter> {
    tokens: Iter,
}

impl<Iter: Iterator<Token>> TokenDeserializer<Iter> {
    fn new(tokens: Iter) -> TokenDeserializer<Iter> {
        TokenDeserializer {
            tokens: tokens,
        }
    }
}

impl<Iter: Iterator<Token>> Iterator<Result<Token, Error>> for TokenDeserializer<Iter> {
    fn next(&mut self) -> option::Option<Result<Token, Error>> {
        self.tokens.next().map(|token| Ok(token))
    }
}

impl<Iter: Iterator<Token>> Deserializer<Error> for TokenDeserializer<Iter> {
    fn end_of_stream_error(&mut self) -> Error {
        Error::EndOfStream
    }

    fn syntax_error(&mut self, _token: Token, expected: &[TokenKind]) -> Error {
        Error::SyntaxError(expected.to_vec())
    }

    fn unexpected_name_error(&mut self, _token: Token) -> Error {
        Error::UnexpectedName
    }

    fn conversion_error(&mut self, _token: Token) -> Error {
        Error::ConversionError
    }

    #[inline]
    fn missing_field<
        T: Deserialize<TokenDeserializer<Iter>, Error>
    >(&mut self, field: &'static str) -> Result<T, Error> {
        Err(Error::MissingField(field))
    }
}

Overall it should be pretty straight forward. As usual, error handling makes things a bit noisier, but hopefully it’s not too onerous. Next is a Deserialize for bool:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
impl<D: Deserializer<E>, E> Deserialize<D, E> for bool {
    #[inline]
    fn deserialize_token(d: &mut D, token: Token) -> Result<bool, E> {
        d.expect_bool(token)
    }
}

pub trait Deserializer<E>: Iterator<Result<Token, E>> {
    ...

    #[inline]
    fn expect_bool(&mut self, token: Token) -> Result<bool, E> {
        match token {
            Token::Bool(value) => Ok(value),
            token => {
                static EXPECTED_TOKENS: &'static [TokenKind] = &[
                    TokenKind::BoolKind,
                ];
                Err(self.syntax_error(token, EXPECTED_TOKENS))
            }
        }
    }


    ...
}

Simple! Sequences are a bit more tricky. Here’s Deserialize a Vec<T>. We use a helper adaptor SeqDeserializer to deserialize from all types that implement FromIterator:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
impl<
    D: Deserializer<E>,
    E,
    T: Deserialize<D ,E>
> Deserialize<D, E> for Vec<T> {
    #[inline]
    fn deserialize_token(d: &mut D, token: Token) -> Result<Vec<T>, E> {
        d.expect_seq(token)
    }
}

pub trait Deserializer<E>: Iterator<Result<Token, E>> {
    ...

    #[inline]
    fn expect_seq<
        T: Deserialize<Self, E>,
        C: FromIterator<T>
    >(&mut self, token: Token) -> Result<C, E> {
        let len = try!(self.expect_seq_start(token));
        let mut err = None;

        let collection: C = {
            let d = SeqDeserializer {
                d: self,
                len: len,
                err: &mut err,
            };

            d.collect()
        };

        match err {
            Some(err) => Err(err),
            None => Ok(collection),
        }
    }

    ...
}

struct SeqDeserializer<'a, D: 'a, E: 'a> {
    d: &'a mut D,
    len: uint,
    err: &'a mut Option<E>,
}

impl<
    'a,
    D: Deserializer<E>,
    E,
    T: Deserialize<D, E>
> Iterator<T> for SeqDeserializer<'a, D, E> {
    #[inline]
    fn next(&mut self) -> option::Option<T> {
        match self.d.expect_seq_elt_or_end() {
            Ok(next) => {
                self.len -= 1;
                next
            }
            Err(err) => {
                *self.err = Some(err);
                None
            }
        }
    }

    #[inline]
    fn size_hint(&self) -> (uint, option::Option<uint>) {
        (self.len, Some(self.len))
    }
}

Last is a struct deserializer. This relies on a simple state machine in order to deserialize from out of order maps:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
struct Foo {
    a: (),
    b: uint,
    c: TreeMap<String, Option<char>>,
}

impl<
    D: Deserializer<E>,
    E
> Deserialize<D, E> for Foo {
    #[inline]
    fn deserialize_token(d: &mut D, token: Token) -> Result<Foo, E> {
        try!(d.expect_struct_start(token, "Foo"));

        let mut a = None;
        let mut b = None;
        let mut c = None;

        static FIELDS: &'static [&'static str] = &["a", "b", "c"];

        loop {
            let idx = match try!(d.expect_struct_field_or_end(FIELDS)) {
                Some(idx) => idx,
                None => { break; }
            };

            match idx {
                Some(0) => { a = Some(try!(d.expect_struct_value())); }
                Some(1) => { b = Some(try!(d.expect_struct_value())); }
                Some(2) => { c = Some(try!(d.expect_struct_value())); }
                Some(_) => unreachable!(),
                None => { let _: IgnoreTokens = try!(Deserialize::deserialize(d)); }
            }
        }

        Ok(Foo { a: a.unwrap(), b: b.unwrap(), c: c.unwrap() })
    }
}

It’s more complicated than libserialize’s struct parsing, but it performs much better because it can handle out of order maps without buffering tokens.

Serialization

Serialization’s story is a much simpler one. Conceptually serde::ser::Serializer/serde::ser::Serialize are inspired by the deserialization story, but we don’t need the tagged tokens because we already know the types. Here are the traits:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
pub trait Serialize<S: Serializer<E>, E> {
    fn serialize(&self, s: &mut S) -> Result<(), E>;
}

pub trait Serializer<E> {
    fn serialize_null(&mut self) -> Result<(), E>;

    fn serialize_bool(&mut self, v: bool) -> Result<(), E>;

    #[inline]
    fn serialize_int(&mut self, v: int) -> Result<(), E> {
        self.serialize_i64(v as i64)
    }

    #[inline]
    fn serialize_i8(&mut self, v: i8) -> Result<(), E> {
        self.serialize_i64(v as i64)
    }

    #[inline]
    fn serialize_i16(&mut self, v: i16) -> Result<(), E> {
        self.serialize_i64(v as i64)
    }

    #[inline]
    fn serialize_i32(&mut self, v: i32) -> Result<(), E> {
        self.serialize_i64(v as i64)
    }

    #[inline]
    fn serialize_i64(&mut self, v: i64) -> Result<(), E>;

    #[inline]
    fn serialize_uint(&mut self, v: uint) -> Result<(), E> {
        self.serialize_u64(v as u64)
    }

    #[inline]
    fn serialize_u8(&mut self, v: u8) -> Result<(), E> {
        self.serialize_u64(v as u64)
    }

    #[inline]
    fn serialize_u16(&mut self, v: u16) -> Result<(), E> {
        self.serialize_u64(v as u64)
    }

    #[inline]
    fn serialize_u32(&mut self, v: u32) -> Result<(), E> {
        self.serialize_u64(v as u64)
    }

    #[inline]
    fn serialize_u64(&mut self, v: u64) -> Result<(), E>;

    #[inline]
    fn serialize_f32(&mut self, v: f32) -> Result<(), E> {
        self.serialize_f64(v as f64)
    }

    fn serialize_f64(&mut self, v: f64) -> Result<(), E>;

    fn serialize_char(&mut self, v: char) -> Result<(), E>;

    fn serialize_str(&mut self, v: &str) -> Result<(), E>;

    fn serialize_tuple_start(&mut self, len: uint) -> Result<(), E>;
    fn serialize_tuple_elt<
        T: Serialize<Self, E>
    >(&mut self, v: &T) -> Result<(), E>;
    fn serialize_tuple_end(&mut self) -> Result<(), E>;

    fn serialize_struct_start(&mut self, name: &str, len: uint) -> Result<(), E>;
    fn serialize_struct_elt<
        T: Serialize<Self, E>
    >(&mut self, name: &str, v: &T) -> Result<(), E>;
    fn serialize_struct_end(&mut self) -> Result<(), E>;

    fn serialize_enum_start(&mut self, name: &str, variant: &str, len: uint) -> Result<(), E>;
    fn serialize_enum_elt<
        T: Serialize<Self, E>
    >(&mut self, v: &T) -> Result<(), E>;
    fn serialize_enum_end(&mut self) -> Result<(), E>;

    fn serialize_option<
        T: Serialize<Self, E>
    >(&mut self, v: &Option<T>) -> Result<(), E>;

    fn serialize_seq<
        T: Serialize<Self, E>,
        Iter: Iterator<T>
    >(&mut self, iter: Iter) -> Result<(), E>;

    fn serialize_map<
        K: Serialize<Self, E>,
        V: Serialize<Self, E>,
        Iter: Iterator<(K, V)>
    >(&mut self, iter: Iter) -> Result<(), E>;
}

There are many default methods, so only a handful of implementations need to be specified. Now lets look at how they are used. Here’s a simple AssertSerializer that I use in my test suite to make sure I’m serializing properly:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
struct AssertSerializer<Iter> {
    iter: Iter,
}

impl<'a, Iter: Iterator<Token<'a>>> AssertSerializer<Iter> {
    fn new(iter: Iter) -> AssertSerializer<Iter> {
        AssertSerializer {
            iter: iter,
        }
    }

    fn serialize<'b>(&mut self, token: Token<'b>) -> Result<(), Error> {
        let t = self.iter.next().unwrap();

        assert_eq!(t, token);

        Ok(())
    }
}

impl<'a, Iter: Iterator<Token<'a>>> Serializer<Error> for AssertSerializer<Iter> {
    fn serialize_null(&mut self) -> Result<(), Error> {
        self.serialize(Token::Null)
    }
    fn serialize_bool(&mut self, v: bool) -> Result<(), Error> {
        self.serialize(Token::Bool(v))
    }
    fn serialize_int(&mut self, v: int) -> Result<(), Error> {
        self.serialize(Token::Int(v))
    }
    ...
}

Implementing Serialize for values follows the same pattern. Here’s bool:

1
2
3
4
5
6
impl<S: Serializer<E>, E> Serialize<S, E> for bool {
    #[inline]
    fn serialize(&self, s: &mut S) -> Result<(), E> {
        s.serialize_bool(*self)
    }
}

Vec<T>:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
impl<
    S: Serializer<E>,
    E,
    T: Serialize<S, E>
> Serialize<S, E> for Vec<T> {
    #[inline]
    fn serialize(&self, s: &mut S) -> Result<(), E> {
        s.serialize_seq(self.iter())
    }
}

pub trait Serializer<E> {
    ...

    fn serialize_seq<
        T: Serialize<AssertSerializer<Iter>, Error>,
        SeqIter: Iterator<T>
    >(&mut self, mut iter: SeqIter) -> Result<(), Error> {
        let (len, _) = iter.size_hint();
        try!(self.serialize(Token::SeqStart(len)));
        for elt in iter {
            try!(elt.serialize(self));
        }
        self.serialize(Token::SeqEnd)
    }

    ...
}

And structs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
struct Foo {
    a: (),
    b: uint,
    c: TreeMap<String, Option<char>>,
}

impl<
  S: Serializer<E>,
  E
> Serialize<S, E> for Foo {
    fn serialize(&self, s: &mut S) -> Result<(), E> {
        try!(s.serialize_struct_start("Foo", 2u));
        try!(s.serialize_struct_elt("a", &self.a));
        try!(s.serialize_struct_elt("b", &self.b));
        try!(s.serialize_struct_elt("c", &self.c));
        s.serialize_struct_end()
    }
}

Much simpler than deserialization.

Performance

So how does it perform? Here’s the serialization benchmarks, with yet another ordering. This time sorted by the performance:

language library format serialization (MB/s)
Rust capnproto-rust Cap’n Proto (unpacked) 4349
Go go-capnproto Cap’n Proto 3824.20
Rust bincode Binary 1020
Go gogoprotobuf Protocol Buffers 596.78
Rust capnproto-rust Cap’n Proto (packed) 583
Rust rust-msgpack MessagePack 397
Rust rust-protobuf Protocol Buffers 357
C++ rapidjson JSON 304
Rust serde::json JSON 222
Go goprotobuf Protocol Buffers 214.68
Go ffjson JSON 147.37
Rust serialize::json JSON 147
Go encoding/json JSON 80.49

serde::json is doing pretty good! It still has got a ways to go to catch up to rapidjson, but it’s pretty cool it’s beating goprotobuf out of the box :)

Here are the deserialization numbers:

language library format deserialization (MB/s)
Rust capnproto-rust Cap’n Proto (unpacked) 2185
Go go-capnproto Cap’n Proto (zero copy) 1407.95
Go go-capnproto Cap’n Proto 711.77
Rust capnproto-rust Cap’n Proto (packed) 351
Go gogoprotobuf Protocol Buffers 272.68
C++ rapidjson JSON (sax) 189
C++ rapidjson JSON (dom) 162
Rust rust-msgpack MessagePack 138
Rust rust-protobuf Protocol Buffers 129
Go ffjson JSON 95.06
Rust bincode Binary 80
Go goprotobuf Protocol Buffers 79.78
Rust serde::json JSON 67
Rust serialize::json JSON 24
Go encoding/json JSON 22.79

Well on the plus side, serde::json nearly 3 times faster than libserialize::json. On the downside rapidjson is nearly 3 times faster than us in it’s SAX style parsing. Even the newly added deserialization support in ffjson is 1.4 times faster than us. So we got more work cut out for us!

Next time, serde2!

PS: I’m definitely getting close to the end of my story, and while I have some better numbers with serde2, nothing is quite putting me in the rapidjson range. Anyone want to help optimize serde? I would greatly appreciate the help!

PPS: I’ve gotten a number of requests for my serialization benchmarks to be ported over to other languages and libraries. Especially a C++ version of Cap’n Proto. Unfortunately I don’t really have the time to do it myself. Would anyone be up for helping to implement it?

comments on reddit

Using Rust to Make a Safer Interface for Yahoo’s Fast MDBM Database

I’m really supposed to be working on my serialization series, but I just saw a neat new library that was open sourced by Yahoo a couple days ago called MDBM on Hacker News. I know nothing about the library, but there are some really neat claims:

  1. It’s supposed to be fast, and that’s always nice.
  2. It’s supposed to have a really slim interface.
  3. It’s so fast because it’s passing around naked pointers to mmapped files, which is terribly unsafe. Unless you got rust which can prove that those pointers won’t escape :)

So I wanted to see how easy it’d be to make a Rust binding for the project. If you want to follow along, first make sure you have rust installed. Unfortunately it looks like MDBM only supports Linux and FreeBSD, so I had to build out a Fedora VM to test this out on. I think this is all you need to build it:

1
2
3
4
5
% git clone https://github.com/yahoo/mdbm
% cd mdbm/redhat
% make
% rpm -Uvh ~/rpmbuild/RPMS/x86_64/mdbm-4.11.1-1.fc21.x86_64.rpm
% rpm -Uvh ~/rpmbuild/RPMS/x86_64/mdbm-devel-4.11.1-1.fc21.x86_64.rpm

Unfortunately it’s only for linux, and I got a mac, but it turns out there’s plenty I can do to prep while VirtualBox and Fedora 21 download. Lets start out by creating our project with cargo:

1
2
% cargo new mdbm
% cd rust-mdbm

(Right now there’s no way to have the name be different than the path, so edit Cargo.toml to rename the project to mdbm. I filed #1030 to get that implemented).

By convention, we put bindgen packages into $name-sys, so make that crate as well:

1
2
% cargo new --no-git mdbm-sys
% cd mdbm-sys

We’ve got a really cool tool called bindgen, which uses clang to parse header files and convert them into an unsafe rust interface. So lets check out MDBM, and generate a crate to wrap it up in.

1
2
3
4
5
6
7
8
9
10
11
12
% cd ../..
% git clone git@github.com:crabtw/rust-bindgen.git
% cd rust-bindgen
% cargo build
% cd ..
% git clone git@github.com:yahoo/mdbm.git
% cd rust-mdbm/mdbm-sys
% DYLD_LIBRARY_PATH=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib \
  ~/rust/rust-bindgen/target/bindgen \
  -lmdbm \
  -o src/lib.rs \
  mdbm/include/mdbm.h

Pretty magical. Make sure it builds:

1
2
3
4
5
6
7
8
9
10
11
% cargo build
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:3:21: 3:25 error: failed to resolve. Maybe a missing `extern crate libc`?
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:3 pub type __int8_t = ::libc::c_char;
                                                                  ^~~~
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:3:21: 3:35 error: use of undeclared type name `libc::c_char`
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:3 pub type __int8_t = ::libc::c_char;
                                                                  ^~~~~~~~~~~~~~
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:4:22: 4:26 error: failed to resolve. Maybe a missing `extern crate libc`?
/Users/erickt/rust-mdbm/mdbm-sys/src/lib.rs:4 pub type __uint8_t = ::libc::c_uchar;
                                                                   ^~~~
...

Nope! The problem is that we don’t have the libc crate imported. We don’t have a convention yet for this, but I like to do is:

1
mv src/lib.rs src/ffi.rs

And create a src/lib.rs that contains:

1
2
3
4
5
6
7
8
#![allow(non_camel_case_types)]
#![allow(non_snake_case)]

extern crate libc;

type __builtin_va_list = libc::c_void;

include!("ffi.rs")

This lets me run bindgen later on without mucking up the library. This now compiles. Next up is our high level interface. Add mdbm-sys to our high level interface by adding this to the rust-mdbm/Cargo.toml file:

1
2
[dependencies.mdbm-sys]
path = "mdbm-sys"

By now I got my VirtualBox setup working, so now to the actual code! Lets start with a barebones wrapper around the database:

1
2
3
pub struct MDBM {
    db: *mut mdbm_sys::MDBM,
}

Next is the constructor and destructor. I’m hardcoding things for now and using IoError, since MDBM appears to log everything to the ERRNO:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
impl MDBM {
    pub fn new(
        path: &Path,
        flags: uint,
        mode: uint,
        psize: uint,
        presize: uint
    ) -> Result<MDBM, IoError> {
        unsafe {
            let path = path.to_c_str();
            let db = mdbm_sys::mdbm_open(
                path.as_ptr(),
                flags as libc::c_int,
                mode as libc::c_int,
                psize as libc::c_int,
                presize as libc::c_int);

            if db.is_null() {
                Err(IoError::last_error())
            } else {
                Ok(MDBM { db: db })
            }
        }
    }

    ...
}

impl Drop for MDBM {
    fn drop(&mut self) {
        unsafe {
            mdbm_sys::mdbm_sync(self.db);
            mdbm_sys::mdbm_close(self.db);
        }
    }
}

Pretty straightforward translation of the examples with some hardcoded values to start out. Next up is a wrapper around MDBM’s datum type, which is the type used for both keys and values. datum is just a simple struct containing a pointer and length, pretty much analogous to our &[u8] slices. However our slices are much more powerful because our type system can guarantee that in safe Rust, these slices can never outlive where they are derived from:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
pub struct Datum<'a> {
    bytes: &'a [u8],
}

impl<'a> Datum<'a> {
    pub fn new(bytes: &[u8]) -> Datum {
        Datum { bytes: bytes }
    }
}

fn to_raw_datum(datum: &Datum) -> mdbm_sys::datum {
    mdbm_sys::datum {
        dptr: datum.bytes.as_ptr() as *mut _,
        dsize: datum.bytes.len() as libc::c_int,
    }
}

And for convenience, lets add a AsDatum conversion method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
pub trait AsDatum for Sized? {
    fn as_datum<'a>(&'a self) -> Datum<'a>;
}

impl<'a, Sized? T: AsDatum> AsDatum for &'a T {
    fn as_datum<'a>(&'a self) -> Datum<'a> { (**self).as_datum() }
}

impl AsDatum for [u8] {
    fn as_datum<'a>(&'a self) -> Datum<'a> {
        Datum::new(self)
    }
}

impl AsDatum for str {
    fn as_datum<'a>(&'a self) -> Datum<'a> {
        self.as_bytes().as_datum()
    }
}

And finally, we got setting and getting a key-value. Setting is pretty straightforward. The only fun thing is using the AsDatum constraints so we can do db.set(&"foo", &"bar", 0) instead of db.set(Datum::new(&"foo".as_slice()), Datum::new("bar".as_slice()), 0). we’re copying into the database, we don’t have to worry about lifetimes yet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
impl MDBM {
    ...

    /// Set a key.
    pub fn set<K, V>(&self, key: &K, value: &V, flags: int) -> Result<(), IoError> where
        K: AsDatum,
        V: AsDatum,
    {
        unsafe {
            let rc = mdbm_sys::mdbm_store(
                self.db,
                to_raw_datum(&key.as_datum()),
                to_raw_datum(&value.as_datum()),
                flags as libc::c_int);

            if rc == -1 {
                Err(IoError::last_error())
            } else {
                Ok(())
            }
        }
    }

    ...

MDBM requires the database to be locked in order to get the keys. This os where things get fun in order to prevent those interior pointers from escaping. We’ll create another wrapper type that manages the lock, and uses RAII to unlock when we’re done. We tie the lifetime of the Lock to the lifetime of the database and key, which prevents it from outliving either object:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
impl MDBM {
    ...

    /// Lock a key.
    pub fn lock<'a, K>(&'a self, key: &'a K, flags: int) -> Result<Lock<'a>, IoError> where
        K: AsDatum,
    {
        let rc = unsafe {
            mdbm_sys::mdbm_lock_smart(
                self.db,
                &to_raw_datum(&key.as_datum()),
                flags as libc::c_int)
        };

        if rc == 1 {
            Ok(Lock { db: self, key: key.as_datum() })
        } else {
            Err(IoError::last_error())
        }
    }

    ...
}

pub struct Lock<'a> {
    db: &'a MDBM,
    key: Datum<'a>,
}

#[unsafe_destructor]
impl<'a> Drop for Lock<'a> {
    fn drop(&mut self) {
        unsafe {
            let rc = mdbm_sys::mdbm_unlock_smart(
                self.db.db,
                &to_raw_datum(&self.key),
                0);

            assert_eq!(rc, 1);
        }
    }
}

(Note that I’ve heard #[unsafe_destrutor] as used here may become unnecessary in 1.0).

Finally, let’s get our value! Assuming the value exists, we tie the lifetime of the Lock to the lifetime of the returned &[u8]:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
impl<'a> Lock<'a> {
    /// Fetch a key.
    pub fn get<'a>(&'a self) -> Option<&'a [u8]> {
        unsafe {
            let value = mdbm_sys::mdbm_fetch(
                self.db.db,
                to_raw_datum(&self.key));

            if value.dptr.is_null() {
                None
            } else {
                // we want to constrain the ptr to our lifetime.
                let ptr: &*const u8 = mem::transmute(&value.dptr);
                Some(slice::from_raw_buf(ptr, value.dsize as uint))
            }
        }
    }
}

Now to verify it works:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#[test]
fn test() {
    let db = MDBM::new(
        &Path::new("test.db"),
        super::MDBM_O_RDWR | super::MDBM_O_CREAT,
        0o644,
        0,
        0
    ).unwrap();

    db.set(&"hello", &"world", 0).unwrap();

    {
        // key needs to be an lvalue so the lock can hold a reference to
        // it.
        let key = "hello";

        // Lock the key. RAII will unlock it when we exit this scope.
        let value = db.lock(&key, 0).unwrap();

        // Convert the value into a string. The lock is still live at this
        // point.
        let value = str::from_utf8(value.get().unwrap()).unwrap();
        assert_eq!(value, "world");
        println!("hello: {}", value);
    }
}

Which when run with cargo test, produces:

1
2
3
4
5
6
7
8
9
10
11
12
13
   Compiling rust-mdbm v0.0.1 (file:///home/erickt/Projects/rust-mdbm)
     Running target/rust-mdbm-98b81ab156dc1e5f

running 1 test
test tests::test_set_get ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured

   Doc-tests rust-mdbm

running 0 tests

test result: ok. 0 passed; 0 failed; 0 ignored; 0 measured

Next we want to make sure invalid behavior is a compile-time error. First, make sure we don’t leak the keys:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
#[test]
fn test_keys_cannot_escape() {
    let db = MDBM::new(
        &Path::new("test.db"),
        super::MDBM_O_RDWR | super::MDBM_O_CREAT,
        0o644,
        0,
        0
    ).unwrap();

    db.set(&"hello", &"world", 0).unwrap();

    let _ = {
        let key = vec![1];
        db.lock(&key.as_slice(), 0).unwrap()
    };
}

Which errors with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
src/lib.rs:217:22: 217:25 error: `key` does not live long enough
src/lib.rs:217             db.lock(&key.as_slice(), 0).unwrap()
                                    ^~~
src/lib.rs:215:17: 218:10 note: reference must be valid for the expression at 215:16...
src/lib.rs:215         let _ = {
src/lib.rs:216             let key = vec![1];
src/lib.rs:217             db.lock(&key.as_slice(), 0).unwrap()
src/lib.rs:218         };
src/lib.rs:215:17: 218:10 note: ...but borrowed value is only valid for the block at 215:16
src/lib.rs:215         let _ = {
src/lib.rs:216             let key = vec![1];
src/lib.rs:217             db.lock(&key.as_slice(), 0).unwrap()
src/lib.rs:218         };
error: aborting due to previous error

And confirm the value doesn’t leak either:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#[test]
fn test_values_cannot_escape() {
    let db = MDBM::new(
        &Path::new("test.db"),
        super::MDBM_O_RDWR | super::MDBM_O_CREAT,
        0o644,
        0,
        0
    ).unwrap();

    let _ = {
        db.set(&"hello", &"world", 0).unwrap();

        let key = "hello";
        let value = db.lock(&key, 0).unwrap();
        str::from_utf8(value.get().unwrap()).unwrap()
    };
}

Which errors with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
src/lib.rs:237:34: 237:37 error: `key` does not live long enough
src/lib.rs:237             let value = db.lock(&key, 0).unwrap();
                                                ^~~
src/lib.rs:233:17: 239:10 note: reference must be valid for the expression at 233:16...
src/lib.rs:233         let _ = {
src/lib.rs:234             db.set(&"hello", &"world", 0).unwrap();
src/lib.rs:235
src/lib.rs:236             let key = "hello";
src/lib.rs:237             let value = db.lock(&key, 0).unwrap();
src/lib.rs:238             str::from_utf8(value.get().unwrap()).unwrap()
               ...
src/lib.rs:233:17: 239:10 note: ...but borrowed value is only valid for the block at 233:16
src/lib.rs:233         let _ = {
src/lib.rs:234             db.set(&"hello", &"world", 0).unwrap();
src/lib.rs:235
src/lib.rs:236             let key = "hello";
src/lib.rs:237             let value = db.lock(&key, 0).unwrap();
src/lib.rs:238             str::from_utf8(value.get().unwrap()).unwrap()
               ...
src/lib.rs:238:28: 238:33 error: `value` does not live long enough
src/lib.rs:238             str::from_utf8(value.get().unwrap()).unwrap()
                                          ^~~~~
src/lib.rs:233:17: 239:10 note: reference must be valid for the expression at 233:16...
src/lib.rs:233         let _ = {
src/lib.rs:234             db.set(&"hello", &"world", 0).unwrap();
src/lib.rs:235
src/lib.rs:236             let key = "hello";
src/lib.rs:237             let value = db.lock(&key, 0).unwrap();
src/lib.rs:238             str::from_utf8(value.get().unwrap()).unwrap()
               ...
src/lib.rs:233:17: 239:10 note: ...but borrowed value is only valid for the block at 233:16
src/lib.rs:233         let _ = {
src/lib.rs:234             db.set(&"hello", &"world", 0).unwrap();
src/lib.rs:235
src/lib.rs:236             let key = "hello";
src/lib.rs:237             let value = db.lock(&key, 0).unwrap();
src/lib.rs:238             str::from_utf8(value.get().unwrap()).unwrap()
                                                     ...

Success! Not too bad for 2 hours of work. Baring bugs, this mdbm library should perform at roughly the same speed as the C library, but eliminate many very painful bug opportunities that require tools like Valgrind to debug.

Benchmarking Is Confusing in Low Level Rust

Edit: you can find all the code in this post here, and I filed 19281 for the regression I mentioned at the end of the post.


Low level benchmarking is confusing and non-intuitive.

The end.


Or not. Whatever. So I’m trying to get my implement-Reader-and-Writer-for-&[u8] type PR #18980 landed. But Steven Fackler obnixously and correctly pointed out that this won’t play that nicely with the new Reader and Writer implementation for Vec<u8>. Grumble grumble. And then Alex Crichton had the gall to mention that a Writer for mut &[u8] also probably won’t be that common either. Sure, he’s write and all, but but I got it working without needing an index! That means that the &mut [u8] Writer only needs 2 pointers instead of BufWriter’s three, so it just has to be faster! Well, doesn’t it?

Stupid benchmarks.

I got to say it’s pretty addicting writing micro-benchmarks. It’s a lot of fun seeing how sensitive low-level code can be to just the smallest of tweaks. It’s also really annoying when you write something you think is pretty neat, then you find it’s chock-full of false dependencies between cache lines, or other mean things CPUs like to impose on poor programmers.

Anyway, to start lets look at what should be the fastest way to write to a buffer. Completely unsafely with no checks.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
unsafe fn do_copy_nonoverlapping_memory(
  mut dst: *mut u8,
  src: *const u8,
  len: uint,
  batches: uint
) {
    for _ in range(0, batches) {
        ptr::copy_nonoverlapping_memory(dst, src, len);
        dst = dst.offset(len as int);
    }
}

#[test]
fn test_copy_nonoverlapping_memory() {
    let dst = &mut [0_u8, .. BATCHES * SRC_LEN];
    let src = &[1, .. SRC_LEN];

    unsafe {
        do_copy_nonoverlapping_memory(
            dst.as_mut_ptr(),
            src.as_ptr(),
            src.len(),
            BATCHES
        );
    }
    assert!(dst.iter().all(|c| *c == 1));
}

With SRC_LEN=4 and BATCHES=128, we get this. For fun I added the new libtest from #19233 that will hopefully land soon. I also added also ran variations that explicitly inlined and not inlined the inner function:

1
2
3
test bench_copy_nonoverlapping_memory               ... bench: 50 |   [-***#**------]                        | 200:        72 ns/iter (+/- 45)
test bench_copy_nonoverlapping_memory_inline_always ... bench: 50 |       [---***#****************----------]| 100:        65 ns/iter (+/- 39)
test bench_copy_nonoverlapping_memory_inline_never  ... bench: 500 |      [------********#*********-------] | 1000:       747 ns/iter (+/- 393)

So overall it does quite well. Now lets compare with the code I wrote:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
impl<'a> Writer for &'a mut [u8] {
    #[inline]
    fn write(&mut self, src: &[u8]) -> IoResult<()> {
        if self.is_empty() { return Err(io::standard_error(io::EndOfFile)); }

        let dst_len = self.len();
        let src_len = src.len();

        let write_len = min(dst_len, src_len);

        slice::bytes::copy_memory(*self, src.slice_to(write_len));

        unsafe {
            *self = mem::transmute(raw::Slice {
                data: self.as_ptr().offset(write_len as int),
                len: dst_len - write_len,
            });
        }

        if src_len > dst_len {
            Err(io::standard_error(io::ShortWrite(write_len)))
        } else {
            Ok(())
        }
    }
}

And. Well. It didn’t do that well.

1
2
3
test writer::bench_slice_writer                             ... bench: 500 |   [------**#**--]                      | 2000:       920 ns/iter (+/- 448)
test writer::bench_slice_writer_inline_always               ... bench: 600 | [-**#*****---]                         | 2000:       711 ns/iter (+/- 405)
test writer::bench_slice_writer_inline_never                ... bench: 600 |   [-***#******---]                     | 2000:       838 ns/iter (+/- 474)

Wow. That’s pretty bad compared to the ideal.

Crud. So not only did I add an implementation that’s probably going to not work with write! and now it turns out the performance is pretty terrible. Inlining isn’t helping like it did in the unsafe case. So how’s std::io::BufWriter compare?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
pub struct BufWriter<'a> {
    buf: &'a mut [u8],
    pos: uint
}

impl<'a> Writer for BufWriter<'a> {
    #[inline]
    fn write(&mut self, buf: &[u8]) -> IoResult<()> {
        // return an error if the entire write does not fit in the buffer
        let cap = if self.pos >= self.buf.len() { 0 } else { self.buf.len() - self.pos };
        if buf.len() > cap {
            return Err(IoError {
                kind: io::OtherIoError,
                desc: "Trying to write past end of buffer",
                detail: None
            })
        }

        slice::bytes::copy_memory(self.buf[mut self.pos..], buf);
        self.pos += buf.len();
        Ok(())
    }
}

Here’s how it does:

1
2
3
test writer::bench_std_buf_writer                           ... bench: 50 |     [------**************#******-------] | 100:        79 ns/iter (+/- 40)
test writer::bench_std_buf_writer_inline_always             ... bench: 50 |    [-----************#*********-------]  | 100:        75 ns/iter (+/- 40)
test writer::bench_std_buf_writer_inline_never              ... bench: 600 |   [#****-----]                         | 2000:       705 ns/iter (+/- 337)

That’s just cruel. The optimization gods obviously hate me. So I started playing with a lot of variations (it’s my yeah yeah it’s my serialization benchmark suite, I’m planning on making it more general purpose. Besides it’s my suite and I can do whatever I want with it, so there):

  • (BufWriter0): Turning this Writer into a struct wrapper shouldn’t do anything, and it didn’t.
  • (BufWriter1): There’s error handling, does removing it help? Nope!
  • (BufWriter5): There’s an implied branch in let write_len = min(dst_len, src_len). We can turn that into the branch-predictor-friendly:
1
2
let x = (dst_len < buf_len) as uint;
let write_len = dst_len * x + src_len * (1 - x);

Doesn’t matter, still performs the same.

  • (BufWriter2): Fine then, optimization gods! Lets remove the branch altogether and just always advance the slice src.len() bytes! Damn the safety! That, of course, works. I can hear them giggle.
  • (BufWriter3): Maybe, just maybe there’s something weird going on with inlining across crates? Lets copy std::io::BufWriter and make sure that it’s still nearly optimal. It still is.
  • (BufWriter6): Technically the min(dst_len, src_len) is a bounds check, so we could switch from the bounds checked std.slice::bytes::copy_memory to the unsafe std::ptr::copy_nonoverlapping_memory, but that also doesn’t help.
  • (BufWriter7): Might as well and apply the last trick to std::io::BufWriter, and it does shave a couple nanoseconds off. It might be worth pushing it upstream:
1
2
3
test writer::bench_buf_writer_7                             ... bench: 50 |   [-*#********--------------------------]| 100:        55 ns/iter (+/- 44)
test writer::bench_buf_writer_7_inline_always               ... bench: 50 |     [---------********#*********-------] | 100:        76 ns/iter (+/- 39)
test writer::bench_buf_writer_7_inline_never                ... bench: 600 |   [-***#****----]                      | 2000:       828 ns/iter (+/- 417)
  • (BufWriter4): While I’m using one less uint than std::io::BufWriter, I’m doing two writes to advance my slice, one to advance the pointer, and one to shrink the length. std::io::BufWriter only has to advance it’s pos index. But in this case if instead of treating the slice as a (ptr, length), we can convert it into a (start_ptr, end_ptr), where start_ptr=ptr, and end_ptr=ptr+length. This works! Ish:
1
2
3
test writer::bench_buf_writer_4                             ... bench: 80 |   [--******#*******-----]                | 200:       109 ns/iter (+/- 59)
test writer::bench_buf_writer_4_inline_always               ... bench: 100 |     [------***#******---]               | 200:       133 ns/iter (+/- 44)
test writer::bench_buf_writer_4_inline_never                ... bench: 500 |      [-----***********#***********----]| 1000:       778 ns/iter (+/- 426)

I know when I’m defeated. Oh well. I guess I can at least update std::io::BufWriter to support the new error handling approach:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
impl<'a> MyWriter for BufWriter10<'a> {
    #[inline]
    fn my_write(&mut self, src: &[u8]) -> IoResult<()> {
        let dst_len = self.dst.len();

        if self.pos == dst_len { return Err(io::standard_error(io::EndOfFile)); }

        let src_len = src.len();
        let cap = dst_len - self.pos;

        let write_len = min(cap, src_len);

        slice::bytes::copy_memory(self.dst[mut self.pos..], src[..write_len]);

        if src_len > dst_len {
            return Err(io::standard_error(io::ShortWrite(write_len)));
        }

        self.pos += write_len;
        Ok(())
    }
}

How’s it do?

1
2
3
test writer::bench_buf_writer_10                            ... bench: 600 | [----**#***--]                         | 2000:       841 ns/iter (+/- 413)
test writer::bench_buf_writer_10_inline_always              ... bench: 600 |  [----**#***--]                        | 2000:       872 ns/iter (+/- 387)
test writer::bench_buf_writer_10_inline_never               ... bench: 600 |   [--******#**---]                     | 2000:       960 ns/iter (+/- 486)

Grumble grumble. It turns out that if we tweak the copy_memory line to:

1
    slice::bytes::copy_memory(self.dst[mut self.pos..], src);

It shaves 674 nanoseconds off the run:

1
2
3
test writer::bench_buf_writer_10                            ... bench: 200 |[---***#************------------]        | 400:       230 ns/iter (+/- 147)
test writer::bench_buf_writer_10_inline_always              ... bench: 200 |   [-----********#*******------]         | 400:       280 ns/iter (+/- 128)
test writer::bench_buf_writer_10_inline_never               ... bench: 600 |   [--***#****----]                     | 2000:       885 ns/iter (+/- 475)

But still no where near where we need to be. That suggests though that always cutting down the src, which triggers another bounds check has some measurable impact. So maybe I should only shrink the src slice when we know it needs to be shrunk?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
impl<'a> MyWriter for BufWriter11<'a> {
    #[inline]
    fn my_write(&mut self, src: &[u8]) -> IoResult<()> {
        let dst = self.dst[mut self.pos..];
        let dst_len = dst.len();

        if dst_len == 0 {
            return Err(io::standard_error(io::EndOfFile));
        }

        let src_len = src.len();

        if dst_len >= src_len {
            unsafe {
                ptr::copy_nonoverlapping_memory(
                    dst.as_mut_ptr(),
                    src.as_ptr(),
                    src_len);
            }

              self.pos += src_len;

            Ok(())
        } else {
            unsafe {
                ptr::copy_nonoverlapping_memory(
                    dst.as_mut_ptr(),
                    src.as_ptr(),
                    dst_len);
            }

              self.pos += dst_len;

            Err(io::standard_error(io::ShortWrite(dst_len)))
        }
    }
}

Lets see how it failed this time…

1
2
3
test writer::bench_buf_writer_11                            ... bench: 60 |      [------********#*********-----]     | 100:        79 ns/iter (+/- 28)
test writer::bench_buf_writer_11_inline_always              ... bench: 60 |[-------******#*************-----------]  | 100:        72 ns/iter (+/- 35)
test writer::bench_buf_writer_11_inline_never               ... bench: 600 |  [--***#***----]                       | 2000:       835 ns/iter (+/- 423)

No way. That actually worked?! That’s awesome! I’ll carve that out into another PR. Maybe it’ll work for my original version that doesn’t use a pos?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
impl<'a> MyWriter for BufWriter12<'a> {
    #[inline]
    fn my_write(&mut self, src: &[u8]) -> IoResult<()> {
        let dst_len = self.dst.len();

        if dst_len == 0 {
            return Err(io::standard_error(io::EndOfFile));
        }

        let src_len = src.len();

        if dst_len >= src_len {
            unsafe {
                ptr::copy_nonoverlapping_memory(
                    self.dst.as_mut_ptr(),
                    src.as_ptr(),
                    src_len);

                self.dst = mem::transmute(raw::Slice {
                    data: self.dst.as_ptr().offset(src_len as int),
                    len: dst_len - src_len,
                });
            }

            Ok(())
        } else {
            unsafe {
                ptr::copy_nonoverlapping_memory(
                    self.dst.as_mut_ptr(),
                    src.as_ptr(),
                    dst_len);

                self.dst = mem::transmute(raw::Slice {
                    data: self.dst.as_ptr().offset(dst_len as int),
                    len: 0,
                });
            }

            Err(io::standard_error(io::ShortWrite(dst_len)))
        }
    }
}

And yep, just as fast!

1
2
3
test writer::bench_buf_writer_12                            ... bench: 50 |[**#*----------------------------------]   | 80:        51 ns/iter (+/- 26)
test writer::bench_buf_writer_12_inline_always              ... bench: 50 |  [--------**********#********------------]| 90:        69 ns/iter (+/- 36)
test writer::bench_buf_writer_12_inline_never               ... bench: 800 |  [---**#***-]                          | 2000:      1000 ns/iter (+/- 263)

At this point, both solutions are approximately just as fast as the unsafe ptr::copy_nonoverlapping_memory! So that’s awesome. Now would anyone really care enough about the extra uint? There may be a few very specialized cases where that extra uint could cause a problem, but I’m not sure if it’s worth it. What do you all think?


I thought that was good, but since I’m already here, how’s the new Vec<u8> writer doing? Here’s the driver:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
impl Writer for Vec<u8> {
    #[inline]
    fn write(&mut self, buf: &[u8]) -> IoResult<()> {
        self.push_all(buf);
        Ok(())
    }
}

impl Writer for Vec<u8> {
    #[inline]
    fn write(&mut self, buf: &[u8]) -> IoResult<()> {
        self.push_all(buf);
        Ok(())
    }
}

#[bench]
fn bench_std_vec_writer(b: &mut test::Bencher) {
    let mut dst = Vec::with_capacity(BATCHES * SRC_LEN);
    let src = &[1, .. SRC_LEN];

    b.iter(|| {
        dst.clear();

        do_std_writes(&mut dst, src, BATCHES);
    })
}

And the results:

1
2
3
test writer::bench_std_vec_writer                           ... bench: 1000 | [----*****#*****--------]             | 2000:      1248 ns/iter (+/- 588)
test writer::bench_std_vec_writer_inline_always             ... bench: 900 |   [----*#***--]                        | 2000:      1125 ns/iter (+/- 282)
test writer::bench_std_vec_writer_inline_never              ... bench: 1000 |  [----***#*****--------]              | 2000:      1227 ns/iter (+/- 516)

Wow. That’s pretty terrible. Something weird must be going on with Vec::push_all. (Maybe that’s what caused my serialization benchmarks to slow 1/3?). Lets skip it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
impl<'a> MyWriter for VecWriter1<'a> {
    #[inline]
    fn my_write(&mut self, src: &[u8]) -> IoResult<()> {
        let src_len = src.len();

        self.dst.reserve(src_len);

        let dst = self.dst.as_mut_slice();

        unsafe {
            // we reserved enough room in `dst` to store `src`.
            ptr::copy_nonoverlapping_memory(
                dst.as_mut_ptr(),
                src.as_ptr(),
                src_len);
        }

        Ok(())
    }
}

And it looks a bit better, but not perfect:

1
2
3
test writer::bench_vec_writer_1                             ... bench: 100 |         [------*********#*****--------] | 200:       160 ns/iter (+/- 68)
test writer::bench_vec_writer_1_inline_always               ... bench: 100 |     [--------****#**--]                 | 300:       182 ns/iter (+/- 79)
test writer::bench_vec_writer_1_inline_never                ... bench: 600 |   [---****#**--]                       | 2000:       952 ns/iter (+/- 399)

There’s even less going on here than before. The only difference is that reserve call. Commenting it out gets us back to copy_nonoverlapping_memory territory:

1
2
3
test writer::bench_vec_writer_1                             ... bench: 70 | [----------------*********#******-------]| 100:        89 ns/iter (+/- 27)
test writer::bench_vec_writer_1_inline_always               ... bench: 50 |   [-***#******--]                        | 200:        75 ns/iter (+/- 46)
test writer::bench_vec_writer_1_inline_never                ... bench: 500 |   [--***#***---]                       | 2000:       775 ns/iter (+/- 433)

Unfortunately it’s getting pretty late, so rather than wait until the next time to dive into this, I’ll leave it up to you all. Does anyone know why reserve is causing so much trouble here?


PS: While I was working on this, I saw stevencheg submitted a patch to speed up the protocol buffer support. But when I ran the tests, everything was about 40% slower than the last benchmark post! Something happened with Rust’s performance over these past couple weeks!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
test goser::bincode::bench_decoder                          ... bench:      7682 ns/iter (+/- 3680) = 52 MB/s
test goser::bincode::bench_encoder                          ... bench:       516 ns/iter (+/- 265) = 775 MB/s
test goser::bincode::bench_populate                         ... bench:      1504 ns/iter (+/- 324)
test goser::capnp::bench_deserialize                        ... bench:       251 ns/iter (+/- 140) = 1784 MB/s
test goser::capnp::bench_deserialize_packed_unbuffered      ... bench:      1344 ns/iter (+/- 533) = 250 MB/s
test goser::capnp::bench_populate                           ... bench:       663 ns/iter (+/- 236)
test goser::capnp::bench_serialize                          ... bench:       144 ns/iter (+/- 37) = 3111 MB/s
test goser::capnp::bench_serialize_packed_unbuffered        ... bench:       913 ns/iter (+/- 436) = 369 MB/s
test goser::msgpack::bench_decoder                          ... bench:      3411 ns/iter (+/- 1837) = 84 MB/s
test goser::msgpack::bench_encoder                          ... bench:       961 ns/iter (+/- 477) = 298 MB/s
test goser::msgpack::bench_populate                         ... bench:      1564 ns/iter (+/- 453)
test goser::protobuf::bench_decoder                         ... bench:      3116 ns/iter (+/- 1485) = 91 MB/s
test goser::protobuf::bench_encoder                         ... bench:      1220 ns/iter (+/- 482) = 234 MB/s
test goser::protobuf::bench_populate                        ... bench:       942 ns/iter (+/- 836)
test goser::serialize_json::bench_decoder                   ... bench:     31934 ns/iter (+/- 16186) = 18 MB/s
test goser::serialize_json::bench_encoder                   ... bench:      8481 ns/iter (+/- 3392) = 71 MB/s
test goser::serialize_json::bench_populate                  ... bench:      1471 ns/iter (+/- 426)

Chasing an EPROTOTYPE Through Rust, Sendto, and the OSX Kernel With C-Reduce

A slight diversion from my serialization series. Last week, strcat submitted #18885 pull request, which adds support for using a Vec as a Writer. Over the weekend I submitted #18980, which allows a &[u8] to be used as a Reader. Overall a pretty simple change. However, when I was running the test suite, I found that the std::io::net::tcp::write_close_ip4() test was occasionally failing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#[test]
fn write_close_ip4() {
    let addr = next_test_ip4();
    let mut acceptor = TcpListener::bind(addr).listen();

    spawn(proc() {
        let _stream = TcpStream::connect(addr);
        // Close
    });

    let mut stream = acceptor.accept();
    let buf = [0];
    loop {
        match stream.write(buf) {
            Ok(..) => {}
            Err(e) => {
                assert!(e.kind == ConnectionReset ||
                        e.kind == BrokenPipe ||
                        e.kind == ConnectionAborted,
                        "unknown error: {}", e);
                break;
            }
        }
    }
}

The write would succeed a few times, then occasionally error with the unknown error: Protocol wrong type for socket, or the EPROTOTYPE errno. This is a really surprising error. As far as I know, the only functions that return that errno are socket, socketpair, and connect. I searched everywhere, but I couldn’t find any documentation suggesting that write would ever produce that error.

I wasn’t the only one who ran into it. bjz opened #18900 describing the same problem. One interesting thing to note is they were also on OSX Yosemite. So I took a little bit of time to extract out that test from the Rust test suite into this gist and got someone on #rust-internals to run it on linux for me with this little driver script:

1
2
3
4
5
6
7
8
9
#!/bin/sh

set -e

rustc test.rs

for i in $(seq 1 1000); do
  ./test tcp::test::write_close_ip4
done

and it didn’t error out. So it seems to be system dependent. Further experimentation showed that if we introduced sleeps or a mutex synchronization appeared to fix the problem as well. At this point I wasn’t sure if this was a non-issue, a bug in our code, or a bug in the OS. One things for sure though, if there is a bug, it could be scattered somewhere across the Rust codebase, which just std::io alone is 20 files at around 12522 lines. It’d be a pain to cut that down to a self contained test case.

Fortunately we’ve got C-Reduce to help us out. Back in May Professor John Regehr from the University of Utah came to our Bay Area Rust meetup to talk about compiler testing and fuzzing. We recorded the talk, so if you want to watch it, you can find it here. One of the things he talked about was the tool C-Reduce his research group developed to automatically cut out unnecessary lines of code that can still reproduce a bug you’re looking for. While it’s written to target C files, it turns out Rust is syntatically close enough to C that it works out pretty well for it too. All you need is a single source file and driver script that’ll report if the compiled source file reproduced the bug.

Aside 1: By the way, one of the other things I did this weekend was I put together a Homebrew pull request for C-Reduce. It hasn’t landed yet, but you want to use it, you can do:

1
2
% brew install https://raw.githubusercontent.com/erickt/homebrew/delta-and-creduce/Library/Formula/delta.rb
% brew install https://raw.githubusercontent.com/erickt/homebrew/delta-and-creduce/Library/Formula/creduce.rb

Hopefully it’ll land soon so you’ll be able to do:

1
% brew install creduce

Anyway, back to the story. So we’ve got a rather large code base to cover, and while C-reduce does a pretty good job of trimming away lines, just pointing it at the entire std module is a bit too much for it to handle in a reasonable amount of time. It probably needs some more semantic information about Rust to do a better job of cutting out broad swaths of code.

So I needed to do at least a rough pass to slim down the files. I figured the problem was probably contained in std::io or std::sys, so I wrote a simple test.rs that explicitly included those modules as part of the crate (see this gist if you are interested), and used the pretty printer to merge it into one file:

1
% rustc --pretty normal test.rs > test.rs

Aside 2: Our pretty printer still has some bugs in it, which I filed: 19075 and 19077. Fortunately it was pretty simple to fix those cases by hand in the generated std.rs, and odds are good they’ll be fixed by the time you might use this process.

Next up, we need a driver script. We can adapt our one from before. The only special consideration is that we need to make sure to only exit with a return code of 0 if the version of std.rs we’re compiling errors with EPROTOTYPE:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!/bin/sh

rustc \
  -A unused_imports \
  -A unused_unsafe \
  -A non_camel_case_types \
  -A unused_variables \
  --test \
  -o test \
  test.rs >/dev/null 2>&1
if [ "$?" -ne "0" ]; then
  exit 1
fi

root=`dirname $0`

for i in $(seq 1 1000); do
  $root/timeout.sh 5 ./test --nocapture tcp::test::write_close_ip4 2>&1 | tee log
  grep "Protocol wrong type for socket" log
  RET="$?"
  if [ "$RET" -eq "0" ]; then
          exit 0
  elif [ "$RET" -eq "143" ]; then
          exit 1
  fi
  echo
done

exit 1

I used the helper script timeout.sh to time out tests in case C-Reduce accidentally made us an infinite loop.

Finally we’re ready to start running C-Reduce:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
% creduce ./driver.sh test.rs
===< 30598 >===
running 8 interestingness test(s) in parallel
===< pass_blank :: 0 >===
(0.0 %, 156170 bytes)
===< pass_clang_binsrch :: replace-function-def-with-decl >===
===< pass_clang_binsrch :: remove-unused-function >===
===< pass_lines :: 0 >===
(-0.6 %, 157231 bytes)
(1.2 %, 154323 bytes)
(3.7 %, 150455 bytes)
(4.6 %, 149074 bytes)
(5.5 %, 147639 bytes)
(6.4 %, 146172 bytes)
(7.3 %, 144881 bytes)
(7.7 %, 144187 bytes)
(9.1 %, 141936 bytes)
(9.9 %, 140706 bytes)
(10.3 %, 140104 bytes)
(10.4 %, 139998 bytes)
...

I let it run in the background for sometime in the background while I did some other things. When I got back, C-Reduce automatically reduced the file from 153KB to a slim 22KB. I then reran rust with the lints enabled to manually cut out the dead code C-Reduce failed to remove, and flattened away some unnecessary structs and methods. I was finally left with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
extern crate libc;

use std::io::net::ip::Ipv4Addr;
use std::io::net::ip::SocketAddr;
use std::io::net::ip::ToSocketAddr;
use std::io::net::ip;
use std::io::test::next_test_ip4;
use std::mem;
use std::num::Int;
use std::os::errno;
use std::os::error_string;
use std::ptr;

fn bind(addr: ip::SocketAddr) -> libc::c_int {
    unsafe {
        let fd = match libc::socket(libc::AF_INET, libc::SOCK_STREAM, 0) {
            -1 => panic!(),
            fd => fd,
        };

        let mut storage = mem::zeroed();
        let len = addr_to_sockaddr(addr, &mut storage);
        let addrp = &storage as *const _ as *const libc::sockaddr;
        match libc::bind(fd, addrp, len) {
            -1 => panic!(),
            _ => fd,
        }
    }
}

fn listen(fd: libc::c_int, backlog: int) {
    unsafe {
        match libc::listen(fd, backlog as libc::c_int) {
            -1 => panic!(),
            _ => {}
        }
    }
}

fn accept(fd: libc::c_int) -> libc::c_int {
    unsafe {
        let x = libc::accept(fd, ptr::null_mut(), ptr::null_mut());
        match x {
            -1 => panic!(),
            fd => fd,
        }
    }
}

fn connect(addr: SocketAddr) -> libc::c_int {
    unsafe {
        let fd = match libc::socket(libc::AF_INET, libc::SOCK_STREAM, 0) {
            -1 => panic!(),
            fd => fd,
        };

        let mut storage = mem::zeroed();
        let len = addr_to_sockaddr(addr, &mut storage);
        let addrp = &storage as *const _ as *const libc::sockaddr;
        let x = libc::connect(fd, addrp, len);
        fd
    }
}

fn write(fd: libc::c_int, buf: &[u8]) -> Result<(), uint> {
    unsafe {
        let len = buf.len();
        let ret = libc::send(fd, buf.as_ptr() as *const _, len as libc::size_t, 0) as i64;
        if ret < 0 {
            Err(errno())
        } else {
            Ok(())
        }
    }
}

fn close(fd: libc::c_int) {
    unsafe {
        let x = libc::close(fd);
        assert_eq!(x, 0);
    }
}

fn addr_to_sockaddr(addr: SocketAddr, storage: &mut libc::sockaddr_storage) -> libc::socklen_t {
    let inaddr = match addr.ip {
        Ipv4Addr(a, b, c, d) => {
            let ip = a as u32 << 24 | b as u32 << 16 | c as u32 << 8 | d as u32 << 0;
            libc::in_addr {
                s_addr: Int::from_be(ip),
            }
        }
        _ => panic!(),
    };
    unsafe {
        let storage = storage as *mut _ as *mut libc::sockaddr_in;
        (*storage).sin_family = libc::AF_INET as libc::sa_family_t;
        (*storage).sin_port = addr.port.to_be();
        (*storage).sin_addr = inaddr;
        let len = mem::size_of::<libc::sockaddr_in>();
        len as libc::socklen_t
    }
}

fn main() {
    let addr = next_test_ip4();

    let listener = bind(addr);
    listen(listener, 128);

    spawn(proc() {
        let addresses = addr.to_socket_addr_all().unwrap();
        for addr in addresses.into_iter() {
            let fd = connect(addr);
            close(fd);
            return;
        }
    });

    let stream = accept(listener);
    loop {
        let x = write(stream, [0]);
        match x {
            Ok(..) => { }
            Err(e) => {
                let e = e as i32;
                assert!(
                    e == libc::ECONNREFUSED ||
                    e == libc::EPIPE ||
                    e == libc::ECONNABORTED,
                    "unknown error: {} {}", e, error_string(e as uint));
                break;
            }
        }
    }

    close(stream);
    close(listener);
}

This snippet reproduces the same EPROTOTYPE that we started with at the top of the post. Pretty cool that we got here without much effort?

Now At this point you might say to yourself that couldn’t I have extracted this out myself? And yeah, you would be right. This is a pretty much a c-in-rust implementation of this bug. But what’s great about using C-Reduce here is that I only had to make some very rough guesses about what files were and were not important to include in my test.rs. Eventually when we get some rust plugins written for C-Reduce I probably could just point it at the whole libstd let C-Reduce do it’s thing. Doing this by hand can be a pretty long and painful manual process, especially if we’re trying to debug a codegen or runtime bug. In the past I’ve spent hours reducing some codegen bugs down into a small snippet that C-Reduce was also able to produce in a couple minutes.

The last step with this code was to eliminate Rust from the picture, and translate this code into C:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <strings.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <pthread.h>
#include <errno.h>

int do_server() {
  int fd;
  struct sockaddr_in server_addr;

  fd = socket(AF_INET, SOCK_STREAM, 0);

  if (fd == -1) {
    perror("error socket server");
    exit(1);
  }

  bzero((char*) &server_addr, sizeof(server_addr));
  server_addr.sin_family = AF_INET;
  server_addr.sin_addr.s_addr = INADDR_ANY;
  server_addr.sin_port = htons(9600);

  if (bind(fd, (struct sockaddr *) &server_addr, sizeof(server_addr)) < 0) {
    perror("error binding");
    exit(1);
  }

  return fd;
}

void* do_child_thread(void* unused) {
  struct sockaddr_in client_addr;
  int fd;

  fd = socket(AF_INET, SOCK_STREAM, 0);

  if (fd == -1) {
    perror("error socket client");
    exit(1);
  }

  bzero((char*) &client_addr, sizeof(client_addr));
  client_addr.sin_family = AF_INET;
  client_addr.sin_addr.s_addr = INADDR_ANY;
  client_addr.sin_port = htons(9600);

  if (connect(fd, (struct sockaddr *) &client_addr, sizeof(client_addr)) < 0) {
    perror("error connect");
    exit(1);
  }

  fprintf(stderr, "closing client socket\n");

  if (close(fd) < 0) {
    perror("error close client socket");
    exit(1);
  }

  fprintf(stderr, "closed client socket\n");

  return NULL;
}

int main(int argc, char** argv) {
  int server_fd, client_fd;
  socklen_t client_len;
  struct sockaddr_in client_addr;
  char buf[] = { 'a', '\n' };
  pthread_t child_thread;
  int rc;

  signal(SIGPIPE, SIG_IGN);

  server_fd = do_server();
  rc = listen(server_fd, 5);
  if (rc < 0) {
    perror("error listen");
    return 1;
  }

  rc = pthread_create(&child_thread, NULL, do_child_thread, NULL);
  if (rc != 0) {
    perror("error pthread_create");
    return 1;
  }

  client_len = sizeof(client_addr);
  client_fd = accept(server_fd, (struct sockaddr *) &client_addr, &client_len);
  if (client_fd < 0) {
    perror("error accept");
    return 1;
  }

  while (1) {
    fprintf(stderr, "before send\n");
    rc = send(client_fd, buf, sizeof(buf), 0);
    fprintf(stderr, "after send: %d\n", rc);

    if (rc < 0) {
      if (errno == EPIPE) {
        break;
      } else {
        int so_type;
        socklen_t so_len = sizeof(so_type);
        getsockopt(client_fd, SOL_SOCKET, SO_TYPE, &so_type, &so_len);
        fprintf(stderr, "type: %d %d\n", so_type, SOCK_STREAM);

        perror("error send");
        return 1;
      }
    }
  }

  fprintf(stderr, "before server closing client fd\n");
  if (close(client_fd) < 0) {
    perror("error close client");
    return 1;
  }
  fprintf(stderr, "after server closing client fd\n");


  fprintf(stderr, "before server closing fd\n");
  if (close(server_fd) < 0) {
    perror("error close server");
    return 1;
  }
  fprintf(stderr, "after server closing fd\n");

  rc = pthread_join(child_thread, NULL);
  if (rc != 0 && rc != ESRCH) {
    fprintf(stderr, "error pthread_join: %d\n", rc);
    return 1;
  }

  return 0;
}

This also produces EPROTOTYPE, so we can eliminate Rust altogther. But lets keep digging. What exactly is producing this error? If I was on Linux, I’d use strace, but that program isn’t on Macs. There’s a similar tool called dtruss, but that seemed to slow things down enough that the EPROTOTYPE never happened. Fortunately though there is another program called errinfo, that just prints the errno along with every syscall. In one terminal I ran while ./test; do sleep 0.1; done. In the other:

1
2
3
4
5
6
7
8
9
10
11
12
13
% sudo errinfo -n test
...
           a.out           stat64    0
           a.out             open    0
           a.out psynch_mutexwait    0
           a.out   write_nocancel    0
           a.out           sendto    0
           a.out   write_nocancel    0
           a.out   write_nocancel    0
           a.out           sendto   41  Protocol wrong type for socket
           a.out   write_nocancel    0
           a.out       getsockopt    0
...

Right there we see our sendto syscall is actually returning the EPROTOTYPE. This errno then is definitely being created inside the OSX kernel, not in any userspace code. Fortunately, most of the Apple kernel, XNU, is open sourced, so we can dig down to what’ll be the my last layer. You can find the tarballs at http://www.opensource.apple.com/. But I’d rather use the unoffical GitHub repository. Using GitHub’s search tools, We can find all 17 instances of EPROTOTYPE in the codebase. Now I don’t know the XNU codebase, but there are still some really interesting things we can find. The first is in bsd/kern/uipc_usrreq.c:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/*
 * Returns:  0      Success
 *    EINVAL
 *    EOPNOTSUPP
 *    EPIPE
 *    ENOTCONN
 *    EISCONN
 *  unp_internalize:EINVAL
 *  unp_internalize:EBADF
 *  unp_connect:EAFNOSUPPORT  Address family not supported
 *  unp_connect:EINVAL        Invalid argument
 *  unp_connect:ENOTSOCK      Not a socket
 *  unp_connect:ECONNREFUSED  Connection refused
 *  unp_connect:EISCONN       Socket is connected
 *  unp_connect:EPROTOTYPE    Protocol wrong type for socket
 *  unp_connect:???
 *  sbappendaddr:ENOBUFS      [5th argument, contents modified]
 *  sbappendaddr:???          [whatever a filter author chooses]
 */
static int
uipc_send(struct socket *so, int flags, struct mbuf *m, struct sockaddr *nam,
    struct mbuf *control, proc_t p)
{
...

Hey look at that! There’s handler for the send syscall (although for IPC, not TCP) that actually documents that it can return EPROTOTYPE! While it doesn’t explain exactly how this can happen, the fact it mentions unp_connect hints that uipc_send may trigger a connect, and that’s exactly what we find a couple lines into the function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
...
    /* Connect if not connected yet. */
    /*
     * Note: A better implementation would complain
     * if not equal to the peer's address.
     */
    if ((so->so_state & SS_ISCONNECTED) == 0) {
      if (nam) {
        error = unp_connect(so, nam, p);
        if (error)
          break;  /* XXX */
      } else {
        error = ENOTCONN;
        break;
      }
    }
...

The fact that the comment says the socket might not be connected yet when we’re doing a send hints that Apple may have introduced some level of asynchrony and preemption to sockets. So if we trigger the actual connect here, it could then return EPROTOTYPE, which makes sense. Unfortunately that’s still not quite the behavior we’re seeing. We’re not getting EPROTOTYPE on our first write, but after we’ve done a couple.

I believe we find that behavior in the actual TCP syscall file, bsd/netinet/tcp_usrreq.c:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
static int
tcp_usr_send(struct socket *so, int flags, struct mbuf *m,
     struct sockaddr *nam, struct mbuf *control, struct proc *p)
{
  int error = 0;
  struct inpcb *inp = sotoinpcb(so);
  struct tcpcb *tp;
  uint32_t msgpri = MSG_PRI_DEFAULT;
#if INET6
  int isipv6;
#endif
  TCPDEBUG0;

  if (inp == NULL || inp->inp_state == INPCB_STATE_DEAD
#if NECP
    || (necp_socket_should_use_flow_divert(inp))
#endif /* NECP */
    ) {
    /*
     * OOPS! we lost a race, the TCP session got reset after
     * we checked SS_CANTSENDMORE, eg: while doing uiomove or a
     * network interrupt in the non-splnet() section of sosend().
     */
    if (m != NULL)
      m_freem(m);
    if (control != NULL) {
      m_freem(control);
      control = NULL;
    }

    if (inp == NULL)
      error = ECONNRESET;  /* XXX EPIPE? */
    else
      error = EPROTOTYPE;
    tp = NULL;
    TCPDEBUG1();
    goto out;
  }
...

I believe that comment explains everything we’re seeing. If we trigger a send while the kernel is in the middle of tearing down the socket, it returns EPROTOTYPE. This then looks to be an error we could retry. Once the socket is fully torn down, it should eventually return the proper EPIPE. This is also pretty easy to test. So I modified the inner loop of our C test:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
  while (1) {
    fprintf(stderr, "before send\n");
    rc = send(client_fd, buf, sizeof(buf), 0);
    fprintf(stderr, "after send: %d\n", rc);

    if (rc < 0) {
      if (errno == EPIPE) {
        break;
      } else if (errno == EPROTOTYPE) {
        continue;
      } else {
        int so_type;
        socklen_t so_len = sizeof(so_type);
        getsockopt(client_fd, SOL_SOCKET, SO_TYPE, &so_type, &so_len);
        fprintf(stderr, "type: %d %d\n", so_type, SOCK_STREAM);

        perror("error send");
        return 1;
      }
    }
  }

And yep, it exits cleanly. After all of this, I think it’s pretty clear at this point that there’s no weird kernel corruption bug going on, just a poorly documented edge case. But it sure was fun chasing this condition through the system.

To prevent anyone else from tripping over this edge case, I filed a Apple Radar ticket (number #19012087 for any Apple employees reading this). Hopefully if anyone runs into this mysterious EPROTOTYPE it’ll be documented for them, or at least there’s a chance they’ll stumble over this blog post and save themselves a weekend diving through the OS.

Rewriting Rust Serialization, Part 2.2: More Benchmarks

Back to the benchmarks! I got some great comments on reddit, So I wanted to do another post to update my numbers. Here’s what I changed:

  • I wasn’t consistent on whether or not the serialization benchmarks included Some tests are including the allocation of a buffer to write into. I’ve changed it so most are reusing one, which speeds everything up (especially capnproto-rust!). This does depend on #18885 landing though.
  • I’ve added bincode, which serializes values as raw bytes. Quite speedy too! Not nearly as fast as Cap’n Proto though.
  • I’ve changed C++ and Rust JSON tests to serialize enums as uints.
  • I added the time it takes to create the populate the structures. I’m betting the reason the Rust numbers are so high is that we’re allocating strings. Not sure if the other languages are able to avoid that allocation.

JSON:

language library population (ns) serialization (MB/s) deserialization (MB/s)
Rust serialize::json 1127 117 26
C++ rapidjson (dom) 546 281 144
C++ rapidjson (dom) 546 281 181
Go encoding/json 343 63.99 22.46
Go ffjson 343 144.60 (not supported)

Cap’n Proto:

language library population (ns) serialization (MB/s) deserialization (MB/s)
Rust capnproto-rust (unpacked) 325 4977 2251
Rust capnproto-rust (packed) 325 398 246
Go go-capnproto 2368 2226.71 450
Go go-capnproto (zero copy) 2368 2226.71 1393.3

Protocol Buffers:

language library population (ns) serialization (MB/s) deserialization (MB/s)
Rust rust-protobuf 1041 370 118
Go goprotobuf 1133 138.27 91.18
Go gogoprotobuf 343 472.69 295.33

Misc:

language library population (ns) serialization (MB/s) deserialization (MB/s)
Rust rust-msgpack 1143 454 144
Rust bincode 1143 1149 82

Anyone want to add more C/Go/Rust/Java/etc benchmarks?

Rewriting Rust Serialization, Part 2.1: Benchmarks

After part 2 I received a couple requests to add in a couple other rust serialization libraries. So one thing led to another, and now I’ve got a benchmark suite I’m calling rust-serialization-benchmarks. Really creative name, eh? This includes all the other benchmarks I referred to previously, as well as capnproto, msgpack, and protobuf.

language library format serialization (MB/s) deserialization (MB/s)
C++ rapidjson JSON (dom) 233 102
C++ rapidjson JSON (sax) 233 124
Go encoding/json JSON 54.93 16.72
Go ffjson JSON 126.40 (not supported)
Go goprotobuf Protocol Buffers 138.27 91.18
Go gogoprotobuf Protocol Buffers 472.69 295.33
Go go-capnproto Cap’n Proto 2226.71 450
Go go-capnproto Cap’n Proto (zero copy) 2226.71 1393.3
Rust serialize::json JSON 89 18
Rust rust-msgpack MessagePack 160 52
Rust rust-protobuf Protocol Buffers 177 70
Rust capnproto-rust Cap’n Proto (unpacked) 1729 1276
Rust capnproto-rust Cap’n Proto (packed) 398 246

I upgraded to OS X Yosemite, so I think that brought these numbers down overall from the last post.

Rewriting Rust Serialization, Part 2: Performance

As I said in the last post, Rust’s serialize library, specifically serialize::json is pretty slow. Back when I started this project a number of months ago, I wanted to benchmark to see how we compared to some other languages. There are a bunch of JSON benchmarks, but the one I chose was Cloudflare’s Go language. Goser, mainly because it was using a complex real world log structure, and they did the hard work of implementing benchmarks for encoding/json, goprotobuf, gogoprotobuf, and go-capnproto. I also included the Go ffjson and C++ rapidjson, which both claim to be the fastest JSON libraries for those languages. Here are the results I got:

language library format serialization (MB/s) deserialization (MB/s)
C++ rapidjson JSON 294 164 (DOM) / 192 (SAX)
Go encoding/json JSON 71.47 25.09
Go ffjson JSON 156.67 (not supported)
Go goprotobuf Protocol Buffers 148.78 99.57
Go gogoprotobuf Protocol Buffers 519.48 319.40
Go go-capnproto Cap’n Proto 3419.54 665.35
Rust serialize::json JSON 40-ish 10-ish

Notes:

  • rapidjson supports both DOM-style and SAX-style deserializing. DOM-style means deserializing into a generic object, then from there into the final object, SAX-style means a callback approach where a callback handler is called for each JSON token.
  • Go’s encoding/json uses reflection to serialize arbitrary values. ffjson uses code generation to get it’s serialization sped, but it doesn’t implement deserialization.
  • both goprotobuf and gogoprotobuf use code generation, but gogoprotobuf uses Protocol Buffer’s extension support to do cheaper serialization.
  • Cap’n Proto doesn’t really do serialization, but lays the serialized data out just like it is in memory so it has nearly zero serialization speed.
  • The Rust numbers are from a couple months ago and I couldn’t track down the exact numbers.

So. Yikes. Not only are we no where near rapidjson, we were being soundly beaten by Go’s reflection-based framework encoding/json. Even worse, our compile time was at least 10 times theirs. So, not pretty at all.

But that was a couple months ago. Between then and now, Patrick Walton, Luqman Aden, myself, and probably lots others found and fixed a number of bugs across serialize::json, std::io, generic function calls, and more. All this work got us to more than double our performance:

language library format serialization (MB/s) deserialization (MB/s)
Rust serialize::json JSON 117 25

We’re (kind of) beating Go! At least the builtin reflection-based solution. Better, but not great. I think our challenge is those dang closures. While LLVM can optimize simple closures, it seems to have a lot of trouble with all these recursive closure calls. While having finished unboxed closures might finally let us break through this performance bottleneck, it’s not guaranteed.

All in all, this, and the representational problems from post 1 make it pretty obvious we got some fundamental issues here and we need to use an alternative solution. Next post I’ll start getting into the details of the design of serde.

Rewriting Rust Serialization, Part 1

Hello everybody! It’s been, what, two years since I last blogged? Not my best performance, I’m sorry to say. So for all of my 3 pageviews that are probably bots, I appologize for such a long delay on updating my blog. I got to say I’ve been pretty inspired by the great Julia Evans (who I hope we can someday get back to working on rust stuff). She’s an epic blogger, and I hope I can get somewhere near that speed.

Anyway, on to the post. My main on-again-off-again project this past year has been working Rust’s generic serialize library. If you haven’t played with it yet, it’s really nifty. It’s a generic framework that allows a generic Encoder serialize a generic Encodable, and the inverse with Decoder and Decodable. This allows you to write just one Encodable impl that can transparently work with our json library, msgpack, toml, and etc. It’s simple to use too in most cases as you can use #[deriving(Encodable, Decodable)] to automatically create a implementation for your type. Here’s an example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
extern crate serialize;

use serialize::json;

#[deriving(Encodable, Decodable, Show)]
struct Employee {
    name: String,
}

#[deriving(Encodable, Decodable, Show)]
struct Company {
    employees: Vec<Employee>,
}

fn main() {
    let company = Company {
        employees: vec![
            Employee { name: "Dan".to_string() },
            Employee { name: "Erin".to_string() },
            Employee { name: "Jeff".to_string() },
            Employee { name: "Spencer".to_string() },
        ],
    };

    let s = json::encode(&company);
    let company: Company = json::decode(s.as_slice()).unwrap();
}

There are some downsides to serialize though. Manually implementing can be a bit of a pain. Here’s the example from before:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
impl<S: Encoder<E>, E> Encodable<S, E> for Employee {
    fn encode(&self, s: &mut S) -> Result<(), E> {
        match *self {
            Employee { name: ref name } => {
                s.emit_struct("Employee", 1u, |s| {
                    s.emit_struct_field("name", 0u, |s| name.encode(s))
                })
            }
        }
    }
}

impl<D: Decoder<E>, E> Decodable<D, E> for Employee {
    fn decode(d: &mut D) -> Result<Employee, E> {
        d.read_struct("Employee", 1u, |d| {
            Ok(Employee {
                name: {
                    try!(d.read_struct_field("name", 0u, |d| {
                        Decodable::decode(d)
                    }))
                }
            })
        })
    }
}

As you can see, parsing compound structures requires these recursive closure calls in order to perform the handshake between the Encoder and the Encodable. A couple people have run into bugs in the past where they didn’t implement this pattern, which results in some confusing bugs. Furthermore, LLVM isn’t great at inlining these recursive calls, so serialize impls tend to not perform well.

That’s not the worst of it though. The real problem is that there are types that can implement Encodable, there’s no way to write a Decodable implementation. They’re pretty common too. For example, the serialize::json::Json type:

1
2
3
4
5
6
7
8
9
10
11
12
13
pub enum Json {
    I64(i64),
    U64(u64),
    F64(f64),
    String(string::String),
    Boolean(bool),
    List(JsonList),
    Object(JsonObject),
    Null,
}

pub type JsonList = Vec<Json>;
pub type JsonObject = TreeMap<string::String, Json>;

The Json value can represent any value that’s in a JSON string. Implied in this is the notion that the Decodable has to look ahead to see what the next value is so it can decide which Json variant to construct. Unfortunately our current Decoder infrastructure doesn’t support lookahead. The way the Decoder/Decodable handshake works is essentially:

  • Decodable asks for a struct named "Employee".
  • Decodable asks for a field named "name".
  • Decodable asks for a value of type String.
  • Decodable asks for a field named "age".
  • Decodable asks for a value of type uint.

Any deviation from this pattern results in an error. There isn’t a way for the Decodable to ask what is the type of the next value, so this is why we serialize generic enums by explicitly tagging the variant, as in:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
extern crate serialize;

use serialize::json;

#[deriving(Encodable, Decodable, Show)]
enum Animal {
    Dog(uint),
    Frog(String, uint),
}

fn main() {
    let animal = Frog("Frank".to_string(), 349);

    let s = json::encode(&animal);

    println!("{}", s);
    // prints {"variant":"Frog","fields":["Frank",349]}
}

That’s probably good enough for now. In my next post I’ll go into in my approach to fix this in serde.

Rust for Ragel

I’ve written a bunch of simple parsers for Rust, and it’s starting to get a little obnoxious. So I added a Rust backend to the Ragel State Machine Compiler. You can find my fork here. I’m waiting for Rust to stablize before I try to push it upstream.

Ragel is a rather neat way of writing simple parsers. In some ways it’s pretty similar to Lex, but Ragel also allows you execute arbitrary code at any point in the state machine. Furthermore, this arbitrary code can manipulate the state machine itself, so it can be used in many places you’d traditionally need a full parser, such as properly handling parentheses.

Here’s an example of a atoi function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
%%{
machine atoi;

action see_neg   { neg = true; }
action add_digit { res = res * 10 + (fc as int - '0' as int); }

main :=
    ( '-' @see_neg | '+' )? ( digit @add_digit )+
    '\n'?
;

write data;
}%%

fn atoi(data: ~str) -> option<int> {
    let mut neg = false;
    let mut res = 0;

    // Ragel assumes that it will be iterating over a value called data, but we
    // need to tell ragel where to start (p) and end (pe) parsing.
    let mut p = 0;
    let mut pe = data.len();

    // This is the current state in the state machine.
    let mut cs: int;

    write init;
    write exec;

    if neg { res = -1 * res; }

    // If we stopped before we hit one of the exit states, then there must have
    // been an error of some sort.
    if cs < atoi_first_final {
        none
    } else {
        some(res)
    }
}

While this is probably a bit more verbose than writing atoi by hand, it does make the grammar pretty explicit, which can help keep it accurate.

Unfortunately there are some pretty severe performance issues at the moment. Ragel supports two state machine styles, table-driven and goto-driven. My backend uses tables, but since Rust doesn’t yet support global constant vectors, I need to malloc the state machine table on every function call. This results in the ragel-based url parser being about 10 times slower than the equivalent table-based parser in OCaml. You can see the generated code here.

The goto route could be promising to explore. We could simulate it using mutually recursive function calls. OCaml does this. But again, since Rust doesn’t support tailcalls (and may ever), we could run into a stack explosion. It may work well for small grammars though, and maybe LLVM could optimize calls into tailcalls.

Unless I’m doing something glaringly wrong, it seems likely that we are going to need some compiler help before these performance issues get solved.