So libserialize has some pretty serious downsides. It’s slow, it’s got this
weird recursive closure thing going on, and it can’t even represent enum types
like a serialize::json::Json. We need a new solution, and while I was at it,
we ended up with two: serde and
serde2. Both are
different approaches to trying to address these problems. The biggest one being
the type representation problem.
Serde Version 1
Deserialization
I want to start with deserialization first, as that’s really the interesting
bit. To repeat myself a little bit from
part 1,
here is a generic json Value enum:
To deserialize a string like [1, true] into
Array(vec![I64(1), Boolean(true)]), we need to peek at one character ahead
(ignoring whitespace) in order to discover what is the type of the next value.
We then can use that knowledge to pick the right variant, and parse the next
value correctly. While I haven’t formally studied this stuff, I believe this
can be more formally stated as Value requires at least a LL(1) grammar,
but since libserialize supports no lookahead, so at most it can handle LL(0)
grammars.
Since I was thinking of this problem in terms of grammars, I wanted to take a
page out of their book and implement generic deserialization in this style.
serde::de::Deserializers are then an Iterator<serde::de::Token> lexer that
produces a token stream, and serde::de::Deserializes are a parser that
consumes this stream to produce a value. Here’s serde::de::Token, which can
represent nearly all the rust types:
pubenumToken{Null,Bool(bool),Int(int),I8(i8),I16(i16),I32(i32),I64(i64),Uint(uint),U8(u8),U16(u16),U32(u32),U64(u64),F32(f32),F64(f64),Char(char),Str(&'staticstr),String(String),Option(bool),// true if the option has a valueTupleStart(uint),// estimate of the number of valuesStructStart(&'staticstr,// the struct nameuint,// estimate of the number of (string, value) pairs),EnumStart(&'staticstr,// the enum name&'staticstr,// the variant nameuint// estimate of the number of values),SeqStart(uint),// number of valuesMapStart(uint),// number of (value, value) pairsEnd,}
The serde::de::Deserialize stream must generate tokens that follow this
grammar:
pubtraitDeserialize<D:Deserializer<E>,E>{fndeserialize(d:&mutD)->Result<Self,E>{lettoken=try!(d.expect_token());Deserialize::deserialize_token(d,token)}fndeserialize_token(d:&mutD,token:Token)->Result<Self,E>;}pubtraitDeserializer<E>:Iterator<Result<Token,E>>{/// Called when a `Deserialize` expected more tokens, but the/// `Deserializer` was empty.fnend_of_stream_error(&mutself)->E;/// Called when a `Deserializer` was unable to properly parse the stream.fnsyntax_error(&mutself,token:Token,expected:&'static[TokenKind])->E;/// Called when a named structure or enum got a name that it didn't expect.fnunexpected_name_error(&mutself,token:Token)->E;/// Called when a value was unable to be coerced into another value.fnconversion_error(&mutself,token:Token)->E;/// Called when a `Deserialize` structure did not deserialize a field/// named `field`.fnmissing_field<T:Deserialize<Self,E>>(&mutself,field:&'staticstr)->Result<T,E>;/// Called when a `Deserialize` has decided to not consume this token.fnignore_field(&mutself,_token:Token)->Result<(),E>{let_:IgnoreTokens=try!(Deserialize::deserialize(self));Ok(())}#[inline]fnexpect_token(&mutself)->Result<Token,E>{self.next().unwrap_or_else(||Err(self.end_of_stream_error()))}...}
The Deserialize trait is kept pretty slim, and is how lookahead is
implemented. Deserializer is an enhanced Iterator<Result<Token, E>>, with
many helpful default methods. Here are them in action. First we’ll start with
what’s probably the simplest Deserializer, which just wraps a Vec<Token>:
Overall it should be pretty straight forward. As usual, error handling makes
things a bit noisier, but hopefully it’s not too onerous. Next is a
Deserialize for bool:
Simple! Sequences are a bit more tricky. Here’s Deserialize a Vec<T>. We
use a helper adaptor SeqDeserializer to deserialize from all types that
implement FromIterator:
It’s more complicated than libserialize’s struct parsing, but it performs
much better because it can handle out of order maps without buffering tokens.
Serialization
Serialization’s story is a much simpler one. Conceptually
serde::ser::Serializer/serde::ser::Serialize are inspired by the
deserialization story, but we don’t need the tagged tokens because we already
know the types. Here are the traits:
There are many default methods, so only a handful of implementations need to be
specified. Now lets look at how they are used. Here’s a simple
AssertSerializer that I use in my test suite to make sure I’m serializing
properly:
So how does it perform? Here’s the serialization benchmarks, with yet another
ordering. This time sorted by the performance:
language
library
format
serialization (MB/s)
Rust
capnproto-rust
Cap’n Proto (unpacked)
4349
Go
go-capnproto
Cap’n Proto
3824.20
Rust
bincode
Binary
1020
Go
gogoprotobuf
Protocol Buffers
596.78
Rust
capnproto-rust
Cap’n Proto (packed)
583
Rust
rust-msgpack
MessagePack
397
Rust
rust-protobuf
Protocol Buffers
357
C++
rapidjson
JSON
304
Rust
serde::json
JSON
222
Go
goprotobuf
Protocol Buffers
214.68
Go
ffjson
JSON
147.37
Rust
serialize::json
JSON
147
Go
encoding/json
JSON
80.49
serde::json is doing pretty good! It still has got a ways to go to catch up
to rapidjson, but it’s pretty cool it’s
beating goprotobuf out of the box :)
Here are the deserialization numbers:
language
library
format
deserialization (MB/s)
Rust
capnproto-rust
Cap’n Proto (unpacked)
2185
Go
go-capnproto
Cap’n Proto (zero copy)
1407.95
Go
go-capnproto
Cap’n Proto
711.77
Rust
capnproto-rust
Cap’n Proto (packed)
351
Go
gogoprotobuf
Protocol Buffers
272.68
C++
rapidjson
JSON (sax)
189
C++
rapidjson
JSON (dom)
162
Rust
rust-msgpack
MessagePack
138
Rust
rust-protobuf
Protocol Buffers
129
Go
ffjson
JSON
95.06
Rust
bincode
Binary
80
Go
goprotobuf
Protocol Buffers
79.78
Rust
serde::json
JSON
67
Rust
serialize::json
JSON
24
Go
encoding/json
JSON
22.79
Well on the plus side, serde::json nearly 3 times faster than
libserialize::json. On the downside rapidjson is nearly 3 times faster than
us in it’s SAX style parsing. Even the newly added deserialization support in
ffjson is 1.4 times faster than us. So we
got more work cut out for us!
Next time, serde2!
PS: I’m definitely getting close to the end of my story, and while I have some
better numbers with serde2, nothing is quite putting me in the rapidjson
range. Anyone want to help optimize
serde? I would greatly appreciate the help!
PPS: I’ve gotten a number of requests for my
serialization benchmarks
to be ported over to other languages and libraries. Especially a C++ version
of Cap’n Proto. Unfortunately I don’t really have the time to do it myself.
Would anyone be up for helping to implement it?