Chasing Rabbits

A poorly updated blog about what I’m working on

Serde 0.7

I’m thrilled to announce Serde 0.7.0! It’s been a long time coming, and has a number of long awaited new features, breaking changes, and other notable changes. Serde 0.6.x is now deprecated, and while I’ll try to keep serde_codegen and serde_macros while projects switch over to 0.7, I’m going to shift more to a pull-based approach, so please file a bug ticket if a nightly release has broken you.

On to the list on the major changes!

Serde

  • Removed the word Error from serde::de::Error variants.
  • Renamed visit_ methods to serialize_ and deserialize_.
  • Removed dependency on the deprecated num crate.
  • Require that implementations of serde::de::Error implement std::error::Error.
  • Added serde::de::Deserializer::deserialize_struct_field hook that allows a Deserializer to know the next value is a struct field.
  • Added serde::ser::Error, which allows a Serialize type produce an error if it cannot be serialized.
  • Serializing std::path::Path with non-unicode characters will result in a Serde error, rather than a panic.
  • Added implementations for std::net types.
  • Added serde::de::Error::unknown_variant error message hook.
  • Renamed serde::de::Error::syntax to serde::de::Error::custom.

Serde Codegen and Macros

  • Serde now by default ignores unknown fields when deserializing. The previous behavior, where Serde will report unknown fields as an error, can be opted in with the container annotation #[serde(deny_unknown_fields)], as in:
1
2
3
4
5
#[serde(deny_unknown_fields)]
struct Point {
    x: usize,
    y: usize,
}
  • Added the container annotation #[serde(rename="...")]to rename the container, as in:
1
2
3
4
5
#[serde(rename="point")]
struct Point {
    x: usize,
    y: usize,
}
  • Added the variantl annotation #[serde(rename="...")] to rename variants, as in:
1
2
3
4
enum Value {
    #[serde(rename="type")]
    Type
}
  • Added rename annotation #[serde(rename(serialize="...", deserialize="..."))] that supports crazy schemas like AWS that expect serialized fields with the first character lower case, and the deserialized response with the first character upper cased.
  • Removed support for the unused format-specific rename support.
  • Added the field annotation #[serde(default="$path")], where $path is a reference to some function that returns a default value for a field if it’s not present when deserializing. For example:
1
2
3
4
5
6
7
8
9
trait MyDefault {
    fn my_default() -> Self;
}

struct Point<T: MyDefault> {
    x: T,
    #[serde(default="MyDefault::my_default")]
    y: T,
}
  • Added the field annotation #[serde(skip_serializing_if="$path")], where $path is a path reference to some function that returns a bool, that if true, should skip serializing the field.
1
2
3
4
5
6
7
8
9
trait ShouldSkip {
    fn should_skip(&self) -> bool;
}

struct Point<T: ShouldSkip> {
    x: T,
    #[serde(skip_serializing_if="ShouldSkip::should_skip")]
    y: T,
}
  • Added the field annotations #[serde(serialize_with="$path")] and #[serde(deserialize_with="$path")], where $path us a path reference to some function that serializes a field, as in:
1
2
3
4
5
6
7
8
9
10
11
trait MySerialization {
    fn serialize_with<S: Serializer>(&self, serializer: &mut S) -> Result<(), S::Error>;

    fn deserialize_with<D: Deserializer>(deserializer: &mut D) -> Result<Self, D::Error>;
}

struct Record {
    #[serde(serialize_with="MySerialization::serialize_with")]
    #[serde(deserialize_with="MySerialization::deserialize_with")]
    timestamp: time::Timespec,
}

Serde JSON

  • Added StreamDeserializer, that enables parsing a stream of JSON values optionally separated by whitespace into an iterator of those deserialized values.

Thanks

I’d like to thank everyone that’s helped out over the past few months. Please forgive me if I accidentally left you off the list:

  • Craig M. Brandenburg
  • Florian Gilcher
  • Hans Kristian Flaatten
  • Issam Hakimi
  • Joe Wilm
  • John Heitmann
  • Marek Kotewicz
  • Ms2ger
  • Nathan Lilienthal
  • Oliver Schneider
  • Paul Woolcock
  • Roma Sokolov
  • Simon Persson
  • Thomas Bahn
  • Tom Jakubowski
  • thorbenk

Stateful, Part 2: How Stateful Cheats at Analysis

As I mentioned in the last part, Stateful has some challenges it needs to overcome in order to add new and exciting control flow mechanisms to Rust. While we don’t get access to any of the cool analysis passes inside the Rust compiler, Stateful is able to sneak around their necessity in many cases since it really only needs to support a subset of Rust. Here are some of the techniques it exploits, err, uses.

Variables

First off, let’s talk about variables. One of the primary things Stateful needs to do is manage the process of state flowing through the machine. However, consider a statement like this:

1
let x = ...;

“Obviously it’s a variable, right?” Actually you can’t be sure. What if someone did:

1
2
3
4
enum Foo {
    x,
}
use Foo::*;

Well then the compiler would helpfully report:

1
2
foo.rs:7:9: 7:10 error: declaration of `x` shadows an enum variant or unit-like struct in scope [E0413]
foo.rs:7     let x = x;

But that warning only works for simple let statements. Consider what happens with matches. Consider:

1
2
3
4
match ... {
    x => { ... }
    y => { ... }
}

Is x or y a variable, or a variant? There’s no way to know unless you perform name resolution, otherwise known as the resolve pass in the compiler. Unfortunately though, there’s no way for Stateful to run that analysis. As Sméagol said, “There is another way. More secret, and dark way.”. This leads us to Cheat Number One: Stateful assumes that all lowercase identifiers are variables, and uppercase ones are enum variants. Sure, Rust supports lowercase variants, but there’s no reason why Stateful has to use them. It makes our lives much easier.

Types

The next problem is typing. Sure, Rust is nice and all that you can write a local variable like let x = ... and it’ll infer the type for you. All Rust asks for is that the user explicitly specify the type of a value that enters or leaves the bounds of a function. Our problem is that one of the main tasks of Stateful is to lift variables into some State structure so that their available when the function is re-entered. So in effect, all variables inside Stateful must be typed. Consider the example from last week:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
fn advance(mut state: State) -> (Option<usize>, State) {
    loop {
        match state {
            State::Enter => {
                let mut i = 0;
                goto!(State::Loop { i: i });
            }
            State::Loop { mut i } => {
                if i < 3 {
                    goto!(State::Then { i: i });
                } else {
                    goto!(State::Else { i: i });
                }
            }
            State::Then { mut i } => {
                return_!(Some(i); State::AfterYield { i: i });
            }
            State::Else { mut i } => {
                goto!(State::AfterLoop { i: i });
            }
            State::AfterYield { mut i } => {
                i += 1;
                goto!(State::Loop { i: i });
            }
            State::AfterLoop { mut i } => {
                goto!(State::Exit);
            }
            State::Exit => {
                return_!(None; State::Exit);
            }
        }
    }
}

This State enumeration is what I’m talking about. It gets passed into and out of the advance function. It needs to be some concrete type, which looks something like this:

1
2
3
4
5
6
7
8
9
enum State {
    Enter,
    Loop { i: usize },
    Then { i: usize },
    Else { i: usize },
    AfterYield { i: usize },
    AfterLoop { i: usize },
    Exit,
}

The problem is that we want to write code like this:

1
2
3
4
5
6
7
8
#[generator]
fn gen3() -> Iterator<Item=usize> {
    let mut i = 0;
    while i < 3 {
        yield_!(i);
        i += 1;
    }
}

So how can we resolve this? Well first, we could wait for RFC 105 or RFC 1305 to get implemented, but that’s not happening any time soon. Until then, there is cheat number two: Hide state variables in a boxed trait. This one is from Eduard Burtescu. Instead of the nice well typed example from the last post, we actually generate some code that hides the types with an over abundance of generic types:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
fn gen3() -> Iterator<Item=usize> {
    struct Wrapper<S, F> {
        state: S,
        next: F,
    }

    impl<S, T, F> Iterator for Wrapper<S, F>
        where S: Default,
              F: Fn(S) -> (Option<T>, S)
      {
          type Item = T;

          fn next(&mut self) -> Option<Self::Item> {
              let old_state = ::std::mem::replace(
                  &mut self.state,
                  S::default());

              let (value, next_state) = (self.next)(old_state);
              self.state = next_state;
              value
          }
      }

    enum State<T0> {
        Enter,
        Loop { i: T0 },
        Then { i: T0 },
        Else { i: T0 },
        AfterYield { i: T0 },
        AfterLoop { i: T0 },
        Exit,
    }

    impl<T0> Default for State<T0> {
        fn default() -> Self {
            {
                State::State1End
            }
        }
    }

    Box::new(Wrapper::new(State::Start, |mut state| {
        loop {
            match state {
                State::Enter => {
                    let mut i = 0;
                    goto!(State::Loop { i: i });
                }
                State::Loop { mut i } => {
                    if i < 3 {
                        goto!(State::Then { i: i });
                    } else {
                        goto!(State::Else { i: i });
                    }
                }
                State::Then { mut i } => {
                    return_!(Some(i); State::AfterYield { i: i });
                }
                State::Else { mut i } => {
                    goto!(State::AfterLoop { i: i });
                }
                State::AfterYield { mut i } => {
                    i += 1;
                    goto!(State::Loop { i: i });
                }
                State::AfterLoop { mut i } => {
                    goto!(State::Exit);
                }
                State::Exit => {
                    return_!(None; State::Exit);
                }
            }
        }
    }))
}

All for the cost of a boxed variable. It’s not ideal, but it does let us keep experimenting. However, if we do want to avoid this allocation, we can just require that all variables that survive across a yield point have their type specified. So our previous example would be written as:

1
2
3
4
5
6
7
8
#[generator]
fn gen3() -> Iterator<Item=usize> {
    let mut i: usize = 0;
    while i < 3 {
        yield_!(i);
        i += 1;
    }
}

It’s not so bad here, but it’d get obnoxious if we had a generator like:

1
2
3
4
5
6
7
8
#[generator]
fn complicated(items: &[usize]) -> Iterator<Item=usize> {
    let iter = items.iter().map(|item| item * 3);

    for item in iter {
        yield_!(item);
    }
}

The type of iter, by the way, is impossible to write because there is currently no way to specify the type of the closure. Instead, it needs to be rewritten to use a free function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#[generator]
fn complicated(items: &[usize]) -> Iterator<Item=usize> {
    fn map<'a>(item: &'a usize) -> usize { *item * 3 }

    let iter: std::iter::Map<
            std::slice::Iter<'a, usize>,
            fn(&'a usize) -> usize
        >
        = items.iter().map(map);

    for item in iter {
        yield_!(item);
    }
}

If we want to support closures though, we need to use the Box<Iterator<...> trick.

References

This one’s a doozy. Here’s an example of the problem. Consider:

1
2
3
4
5
6
7
#[generator]
fn gen<'a>(opt: Option<&'a mut usize>) -> Iterator<Item=&'a mut usize> {
    match opt {
        Some(value) => { yield_!(value); }
        None => { }
    }
}

This would look something like this (which also demonstrates how match statements are represented):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
...
Box::new(Wrapper::new(State::State0Start { opt: opt }, |mut state| {
    loop {
        match state {
            State::State0Start { opt: opt } => {
                match opt {
                    Some(value) => {
                        state = State::State2Arm {
                            opt: opt,
                            value: value,
                        };
                        continue;
                    }
                    None => {
                        state = State::State3Arm { opt: opt };
                        continue;
                    }
                }
            }
            State::State1End => {
                return (::std::option::Option::None, State::State1End);
            }
            State::State2Arm { opt: opt, value: value } => {
                return (::std::option::Option::Some(value),
                        State::State5AfterYield {
                    opt: opt,
                    value: value,
                });
            }
            State::State3Arm { opt: opt } => {
                {
                };
                state = State::State4MatchJoin { opt: opt };
                continue;
            }
            State::State4MatchJoin { opt: opt } => {
                ::std::mem::drop(opt);
                state = State::State1End;
                continue;
            }
            State::State5AfterYield {
                              opt: opt, value: value } => {
                ::std::mem::drop(value);
                state = State::State4MatchJoin { opt: opt };
                continue;
            }
        }
    }
}))

Zero in on this block:

1
2
3
4
5
6
7
8
9
...
Some(value) => {
    state = State::State2Arm {
        opt: opt,
        value: value,
    };
    continue;
}
...

The type of opt is Option<&'a mut usize>, and value is &'a mut usize. So we’ve got two outstanding mutable borrows, which is illegal. The real problem is that Stateful without Resolve and the Borrow Checker pass, it cannot know if a use of the variable is a copy or move in all cases. So we now have cheat number 3: Use pseudo-macros to hint to Stateful if a type is copyable or movable. This is the same technique we use to implement the pseudo-macro yield_!(...), where we would add move_!(...) and copy_!(...) to inform Stateful when something has been, well, moved or copied. Our previous example would then be written as:

1
2
3
4
5
6
7
#[generator]
fn gen<'a>(opt: Option<&'a mut usize>) -> Iterator<Item=&'a mut usize> {
    match move_!(opt) {
        Some(value) => { yield_!(value); }
        None => { }
    }
}

Which would then give Stateful enough information to generate something like this, which would then know that the match consumed the option:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
Box::new(Wrapper::new(State::State0Start { opt: opt }, |mut state| {
    loop {
        match state {
            State::State0Start { opt: opt } => {
                match opt {
                    Some(value) => {
                        state = State::State2Arm {
                            value: value,
                        };
                        continue;
                    }
                    None => {
                        state = State::State3Arm;
                        continue;
                    }
                }
            }
            State::State1End => {
                return (::std::option::Option::None, State::State1End);
            }
            State::State2Arm { value: value } => {
                return (::std::option::Option::Some(value),
                State::State5AfterYield);
            }
            State::State3Arm => {
                state = State::State4MatchJoin;
                continue;
            }
            State::State4MatchJoin => {
                state = State::State1End;
                continue;
            }
            State::State5AfterYield => {
                state = State::State4MatchJoin;
                continue;
            }
        }
    }
}))

I’m also considering some default rules, that can be overridden with these macros:

  • If a value is known to be copyable (it’s a primitive type, or it’s a &T type), then it’s always copied. All other types are assumed to not be copyable.
  • Non-copyable types are moved when passed into a function argument, unless wrapped in a copy_!(...) hint.
  • Non-copyable type method calls are by reference, unless explicitly wrapped in a move_!(...) hint.
  • Non-copyable types are moved in match statement, unless one of the match arms uses ref or ref mut.

Hopefully this will enable a vast majority of code to work without copy_!(...) or move_!(...).

Conclusion

Those are our major cheats! I’m sure there will be plenty more in the future. In the meantime, I want to show off some some actual working code! Check this puppy out!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#![feature(plugin)]
#![plugin(stateful)]

#![allow(dead_code)]
#![allow(non_shorthand_field_patterns)]
#![allow(unused_mut)]
#![allow(unused_variables)]

#[generator]
fn gen<'a, T>(items: &'a [T]) -> &'a T {
    let mut iter = items.iter();
    loop {
        match iter.next() {
            Some(item) => {
                yield_!(item);
            }
            None => {
                break;
            }
        };
    };
}

fn main() {
    let items = &[1, 2, 3];
    for value in gen(items).take(20) {
        println!("{}", value);
    }
}

Produces:

1
2
3
1
2
3

Isn’t it beautiful? We got generics, mutable variables, loops, matches, breaks, and a whole host of ignored warnings!

Stateful: A Rust Experimental Syntax Extension for Generators and More

AKA: Erick Does More Horrible Things to Rust

Hello internet! It’s been too long. Not only are the Rust Meetups back up and running, it’s time for me to start back to blogging. For the past couple months, I’ve been working on a new syntax extension that will allow people to create fun and exciting new control flow mechanisms in stable Rust. “For the love of all that is sigils, why?!” Well, Because I can. Sometimes when you stare into the madness, it stares back into you? Or something like that?

It’s called Stateful, which helpfully has no documentation. Such an innocent name, right? It’s very much in progress (and mostly broken) implementation of some of the ideas in this and future posts. So don’t go and think these code snippets are executable just yet :)

Anyway, lets show off Stateful by showing how we can implement Generators. We’ve got an RFC ticket to implement them, but wouldn’t it be nice to have them sooner? For those of you unfamiliar with the concept, Generators are function that can be returned from multiple times, all while preserving state between those calls. Basically, they’re just a simpler way to write Iterators.

Say we wanted to iterate over the numbers 0, 1, and 2. Today, we would write an Iterator with something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
struct Iter3(usize);

impl Iter3 {
    fn new() -> Self {
        Iter3(0)
    }
}

impl Iterator for Iter3 {
    fn next(&mut self) -> Option<usize> {
        if self.0 < 3 {
            let i = self.0;
            self.0 += 1;
            Some(i)
        } else {
            None
        }
    }
}

fn main() {
    let iter = Iter3::new();
    assert_eq!(iter.next(), Some(0));
    assert_eq!(iter.next(), Some(1));
    assert_eq!(iter.next(), Some(2));
    assert_eq!(iter.next(), None);
}

The struct preserves our state across these function calls. It’s a pretty straightforward implementation, but it does have some amount of boilerplate code. For large iterator implementations, this state management can get quite complicated. Instead, lets see how this same code could be expressed with something like Stateful:

1
2
3
4
5
6
7
8
9
10
#![plugin(stateful)]

#[generator]
fn gen3() -> Iterator<Item=usize> {
    let mut i = 0;
    while i < 3 {
        yield_!(i);
        i += 1;
    }
}

Where yield_!(i) is some magical control flow mechanism that not only returned some value Some(i), but also made sure on the iter.next() would jump the execution to just after the yield. At the end of the generator, we’d just return None. We could simplify this even more by unrolling that loop into:

1
2
3
4
5
6
#[generator]
fn gen3_unrolled() -> Iterator<Item=usize> {
    yield_!(0);
    yield_!(1);
    yield_!(2);
}

The fun part is figuring out how to convert these generators into something that’s roughly equivalent to Iter3. At it’s heart, Iter3 really is a simple state machine, where we save the counter state in the structure before we “yield” the value to the caller. Let’s look at what we would generate for gen3_unrolled.

First, we need some boilerplate, that sets up the state of our generator. We don’t yet have impl trait, so we hide all our stuff in a module:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
fn gen3_unrolled() -> gen3_unrolled::Generator {
    gen3_unrolled::Generator::new()
}

mod gen3_unrolled {
    pub struct Generator {
        state: State,
    }

    impl Generator {
        pub fn new() -> Self {
            Generator {
                state: State::Enter,
            }
        }
    }

    ...

We represent our generator’s state with an enum. We have our initial state, a state per yield, then an exit state:

1
2
3
4
5
6
7
enum State {
    Enter,
    AfterYield0,
    AfterYield1,
    AfterYield2,
    Exit,
}

Finally, we have our state machine, and a pretty trivial Iterator implementation that manages entering and exiting the state machine:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
impl Iterator for Generator {
    type Item = usize;

    fn next(&mut self) -> Option<usize> {
        let state = mem::replace(&mut self.state, State::Exit);
        let (result, next_state) = advance(state);
        self.state = next_state;
        result
    }
}

fn advance(mut state: State) -> (Option<usize>, State) {
    loop {
        state = match state {
            State::Enter => {
                return_!(Some(0); State::AfterYield0);
            }
            State::AfterYield0 => {
                return_!(Some(1); State::AfterYield1);
            }
            State::AfterYield1 => {
                return_!(Some(2); State::AfterYield2);
            }
            State::AfterYield2 => {
                goto!(State::Exit);
            }
            State::Exit => {
                return_!(None; State::Exit);
            }
        }
    }
}
}

We move the current state into advance, then have this loop-match state machine. Then there are 2 new control flow constructs: return_!($expr; $next_state) and our old friend goto!($next_state). return_!() returns some value and also sets the position the generator should resume at, and goto!() just sets the next state without leaving the function.

Here’s one way they might be implemented:

1
2
3
4
5
6
7
8
9
10
11
12
macro_rules! goto {
    ($next_state:expr) => {
        $state = $next_state;
        continue;
    }
}

macro_rules! return_ {
    ($result: expr; $next_state:expr) => {
        return ($result, $next_state);
    }
}

Relatively straightforward transformation, right? But that’s an easy case. Things start to get a wee bit more complicated when we start thinking about how we’d transform gen3, because it’s got both a while loop and a mutable variable. Lets see that in action. I’ll leave out the boilerplate code and just focus on the advance function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
fn advance(mut state: State) -> (Option<usize>, State) {
    loop {
        match state {
            State::Enter => {
                let mut i = 0;
                goto!(State::Loop(i));
            }
            State::Loop(mut i) => {
                if i < 3 {
                    goto!(State::Then(i));
                } else {
                    goto!(State::Else(i));
                }
            }
            State::Then(mut i) => {
                return_!(Some(i); State::AfterYield(i));
            }
            State::Else(mut i) => {
                goto!(State::AfterLoop(i));
            }
            State::AfterYield(mut i) => {
                i += 1;
                goto!(State::Loop(i));
            }
            State::AfterLoop(mut i) => {
                goto!(State::Exit);
            }
            State::Exit => {
                return_!(None; State::Exit);
            }
        }
    }
}

Now things are getting interesting! There are two critical things we can see off the bat. First, we need to reify the loops and conditionals into the state machine, because they affect the control flow. Second, we need to lift any variables that are accessed across states into the State enum.

We can also start seeing the complications. The obvious one is mutable variables. We need to somehow thread the information about i’s mutability through each of the states. This naive implementation would trip over the #[warn(unused_mut)] lint. And now you might start to get a sense of the horror that lies beneath Stateful.

At this point, you might be thinking to yourself, “Self, if mutable variables are going to be complicated, what about copies and moves?” You sound like a pretty sensible person. Therein lies madness. You might want to stop thinking too deeply on it. If you can’t, maybe you think “Wait. What about Generics?” Yep. “Borrows?!” Now I’m getting a little worried. “How do you even know what’s a variable!?!” Sorry.

Yeah so there are one or two things that might be a tad challenging.


So that’s Stateful. It’s an experiment to get some real world experience with these control flow mechanisms that may someday feed into RFCs, and maybe, just maybe, might get implemented in the compiler. There’s no reason we need to support everything, which would require us to basically reimplement the compiler. Instead, I believe there’s a subset of Rust that we can support in order to start getting real experience now.

Generators area really just the start. There’s a whole host of other things that really are just other things that, if you just squint at em, are really just state machines in disguise. It’s quite possible if we can pull Stateful, we’ll also be able to implement things like Coroutines, Continuations, and that hot new mechanism all the cool languages are implementing these days, Async/Await.

But that’s all for later. First is to get this to work. In closing, I leave you with these wise words.

ph’nglui mglw’nafh Cthulhu R’lyeh wgah’nagl fhtagn.

If You Use Unsafe, You Should Be Using Compiletest

One of the coolest things about the Rust typesystem is that you can use it to make unsafe bindings safe. Read all about it in the Rustonomicon. However, it can be really quite easy to slip in a bug where you’re not actually making the guarantees you think you’re making. For example, here’s a real bug I made in the ZeroMQ FFI bindings (which have been edited for clarity):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
pub struct Socket {
    sock: *mut libc::c_void,
    closed: bool
}

impl Socket {
    pub fn as_poll_item<'a>(&self, events: i16) -> PollItem<'a> { // <- BUG!!!
        PollItem {
            socket: self.sock,
            fd: 0,
            events: events,
            revents: 0,
            marker: PhantomData
        }
    }
}

impl Drop for Socket {
    fn drop(&mut self) {
        unsafe {
            zmq_sys::zmq_close(self.sock);
        }
    }
}

pub struct PollItem<'a> {
    socket: *mut libc::c_void,
    fd: libc::c_int,
    events: i16,
    revents: i16,
    marker: PhantomData<&'a Socket>
}

pub fn poll(items: &mut [PollItem], timeout: i64) -> Result<i32, Error> {
    unsafe {
        let rc = zmq_sys::zmq_poll(
            items.as_mut_ptr() as *mut zmq_sys::zmq_pollitem_t,
            items.len() as c_int,
            timeout as c_long);

        if rc == -1i32 {
            Err(errno_to_error())
        } else {
            Ok(rc as i32)
        }
    }
}

Here’s the bug if you missed my callout:

1
pub fn as_poll_item<'a>(&self, events: i16) -> PollItem<'a> { // <- BUG!!!

My intention was to tie the lifetime of PollItem<'a> to the lifetime of the Socket, but because I left out one measly 'a, Rust doesn’t tie the two together, and instead is actually using the 'static lifetime. This then lets you do something evil like:

1
2
3
4
5
6
7
8
9
// leak the pointer!
let poll_item = {
      let context = zmq::Context::new();
      let socket = context.socket(zmq::PAIR).unwrap();
      socket.as_poll_item(0)
};

// And use the now uninitialized pointer! Wee! Party like it's C/C++!
poll(&[poll_item], 0).unwrap();

It’s just that easy. Fix is simple, just change the function to use &'a self and Rust will refuse to compile this snippet. Job well done!

Well, no, not really. Because what was particularly devious about this bug is that it actually came back. Later on I accidentally reverted &'a self back to &self because I secretly hate myself. The project and examples still compiled and ran, but that unitialized dereference was just waiting around to cause a security vulnerability.

Oops.

Crap.

Making sure Rust actually rejects programs that it ought to be rejecting fundamentally important when writing a library that uses Unsafe Rust.

That’s where compiletest comes in. It’s a testing framework that’s been extacted from rust-lang/rust that lets you write these “shouldn’t-compile” tests. Here’s how to use it. First add this to your Cargo.toml. We do a little feature dance because currently compiletest only runs on nightly:

1
2
3
4
5
6
7
8
...
[features]
unstable = ["compiletest_rs"]
...

[dependencies]
compiletest_rs = { "version = "*", optional = true }
...

Then, add add a test driver tests/compile-tests.rs (or whatever you want to name it) that runs the compiletest tests:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#![cfg(feature = "unstable")]

extern crate compiletest_rs as compiletest;

use std::path::PathBuf;
use std::env::var;

fn run_mode(mode: &'static str) {
    let mut config = compiletest::default_config();

    let cfg_mode = mode.parse().ok().expect("Invalid mode");

    config.target_rustcflags = Some("-L target/debug/ -L target/debug/deps/".to_owned());
    if let Ok(name) = var::<&str>("TESTNAME") {
        let s : String = name.to_owned();
        config.filter = Some(s)
    }
    config.mode = cfg_mode;
    config.src_base = PathBuf::from(format!("tests/{}", mode));

    compiletest::run_tests(&config);
}

#[test]
fn compile_test() {
    run_mode("compile-fail");
}

Finally, add the test! Here’s the one I wrote, tests/compile-fail/no-leaking-poll-items.rs:

1
2
3
4
5
6
7
8
9
extern crate zmq;

fn main() {
    let mut context = zmq::Context::new();
    let _poll_item = {
        let socket = context.socket(zmq::PAIR).unwrap();
        socket.as_poll_item(0) //~ ERROR error: `socket` does not live long enough
    };
}

Now you can live in peace with the confidence that this bug won’t ever appear again:

1
2
3
4
5
6
7
8
9
10
11
% multirust run nightly cargo test --features unstable
     Running target/debug/compile_tests-335c5f56b353961f

running 1 test

running 1 test
test [compile-fail] compile-fail/no-leaking-poll-items.rs ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured

test compile_test ... ok

In summary, use compiletest, and demand it’s use from the Unsafe Rust libraries you use! Otherwise you can never be sure if unsafe and undefined behavior like this will sneak into your project.

TLDR:

Serde 0.5.0 - Many Many Changes

Hello all you beautiful and talented people! I’m pleased to announce serde 0.5.0. We’re bumping the major (unstable) version number here because there have been a huge amount of breaking changes in the API. This has been done to better support serialization formats like bincode, which relies on the Serializee to hint to the Serializer how to parse the next bytes. This will enable Servo to use bincode for its IPC protocol.

Here are the major changes:

  • serde::json was factored out into its own separate crate serde_json #114.
  • Added serialization and deserialization type hints.
  • Renamed many functions to change visit_named_{map,seq} to visit_struct and visit_tuple_struct #114 #120.
  • Added hooks to allow serializers to serialize newtype tuple structs without a wrapper type #121.
  • Remove _error from de::Error #129.
  • Rewrote json parser to not consume the whole stream #127.
  • Fixed serde_macros for generating fully generic code #117.

Thank you to everyone that’s helped with this release:

  • Craig Brandenburg
  • Hugo Duncan
  • Jarred Nicholis
  • Oliver Schneider
  • Patrick Walton
  • Sebastian Thiel
  • Skylar Lipthay
  • Thomas Bahn
  • dswd

Benchmarks

It’s been a bit since we last did some benchmarks, so here are the latest numbers with these compilers:

  • rustc: 1.4.0-nightly (1181679c8 2015-08-07)
  • go: version go1.4.2 darwin/amd64
  • clang: Apple LLVM version 6.1.0 (clang-602.0.53) (based on LLVM 3.6.0svn)

bincode’s serde support makes its first appearance, which starts out roughly 1/3 slower at serialization, but about the same speed at deserialization. I haven’t done much optimization, so there’s probably a lot of low hanging fruit.

serde_json saw a good amount of improvement, mainly from some compiler optimizations in the 1.4 nightly. The deserializer is slightly slower due to the parser rewrite.

capnproto-rust’s unpacked format shows a surprisingly large large serialization improvement, with a 10x improvement from 4GB/s to 15GB/s. Good job dwrensha! Deserialization is half as slow as before though. Perhaps I have a bug in my code?

I’ve changed the Rust MessagePack implementation to rmp, which has a wee bit faster serializer, but deserialization was about the same.

I’ve also updated the numbers for Go and C++, but those numbers stayed roughly the same.

Serialization:

language library format serialization (MB/s)
Rust capnproto-rust Cap’n Proto (unpacked) 4349 15448
Go go-capnproto Cap’n Proto 3877
Rust bincode Raw 1020 3278
Rust bincode (serde) Raw 2143
Rust capnproto-rust Cap’n Proto (packed) 583 656
Go gogoprotobuf Protocol Buffers 596 627
Rust rmp MessagePack 397 427
Rust rust-protobuf Protocol Buffers 357 373
Rust serde::json JSON 288 337
C++ rapidjson JSON 307
Go goprotobuf Protocol Buffers 214 226
Rust serialize::json JSON 147 212
Go ffjson JSON 147
Go encoding/json JSON 85

Deserialization:

language library format deserialization (MB/s)
Rust capnproto-rust Cap’n Proto (unpacked) 2185 1306
Go go-capnproto Cap’n Proto (zero copy) 1407
Go go-capnproto Cap’n Proto 711
Rust capnproto-rust Cap’n Proto (packed) 351 464
Rust bincode (serde) Raw 310
Rust bincode Raw 142 291
Go gogoprotobuf Protocol Buffers 270
C++ rapidjson JSON (sax) 182
C++ rapidjson JSON (dom) 155
Rust rust-protobuf Protocol Buffers 143
Rust rmp MessagePack 138 128
Rust serde::json JSON 140 122
Go ffjson JSON 95
Go goprotobuf Protocol Buffers 81
Go encoding/json JSON 23
Rust serialize::json JSON 23

Serde 0.4.0 - Syntax Extensions in Stable Rust and More!

Hello Internet! I’m pleased to announce serde 0.4.0, which now supports many new features with help from our growing serde community. The largest is now serde supports syntax extensions in stable Rust by way of syntex. syntex is a fork of Rust’s parser library libsyntax that has been modified to enable code generation. serde uses it along with a Cargo build script to expand the #[derive(Serialize, Deserialize)] decorator annotations. Here’s how to use it.

First, lets start with a simple serde 0.3.x project that’s forced to use nightly because it uses serde_macros. The Cargo.toml is:

1
2
3
4
5
6
7
8
9
[package]
name = "hello_world"
versio = "0.1.0"
authors = ["Erick Tryzelaar <erick.tryzelaar@gmail.com>"]
license = "MIT/Apache-2.0"

[dependencies]
serde = "*"
serde_macros = "*"

And the actual library is src/lib.rs:

1
2
3
4
5
6
7
8
9
10
#![feature(custom_derive, plugin)]
#![plugin(serde_macros)]

extern crate serde;

#[derive(Serialize, Deserialize)]
pub struct Point {
    x: u32,
    y: u32,
}

In order to use Stable Rust, we can use the new serde_codegen. Our strategy is to split our input into two files. The first is the entry point Cargo will use to compile the library, src/lib.rs. The second is a template that contains the macros, src/lib.rs.in. It will be expanded into $OUT_DIR/lib.rs, which is included in src/lib.rs. So src/lib.rs looks like:

1
2
3
extern crate serde;

include!(concat!(env!("OUT_DIR"), "/lib.rs"));

src/lib.rs.in then just looks like:

1
2
3
4
5
#[derive(Serialize, Deserialize)]
pub struct Point {
    x: u32,
    y: u32,
}

In order to generate the $OUT_DIR/lib.rs, we’ll use a Cargo build script. We’ll configure Cargo.toml with:

1
2
3
4
5
6
7
8
9
10
11
12
13
[package]
name = "hello_world"
versio = "0.1.0"
authors = ["Erick Tryzelaar <erick.tryzelaar@gmail.com>"]
license = "MIT/Apache-2.0"
build = "build.rs"

[build-dependencies]
syntex = "*"
serde_codegen = "*"

[dependencies]
serde = "*"

Finally, the build.rs script itself uses syntex to expand the syntax extensions:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
extern crate syntex;
extern crate serde_codegen;

use std::env;
use std::path::Path;

fn main() {
    let out_dir = env::var_os("OUT_DIR").unwrap();

    let src = Path::new("src/lib.rs.in");
    let dst = Path::new(&out_dir).join("lib.rs");

    let mut registry = syntex::Registry::new();

    serde_codegen::register(&mut registry);
    registry.expand("", &src, &dst).unwrap();
}

Downside 1: Error Locations

While syntex is quite powerful, there are a few major downsides. Rust does not yet support the ability for a generated file to provide error location information from a template file. This means that tracking down errors requires manually looking at the generated code and trying to identify where the error in the template. However, there is a workaround. It’s actually not that difficult to support syntex and the Rust Nightly compiler plugins. To update our example, we’ll change the Cargo.toml to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
[package]
name = "hello_world"
versio = "0.1.0"
authors = ["Erick Tryzelaar <erick.tryzelaar@gmail.com>"]
license = "MIT/Apache-2.0"
build = "build.rs"

[features]
default = ["with_syntex"]
nightly = ["serde_macros"]
with_syntex = ["serde", "serde_codegen"]

[build-dependencies]
syntex = { version = "*", optional = true }
serde_codegen = { version = "*", optional = true }

[dependencies]
serde = "*"
serde_macros = { version = "*", optional = true }

Then the build.rs is changed to optionally expand the macros in our template:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#[cfg(feature = "with_syntex")]
mod inner {
    extern crate syntex;
    extern crate serde_codegen;

    use std::env;
    use std::path::Path;

    pub fn main() {
        let out_dir = env::var_os("OUT_DIR").unwrap();

        let src = Path::new("src/lib.rs.in");
        let dst = Path::new(&out_dir).join("lib.rs");

        let mut registry = syntex::Registry::new();

        serde_codegen::register(&mut registry);
        registry.expand("", &src, &dst).unwrap();
    }
}

#[cfg(not(feature = "with_syntex"))]
mod inner {
    pub fn main() {}
}

pub fn main() {
    inner::main()
}

Finally, src/lib.rs is updated to:

1
2
3
4
5
6
7
8
9
10
#![cfg_attr(feature = "nightly", feature(plugin))]
#![cfg_attr(feature = "nightly", plugin(serde_macros))]

extern crate serde;

#[cfg(feature = "nightly")]
include!("lib.rs.in");

#[cfg(feature = "with_syntex")]
include!(concat!(env!("OUT_DIR"), "/lib.rs"));

Then most development can happen with using the Nightly Rust and cargo build --no-default-features --features nightly for better error messages, but downstream consumers can use Stable Rust without worry.

Downside 2: Macros in Macros

Syntex can only expand macros inside macros it knows about, and it doesn’t know about the builtin macros. This is because a lot of the stable macros are using unstable features under the covers. So unfortunately if you’re using a library like the quasiquoting library quasi, you cannot write:

1
let exprs = vec![quote_expr!(cx, 1 + 2)];

Instead you have to pull out the syntex macros into a separate variable:

1
2
let expr = quote_expr!(cx, 1 + 1);
let exprs = vec![expr];

Downside 3: Compile Times

Syntex can take a while to compile. It may be possible to optimize this, but that may be difficult while keeping compatibility with libsyntax.


That’s v0.4.0. I hope you enjoy it! Please let me know if you run into any problems.

Release Notes

Here are other things that came with this version:

  • Added field annotation to enable renaming fields for different backends #69. For example:
1
2
3
4
5
6
7
struct Point {
  #[serde(rename="X")]
  x: u32,

  #[serde(rename(json="the-x", xml="X")]
  y: u32,
}
  • Faster JSON string parsing #71.
  • Add a LineColIterator that tracks line and column information for deserializers #58.
  • Improved bytestring support #72
  • Changed de::PrimitiveVisitor to also depend on FromStr #70
  • Added impls for fixed sized arrays with 1 to 32 elements #74
  • Added json::Value::lookup, that allows values to be extracted with value.lookup("foo.bar.baz") #76

Bug Fixes:

  • Make sure that -0.0 gets serialized as “-0.0” f0c87fb.
  • Missing field errors displayed original field name instead of renamed #64.
  • Fixed handling json integer overflow

A special thanks to everyone that helped with this release:

  • Alex Crichton
  • Andrew Poelstra
  • Corey Farwell
  • Hugo Duncan
  • Jorge Israel Peña
  • Kang Seonghoon
  • Mikhail Borisov
  • Oliver Schneider
  • Sebastian Thiel
  • Steven Fackler
  • Thomas Bahn
  • derhaskell

Serde 0.3.1 - Now Compatible With Beta! Plus Aster and Quasi Updates

I just pushed up serde 0.3.1 to crates.io, which is now compatible with beta! serde_macros 0.3.1, however still requires nightly. But this means that if you implement the all the traits using stable features, then any users of serde should work with rust 1.0.

Here’s what’s also new in serde v0.3.1:

  • Renamed ValueDeserializer::deserializer into ValueDeserializer::into_deserializer.
  • Renamed the attribute that changes the name a field is serialized #[serde(alias="...")] to #[serde(rename="...")].
  • Added implementations for Box, Rc, and Arc.
  • Updated VariantVisitor to hint to the deserializer which variant kind it is expecting. This allows serializers to serialize a unit variant as a string.
  • Added an Error::unknown_field_error error message.
  • Progress on the documentation, but there’s still plenty more to go.

Upstream of serde, I’ve been also doing some work on aster and quasi, which are my helper libraries to simplify writing syntax extensions.

aster v0.2.0:

  • Added builders for qualified paths, slices, Vec, Box, Rc, and Arc.
  • Extended item builders to support use simple paths, globs, and lists.
  • Added a helper for building the #[automatically_derived] annotation.

quasi v0.1.9:

  • Backported support for quote_attr!() and quote_matchers!() from libsyntax.
  • Added support for unquoting arbitrary slices.

Thanks for everyone’s help with this release!

Serde 0.3

I’m happy to announce that I’ve released serde 0.3 on crates.io today. For those unfamiliar with serde, it’s a generic serialization framework, much like rustc-serialize, but much more powerful. Check out my serialization series if you’re interested in serde’s original development.

There’s been a ton of work since 0.2. Here are the highlights:

  • Ported over from std::old_io to std::io. There is a bit of a performance hit when serializing to &mut [u8], although it’s really not that bad. In my goser benchmarks, previously it ran in 373 MB/s, but now it’s running at 260 MB/s. However, this hasn’t impacted the Vec<u8> serialization performance, nor deserialization performance.

  • Much better JSON deserialization errors. Now std::io::Error is properly propogated, and error locations are reported when a Deserialize raises an error.

  • Merged serde::ser::Serializer and serde::ser::Visitor.

  • Renamed serde::ser::Serialize::visit to serde::ser::Serialize::serialize.

  • Replaced serde::ser::{Seq,Map}Visitor::size_hint with a len() method that returns an optional length. This has a little stronger emphasis that we either need an exactly length or no length. Formats that need an exact length should make sure to verify the length passed in matches the actual amount of values serialized.

  • serde::json now deserializes missing values as a ().

  • Finished implementing #[derive(Serialize, Deserialize)] for all struct and enum forms.

  • Ported serde_macros over to aster and quasi, which simplies code generation.

  • Removed the unnessary first argument from visit_{seq,map}_elt.

  • Rewrote enum deserializations to not require allocations. Oddly enough this is a tad slower than the allocation form. I suspect it’s coming from the function calls not getting inlined away.

  • Allowed enum serialization and deserialization to support more than one variant.

  • Allowed Deserialize types to hint that it’s expecting a sequence or a map.

  • Allowed maps to be deserialized from a ().

  • Added a serde::bytes::{Bytes,ByteBuf}, which wrap &[u8]/Vec<u8> to allow some formats to encode these values more efficiently than generic sequences.

  • Added serde::de::value, which contains some helper deserializers to deserialize from a Rust type.

  • Added impls for most collection types in the standard library.

Thanks everyone that’s helped out with this release!

Rewriting Rust Serialization: Part 4: Serde2 Is Ready!

It’s been a while, hasn’t it? Here’s part 1, part 2, part 2.1, part 2.2, part 3, and part 3.1 if you want to catch up.

Serde Version 2

Well it’s a long time coming, but serde2 is finally in a mostly usable position! If you recall from part 3, one of the problems with serde1 is that we’re paying a lot for tagging our types, and it’s really hurting us on the deserialization side of things. So there’s one other pattern that we can use that allows for lookahead that doesn’t need tags: visitors. A year or so ago I rewrote our generic hashing framework to use the visitor pattern to great success. serde2 came out of experiments to see if I could do the same thing here. It turned out that it was a really elegant approach.

Serialize

It all starts with a type that we want to serialize:

1
2
3
4
5
pub trait Serialize {
    fn visit<
        V: Visitor,
    >(&self, visitor: &mut V) -> Result<V::Value, V::Error>;
}

(Aside: while I’d rather use where here for this type parameter, that would force me to write <V as Visitor>::Value> due to #20300).

This Visitor trait then looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
pub trait Visitor {
    type Value;
    type Error;

    fn visit_unit(&mut self) -> Result<Self::Value, Self::Error>;

    #[inline]
    fn visit_named_unit(&mut self, _name: &str) -> Result<Self::Value, Self::Error> {
        self.visit_unit()
    }


    fn visit_bool(&mut self, v: bool) -> Result<Self::Value, Self::Error>;

    ...
}

So the implementation for a bool then looks like:

1
2
3
4
5
6
7
8
impl Serialize for bool {
    #[inline]
    fn visit<
        V: Visitor,
    >(&self, visitor: &mut V) -> Result<V::Value, V::Error> {
        visitor.visit_bool(*self)
    }
}

Things get more interesting when we get to compound structures like a sequence. Here’s Visitor again. It needs to both be able to visit the overall structure as well as each item:

1
2
3
4
5
6
7
8
9
10
11
12
...

fn visit_seq<V>(&mut self, visitor: V) -> Result<Self::Value, Self::Error>
    where V: SeqVisitor;

fn visit_seq_elt<T>(&mut self,
                    first: bool,
                    value: T) -> Result<Self::Value, Self::Error>
    where T: Serialize;

...
}

We also have this SeqVisitor trait that the type to serialize provides. It really just looks like an Iterator, but the type parameter has been moved to the visit method so that it can return different types:

1
2
3
4
5
6
7
8
9
10
pub trait SeqVisitor {
    fn visit<
        V: Visitor,
    >(&mut self, visitor: &mut V) -> Result<Option<V::Value>, V::Error>;

    #[inline]
    fn size_hint(&self) -> (usize, Option<usize>) {
        (0, None)
    }
}

Finally, to implement this for a type like &[T] we create an Iterator-to-SeqVisitor adaptor and pass it to the visitor, which then in turn visits each item:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
pub struct SeqIteratorVisitor<Iter> {
    iter: Iter,
    first: bool,
}

impl<T, Iter: Iterator<Item=T>> SeqIteratorVisitor<Iter> {
    #[inline]
    pub fn new(iter: Iter) -> SeqIteratorVisitor<Iter> {
        SeqIteratorVisitor {
            iter: iter,
            first: true,
        }
    }
}

impl<
    T: Serialize,
    Iter: Iterator<Item=T>,
> SeqVisitor for SeqIteratorVisitor<Iter> {
    #[inline]
    fn visit<
        V: Visitor,
    >(&mut self, visitor: &mut V) -> Result<Option<V::Value>, V::Error> {
        let first = self.first;
        self.first = false;

        match self.iter.next() {
            Some(value) => {
                let value = try!(visitor.visit_seq_elt(first, value));
                Ok(Some(value))
            }
            None => Ok(None),
        }
    }

    #[inline]
    fn size_hint(&self) -> (usize, Option<usize>) {
        self.iter.size_hint()
    }
}

impl<
    'a,
    T: Serialize,
> Serialize for &'a [T] {
    #[inline]
    fn visit<
        V: Visitor,
    >(&self, visitor: &mut V) -> Result<V::Value, V::Error> {
        visitor.visit_seq(SeqIteratorVisitor::new(self.iter()))
    }
}

SeqIteratorVisitor is publically exposed, so it should be easy to use it with custom data structures. Maps follow the same pattern (and also expose MapIteratorVisitor), but each item instead uses visit_visit_map_elt(first, key, value). Tuples, struct tuples, and tuple enum variants are all really just named sequences. Likewise, structs and struct enum variants are just named maps.

Because struct implementations are so common, here’s an example how to do it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
struct Point {
    x: i32,
    y: i32,
}

struct PointVisitor<'a> {
    state: u32,
    value: &'a Point,
}

impl<'a> MapVisitor for PointVisitor<'a> {
    fn visit<
        V: Visitor,
    >(&mut self, visitor: &mut V) -> Result<V::Value, V::Error> {
        match self.state {
            0 => {
                self.state += 1;
                Ok(Some(try!(visitor.visit_map_elt(true, "x", &self.x))))
            }
            1 => {
                self.state += 1;
                Ok(Some(try!(visitor.visit_map_elt(true, "y", &self.y))))
            }
            _ => Ok(None),
        }
    }
}

impl Serialize for Point {
    fn visit<
        V: Visitor,
    >(&self, visitor: &mut V) -> Result<V::Value, V::Error> {
        visit_named_map("Point", PointVisitor {
            state: 0,
            value: self,
        })
    }
}

Fortunately serde2 also comes with a #[derive_serialize] macro so you don’t need to write this out by hand if you don’t want to.

Serializer

Now to actually build a serializer. We start with a trait:

1
2
3
4
5
6
7
pub trait Serializer {
    type Value;
    type Error;

    fn visit<T>(&mut self, value: &T) -> Result<Self::Value, Self::Error>
        where T: Serialize;
}

It’s the responsibility of the serializer to create a visitor and then pass it to the type. Oftentimes the serializer also implements Visitor, but it’s not required. Here’s a snippet of the JSON serializer visitor:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
struct Visitor<'a, W: 'a> {
    writer: &'a mut W,
}

impl<'a, W: Writer> Visitor for Visitor<'a, W> {
    type Value = ();
    type Error = io::Error;

    fn visit_unit(&mut self) -> IoResult<()> {
        self.writer.write_all(b"null")
    }

    #[inline]
    fn visit_bool(&mut self, value: bool) -> IoResult<()> {
        if value {
            self.writer.write_all(b"true")
        } else {
            self.writer.write_all(b"false")
        }
    }

    #[inline]
    fn visit_isize(&mut self, value: isize) -> IoResult<()> {
        write!(self.writer, "{}", value)
    }

    ...

    #[inline]
    fn visit_map<V>(&mut self, mut visitor: V) -> IoResult<()>
        where V: ser::MapVisitor,
    {
        try!(self.writer.write_all(b"{"));

        while let Some(()) = try!(visitor.visit(self)) { }

        self.writer.write_all(b"}")
    }

    #[inline]
    fn visit_map_elt<K, V>(&mut self, first: bool, key: K, value: V) -> IoResult<()>
        where K: ser::Serialize,
              V: ser::Serialize,
    {
        if !first {
            try!(self.writer.write_all(b","));
        }

        try!(key.visit(self));
        try!(self.writer.write_all(b":"));
        value.visit(self)
    }
}

Hopefully it is pretty straight forward.

Deserialization

Now serialization is the easy part. Deserialization is where it always gets more tricky. We follow a similar pattern as serialization. A deserializee creates a visitor which accepts any type (most resulting in an error), and passes it to a deserializer. This deserializer then extracts it’s next value from it’s stream and passes it to the visitor, which then produces the actual type.

It’s achingly close to the same pattern between a serializer and a serializee, but as hard as I tried, I couldn’t unify the two. The error semantics are different. In serialization, you want the serializer (which creates the visitor) to define the error. In deserialization, you want the deserializer which consumes the visitor to define the error.

Let’s start first with Error. As opposed to serialization, when we’re deserializing we can error both in the Deserializer if there is a parse error, or in the Deserialize if it’s received an unexpected value. We do this with an Error trait, which allows a deserializee to generically create the few errors it needs:

1
2
3
4
5
6
7
pub trait Error {
    fn syntax_error() -> Self;

    fn end_of_stream_error() -> Self;

    fn missing_field_error(&'static str) -> Self;
}

Now the Deserialize trait, which looks similar to Serialize:

1
2
3
4
5
pub trait Deserialize {
    fn deserialize<
        D: Deserializer,
    >(deserializer: &mut D) -> Result<Self, D::Error>;
}

The Visitor also looks like the serialization Visitor, except for the methods error by default.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
pub trait Visitor {
    type Value;

    fn visit_bool<
        E: Error,
    >(&mut self, _v: bool) -> Result<Self::Value, E> {
        Err(Error::syntax_error())
    }

    fn visit_isize<
        E: Error,
    >(&mut self, v: isize) -> Result<Self::Value, E> {
        self.visit_i64(v as i64)
    }

    ...
}

Sequences and Maps are also a little different:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
pub trait Visitor {
    ...

    fn visit_seq<
        V: SeqVisitor,
    >(&mut self, _visitor: V) -> Result<Self::Value, V::Error> {
        Err(Error::syntax_error())
    }

    fn visit_map<
        V: MapVisitor,
    >(&mut self, _visitor: V) -> Result<Self::Value, V::Error> {
        Err(Error::syntax_error())
    }

    ...
}

pub trait SeqVisitor {
    type Error: Error;

    fn visit<
        T: Deserialize,
    >(&mut self) -> Result<Option<T>, Self::Error>;

    fn end(&mut self) -> Result<(), Self::Error>;

    #[inline]
    fn size_hint(&self) -> (usize, Option<usize>) {
        (0, None)
    }
}

pub trait MapVisitor {
    type Error: Error;

    #[inline]
    fn visit<
        K: Deserialize,
        V: Deserialize,
    >(&mut self) -> Result<Option<(K, V)>, Self::Error> {
        match try!(self.visit_key()) {
            Some(key) => {
                let value = try!(self.visit_value());
                Ok(Some((key, value)))
            }
            None => Ok(None)
        }
    }

    fn visit_key<
        K: Deserialize,
    >(&mut self) -> Result<Option<K>, Self::Error>;

    fn visit_value<
        V: Deserialize,
    >(&mut self) -> Result<V, Self::Error>;

    fn end(&mut self) -> Result<(), Self::Error>;

    #[inline]
    fn size_hint(&self) -> (usize, Option<usize>) {
        (0, None)
    }
}

Here is an example struct deserializer. Structs are deserialized as a map, but since maps are unordered, we need a simple state machine to extract the values. In order to get the keys, we just create an enum for the fields, and a custom deserializer to convert a string into a field without an allocation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
struct Point {
    x: i32,
    y: i32,
}

impl Deserialize for Point {
    fn deserialize<
        D: Deserializer,
    >(deserializer: &mut D) -> Result<Point, D::Error> {
        enum Field {
            x,
            y,
        }

        struct FieldVisitor;

        impl Visitor for FieldVisitor {
            type Value = Field;

            fn visit_str<
                E: Error,
            >(&mut self, value: &str) -> Result<Field, E> {
                match value {
                    "x" => Ok(Field::x),
                    "y" => Ok(Field::y),
                    _ => Err(Error::syntax_error()),
                }
            }
        }

        impl Deserialize for Field {
            fn deserialize<
                D: Deserializer,
            >(state: &mut D) -> Result<Field, D::Error> {
                state.visit(&mut FieldVisitor)
            }
        }

        struct Visitor;

        impl Visitor for Visitor {
            type Value = Point;

            fn visit_map<
                V: MapVisitor,
            >(&mut self, mut visitor: V) -> Result<Point, V::Error> {
                {
                    let mut x = None;
                    let mut y = None;

                    while let Some(key) = try!(visitor.visit_key()) {
                        match key {
                            Field::x => {
                                x = Some(try!(visitor.visit_value()));
                            }
                            Field::y => {
                                y = Some(try!(visitor.visit_value()));
                            }
                        }
                    }

                    let x = match x {
                        Some(x) => x,
                        None => {
                            return Err(Error::missing_field_error("x"));
                        }
                    };
                    let y = match y {
                        Some(y) => y,
                        None => {
                            return Err(Error::missing_field_error("y"));
                        }
                    };

                    Ok(Point {
                        x: x,
                        y: y,
                    })
                }
            }

            fn visit_named_map<
                V: MapVisitor,
            >(&mut self, name: &str, visitor: V) -> Result<Point, V::Error> {
                if name == "Point" {
                    self.visit_map(visitor)
                } else {
                    Err(Error::syntax_error())
                }
            }
        }

        deserializer.visit(&mut Visitor)
    }
}

It’s a little more complicated, but once again there is #[derive_deserialize], which does all this work for you.

Deserializer

Deserializers then follow the same pattern as serializers. The one difference is that we need to provide a special hook for Option<T> types so formats like JSON can treat null types as options.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
pub trait Deserializer {
    type Error: Error;

    fn visit<
        V: Visitor,
    >(&mut self, visitor: &mut V) -> Result<V::Value, Self::Error>;

    /// The `visit_option` method allows a `Deserialize` type to inform the
    /// `Deserializer` that it's expecting an optional value. This allows
    /// deserializers that encode an optional value as a nullable value to
    /// convert the null value into a `None`, and a regular value as
    /// `Some(value)`.
    #[inline]
    fn visit_option<
        V: Visitor,
    >(&mut self, visitor: &mut V) -> Result<V::Value, Self::Error> {
        self.visit(visitor)
    }
}

Performance

So how does it perform? Here’s the serialization benchmarks, with yet another ordering. This time sorted by the performance:

language library format serialization (MB/s)
Rust capnproto-rust Cap’n Proto (unpacked) 4226
Go go-capnproto Cap’n Proto 3824.20
Rust bincode Binary 1020
Rust capnproto-rust Cap’n Proto (packed) 672
Go gogoprotobuf Protocol Buffers 596.78
Rust rust-msgpack MessagePack 397
Rust serde2::json (&[u8]) JSON 373
Rust rust-protobuf Protocol Buffers 357
C++ rapidjson JSON 316
Rust serde2::json (Custom) JSON 306
Rust serde2::json (Vec) JSON 288
Rust serde::json (Custom) JSON 244
Rust serde::json (&[u8]) JSON 222
Go goprotobuf Protocol Buffers 214.68
Rust serde::json (Vec) JSON 149
Go ffjson JSON 147.37
Rust serialize::json JSON 183
Go encoding/json JSON 80.49

I think it’s fair to say that on at least this benchmark we’ve hit our performance numbers. Writing to a preallocated buffer with BufWriter is 18% faster than rapidjson (although to be fair they are allocating). Our Vec<u8> writer comes in 12% slower. What’s interesting is this custom Writer. It turns out LLVM is still having trouble lowering our generic Vec::push_all into a memcpy. This Writer variant however is able to get us to rapidjson’s level:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
fn push_all_bytes(dst: &mut Vec<u8>, src: &[u8]) {
    let dst_len = dst.len();
    let src_len = src.len();

    dst.reserve(src_len);

    unsafe {
        // we would have failed if `reserve` overflowed.
        dst.set_len(dst_len + src_len);

        ::std::ptr::copy_nonoverlapping_memory(
            dst.as_mut_ptr().offset(dst_len as isize),
            src.as_ptr(),
            src_len);
    }
}

struct MyMemWriter1 {
    buf: Vec<u8>,
}

impl Writer for MyMemWriter1 {
    #[inline]
    fn write_all(&mut self, buf: &[u8]) -> IoResult<()> {
        push_all_bytes(&mut self.buf, buf);
        Ok(())
    }
}

Deserialization we do much better than serde because we aren’t passing around all those tags, but we have a ways to catch up to rapidjson. Still, being just 37% slower than the fastest JSON deserializer makes me feel pretty proud.

language library format deserialization (MB/s)
Rust capnproto-rust Cap’n Proto (unpacked) 2123
Go go-capnproto Cap’n Proto (zero copy) 1407.95
Go go-capnproto Cap’n Proto 711.77
Rust capnproto-rust Cap’n Proto (packed) 529
Go gogoprotobuf Protocol Buffers 272.68
C++ rapidjson JSON (sax) 189
C++ rapidjson JSON (dom) 162
Rust bincode Binary 142
Rust rust-protobuf Protocol Buffers 141
Rust rust-msgpack MessagePack 138
Rust serde2::json JSON 122
Go ffjson JSON 95.06
Go goprotobuf Protocol Buffers 79.78
Rust serde::json JSON 67
Rust serialize::json JSON 25
Go encoding/json JSON 22.79

Conclusion

What a long trip it’s been! I hope you enjoyed it. While there are still a few things left to port over from serde1 to serde2 (like the JSON pretty printer), and some things probably should be renamed, I’m happy with the design so I think it’s in a place where people can start using it now. Please let me know how it goes!

Syntex: Syntax Extensions for Rust 1.0

I use and love syntax extensions, and I’m planning on using them to simplify down how one interacts with a system like serde. Unfortunately though, to write them you need to use Rust’s libsyntax, which is not going to be exposed in Rust 1.0 because we’re not ready to stablize it’s API. That would really hamper future development of the compiler.

It would be so nice though, writing this for every type we want to serialize:

1
2
3
4
5
#[derive_deserialize]
struct Point {
    x: int,
    y: int,
}

instead of:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
impl <D: Deserializer<E>, E> Deserialize<D, E> for Point {
    fn deserialize_token(state: &mut D, token: Token) -> Result<Point, E> {
        try!(state.expect_struct_start(token, "Point"));

        let mut x = None;
        let mut y = None;

        loop {
            let idx = match try!(state.expect_struct_field_or_end(&["x", "y"])) {
                Some(idx) => idx,
                None => { break ; }
            };

            match idx {
                Some(0us) => { x = Some(try!(state.expect_struct_value())); }
                Some(1us) => { y = Some(try!(state.expect_struct_value())); }
                None | Some(_) => { panic!() }
            }
        }

        let x = match x {
            Some(x) => x,
            None => try!(state.missing_field("x")),
        };

        let y = match y {
            Some(y) => y,
            None => try!(state.missing_field("y")),
        };

        Ok(Point{x: x, y: y,})
    }
}

So I want to announce my plan on how to deal with this (and also publically announce that everyone can blame me if this turns out to hurt Rust’s future development). I’ve started syntex, a library that enables code generation using an unofficial tracking fork of libsyntax. In order to deal with the fact that libsyntax might make breaking changes between minor Rust releases, we will just release a minor or major release of syntex, depending on if there were any breaking changes in libsyntax. Ideally syntex will allow the community to experiment with different code generation approaches to see what would be worth merging upstream. Or even better, what hooks are needed in the compiler so it doesn’t have to think about syntax extensions at all.

I’ve got the basic version working right now. Here’s a simple hello_world_macros syntax extension (you can see the actual code here). First, the hello_world_macros/Cargo.toml:

1
2
3
4
5
6
7
8
[package]
name = "hello_world_macros"
version = "0.2.0"
authors = [ "erick.tryzelaar@gmail.com" ]

[dependencies]
syntex = "*"
syntex_syntax = "*"

The syntex_syntax is the crate for my fork of libsyntax, and syntex provides some helper functions to ease registering syntax extensions.

Then the src/lib.rs, which declares a macro hello_world that just produces a "hello world" string:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
extern crate syntex;
extern crate syntex_syntax;

use syntex::Registry;

use syntex_syntax::ast::TokenTree;
use syntex_syntax::codemap::Span;
use syntex_syntax::ext::base::{ExtCtxt, MacExpr, MacResult, TTMacroExpander};
use syntex_syntax::ext::build::AstBuilder;
use syntex_syntax::parse::token::InternedString;

fn expand_hello_world<'cx>(
    cx: &'cx mut ExtCtxt,
    sp: Span,
    tts: &[TokenTree]
) -> Box<MacResult + 'cx> {
    let expr = cx.expr_str(sp, InternedString::new("hello world"));

    MacExpr::new(expr)
}

pub fn register(registry: &mut Registry) {
    registry.register_fn("hello_world", expand_hello_world);
}

Now to use it. This is a little more complicated because we have to do code generation, but Cargo helps with that. Our strategy is use a build.rs script to do code generation in a main.rss file, and then use the include!() macro to include it into our dummy main.rs file. Here’s the Cargo.toml we need:

1
2
3
4
5
6
7
8
9
10
11
12
[package]
name = "hello_world"
version = "0.2.0"
authors = [ "erick.tryzelaar@gmail.com" ]
build = "build.rs"

[build-dependencies]
syntex = "*"
syntex_syntax = "*"

[build-dependencies.hello_world_macros]
path = "hello_world_macros"

Here’s the build.rs, which actually performs the code generation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
extern crate syntex;
extern crate hello_world_macros;

use std::os;

fn main() {
    let mut registry = syntex::Registry::new();
    hello_world_macros::register(&mut registry);

    registry.expand(
        "hello_world",
        &Path::new("src/main.rss"),
        &Path::new(os::getenv("OUT_DIR").unwrap()).join("main.rs"));
}

Our main.rs driver script:

1
2
// Include the real main
include!(concat!(env!("OUT_DIR"), "/main.rs"));

And finally the main.rss:

1
2
3
4
fn main() {
    let s = hello_world!();
    println!("{}", s);
}

One limitiation you can see above is that we unfortunately can’t compose our macros with the Rust macros is that syntex currently has no awareness of the Rust macros, and since macros are parsed outside-in, we have to leave the tokens inside a macro like println!() untouched.


That’s syntex. There is a bunch of more work left to be done in syntex to make it really useable. There’s also a lot of work in Rust and Cargo that would help it be really effective:

  • We need a way to inform Rust that this block of code is actually coming from a different file than the one it’s processing. This is roughly equivalent to the #line macros in C.
  • We could upstream the “ignore unknown macros” patch to minimize changes to libsyntax.
  • It would be nice if we could allow #[path] to reference an environment variable. This would be a little cleaner than using include!(...).
  • We need a way to extract macros from a crate.

On Cargo’s side:

  • It would be nice if Cargo could be told to use a generated file as the main.rs/lib.rs/etc.
  • Cargo.toml could grow a plugin mechanism to remove the need to write a build.rs script.

I’m sure there’s plenty more that needs to get done! So, please help out!

edit: comments on reddit