Wire – Writing one of the fastest .NET serializers


First of all, there is no such thing as “the fastest” serializer, it is all contextual.
But under some conditions, I would however argue that Wire is, by far, the fastest of all the .NET serializers out there.

Given the following POCO type.

public class Poco
{
   public string StringProp { get; set; }
   public int IntProp { get; set; }
   public Guid GuidProp { get; set; }
   public DateTime DateProp { get; set; }
}

Round tripping one million objects of this type, that is, serializing and then deserializing a million objects using Wire with all optimizations on, completes in about 550 milliseconds on my personal laptop.

Doing the same using MS Bond, which is the second fastest serializer in the benchmark, takes about 830 milliseconds, and this is while being very generous to Bond as it has some very specific prerequisites.
Protobuf.NET which is the third serializer on this benchmark completes in about 1360 milliseconds.

Other serializers that was included in the same benchmark was Jil, NetSerializer, FS Pickler, Json.NET and .NET BinaryFormatter.

Just to clarify; this is very selective benchmarking, the old Lies, damned lies, and statistics and all that.
Running the same benchmark with smaller types, e.g. a POCO with only one or two properties favors Jil and NetSerializer a lot, NetSerializer beats Wire under those conditions.

I would also imagine that running benchmarks with a lot bigger types might favor e.g. Protobuf.NET as it does some clever byte buffer pooling.

Wire was originally built as a replacement for the Json.NET based serializer we use in Akka.NET.
Akka.NET is a concurrency and distributed computing framework based on messaging.

As such, the requirements we had for this was that it should support polymorphic types, it should support “surrogate” types, it should support the plethora of standard types we have in the .NET ecosystem, such as immutable collections, F# Discriminated Unions and so on, all of the things that had been bugging us with Json.NET.
Note: there is nothing wrong with Json.NET, it is a fantastic serializer, it is just not the right choice for Akka.NET.

Messages in Akka.NET is typically quite small, thus the shape of the POCO used in the benchmark above, it is fairly representative of a message.

This post is not intended to show how fast and fancy Wire is, but rather about some of the lessons learned while building and optimizing it.

When I stared to build Wire, speed was not my main concern, I just wanted it to be “fast enough”, the main concerns was the above mentioned requirements.

Having never built a serializer before, I did not know much about the topic.
I had a rough idea how I wanted to go about this, I knew I needed to preserve type information, even for primitives in some cases, e.g. if you serialize an array of object containing different primitives, which is exactly what we do for Actor constructor arguments in Akka.NET remote deployment.

It was also fairly obvious that it would be inefficient to write the entire type name for every such occurrence.

Therefore I introduced the idea of a ValueSerialize, this is a type that can serialize and deserialize the content of a given type, be it a complex or a primitive type.

Looking up value serializers by type

The very early attempts contained a concurrent dictionary of vale serializers, so that the serializer could check whatever value was about to be serialized or deserialized and then look up the correct value serializer.

This worked, but doing dictionary lookups is fairly costly, so this is where I first started to introduce some optimizations.

Instead of having code like:

public ValueSerializer GetSerializerByType(Type type)
{
  ValueSerializer serializer;

  if (_serializers.TryGetValue(type, out serializer))
    return serializer;

  //more code to build custom type serializers.. ignore for now.
}

I turned the code into something like:

public ValueSerializer GetSerializerByType(Type type)
{
  if (type == typeof(string))
    return StringSerializer.Instance;

  if (type == typeof(Int32))
    return Int32Serializer.Instance;

  if (type == typeof(Int64))
    return Int64Serializer.Instance;
  ....

This was a faster for primitives, no hashing or lookup needed, only reference checking.
But, there is one call in there for each comparison, can you spot it?

Calls to typeof() actually generates a bit of IL code:

ldtoken      [mscorlib]System.String
call         class [mscorlib]System.Type [mscorlib]System.Type::GetTypeFromHandle(valuetype [mscorlib]System.RuntimeTypeHandle)

We can prevent these extra calls per type simply by storing references to each primitive in advance:

public ValueSerializer GetSerializerByType(Type type)
{
  if (ReferenceEquals(type.GetTypeInfo().Assembly, ReflectionEx.CoreAssembly))
  {
    if (type == TypeEx.StringType) //we simply keep a reference to each primitive type
      return StringSerializer.Instance;

    if (type == TypeEx.Int32Type)
      return Int32Serializer.Instance;

    if (type == TypeEx.Int64Type)
      return Int64Serializer.Instance;

Another optimization that was introduced here was to only do these primitive lookups, if the type we want to lookup belongs to the the System.Core assembly.

This prevents unnecessary comparisons for any user defined type.

Once we had this, we could do fast serializer lookups for primitive types.
The conclusion from this part was to never assume the cost of anything, always profile, always decompile.

Looking up types when deserializing

Another issue that bit me big time early on, was to lookup types via. fully qualified names during deserialization.

If we want to deserialize a complex type, we first need to:

  1. Read the length of the type name
  2. Read an UTF8 encoded byte array containing the type name
  3. Translate the byte array to a string
  4. Lookup the type with this name
  5. Then finally lookup the value serializer for this type.

It turns out that looking up types through their name was really slow, e.g. Type.GetType(name).
Another thing that is horribly slow is to translate strings to and from UTF8 encoded byte arrays.

To avoid both of these issues together, I introduced the idea of an ByteArrayKey, a struct that contains a byte array with a pre-computed hash code.

This way, we can have a concurrent dictionary from ByteArrayKey to Type for fast lookups.
So instead of doing step 3 and 4, I could simply take the byte array and lookup the type directly.

The only time we need to execute 3 and 4 is when we get a cache miss, if the type have not been used before. This is a one time operation per type and process.

This had some really nice effect on the deserialization performance. and I’m pretty sure most other serializers do not do this trick yet.

Byte buffers, allocations and GC

In the very early code of Wire, I simply ignored how many allocations were made. was there a need for writing data into a buffer, I allocated the buffer in place and used it.

For example, when deserializing a string, the code looked something like:

//StringValueSerializer.cs
public override object ReadValue(Stream stream)
{
    var length = stream.ReadInt();
    var buffer = new byte[length];  //allocate a new buffer
    stream.Read(buffer,0,length);
    return Encoding.Utf8.GetString(buffer);
}

The above might be a bit pseudo but you get the gist of it.
This is clearly inefficient, it will take time to allocate new buffers and it will hit the GC hard if we have a lot of unused byte arrays floating around.

To solve this issue, I introduced the concept of “Sessions”, there is a SerializerSession and a DeserializerSession.

In the beginning, only the deserializer session contained code to deal with buffer recycling.
This allowed us to do something like this:

//StringValueSerializer.cs
public override object ReadValue(Stream stream, DeserializerSession session)
{
    var length = stream.ReadInt();          //length of the string in bytes
    var buffer = session.GetBuffer(length);  //fetch a preallocated buffer
    return Encoding.Utf8.GetString(buffer);
}

The session contains a small pre-allocated buffer, which can be re-used whenever a buffer is needed.
If a buffer with a larger size than the existing is requested, only then the buffer was re-allocated.

This saves us a lot of allocations and execution time overhead.

Recently, Szymon Kulec (http://blog.scooletz.com, https://twitter.com/Scooletz), part of the Particular Software team, started contributing to Wire.

He has added some truly awesome optimizations to Wire.
One of the things that he did was to introduce the same buffer concept for serializer sessions.

So when data is being written to a stream, we can now use the same trick.
He created an allocation free bitconverter, much like the built in BitConverter but instead writing bytes into an existing array.

This allowed us to go from code that looked like this:

//Int64ValueSerializer.cs
public override void WriteValue(Stream stream, object value)
{
     long l = (long)value;
     var bytes = BitConverter.GetBytes(l) //this allocates a new byte array every time
     stream.Write(bytes,0,bytes.length)
}

To something like this:

//Int64ValueSerializer.cs
public override void WriteValue(Stream stream, object value, SerializerSession session)
{
     const size = 8;
     long l = (long)value;
     var buffer = session.GetBuffer(size); //fetch a preallocated buffer 
     NoBitConverter.GetBytes(l, buffer)    //write the Int64 to the buffer
     stream.Write(bytes,0,size)
}

This way, we eliminate the same allocation and execution time overhead for allocating buffers when serializing.

There are some interesting tradeoffs here also.

Buffer recycling

In Protobuf.NET, Marc Gravell uses a BufferPool, which contains a lock free pool of byte arrays.
These arrays are fairly large, IIRC they are 1024 bytes, and they can be recycled, so once you are done with one of them, you can release it back to the pool.

This is obviously good if you need large buffers as you avoid allocations and execution time overhead from creating them.

It is fairly easy to use those from within Wire and make the session types use the same BufferPool type, I have tried this myself.
However, it turns out that just by touching the Interlocked members that are used inside the buffer pool, this hurts our performance in Wire for the kind of objects we aim to optimize for.

Therefore, we do not do this. we instead allocate a small byte array for each call to Serialize or Deserialize and resize if needed.

Clever allocations

The session types contains different types of lookups, e.g. there are lookups from identifier to object, from identifier to type, from type to identifier and so forth. things that the different sessions need to keep track of while serializing or deserializing.

One such lookup that is being hammered pretty hard during serialization is for checking if we need to output the type manifest for the object that is about to be written.

The type manifest should only be written once per session and then it should instead output an identifier to the already written manifest.

This was originally done using a Dictionary.

There are two things going on here, first we need to allocate this dictionary object for each session, as it keeps track of types per session, and we need to perform lookups against it.

Both of those operations are a bit costly, and most messages that we want to serialize are simple POCO’s with a few primitive properties only.

Do we really need to allocate and use this dictionary even if there only will be a single type in it most of the times?

No, we can simply cheat and allocate it later.
Like this:

public class FastTypeUShortDictionary
{
  private int _length; //this keeps track on how many types have been added
  private Type _firstType; //at first, just just set this member field
  private Dictionary<Type, ushort> _all; //this is only allocated once there are two types

The lookup will have 0 to n entries.
When there are 0 entries, we know there is nothing in it, so there are no allocations and any lookup will just return directly.
When there is 1 entry, the _firstType is set, so any lookup just compares the lookup type with the _firstType field.. still no allocations or hash lookups.
Only once we add a second type to the lookup, we will fallback and allocate the dictionary.

This save us a lot of allocations and heavy hashing lookups as most types are just a single user type and a few primitives.

Boxing, Unboxing and Virtual calls

As you might have seen already, the interface of the ValueSerializer type contains methods like abstract object Read(...) and abstract void Write(..., object value,...)

This causes boxing to occur for any value type being written or read.
I was skeptical that there would be any good solution to this due to the shape of the value serializer type that I defined very early on in the project.

Szymon however figured out that as we already do code generation for complex types, we could just as well let the value serializer join the code generation process.

He introduced the idea of EmitWriter and EmitReader into the value serializer.
This allows us to have typed implementations for each primitive and let the value serializer hook into the code generation process to inject the correct code to read and write the primitive, without calling any virtual method and without boxing.

We let the value serializer emit its code using an ICompiler abstraction, like so:

public sealed override void EmitWriteValue(ICompiler<ObjectWriter> c, int stream, int fieldValue, int session)
{
    var byteArray = c.GetVariable<byte[]>(DefaultCodeGenerator.PreallocatedByteBuffer);
    c.EmitStaticCall(_write, stream, fieldValue, byteArray);
}

Fast creation of empty objects

Wire relies on the old FormatterServices.GetUninitializedObject(type) in order to create empty instances of objects, this is because all types do not have a default constructor, and, we can’t know if the constructor has side effects or not.

But it turns out that calling a constructor is actually faster, the problem is that we need to know if it has side-effects or not.

You can however extract this information:

var defaultCtor = type.GetTypeInfo().GetConstructor(new Type[] {});
var il = defaultCtor?.GetMethodBody()?.GetILAsByteArray();
var sideEffectFreeCtor = il != null && il.Length <= 8; //this is the size of an empty ctor
if (sideEffectFreeCtor)

By extracting the constructor method body as IL byte code, and then just checking if it is 8 bytes (or less) then we know it is an empty constructor, and thus side effect free.

There are of-course a lot of other optimizations and interesting things going on in Wire, but at least this post give some insight into what and how we solved the main issues we experienced while building it.

Actor based distributed transactions


One question that often shows up when talking about the Actor Model, is how to deal with distributed transactions.

In .NET there is the concept of MSDTC, Microsoft Distributed Transaction Coordinator, that can be used to solve this problem when working with things like SQL Server etc.

The MS Research project Orleans (MS Azure Actor framework) also supports distributed transactions.
See 3.8 http://research.microsoft.com/pubs/153347/socc125-print.pdf on this.
[Edit] The transactions described in that paper was revoked, the Orleans team are currently working on a new implementation.

The problem with distributed transactions is that they are expensive, very expensive, they do not scale very well.

We as programmers are also trained to think of transactions as some sort of binary instant event that occurs, it either succeeds or it fails, and the time span is very short, but during this timespan, you freeze and lock everything that is involved with it.

In the real world however, transactions are more fuzzy, they don’t necessarily succeed or fail in a binary fashion, and they are far from instant.

And just to avoid confusion here, lets think of this as technical transactions vs. business transactions. In the end, they both ensures a degree of consistency, even though the concepts are different.

In the actor world, we can approach this the same way that the Saga pattern works.
See @Kellabytes excellent post on this topic: http://kellabyte.com/2012/05/30/clarifying-the-saga-pattern/

In Kellys post, she talks about failures in a technical context, but the failure could very well be a failure to comply to an agreement also. there would be no distinction between technical and business failures.

Let’s say that you purchase something on credit in a store, you make up a payment plan that stretches over X months.
During this time, you agree to pay Y amount of money at the end of each month for example.
If you do so, everything is fine, if you don’t, the store will send you a reminder, and if you still don’t pay, you will get fined.
This is a transaction that stretches over a very long time, and it can partially succeed.

Happy path:

happypath

Partial failure with compensating action:

overdue
Complete failure with compensating action:

failure

This is how you could model business transactions in a distributed system, it is asynchronous and it scales extremely well.

When two or more parties begin a transaction, all parties have to agree to some sort of contract before the transaction starts, this is your message flow, much like a protocol, that defines what happens if you violate the contract.A scheme of messages and actions that describe exactly how your (business) transaction is supposed to be resolved, and which of the involved parties that needs to ensure that a specific part of this agreement is handled.

This will make your transactional flow very business oriented, it goes very well with the concept of domain driven design. The transactional flow is actually just a process of domain events.

Akka.NET – Concurrency control


Time to break the silence!

A lot of things have happened since I last wrote.
I’ve got a new job at nethouse.se as developer and mentor.

Akka.NET have been doing some crazy progress the last few months.
When I last wrote, we were only two developers, now, we are about 10 core developers.
The project also have a new fresh site here: akkadotnet.github.com.
We are at version 0.7 right now, but pushing hard for a 1.0 release as soon as possible.

But not, let’s get back on topic.

In this mini tutorial, I will show how to deal with concurrency using Akka.NET.

Let’s say we need to model a bank account.
That is a classic concurrency problem.
If we would use OOP, we might start with something like this:

public class BankAccount
{
    private decimal _balance;
    public void Withdraw(decimal amount)
    {
        if (_balance < amount)
            throw new ArgumentException(
              "amount may not be larger than account balance");

        _balance -= amount;

        //... more code
    }
    //... more code
}

That seems fair, right?
This will work fine in a single threaded environment where only one thread is accessing the above code.
But what happens when two or more competing threads are calling the same code?

    public void Withdraw(decimal amount)
    {
        if (_balance < amount) //<-

That if-statement might be running in parallel on two or more theads, and at that very point in time, the balance is still unchanged. so all threads gets past the guard that is supposed to prevent a negative balance.

So in order to deal with this we need to introduce locks.
Maybe something like this:

    private readonly object _lock = new object();
    private decimal _balance;
    public void Withdraw(decimal amount)
    {
        lock(_lock) //<-
        {
           if (_balance < amount)
              throw new ArgumentException(
               "amount may not be larger than account balance");

           _balance -= amount;

        }
    }

This prevents multiple threads from accessing and modifying the state inside the Withdraw method at the same time.
So all is fine and dandy, right?

Not so much.. locks are bad for scaling, threads will end up waiting for resources to be freed up.
And in bad cases, your software might spend more time waiting for resources than it does actually running business code.
It will also make your code harder to read and reason about, do you really want threading primitives in your business code?

Here is where the Actor Model and Akka.NET comes into the picture.
The Actor Model makes actors behave “as if” they are single threaded.
Actors have a concurrency constraint that prevents it from processing more than one message at any given time.
This still applies if there are multiple producers passing messages to the actor.

So let’s model the same problem using an Akka.NET actor:


//immutable message for withdraw:
public class Withdraw
{
     public readonly decimal Amount;
     public Withdraw(decimal amount)
     {
         Amount = amount;
     }
}

//the account actor
public class BankAccount : TypedActor, IHandle<Withdraw>
{
     private decimal _balance;

     public void Handle(Withdraw withdraw)
     {
         if (_balance < amount)
         {
              Sender.Tell("fail"));
              //you should use real message types here
              return;
         }

          _balance -= withdraw.Amount;
          Sender.Tell("success);
          //and here too
     }
}

So what do we have here?
We have a message class that represents the Withdraw message, the actor model relies on async message passing.
The BankAccount actor, is then told to handle any message of the type Withdraw by subtracting the amount from the balance.

If the amount is too large, the actor will reply to it’s sender telling it that the operation failed due to a too large amount trying to be withdrawn.

In the example code, I use strings as the response on the status of the operation, you probably want to use some real message types for this purpose. but to keep the example small, strings will do fine.

How do we use this code then?


ActorSystem system = ActorSystem.Create("mysystem");
...
var account = system.ActorOf<BankAccount>();

var result = await account.Ask(new Withdraw(100));
//result is now "success" or "fail"

Thats about it, we now have a completely lock free implementation of an bank account.

Feel free to ask any question :-)

Deploying actors with Akka.NET


We have now ported both the code and configuration based deployment features of Akka.
This means that you can now use Akka.NET to deploy actors and routers on remote nodes either via code or configuration.

For those new to akka what does this mean?

Let’s say that we are building a simple local actor system.
It might have one actor that deals with user input and another actor that does some sort of work.

It could look something like this:

var system = ActorSystem.Create("mysystem");
var worker = system.ActorOf<WorkerActor>("worker");
var userInput = system.ActorOf<UserInput>("userInput");
while(true)
{
    var input = Console.ReadLine();
    userInput.Tell(input);
}

Ok, maybe a bit cheap on the user experience there, but lets keep the sample small..
Depending on input, the user input actor might pass messages to the worker and order it to perform some sort of work.

Now, lets say that it turns out that our worker can’t handle the load, since actors (act as if they) are single threaded, we might want to add additional workers.
Instead of letting the user input actor know how many workers we have, we can introduce the concept of “Routers”.

Router actors act as a facade on top of other actors, this means that the router can delegate incoming messages to a pool or group of underlying actors.

var system = ActorSystem.Create("mysystem");
var worker1 = system.ActorOf<Worker>("worker1");
var worker2 = system.ActorOf<Worker>("worker2");
var worker3 = system.ActorOf<Worker>("worker3");

var worker = system.ActorOf(Props.Empty.WithRouter(new RoundRobinGroup(new[] { worker1, worker2, worker3 })));

var userInput = system.ActorOf<UserInput>("userInput");
while(true)
{
    var input = Console.ReadLine();
    userInput.Tell(input);
}

Here we have introduced three worker actors and one “round robin group” router.
A round robin group router is a router that will use an array of workers and for each message it receives, it will delegate that message the one of the workers.

We do not need to change any of the other code, as long as the user input actor can find the worker router, we are good to go.

If we want to accomplish the same thing, but using a config instead, we can something like this:

 var config = ConfigurationFactory.ParseString(@"
	akka.actor.deployment {
            /worker {
                router = round-robin-pool
 # pool routers are not yet implemented
 # you have to use the group routers with an array of workers still
                nr-of-instances = 5
            }
");
var system = ActorSystem.Create("mysystem",config);
var worker = system.ActorOf<Worker>("worker");
var userInput = system.ActorOf<UserInput>("userInput");
while(true)
{
    var input = Console.ReadLine();
    userInput.Tell(input);
}

The worker router is no longer configured via code but rather ia a soft config, that could be placed in an external file if you want.
This means that we can scale up and utilize more CPU of our computer just by changing our configuration.
But what if this is still not enough?
We might need to scale out also, and introduce more machines.
This can be done using “remote deployment”, like this:

 var config = ConfigurationFactory.ParseString(@"
	akka.actor.deployment {
            /worker {
                router = round-robin-pool
                nr-of-instances = 5
                remote = akka.tcp://otherSystem@someMachine:8080
            }

    ....more config to set up Akka.Remote
");
var system = ActorSystem.Create("mysystem",config);
var worker = system.ActorOf<Worker>("worker");
var userInput = system.ActorOf<UserInput>("userInput");
while(true)
{
    var input = Console.ReadLine();
    userInput.Tell(input);
}

Using this configuration, we can now tell Akka.NET to deploy the worker router on a different machine.
The settings will be read from the config, packed on a remoting message and sent to the remote system that we want to create our
worker router on.
(This of course means that Akka.NET must run on the remote machine and listen to the port specified on the config)

So by just adding configuration, we can now scale up and out from a single machine to a remote server, or even a cloud provider e.g. Azure.

That’s all for now :)

For more info, see: https://github.com/rogeralsing/Pigeon

Actor lifecycle management and routers – Akka Actors for .NET


Porting Akka to .NET

I finally got around to implement full lifecycle management in Pigeon.
The Pigeon actor behavior is now consistent with real Akka.

The following Scala test: https://github.com/akka/akka/blob/master/akka-actor-tests/src/test/scala/akka/actor/ActorLifeCycleSpec.scala
Is now ported to .NET and shows that the lifecycle events fire in the expected order: https://github.com/rogeralsing/Pigeon/blob/master/Pigeon.Tests/ActorLifeCycleSpec.cs

I’ve also managed to port the fundamentals of the Akka Routers, so RoundRobinGroup routing is now in place: https://github.com/rogeralsing/Pigeon/wiki/Routing

Completely boring blogpost but this is what I’ve been up to the last few days :-)

 

Configuring Pigeon – Akka Actors for .NET


When I began to write the configuration support for my Akka Actors port “Pigeon”, I used JSON for the config files.
I’ve now managed to get some nice progress porting Typesafe’s Configuration library too.
So Pigeon now uses HOCON notation for the config files, and thus, allows for re-use of real Akka config files in Pigeon.

This means you can write configurations like:

var config = ConfigurationFactory.ParseString(@"
# we use real Akka Hocon notation configs
akka {
    remote {
        #this is the host and port the ActorSystem will listen to for connections
        server {
            host : 127.0.0.1
            port : 8080
        }
    }
}
");
//consuming code
var port = config.GetInt("akka.remote.server.port");

This might sound overly optimistic, since Pigeon currently only utilize a handful of the properties from the config.
But still, it’s pretty much only substitution support and numeric units that is missing from the config lib now, so I hope to have a fully working config system in a few days.

Once that is done, I will start incorporating it in the ActorSystem and it’s sub modules.
For those who are interested, the configuration support can be reused in other kinds of applications.
You can read up on the HOCON spec from Typesafe here: https://github.com/typesafehub/config/blob/master/HOCON.md
My port to .NET can be found here: https://github.com/rogeralsing/Pigeon/tree/master/Pigeon/Configuration

If you are interested in using Actor Model programming for .NET, please check out Pigeon:
https://github.com/rogeralsing/Pigeon