Problems I see with the ATA-over-Ethernet protocol

July 15, 2007

I've been experimenting with AoE lately, and as a result I've been looking at the protocol more than I did in my earlier exposure. Unfortunately, the more I look at the AoE protocol, the more uncomfortable I get.

The AoE protocol is quite simple; requests and replies are simple Ethernet frames, and a request's result must fit in a single reply packet. This means that the maximum read and write sizes per request are bounded by the size of the Ethernet frame, and thus on a normal Ethernet the maximum is 1K per request. (AoE does all IO in 512-byte sectors.)

So, the problems I see:

  • AoE effectively requires the target to do buffering in order to bridge the gap between AoE's small requests and the large IO requests that modern disk systems need to see to get decent performance.

    Buffering writes makes targets less transparent and more dangerous. Requiring read buffering means that target performance goes down dramatically if the target can't do it, either because it can't predict the necessary readaheads pattern or because it's run out of spare memory.

    (I am especially worried about readahead prediction because we will be using this for NFS servers that are used by a lot of people at once, so the targets will see what looks like random IO. I do not expect target-based readahead to do at all well in that situation.)

  • because AoE uses such small requests and replies it must send and receive a huge number of packets a second to get full bandwidth. For example, on a normal Ethernet getting 100 Mbytes/sec of read bandwidth requires handling over 200,000 packets per second (about 100,000 pps sent and 100,000 pps received).

    This is a problem because most systems are much better at handling high network bandwidth than they are at handling high numbers of packets per second. (And historically, the pps rate machines can handle has grown more slowly than network bandwidth has.)

The packets per second issue probably only really affects reads; there are few disk systems that can sustain 100 Mbytes/sec of writes, but it is not difficult to build one that can do 100 Mbytes/sec of reads.

(And the interesting thing for us is to build a system that will still manage to use the full network bandwidth when it is not one streaming read but 30 different people each doing their own streaming reads, all being mixed together on the target.)

I find all of this unfortunate. I would like to like AoE, because it has an appealing simplicity; however, I'm a pragmatist, so simplicity without performance is not good enough.

Sidebar: the buffer count problem

There's a third, smaller problem. The 'Buffer Count' in the server configuration reply (section 3.2 of the AoE specification) cannot mean what it says it means. The protocol claims that this is a global limit, that it is:

The maximum number of outstanding messages the server can queue for processing.

The problem is that one initiator has no idea how many messages other initiators are currently sending the server. So this has to actually be the number of outstanding messages a single initiator can send the server, and it is the server's responsibility to divide up a global pool among all of the initiators.

(In practice this means that the server needs to be manually configured to know how many initiators it has.)


Comments on this page:

From 216.105.40.123 at 2011-05-11 04:41:39:

Just like with iSCSI, nobody serious about running an ethernet SAN sticks with 1500 mtu packets. All of my AoE SANs use 9000 mtu packets and I stream a little over 100MB/s of actual disk throughput on gigabit ethernet links.

By cks at 2011-05-11 10:33:07:

Jumbo frames aren't necessary on 1G Ethernet for iSCSI. We found no visible performance difference when we tested (which is good, since we found no cheap switches with good jumbo frame implementations), and this matches the experience and recommendations of the IET developers.

Written on 15 July 2007.
« Weekly spam summary on July 14th, 2007
Why SSL and name-based virtual hosts don't get along »

Page tools: View Source, View Normal, Add Comment.
Search:
Login: Password:
Atom Syndication: Recent Comments.

Last modified: Sun Jul 15 23:39:52 2007
This dinky wiki is brought to you by the Insane Hackers Guild, Python sub-branch.