Linux の Network block device を使ってユーザーランドブロックデバイスを作って遊んでみた。
バックエンドは単純なメモリを使っているので、前回と同様に fio を使って計測してみた。
rw: (g=0): rw=randrw, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64 ... fio-2.20 Starting 4 processes rw: Laying out IO files (8 files / total 250MiB) rw: Laying out IO files (8 files / total 250MiB) rw: Laying out IO files (8 files / total 250MiB) rw: Laying out IO files (8 files / total 250MiB) Jobs: 4 (f=32): [m(4)][100.0%][r=200MiB/s,w=199MiB/s][r=51.3k,w=50.9k IOPS][eta 00m:00s] rw: (groupid=0, jobs=4): err= 0: pid=679: Sun Jun 11 06:52:31 2017 read: IOPS=60.3k, BW=235MiB/s (247MB/s)(41.4GiB/180002msec) slat (usec): min=1, max=33650, avg= 9.24, stdev=31.16 clat (usec): min=0, max=38215, avg=1007.59, stdev=907.35 lat (usec): min=15, max=45771, avg=1017.25, stdev=918.07 clat percentiles (usec): | 1.00th=[ 466], 5.00th=[ 540], 10.00th=[ 580], 20.00th=[ 644], | 30.00th=[ 692], 40.00th=[ 732], 50.00th=[ 772], 60.00th=[ 812], | 70.00th=[ 876], 80.00th=[ 988], 90.00th=[ 1352], 95.00th=[ 2896], | 99.00th=[ 5408], 99.50th=[ 5728], 99.90th=[ 7136], 99.95th=[ 8512], | 99.99th=[14272] bw ( KiB/s): min=21080, max=167656, per=0.02%, avg=60272.28, stdev=14824.84 write: IOPS=60.3k, BW=235MiB/s (247MB/s)(41.4GiB/180002msec) slat (usec): min=1, max=22476, avg=12.33, stdev=34.13 clat (usec): min=86, max=46102, avg=3212.60, stdev=2211.13 lat (usec): min=98, max=46108, avg=3225.39, stdev=2215.15 clat percentiles (usec): | 1.00th=[ 548], 5.00th=[ 700], 10.00th=[ 796], 20.00th=[ 1080], | 30.00th=[ 1496], 40.00th=[ 2040], 50.00th=[ 2736], 60.00th=[ 3472], | 70.00th=[ 4384], 80.00th=[ 5344], 90.00th=[ 6368], 95.00th=[ 7072], | 99.00th=[ 8384], 99.50th=[ 9152], 99.90th=[13504], 99.95th=[16768], | 99.99th=[28800] bw ( KiB/s): min=21392, max=166912, per=0.02%, avg=60277.15, stdev=14798.88 lat (usec) : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01% lat (usec) : 250=0.01%, 500=1.35%, 750=25.00%, 1000=22.80% lat (msec) : 2=17.53%, 4=14.70%, 10=18.43%, 20=0.15%, 50=0.02% cpu : usr=5.31%, sys=19.70%, ctx=38960144, majf=0, minf=50 IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued rwt: total=10848673,10849680,0, short=0,0,0, dropped=0,0,0 latency : target=0, window=0, percentile=100.00%, depth=64 Run status group 0 (all jobs): READ: bw=235MiB/s (247MB/s), 235MiB/s-235MiB/s (247MB/s-247MB/s), io=41.4GiB (44.4GB), run=180002-180002msec WRITE: bw=235MiB/s (247MB/s), 235MiB/s-235MiB/s (247MB/s-247MB/s), io=41.4GiB (44.4GB), run=180002-180002msec Disk stats (read/write): nbd0: ios=10809774/10679800, merge=28452/159229, ticks=7736680/30982613, in_queue=38873356, util=100.00%
かなり適当な実装だけど、lio の blk-mq を使わない場合と同じぐらいは速度が出ているみたいだ。 新しめのカーネルだとデバイスに対して複数ソケットが割り当てられるので、次はスレッド化してどうなるか確認してみる。 実装には rust を使った。 rust いいよ、rust。