Fio压测工具和io队列深度理解和误区

Home > 工具介绍 > Fio压测工具和io队列深度理解和误区

Fio压测工具和io队列深度理解和误区

March 9th, 2012 Yu Feng

原创文章，转载请注明： 转载自系统技术非业余研究

Fio 是个强大的IO压力测试工具，我之前写过不少fio的使用和实践，参见这里。

随着块设备的发展，特别是SSD盘的出现，设备的并行度越来越高。利用好这些设备，有个诀窍就是提高设备的iodepth, 一把喂给设备更多的IO请求，让电梯算法和设备有机会来安排合并以及内部并行处理，提高总体效率。

应用使用IO通常有二种方式：同步和异步。同步的IO一次只能发出一个IO请求，等待内核完成才返回，这样对于单个线程iodepth总是小于1，但是可以透过多个线程并发执行来解决，通常我们会用16-32根线程同时工作把iodepth塞满。异步的话就是用类似libaio这样的linux native aio一次提交一批，然后等待一批的完成，减少交互的次数，会更有效率。

io队列深度通常对不同的设备很敏感，那么如何用fio来探测出合理的值呢？

让我们先来看下和iodepth相关的参数：

iodepth=int
Number of I/O units to keep in flight against the file. Note that increasing iodepth beyond 1 will not affect synchronous ioengines
(except for small degress when verify_async is in use). Even async engines my impose OS restrictions causing the desired depth not to be
achieved. This may happen on Linux when using libaio and not setting direct=1, since buffered IO is not async on that OS. Keep an eye on
the IO depth distribution in the fio output to verify that the achieved depth is as expected. Default:
1.

iodepth_batch=int
Number of I/Os to submit at once. Default: iodepth.

iodepth_batch_complete=int
This defines how many pieces of IO to retrieve at once. It defaults to 1 which
means that we’ll ask for a minimum of 1 IO in the retrieval process from the kernel. The IO retrieval will go on until we hit the limit
set by iodepth_low. If this variable is set to 0, then fio will always check for completed events before queuing more IO. This helps
reduce IO latency, at the cost of more retrieval system calls.

iodepth_low=int
Low watermark indicating when to start filling the queue again. Default: iodepth.

direct=bool
If true, use non-buffered I/O (usually O_DIRECT). Default: false.

fsync=int
How many I/Os to perform before issuing an fsync(2) of dirty data. If 0, don’t sync. Default: 0.

这几个参数在libaio的引擎下的作用，文档写的挺明白，但容我再罗嗦下IO请求的流程：

libaio引擎会用这个iodepth值来调用io_setup准备个可以一次提交iodepth个IO的上下文，同时申请个io请求队列用于保持IO。在压测进行的时候，系统会生成特定的IO请求，往io请求队列里面扔，当队列里面的IO个数达到iodepth_batch值的时候，就调用io_submit批次提交请求，然后开始调用io_getevents开始收割已经完成的IO。每次收割多少呢？由于收割的时候，超时时间设置为0，所以有多少已完成就算多少，最多可以收割iodepth_batch_complete值个。随着收割，IO队列里面的IO数就少了，那么需要补充新的IO。什么时候补充呢？当IO数目降到iodepth_low值的时候，就重新填充，保证OS可以看到至少iodepth_low数目的io在电梯口排队着。

注意：这些参数在文档里面描述的有点小问题，比如说默认值什么的是不太对的，所以我的建议是这些参数要去显示的写。

如何确认fio安装我们的配置在工作呢？ fio提供了诊断办法 --debug=io ，我们来演示下：

# cat nvdisk-test
[global]
bs=512
ioengine=libaio
userspace_reap
rw=randrw
rwmixwrite=20
time_based
runtime=180
direct=1
group_reporting
randrepeat=0
norandommap
ramp_time=6
iodepth=16
iodepth_batch=8
iodepth_low=8
iodepth_batch_complete=8
exitall
[test]
filename=/dev/nvdisk0
numjobs=1

fio任务配置里面有几个点需要非常注意：
1. libaio工作的时候需要文件direct方式打开。
2. 块大小必须是扇区(512字节)的倍数。
3. userspace_reap提高异步IO收割的速度。
4. ramp_time的作用是减少日志对高速IO的影响。
5. 只要开了direct,fsync就不会发生。

# fio nvdisk-test --debug=io
fio: set debug option io
io       22441 load ioengine libaio
io       22441 load ioengine libaio
test: (g=0): rw=randrw, bs=512-512/512-512, ioengine=libaio, iodepth=16
fio 2.0.5
Starting 1 process
io       22444 invalidate cache /dev/nvdisk0: 0/8589926400
io       22444 fill_io_u: io_u 0x6d3210: off=3694285312/len=512/ddir=0//dev/nvdisk0
io       22444 prep: io_u 0x6d3210: off=3694285312/len=512/ddir=0//dev/nvdisk0
io       22444 ->prep(0x6d3210)=0
io       22444 queue: io_u 0x6d3210: off=3694285312/len=512/ddir=0//dev/nvdisk0
io       22444 fill_io_u: io_u 0x6d2f80: off=4595993600/len=512/ddir=0//dev/nvdisk0
io       22444 prep: io_u 0x6d2f80: off=4595993600/len=512/ddir=0//dev/nvdisk0
io       22444 ->prep(0x6d2f80)=0
io       22444 queue: io_u 0x6d2f80: off=4595993600/len=512/ddir=0//dev/nvdisk0
io       22444 fill_io_u: io_u 0x6d2cb0: off=3825244160/len=512/ddir=0//dev/nvdisk0
io       22444 prep: io_u 0x6d2cb0: off=3825244160/len=512/ddir=0//dev/nvdisk0
io       22444 ->prep(0x6d2cb0)=0
io       22444 queue: io_u 0x6d2cb0: off=3825244160/len=512/ddir=0//dev/nvdisk0
io       22444 fill_io_u: io_u 0x6d29a0: off=6994864640/len=512/ddir=0//dev/nvdisk0
io       22444 prep: io_u 0x6d29a0: off=6994864640/len=512/ddir=0//dev/nvdisk0
io       22444 ->prep(0x6d29a0)=0
io       22444 queue: io_u 0x6d29a0: off=6994864640/len=512/ddir=0//dev/nvdisk0
io       22444 fill_io_u: io_u 0x6d2710: off=2572593664/len=512/ddir=0//dev/nvdisk0
io       22444 prep: io_u 0x6d2710: off=2572593664/len=512/ddir=0//dev/nvdisk0
io       22444 ->prep(0x6d2710)=0
io       22444 queue: io_u 0x6d2710: off=2572593664/len=512/ddir=0//dev/nvdisk0
io       22444 fill_io_u: io_u 0x6d2400: off=3267822080/len=512/ddir=0//dev/nvdisk0
io       22444 prep: io_u 0x6d2400: off=3267822080/len=512/ddir=0//dev/nvdisk0
io       22444 ->prep(0x6d2400)=0
io       22444 queue: io_u 0x6d2400: off=3267822080/len=512/ddir=0//dev/nvdisk0
io       22444 fill_io_u: io_u 0x6d2130: off=7099489280/len=512/ddir=0//dev/nvdisk0
io       22444 prep: io_u 0x6d2130: off=7099489280/len=512/ddir=0//dev/nvdisk0
io       22444 ->prep(0x6d2130)=0
io       22444 queue: io_u 0x6d2130: off=7099489280/len=512/ddir=0//dev/nvdisk0
io       22444 fill_io_u: io_u 0x6d1ea0: off=7682447872/len=512/ddir=0//dev/nvdisk0
io       22444 prep: io_u 0x6d1ea0: off=7682447872/len=512/ddir=0//dev/nvdisk0
io       22444 ->prep(0x6d1ea0)=0
io       22444 queue: io_u 0x6d1ea0: off=7682447872/len=512/ddir=0//dev/nvdisk0
io       22444 calling ->commit(), depth 8
io       22444 fill_io_u: io_u 0x6d1b90: off=5983331840/len=512/ddir=0//dev/nvdisk0
io       22444 prep: io_u 0x6d1b90: off=5983331840/len=512/ddir=0//dev/nvdisk0
io       22444 ->prep(0x6d1b90)=0
io       22444 queue: io_u 0x6d1b90: off=5983331840/len=512/ddir=0//dev/nvdisk0
io       22444 fill_io_u: io_u 0x6cdfa0: off=6449852928/len=512/ddir=0//dev/nvdisk0
...

我们可以看到详细的IO工作过程，这个方法不需要对OS非常的熟悉，比较实用。

还有个方法就是透过strace来跟踪系统调用的情况, 更直观点。

# pstree -p
init(1)─┬─agent_eagleye(22296)
        ├─screen(13490)─┬─bash(18324)─┬─emacs(19429)
        │               │             ├─emacs(20365)
        │               │             ├─emacs(21268)
        │               │             ├─fio(22452)─┬─fio(22454)
        │               │             │            └─{fio}(22453)
        │               │             └─man(20385)───sh(20386)───sh(20387)───less(20391)
        ├─sshd(1834)───sshd(13115)───bash(13117)───screen(13662)
        └─udevd(705)─┬─udevd(1438)
                     └─udevd(1745
# strace -p 22454
...
io_submit(140534061244416, 8, {{(nil), 0, 1, 0, 3}, {(nil), 0, 0, 0, 3}, {(nil), 0, 0, 0, 3}, {(nil), 0, 0, 0, 3}, {(nil), 0, 0, 0, 3}, {(nil), 0, 1, 0, 3}, {(nil), 0, 1, 0, 3}, {(nil), 0, 0, 0, 3}}) = 8
io_getevents(140534061244416, 8, 8, {{(nil), 0x6d3210, 512, 0}, {(nil), 0x6d2f80, 512, 0}, {(nil), 0x6d2cb0, 512, 0}, {(nil), 0x6d29a0, 512, 0}, {(nil), 0x6d2710, 512, 0}, {(nil), 0x6d2400, 512, 0}, {(nil), 0x6d2130, 512, 0}, {(nil), 0x6d1ea0, 512, 0}}, NULL) = 8
...

最后有效的一招就是用iostat -dx 1来确认你的iodepth是符合设备特性的。

(由于我用的是nvram卡，这个卡的设备驱动没有队列，iostat看不到队列深度，就用了其他的设备的图代替，表明可以用看iostat看IO队列深度,谢谢网友Uranus指出)
通过这些方法确认你的配置是对的，之后分析出来的数据才会有意义。

祝玩得开心！

Post Footer automatically generated by wp-posturl plugin for wordpress.

Categories: 工具介绍 Tags: fio, iodepth, libaio

Comments (14)

Uranus

March 10th, 2012 at 01:12 | #1

Reply | Quote

块大小是不是还必须是 page size 的倍数？

还有点不明白的，iostat 里的 iodepth 是怎么确认的（avgqu-sz 是 13.14）？

谢谢咯。。。

Yu Feng Reply:
March 10th, 2012 at 10:38 am
1. 在direct方式下块大小必须是512的倍数，其他方式是4096的倍数。
2. 由于我用的是nvram卡，这个卡的设备驱动没有队列，iostat看不到队列深度，就用了其他的设备的图代替，表明可以用看iostat看IO队列深度。

bo_zhou Reply:
August 10th, 2012 at 2:31 pm
用iostat -dx 1 来查看的话计算方式应该是 avgqu-sz/avgrq-sz = io_depth吧

bo_zhou Reply:
August 10th, 2012 at 2:38 pm
好像iostat的man 解释是有点让人费解，不过我测了一下 iodepth 设置为8
并发32个线程
iostat -dx 1 中 avgrq-sz=32 avgqu-sz=254.0 感觉是avgqu-sz/avgrq-sz=io_depth

bo_zhou Reply:
August 10th, 2012 at 7:30 pm
写错了。avgrq-sz 就是每次请求的平均大小、 avgqu-sz是队列长度。但是这个队列长度
大概约等于num-jobs * iodepth
duxing

June 27th, 2012 at 18:30 | #2

Reply | Quote

只要开了direct,fsync就不会发生 direct io 不用 fsync ?

Yu Feng Reply:
June 27th, 2012 at 11:33 pm
是
charmland

March 4th, 2013 at 17:38 | #3

Reply | Quote

上面job file的配置里面，同时设置了time_based、randrepeat=0、norandommap，分别是什么作用？

Yu Feng Reply:
March 5th, 2013 at 12:36 pm
time_based决定脚本的运行时间。norandommap和randrepeat决定每次产生的数据缓冲区的内容不同。
Yun Mao

April 5th, 2013 at 04:03 | #4

Reply | Quote

Are you sure userspace_reap is in effect? The man page says ” The reaping mode is only enabled when polling for a minimum of 0 events (eg when iodepth_batch_complete=0).” But you have iodepth_batch_complete=8 set. Thanks.

Yu Feng Reply:
April 7th, 2013 at 2:09 pm
aio用的内存是在用户进程空间分配的，所以在reap的时候内核和用户空间都可以访问到这块内存，该内存就是个简单的ring结构，所以为了节省系统调用，高性能的服务器会选择在userspace_reap。 fio支持这种方式，也确实有效的。
liu.li

January 20th, 2014 at 16:22 | #5

Reply | Quote

博主，请教个问题，我在使用FIO测试时，脚本如下：
[global]
bs=1m
ioengine=libaio
time_based
direct=1
size=130g
group_reporting=1
iodepth=16
invalidate=1
numjobs=24
timeout=500
filename=/dev/sdb2

[read]
stonewall
rw=read

[write]
stonewall
rw=write

[randread]
stonewall
rw=randread

[randwrite]
stonewall
rw=randwrite

盘是三块sas 600G做了RAID5.结果得到顺序读的带宽居然2.7GB/s
貌似有点不符合实际，这个是为什么呢？

结果如下：

read: (groupid=0, jobs=24): err= 0: pid=61608: Mon Jan 20 08:15:51 2014
read : io=1358.2GB, bw=2781.5MB/s, iops=2781, runt=500023msec
slat (usec): min=58, max=308919, avg=8605.12, stdev=22443.51
clat (usec): min=389, max=718347, avg=129418.99, stdev=34687.73
lat (usec): min=528, max=718493, avg=138024.55, stdev=31929.99
clat percentiles (msec):
| 1.00th=[ 61], 5.00th=[ 67], 10.00th=[ 74], 20.00th=[ 123],
| 30.00th=[ 126], 40.00th=[ 128], 50.00th=[ 130], 60.00th=[ 131],
| 70.00th=[ 133], 80.00th=[ 135], 90.00th=[ 151], 95.00th=[ 188],
| 99.00th=[ 260], 99.50th=[ 326], 99.90th=[ 379], 99.95th=[ 396],
| 99.99th=[ 519]
bw (KB /s): min=28672, max=147456, per=4.18%, avg=118986.51, stdev=13766.18
lat (usec) : 500=0.01%, 750=0.01%, 1000=0.01%
lat (msec) : 2=0.01%, 4=0.01%, 10=0.01%, 20=0.01%, 50=0.06%
lat (msec) : 100=10.59%, 250=88.19%, 500=1.14%, 750=0.02%
cpu : usr=0.04%, sys=1.73%, ctx=259114, majf=0, minf=13094
IO depths : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=100.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.1%, 32=0.0%, 64=0.0%, >=64=0.0%
issued : total=r=1390783/w=0/d=0, short=r=0/w=0/d=0

Yu Feng Reply:
January 21st, 2014 at 2:36 pm
raid卡有内存，再顺序的，当然可能了。

liu.li Reply:
January 22nd, 2014 at 11:40 am
raid卡有内存？不明白~是说raid卡的缓存吗？raid卡缓存机制对于IO性能的提升是如何一个工作过程呢？
可以解释下吗？

如果指系统本身内存，我系统64GB内存，测试设置size=130GB。测试时间够长，应该可以塞满。

还望楼主明示~

不胜感激~

Yu Feng Reply:
January 22nd, 2014 at 2:51 pm
Raid卡的内存通常有512M或者1G，操作系统提交数据后，不立即落盘，会在这块内存呆着。这块内存靠电池来维护持久性。

liu.li Reply:
January 23rd, 2014 at 4:05 pm
我的RAID卡为1G缓存，但是这个缓存能够提升磁盘IO性能到2.7GB/s吗？按照理论说的话，三块600G SAS做成RAID5，它的带宽值应该也就两三百MB/s吧？貌似有点不理解~

liu.li Reply:
January 22nd, 2014 at 11:49 am
direct=1的设置，绕过的是那层缓存啊？是值操作系统层面还是包括硬件层面呢？

感谢回复

Yu Feng Reply:
January 22nd, 2014 at 2:48 pm
操作系统的。
liu.li

January 23rd, 2014 at 16:08 | #6

Reply | Quote

霸爷有没有可以在线请教的方式啊？是否可以加个邮箱 & msn&QQ类的，等着评论回复，实在在揪心~如果可以加您，可以把联系方式发到我的邮箱里：jiayu_66@126.com~感谢霸爷~

Yu Feng Reply:
January 23rd, 2014 at 7:49 pm
参考这个：http://www.lsi.com/downloads/Public/MegaRAID%20SAS/41450-04_RevA.pdf
摘抄里面的性能：
 6.0 Gb/s Serial Attached SCSI (SAS) performance
 6.0 Gb/s SATA III performance
 Eight-lane, 5 GT/s (4Gb/s) PCI Express host interface
也就是说PCIx8的理论的速度是: 4Gb/s*8= 4GB, 而内存的带宽大概在5-6GB。你的速度2.xGB很正常呀，符合预期。我们的Raid卡接很多ssd这个性能不奇怪呀。

Yu Feng Reply:
January 23rd, 2014 at 8:02 pm
gmail: mryufeng@gmail.com
zhangyue

February 4th, 2014 at 15:27 | #7

Reply | Quote

请教一下：
关于direct这个参数。
文档里面，This may happen on Linux when using libaio and not setting direct=1。也就是说使用libaio engine就需要设置direct＝0. buffered＝1
是这样的意思吧。
但是在用的时候，普遍设置的direct＝1 ，buffered 保持默认也是1.是不是冲突，哪里不对？

zhangyue Reply:
February 4th, 2014 at 3:31 pm
设置direct＝1的时候，是否设置buffered＝0 对结果影响挺大。（buffered＝1即不设置，保持默认）
buffered＝0 ＝＝》 read : io=19456MB, bw=491779KB/s, iops=122944
buffered＝1 ＝＝》 read : io=19456MB, bw=700723KB/s, iops=175180
zengzhi

March 17th, 2014 at 17:02 | #8

Reply | Quote

请问，fio能不能对hadoop和mooseFS这种分布式存储的系统进行iops的测试？
我试过在filename指向分布式存储的挂载点，提示“error=Is a directory”，请问这种情况应该怎么测试？

Yu Feng Reply:
March 17th, 2014 at 5:26 pm
是标准的挂载点应该都是可以的吧。fio –debug看下具体原因

zengzhi Reply:
March 18th, 2014 at 11:27 am
我的moose挂载点的情况是这样：
[root@node05 tmp]# df -h /mnt/mfs
文件系统容量已用可用已用%% 挂载点
192.168.2.13:9421 1.3T 0 1.3T 0% /mnt/mfs

然后我这样用fio的时候报错： fio –filename=/mnt/mfs/ –directory=/mfsOut/ –name=test –rw=randread –iodepth=1 –direct=1 –thread –size=1M –debug=io –bs=4K

报错如下：
fio: pid=0, err=21/file:filesetup.c:59, func=unlink, error=Is a directory

Run status group 0 (all jobs):
io 23460 ioengine cpuio unregistered
io 23460 ioengine mmap unregistered
io 23460 ioengine sync unregistered
io 23460 ioengine psync unregistered

Yu Feng Reply:
March 18th, 2014 at 11:49 am
–filename=/mnt/mfs/ –directory=/mfsOut/ filename怎么可以用目录呢

zengzhi Reply:
March 18th, 2014 at 2:00 pm
那我应该怎样测moose？ /mnt/mfs/ 这个是我的moose的挂载目录，192.168.2.13:9421，是ip地址。

我之前看到你的教程里面说filename是指向测试的设备，所以我就指向了/mnt/mfs

Yu Feng Reply:
March 18th, 2014 at 2:53 pm
我那个是裸设备，比如/dev/xxxx 裸设备本身是个文件呀。你这个挂载的是目录呀。

Yu Feng Reply:
March 18th, 2014 at 2:54 pm
global]
runtime=86400
time_based
group_reporting
directory=/your_dir
ioscheduler=deadline
refill_buffers

[mysql-binlog]
filename=test-mysql-bin.log

zengzhi Reply:
March 18th, 2014 at 3:36 pm
那如果要测的设备是fuse挂载的网络位置的话要怎么测？

zengzhi Reply:
March 18th, 2014 at 11:29 am
请问像这种分布式的存储情况应该怎么测试？非常感谢。

MyCo Reply:
July 21st, 2014 at 2:38 pm
fio –directory=/mnt/mfs/ –direct=1 –rw=randwrite –refill_buffers –norandommap –randrepeat=0 –ioengine=libaio –bs=4k –rwmixread=100 –iodepth=1 –numjobs=100 –runtime=120 –group_reporting –name=4ktestwrite –size=500M
若谷

April 22nd, 2014 at 17:30 | #9

Reply | Quote

你好，请教个问题，我用fio -name iops -filename /dev/sdb -ioengine libaio -direct=1 -bssplit=16K -iodepth 128 -rw=write命令，测试时观察iostat合并块大小总是16K，也就是说没有io合并，submit_bio不是会走elv_merge的么，并且/sys/block/sdb/下的配置都是默认，也没有修改no_merge这个选项，请问下霸爷这是什么情况，纠结了好久~

Yu Feng Reply:
April 22nd, 2014 at 8:17 pm
fio –debug all 下看fio如何发出操作的。

若谷 Reply:
April 23rd, 2014 at 11:55 am
哦，明白了，多谢霸爷。O(∩_∩)O。我把elv 的merge理解错了，电梯调度的合并只是把相邻的io放在一起，并不把它们合成一个iovec。
MyCo

May 9th, 2014 at 09:49 | #10

Reply | Quote

我想问一下，我在跑fio的时候给了20个numjobs，run的是4k的读，才开始run的时候它的jobs是这样的：
Jobs: 20 (f=20): [rrrrrrrrrrrrrrrrrrrr]
大概十分钟后就变成了
Jobs: 5 (f=5): [rrrrr_______________]
然后IOPS下降了很多，这到底是怎么回事呢？

Yu Feng Reply:
May 9th, 2014 at 1:59 pm
ssd吗？设备性能可能会抖动.
liuli

June 23rd, 2014 at 14:21 | #11

Reply | Quote

霸爷，

请教FIO测试结果的问题。我使用FIO测试时，查看测试结果发现一个问题：

当使用512KB及以上的bs测试过程中，我使用iostat -m 2去监测带宽跟tps，当FIO测试完成后，对比测试结果与iostat监测的带宽与IOPS值，发现带宽值都一样，但是IOPS值不同，iostat监测的IOPS刚好是FIO测试结果IOPS的2倍。使用FIO测试结果的带宽值与IOPS值相除，刚好是设置的bs值大小。

在使用256K及以下的小数据块时就没有这种情况。

不知道霸爷是否注意过这个问题，还请指教明示~

yang Reply:
June 29th, 2016 at 12:02 pm
您的测试设备的存储单元是256K，可能您做了条带，默认256K是正常的
MyCo

November 18th, 2014 at 14:36 | #12

Reply | Quote

我想问下fio里面的参数–rw,write是顺序写，randwrite是随机写，在两个有啥不同，难道真的是产生的数据块排列是顺序或者随机的吗？
hongmeng

October 3rd, 2015 at 20:35 | #13

Reply | Quote

@liu.li
我认为是这样：
1.一块SAS盘的吞吐量只有200MB，3块SAS盘最Raid 5，吞吐量不会超过600MB，实际Raid5比三个盘性能更低，我没测，这是瓶颈，无法突破。
2.Raid卡的cache有512MB，1GB，2GB这几种。可以cache读的数据。
3.fio你读测试，有24个进程，这24个进程同一时刻访问的是一个LBA地址。如下：
#fio raidtest –debug=io | grep off | grep fill_io_u
io 21622 fill_io_u: io_u 0xf42040: off=0/len=1048576/ddir=0//dev/sda
io 21621 fill_io_u: io_u 0xf42040: off=0/len=1048576/ddir=0//dev/sda
io 21622 fill_io_u: io_u 0xf42040: off=1048576/len=1048576/ddir=0//dev/sda
io 21621 fill_io_u: io_u 0xf42040: off=1048576/len=1048576/ddir=0//dev/sda
io 21622 fill_io_u: io_u 0xf42040: off=2097152/len=1048576/ddir=0//dev/sda
io 21621 fill_io_u: io_u 0xf42040: off=2097152/len=1048576/ddir=0//dev/sda
io 21622 fill_io_u: io_u 0xf42040: off=3145728/len=1048576/ddir=0//dev/sda
io 21621 fill_io_u: io_u 0xf42040: off=3145728/len=1048576/ddir=0//dev/sda
我这里是2个进程的例子，可以看到offset的是同一个。
4.这时候，第一个进程去SAS盘里取出来了数据，cache在了Raid卡的cache里，其余的进程，访问Raid卡，Raid的cache直接返回了读取的数据，并没有去盘里面取，所以性能很好。
5.这就是看到，2.7GB，都是Raid卡Cache（Dram）返回的，并没有去盘里面取数据，所以性能很好。但这并没有什么意义。
6.对于顺序读写，测的是吞吐量，16*24＝384并发，这并没有什么意义。衡量随机读写IOPS和延迟的时候，再用多并发来测。Raid卡+SAS盘，无论IOPS还是吞吐量，都有硬件的实际瓶颈在，要挑选合适的并发数，384并发除了在系统排队，实际硬件是处理不了的。
7.fio的进程访问一个LBA，这个特性太蛋疼了。回头改下。
colin_zhen

November 12th, 2015 at 17:11 | #14

Reply | Quote

你好，请教一个问题：这个我用fio测试我们的一个存储方案的latency数据：

clat percentiles (usec):
| 1.00th=[ 47], 5.00th=[ 48], 10.00th=[ 49], 20.00th=[ 52],
| 30.00th=[ 58], 40.00th=[ 70], 50.00th=[ 82], 60.00th=[ 84],
| 70.00th=[ 86], 80.00th=[ 92], 90.00th=[ 126], 95.00th=[ 157],
| 99.00th=[ 221], 99.50th=[ 239], 99.90th=[ 262], 99.95th=[ 274],
| 99.99th=[ 318]
我想问下这个 99.99th 详细是什么意思？我个人理解的是latency的最大的0.01%的avg，请问对吗？
还有通过我抓取的数据可以看到这个latency出现的很随机，请问有什么方法可以在这方面有些优化？谢谢.