2008 Linux Storage & Filesystem Workshop (LSF '08) February 2526, 2008, San Jose, CA "Storage Track" summary by Grant Grundler (C) Copyright 2008 Google Inc Thanks also to James Bottomley, Martin Petersen, Chris Mason, and the speakers for help in collecting material for this summary. The original schedule wasn't exactly followed. I've organized this summary to match the actual schedule: Actual Schedule http://iou.parisc-linux.org/lsf2008/SCHEDULE.txt Original Schedule http://www.usenix.org/event/lsf08/tech/ This document is distributed under the Creative Commons Attribution License as described here: http://code.google.com/policies.html Actual license text is here: http://creativecommons.org/licenses/by/2.5/ This complies with Usenix Copyright/redistribution policy: http://www.usenix.org/publications/login/writing.html#Copyright Usenix (and anyone actually) has the right to reprint, reuse and redistribute this article or just parts of it. Please give Google and Usenix attribution. Usenix requests reprints should include the text "Reprinted from ;login: The Magazine of USENIX, vol. XX, no. YY (Berkeley, CA: USENIX Association, [year of publication]), pp. nn-nn." Executive Summary ----------------- Several themes came up over the two days: 1) SSDs (Solid State Disk) are coming. Good presentation by Dongjun Shin (Samsung) on SSD internal operation. Some discussion on which parameters were needed for optimal operation (theme #2 below). IO stack needs both micro-optimizations (perf within driver layers) and architectural changes (e.g. parameterize the key attributes so FS's can utilize SSDs optimally). Intel presented scsi_ram and ata_ram drivers to help developers tune the SCSI, ATA, and block IO subsystems for these orders-of-magnitude-faster (random read) devices. Hybrid drives were a hot topic at LSF2007 but only briefly discussed in the introduction this year. The conclusion was that separate cache management was a non starter for hybrid drives (this was the model employed by first generation hybrid drives). If they're to be useful at all, the drive itself has to manage the cache. The hybrid drives are expected to provide methods allowing the OS to manage the cache as well. It was suggested that perhaps linux can just give the hybrid drive flash as a second device to the filesystems. 2) The "device parameters" discussion is just beginning on how to parametrise device characteristics to the block IO schedulers and file systems. For instance, SSDs want all writes to be in units of the erase block size if possible, and device mapping layers would like better control over alignment and placement. Key discussion here is how to provide enough parameters to be useful but not so many that "users" (e.g. file system) gets it wrong. General consensus was more than 2 or 3 parameters would cause more problems than it solved. 3) IO priorities and/or bandwidth sharing have lots of folks interested in IO schedulers. Considered splitting the IO scheduler into two parts: upper half to deal with different needs of feeding the Q (limit bio resource consumption) and lower half to rate limit what gets pushed to the storage driver. 4) Network Block Storage: two technologies were previewed for addition to the linux kernel: pNFS (Parallel NFS) and FCoE (Fiber Channel over Ethernet). Neither is ready for kernel.org inclusion but some constructive guidance was given on what directions specific implementations needed to take. Issues iSCSI was facing were also presented and discussed. User- vs Kernel-space drivers were hot topics within those "Networked Block Storage" forums. MONDAY February 25, 2008 ------------------------ 9:00 - 9:30 Introduction and Opening Statements Recap of Last Year Chris Mason, James Bottomley (aka "jejb") This session was primarily a scorecard of how many topics discussed last year are fixed or implemented this year. The bright spots were the new filesystem (BTRFS = B-Tree FS?) and emerging support for OSD (Object-base Storage Device) in the form of bidirectional command integration (done) and long CDB commands (pending); it was also mentioned that Seagate is looking at producing OSD drives. Error handling was getting better (no more breaking a 128k write up into individual sectors for the retry) but there's still a lot of work to be done and we have some new tools to help test error handling. The 4k sector size, which was a big issue last year has receded in importance because manufacturers are hiding the problem in firmware. 9:30 - 10:15 SSD Dongjun Shin Samsung Electronics http://iou.parisc-linux.org/lsf2008/ssd-Dongjun_Shin.pdf Solid State Disk (SSD) storage was a recurring theme in the workshop. Dongjun gave an excellent introduction and details of how SSDs are organized internally (2-d matrix sort of). The intent was FS folks could understand how data allocation and read/write requests should be optimally structured. "stripes" and "channels" are the two dimensions to increase the level of parallelization and thus increasing the throughput of the drive. The exact configuration is vendor specific. The tradeoff is reducing stripe size to allow multithreaded apps to have multiple IO pending without incurring the "lock up a channel during erase operation" penalty for all pending IOs. HDDs prefer large sequential IOs vs SSDs which prefer many smaller random IOs. Dongjun presented "postmark" (mail server) performance numbers for various file systems. "nilfs" seemed to be an obvious leader of performance for most cases and was never the worst. Successive slides gave more details on some of the FSs tested. Some notable issues: o Flush Barriers kill XFS performance o BTRFS (B-TRee FS) performance better with 4K blocks than with 16k blocks. Dongjun shared what specific standards bodies were available or being added to enable better peformance of SSDs: o Trim command to truncate and mark space as "unused" o SSD identify to report page and erase block sizes (for FS and Vol Mgr) His last slide neatly summarised the issues and it covers the entire storage stack from FS to Block Layer to IO controller. Discussion: jejb asked which parameter (is there only one?) was the most important one. The answer wasn't entirely clear but it seemed to be the "erase block" size followed by the "stripe" size. This was compared to HD RAID controller design which would like to export similar information. Flush Barriers are the only Block IO barriers defined today and we need to revisit that. One issue is the flush barriers kill performance on the SSDs since the "Flash Translation Layer" could no longer coalesce IOs and had to write data out in blocks smaller than the erase block size. Ideally the file system would just issue writes using erase block sizes. We (linux kernel community) need to define more appropriate barriers that work better for SSDs and still allow file systems to indicated required ordering/completions of IOs. 10:30 - 11:00 Error Handling Ric Wheeler EMC [ no slides ] Ric Wheeler introduced the perennial Error Handling topic with the note that bad sector handling had markedly improved over the "total disaster" it was in 2007. He moved on to silent data corruption and noted that the situation here was improving with data checksumming now being built into filesystems (most notably BTRFS and XFS) and emerging support for T10 DIF. "forced unmount" topic provoked a lengthy discussion with James Bottomley claiming that, at least from a block point of view, everything should just work (surprise ejection of USB storage was cited as the example). Ric countered that NFS still doesn't work and others pointed out that even if block IO works, the filesystem might still not release the inodes. Ted Ts'o closed the debate by drawing attention to a yet to be presented paper at FAST2008 showing over 1,300 cases where errors were dropped or lost in the block and filesystem layers. Error injection was the last topic. Everybody agreed if errors are forced into the system, it's possible to consistently check how errors are handled. The session wrapped up with Mark Lord demonstrating new hdparm features to induce an uncorrectable sector failure on a SATA disk with the WRITE_LONG and WRITE_UNC_EXT commands. This forces the "on disk" CRCs to mismatch, thus allowing at least medium errors to be injected from the base of the stack. 11:00 - 11:45 Power Management Kristen Carlson Accardi Intel http://iou.parisc-linux.org/lsf2008/power_management-Kristen_Carlson_Accardi.pdf http://iou.parisc-linux.org/lsf2008/power_management-Kristen_Carlson_Accardi.odp Search using google found another excellent link: http://www.lesswatts.org/tips/disks.php Arjan van de Ven wrote PowerTOP and it's been useful in tracking down processes that cause CPU power consumption but not IO. kjournald and pdflush are shown as the "apps" responsible but obviously they are just surrogates for finishing async IO. e.g. postfix uses sockets which triggers inode updates: - consider lazy update of non-file inodes? - "virtual inodes" where suggested also With ALPM (Aggressive Link Power Management), up to 1.5watt per disk can be saved on desktop systems. Unlike disk drives, no HW issues have been seen with repeated power up/down of the Phy repeatedly so this is safer to implement. The problem is AHCI can only tell which state the drive is currently in (link down or not) and not how long or _when_ the link state changes (hard to gather metrics or determine perf impact). Performance was of interest since trading off power means some latency will be associated with coming back up to a full power state. The transition (mostly due to AN - Async Negotiation when restoring power to the Phys) from SLUMBER to ACTIVE state costs about ~10ms. Normal benchmarks show no performance hit as the drive is always busy. We need to define a "bursty" power benchmark that is more typical of many environments. Kristen presented three more ideas on where linux could help save power. The first was to "batch average" group IO where 5-30 seconds is normal to flush data and instead wait up to 10 minutes before flushing these. The second "suggestion" was a question: Can block layer provide hints to low level driver? E.g. "soon we are going to see IO, wake up". The third suggestion was making "Smarter" timers to limit CPU power up events. I.e. Coordinate the timers so they can wake up at the same time, do necessary work, then let the CPU go to a low power state for a longer period of time. But we need a new interface that specifies how much variability each timer can tolerate. Ric Wheeler (EMC) opened up the discussion on "Could we power down a disk?" since the savings there are typically 6-15 watts per disk. But powering up disks requires coordination across data center. Otherwise network traffic could cause lots of drives to spin up at the same time. He also suggested using SSD to cache and wondered how that might work. Eric Reidel (Seagate): EPA requirements - Idle CPU vs HD? One would be trading off power consumption for Data Access. Regarding "Spin Up/Down" usage: Seagate can design for higher down/up lifecycles. Currently it's not a high count only because Seagate is not getting data from OEMs on how high that count needs to be. It was noted that one version of Ubuntu was killing drives after a few months by spinning them down/up too often. Last question: will ALPM commands work over SATA port multiplier? Kristen thought it should but had not tested it. She also suggested using "Watts up" meter to measure actual power savings. 11:45 - 12:30 [IO] CFQ and Containers Fernando Luis Vazquez Cao Hiroaki Nakano NTT collaborating with VMware http://iou.parisc-linux.org/lsf2008/IO-CFQ_vs_Containers-Fernando_Luis_Vázquez_Cao.pdf Talked touched on three related topics: Block IO resources and cgroups, I/O group scheduling, and IO bandwidth allocation (ioband driver). "cgroups" covered how to "define arbitrary groupings of processes" and "ioband" driver was about how to manage IO bandwidth available to those groups. The proposals were not accepted as-is but the user facing issues agreed upon. The use case would be Xen, KVM, or (I assumed) VMware. Currently the IO Priority is determined by the process which initiated the IO. But the IO priority applies to _all_ devices that process is using. This changed in the month preceeding the conference and the speakers acknoledged that. A more complex scheme was proposed that supports hierarchial assignment of resource control (e.g. CPU, memory and IO priorities). A graphical example given on slide 9 (of 18). Proposed was "page_cgroup" to track write bandwidth (but not needed/used for read bandwidth tracking). One must track when the page is dirtied - forking across dirty pages is a trouble area. The page would get assign to a Cgroup when the BIO is allocated. One advantage of the "get_context()" approach is it does NOT depend on "current" process and thus would also work for kernel threads. Slide #12 proposed three ideas on "group aware" scheduling/bandwidth control. This was a critical slide for this presentation and much of the content below will refer to those three ideas. This revisited the "Stackable Requests" discussion and it was clear Request-DM (Device Mapper) multipath needs the same infrastructure. Idea #1 proposed a layer between the IO scheduler and IO driver. This needs some changes to elevator.c and additional infrastructure changes. Jens Axboe pointed out one can't control incoming queue from below the Block IO scheduler. The scheduler needs to be informed when the device is being throttled from below in order to prevent the IO scheduler queue from getting excessively long and consuming excessive memory resources. Jens suggested they start with #1 since it implements "Fairness". Idea #2 was generally not accepted. This made the last section of the talk moot. (Slides 15 to 17). For idea #3 (group scheduler above LVM "make_request"), adding a hook so cgroup can limit I/O handed to a particular scheduler was proposed and got some traction. Jens thought #3 would require less infrastructure than #1. Effectively, #3 would lead to a "variable sized Q-depth". And #3 would limit BIO resource allocation. Fernando/Hiroaki's IO bandwidth control implementation was using Token Buckets. This implements proportional sharing of available bandwidth. They also wanted a "max bandwidth" limit for each cgroup. Ric Wheeler observed latency of "seekiness" reduces the "usable bandwidth" and makes workload/bandwidth prediction impossible. Discussion about "needs to be scheduler independent" ensued. Most folks are atracted to this idea becuase "code sharing is good" and one wouldn't have to touch any of the schedulers directly. If integrated in the IO scheduler, IO bandwidth control would add some overhead for all users - even those that don't want it. Jens was encouraging as well. But the idea has some downside as well. After LSF, Naveen Gupta (google) talked with Jens Axboe and thought he convinced Jens this may not be such a good idea. One can implement such a bandwidth limiting policy many ways and different schedulers will interact differently with each one. And it seems that the scheduler issueing it's own IO (e.g. anticipatory reads) will just break the B/W limiting and make the wrong tradeoffs. One also can't merge cgroup IOs together and the IO scheduler would have to know that. "Tracking I/O" was about how to track write bandwidth, not for read. Slide 13 diagram showed the proposed data structures. It would associate "ownership" (aka IO Priority) of dirtied pages with the cgroup or original process. A dirty page would get assigned to a cgroup when allocating the BIO. One must track dirtied pages for this to work but forking with dirtied pages is going to be ugly. James Bottomley prefered the "io_context" approach. 1:30 - 2:20 [IO] NCQ Emulation Gwendal Grignou Google http://iou.parisc-linux.org/lsf2008/IO-SATA_NCQ_issues-Gwendal_Grignou.pdf https://docs.google.com/a/google.com/Presentation?id=dhmm8dwf_57rfm8b9tq# Gwendal started by explaining what Native Command Queuing (NCQ) was, his test environment (fio) and which workloads were expected to benefit. In general, the idea is to let the device determine (and decide) the optimal ordering of IOs since it knows current head position on the track and the seek times to any IOs it has in it's queue. Obviously, the more choices the device has, the better choices it can make and thus better overall throughput the device will achieve. Results he presented bear this out - in particular for small (< 32K), random read workloads (e.g. classic database). But the problem is since the device is deciding the order, it can chose to ignore some IOs for quite a while too. And thus latency sensitive applications will suffer occasionally with IOs taking more than 1-2 seconds to complete. He implemented and showed the results of a "queue plugging" that "starved" the drive of new IO requests until the oldest request was no longer over a given threshhold. Other methods to achieve the same effect were discussed but each had it's drawbacks (including this one). He also showed how by pushing more IO to the drive, we also impact the behavior of Block Schedulers to coalesce IO and anticipate which IOs issue next. Briefly discussed (and rejected) was replicating the drive topology knowledge in the IO scheduler. German "CT Magazine" was stated to have written a program to accurately characterize drive models. This was rejected since it introduced a substantial maintenance issue and also required knowing RPM and tracking the "current head position". Jens Axboe also noted that I/O Priorities and B/W allocation policies become harder to enforce. And while NCQ was effective on a "best case" benchmark, it was debated how effective it would be in real life (perhaps < 5%). The actual performance gains depend too much on workload and often the best performance "improvements" come from merging IO requests (also reduces number of rotations of the platters for a given IO sequence). 2:20 - 3:00 [IO] Making the IO Scheduler Aware of the Underlying Storage Topology Aaron Carroll and Joshua Root University of New South Wales http://iou.parisc-linux.org/lsf2008/IO-Scheduler_and_Topology-Aaron_Carroll.pdf Disclosure: I arranged the grant from Google to fund this work. HP is also funding a portion of this work. Aaron and Joshua have created an infrastructure to measure the performance of any particular block trace and were interested in seeing how IO schedulers behave under particular workloads. The perf slides are graphs of how the variations schedulers perform as one increases the number of processes generating the workload. Note: with "idle window" parameter turned off, (IIRC) CFQ (Completely Fair Queuing) scheduler looks like NOOP scheduler. They tested the following schedulers: AS (Anticipatory Scheduler), CFQ (Completely Fair Queueing), Deadline, FIFO, NOOP And a few different configs: o RAID 0 sequential, async o single disk random and sequential o 10 disk RAID 0 random and sequential From this, they started trying to figure out which parameters were relevant: o Queue Depth o Underlying Storage device type o RAID Tolopogy and asked what was the right way to determine those parameters: o User Input o runtime microbenchmark measurements o Ask lower layers Queue Depth generally not as important and not helpful for any sort of anticipation. For Device type, it would be obvious to ask underlying device driver but need a suitable level of abstraction and later slides discussed that. For RAID topology, the key info was "stripe boundaries" (others referred to this as "alignment") and width. He suggested one could do per spindle scheduling but that wasn't "warmly embraced". Ric Wheeler: can see differences in perf depending on Seek profile. o if Most IOs are to one disk at a time o if Array is doing read ahead o Random reads for RAID 3/5/6 depend on worst case (slowest drive) Jens: Disk type could be exported easily o "plugging" : stop Q to build bigger IO o "anticipatory" : start new IO, AFTER previous one has completed but before application has requested the data/meta data. Discussed how to split fairness/bandwidth sharing/Priorities (or whatever you want to call it) so a component above the SW RAID md driver would manage incoming requests. A "lower half" of the scheduler would do "time slice" (fairness per device instead of per process). Need to define an API for this. Also noted that CFQ can "unfairly" penalize bursty IO measurements. Suggestion was to use Token Bucket to mitigate bursty traffic. Discussion resumed on which parameters to export. Partitioned LUNs being a simple case of stupidity ("Don't do that on devices that allow varying the size of the LUN") which exporting the "preferred alignment and size" could avoid. The problem is the RAID stripe becomes misaligned if the start of the LUN doesn't happen to align with the underlying RAID stripe. XFS does attempt to align with RAID stripe and gets messed up on partitioned LUNs. Aaron and Joshua introduce two new schedulers that might be useful in the future: FIFO (true fifo, no merging), V(R) SSTF. No discussion on these - I hope they get submitted soon. Lastly, brief discussion about CFQ bug they (Aaron and Joshua) found. This was related to the case where CFQ scheduler tries to measure Q-depth. 3:30 - 4:20 [IO] DMA Representations sg_table vs. sg_ring IOMMUs and LLD's Restrictions Fujita Tomonori http://iou.parisc-linux.org/lsf2008/IO-DMA_Representations-fujita_tomonori.pdf (LLD = Low Level Driver, e.g NIC or HBA device driver) Fujita did an excellent job of summarizing the current mess that is used inside the linux kernel to represent DMA capabilities of devices. As Fujita dove straight into the technical material with no introduction, I'll attempt to explain what an IOMMU is and the Kernel DMA API. Historically, IO devices have always existed (e.g. ISA, EISA, or more recently 32-bit PCI) which were not capable of generating physical addresses for all of system RAM. The solution without an IOMMU is a "bounce buffer": DMA to a "low" address the device can reach and then memcpy to the target location. IOMMUs can virtualize (aka remap) host physical address space for a devices and thus allow these "legacy" devices to "directly" DMA to any memory address. Bounce buffer no longer necessary and we save the CPU cost of the memcpy. IOMMUs can also provide isolation and containment of IO devices (prevent any given device from spewing crap over random memory - think Virtual Machines), merging of scatter-gather lists into fewer "IO bus addresses" (more efficient block IO transfers), and provide DMA cache coherency for virtually indexed/tagged CPUs (e.g. PA-RISC). The PCI DMA Mapping interface was introduced into Linux 2.4 kernel by Dave Miller primarily to support IOMMUs. James Bottomley updated this to support non-cache coherent DMA and become "Bus Agnostic" by authoring the Documentation/DMA-API.txt in Linux 2.6 kernels. (Just to be clear, neither author did this alone - see the documents for shared credits.) The current DMA API also does not require the IOMMU (IO Memory Management Unit) drivers respect the "max segment length" (ie IOMMU support is coalescing DMA into bigger chunks than the device can handle). The DMA alignment (ie boundaries a DMA cannot cross) has similar issues. e.g. some PCI devices can't DMA across 4GB address boundary. Currently, the drivers which have either length or alignment limitations have code to split the DMA up into smaller chunks again. max_seg_boundary_mask in request queue is not visible to IOMMU since only "struct device *" is passed to IOMMU code. Slide 7 proposes adding a new "struct device_dma_parameters" in order to put all the DMA related parameters into one place and make them visible to IOMMUs. This idea was rejected as over kill. The preferred solution (jejb and others) was to just add the missing fields to struct device. Slide 10 summarized where all the various parameters currently live and adding another struct to deal with it was just adding more pointers that we didn't really need. Most devices do DMA and adding the fields directly to "struct device". jejb also confessed he added "u64 *dma_mask" to avoid ripping dma_mask out of pci_dev and thus annoying Dave Miller (who was primary author of the "PCI DMA API"). He agreed this needs to happen though. Next issue discussed was IOMMU performance and IO TLB flushing. IOMMU driver (and HW) performance is critical to good system performance. New x86 platforms support virtualization of IO and thus it's not just a "high end RISC" computer problem. Issues discussed related to IOMMU mapping/unmapping: 1) How to best manage IOMMU address space? Common Code? Different IOMMU drivers use either bitmap (most RISC) or Intel uses a "Red Black" tree. He trade converting POWER to use Red/Black tree and lost 20% performance with netperf. jejb and ggg agree the address allocation policy needs to be managed by the IOMMU or arch specific code since IO TLB replacement policy dictates the optimal method for allocating IOMMU address space. 2) "When should we flush IOTLB?" (slide 15 of 27) One would like to avoid flushing the IO TLB since (a) it's expensive (as measured in CPU cycles) and (b) it disturbs outstanding DMA (forces reloading IO TLB). However, if we flush the entries when the driver claims the DMA is done, we can prevent DMA going to a virtual DMA address that might have been freed and/or reallocated to someone else. ie flushing IO TLB is required to prevent devices from corrupting RAM. This is NOT a common problem but having an IOMMU to enforce DMA access to RAM is the only way to conclusively determine this (assume you don't have expensive bus analyzers.) Bottomline: the tradeoff is between performance and "safety" (aka robustness). 3) Should we just map everything once? (Assumes IOMMU can span all of RAM). IIRC, some RISC archs did this in the 2.2 kernel. Performance advantage is don't need to map, unmap, and flush IO TLB for individual pages. Trades off isolation (any device can DMA anywhere). Useful in some cases (e.g. "embedded" devices like an NFS server). The last DMA mapping related issue was SG chaining vs SG rings. This is a settled issue (SG chaining is preferred) and Rusty wasn't present to argue the "other side". scsi_sglist() macro was changed after 2.6.24 (slide 20 of 27). SCSI data accessors do allow insertion between chains - last entry has a flag to indicate next chain is valid. Slide 24 (of 27) shows how to walk the scsi_sglist(). It was mentioned the "sg" (SCSI Generic, a pass-thru) driver can build very large sg_list - bigger than the midlayer can handle. Boaz Harrosh pointed out the sg_list from BIO layer can be inserted or extended also. 4:20 - 5:00 (first half of this slot) [IO] iSCSI Transport Class Simplification Mike Christie http://iou.parisc-linux.org/lsf2008/IO-iSCSI_transport_class-Mike_Christie.pdf http://iou.parisc-linux.org/lsf2008/IO-iSCSI_transport_class-Mike_Christie.odp Main thrust here is common libs are needed to share common objects between transport classes. In particular, he called out the issues that lsscsi maintainer has faced across different kernel versions where /sys has evlovled. James Bottomley conceded there were "issues with original implementation". Mike also mentioned problems with parsing /sys under iSCSI devices . Goal is to provide common starting point for user space visible names. Mike proposed a scsi_transport_template that contained new "scsi_port" and "scsi_i_t_nexus" data structures. iSCSI also needs an abstraction between SCSI ports - an I_T_nexus. Other users of I_T_nexus were also discussed. James Bottomley pointed out libsas already has an I_T_nexus abstraction. Libsas provides a "host/port/phy/rphy/target/lun" heirarchy for /sys. However the exported paths need to be more flexible. Mike floated the idea of a new library to encapsulate the SCSI naming conventions so tools like lsscsi wouldn't have to struggle. 4:20 - 5:00 (second half of this slot) Nicholas Bellinger http://iou.parisc-linux.org/lsf2008/IO-iSCSI_and_Target_Mode-Nicholas_Bellinger.pdf Focal point for iSCSI developement is around Linux-iSCSI.org. iSCSI exposed issues with error recovery. The slideset neatly summarizes most of the points Nicholas wanted to make. "Advanced Features" discussed started with Multiple Connections per Session (MC/S) - stated to be faster than ethernet bonding using Gigabit ethernet. iSER (RFC-5045 and related RFC-5040/-5044) standards will help iSCSI to support 10GigE link speeds and direct data placement (e.g. iSCSI over Infiniband). ISNS (RFC-4171) would rework fabric "discovery" and is extensible for other storage fabrics. Project status of the various pieces related it iSCSI described start on slide 8 (of 19): SCSI Target (current design), SCST (older Target design), LIO-SE (Linux iSCSI.Org Storage Engine) and LIO-Target, IET. Slide 14 showed the relationships between "front ends" (e.g. iSER/iWARP), the common "storage engine" (SE), and the data structures used in the kernel to track storage (e.g. struct page and struct scsi_device). Slide 15 made some good arguements for sharing SCSI CDB (command block) emulation code in the kernel. My impression was the arguements are good - just the implementation wasn't acceptable (yet). Slide 16 (of 19) started the contentious discussion over user vs kernel space implementations. Relevant quote from linus: The only split that has worked pretty well is "connection initiaton/setup in user space, actual data tranfers in kernel space". The lively but inconclusive debate left me thinking most of the code will forced to live in user space until evidence is presented otherwise. iSCSI, FC, and SAS would be better in kernel because concurrency control fundementally resides in the kernel. And LIO-SE assumes most drivers belong and are implemented in kernel space because transport APIs force "middle code" into kernel. KVM perf suffers because of movement between virtual kernels (for IO IIRC). 5:00 - 5:40 [IO] Request Based Multipathing Kiyoshi Ueda Jun'ichi Nomura NEC http://iou.parisc-linux.org/lsf2008/IO-Request_DM-NEC_corp.pdf Key Point: proposed Multi-path support belongs below the IO scheduler and this seems to be the favored design. Problems expected with request completion and cleaning up the block layer. block_end_request() in 2.6.25 (needed?) RFC for request stacking framework was posted to linux-scsi and linux-ide mailing lists. See last slide (37) for URLs to postings. Big advantage of Request-based DM multipath is BIOs are already merged and multipath driver can do load balancing since it knows exactly how many IOs are going to each available path. Three issues raised (see Slide #7 of 37): Issue1:How to avoid deadlock during completion? Issue2:How to keep requests in mergeable state? Issue3:How to hook completion for stacking driver? #1 __blk_end_request() will deadlock because queue lock is held through the completion process. jejb suggested moving completions to tasklet (Soft IRQ) since SCSI at one point had the same issue. Also had discussion about migrating drivers to use blk_end_request instead of __blk_end_request(). #2 merging IO: "Busy" stack drivers won't know when lower driver is busy and once a request is removed from the scheduler queue, it's no longer mergable. Slides 14-21 have very good graphic representation of the problem. jejb suggested a "prep" and "unprep" function to indicate when requests are mergable or not. "prepared" means it's unmergable. "unprep" would push it back to a mergeable state. Someone volunteered jens axboe had objections but Jens had left the room a bit earlier. One basic difference between "Bio" (existing code) and proposed Request DM was pointed out: device locking (queue lock) will be required for both submission and completion of Request DM handler IOs and is not required by BIO. #3 completion hook: Slide 24 Problem: req->end_io() is called too late and is called with a queue lock held. Solutions were offered/discussed in the remaining slides (29-36): #1a Only allow use of non-locking drivers - ie drivers that do not lock in the completion path: o all SCSI drivers, cciss, i2o already meet this criteria. o Block Layer is using locking completion. o DASD driver change needed. o Discussion about how much work it was to convert other drivers #1b No submission during completion. o pass submission to a workqueue - sounds like a workaround mkp: why use Q-lock (besides for releasing a request) Suggested to seperate the free and completion function. Then use a workqueue to harvest "free" requests. jejb: other purposes as well but agreed they could be seperated. "SCSI already does #1a". :) #2 Busy_state - not a queue condition exclusively. Other resources may result in a busy response. e.g. mapping resource or other "prep" work might fail. #3 Stacking Hook: A) add "end_io" was the RFC already posted. Don't use B) move end_io calling place C) Use end_io as is. For (A), one can't use end_io for stacking driver. end_io is called to destroy the request. When to call end_io has to move until no one needs to reference the request. Discussion around how to make req->blk_end_io() work: - allow it to point at md_end_io(), end_io(), or free(). (B) had one easy question from Boaz: how many callers of end_io()? A: two - sr and sd. Very few end_io functions. One just has to make sure it's called exactly once. (C) wasn't considered a solution. partial completion couldn't be supported by stacking drivers. TUESDAY February 26, 2008 -------------------------- 9:00 - 9:40 FS and Volume "Mangers" Dave Chinner SGI http://iou.parisc-linux.org/lsf2008/fs_and_volmgrs-Dave_Chinner.odp Dave covered several major areas: o proposal he called "BIO hints" (Val Hansen called them "BIO commands") o DM (Device Mapper) Multi-path o chunk sizes o IO Barriers BIO hints is an attempt to let the FS tell the low level block "hints" about how the storage is being used. Definition of "hint" was something that the storage device _could_ but was not required to implement for correct operation. Suggestions he offered were "release" (indicate space is no longer in use), COW (snapshot, preallocation of ranges), and Don't COW (range of blocks we never snapshot, e.g. journal) mkfs could provide the "space is free" hints. Good for RAID devices, transparent security (zero released data blocks), and SSD's which could put unused blocks in it's garbage collection. Some file system aware storage devices already implement some of this (e.g. FAT file system on USB sticks). DM Multi-path has a basic trust issue. Most folks don't trust it because it doesn't have necessary investment to make it trustworthy. Chicken and egg problem. Ric Wheeler stated EMC does certify DM configs. Other complaints were poor perf, can't partition properly, mgt tools have lousy user interface, not supporting existing devices. (This is just an oral list - not verified.) Power of 2 chunk sizes don't work with HW RAID which use 4+1 or 8+1 disks. Barriers today are only for cache flushing - both to force data to media and enforce ordering of requests. Ordered commands (which SCSI offers at a perf cost) might be helpful. But despite how painful flush barriers are, disks with WCE implemented perform better despite the flush barriers. Later other discussion suggests acquire/release style barriers the enforce ordering of read/write requests might often be sufficient. jejb: suggest "commit on transaction" 9:50 - 10:20 [IO] OSD-based pNFS Benny Halevy Boaz Harrosh Panasas http://iou.parisc-linux.org/lsf2008/IO-pNFS_obj-Benny_Halevy.pdf Benny first described the role of the "layout driver" for OSD-based pNFS. "Layouts" are a catalog of "devices", describing the byte range and attributes of that "device". The main advantage of the layout driver is one can dynamically determine the object storage policy. One suggestion was to store "small" files on RAID1 and "large" files on RAID5. Striping across "devices" is also possible. By caching the "layouts" (object storage server descriptions), one can defer cataloging all the OSD servers at boot time and implement "on-demand" access to those servers. Current "device" implementations include iSCSI, iSER, and FC. "SCSI over USB" and FCoE are also possible. Functional testing has been done and performance was described as "can saturate a gige link". Future work will include "OSD 2.0" protocol developement and it's already clear there will be changes to the "OSD" protocol. Requirements of the linux kernel support OSD pNFS where discussed. Bidirectional SCSI CDB support is in 2.6.25-rcX kernels. No objection to patches for Variable Length CDBs - might go into 2.6.26. Recent patches to implement "Long Sense Buffers" were rejected and a better implementation is required. The discussion ended on DM (device mapper) and ULD (Upper Level Driver; e.g. sd, tape, cd/dvd). DM contains desired striping functionality but it also takes "ownership" of the device. Distributed error handling is not possible unless the DM would pass errors back up to high layers. Each ULD is expected to register an "OSD type". But the real question is do we want to represent objects as block devices (segue to the next talk) and how to represent those in some name space. 10:20 - 10:40 [IO] Block-based pNFS Andy Adamson (U of Michigan) Jason Glasgow (EMC) http://iou.parisc-linux.org/lsf2008/IO-pnfs_block-William_Andy_Adamson.pdf Afterwards, PNFS was summarized to me as "clustered FS folks are trying to pull coherency into NFS". The underlying issue that every clustered files system (e.g. Lustre) requires coherency of meta data across nodes of the cluster. NFS historically has bottlenecked on the NFS server since it was the only entity managing the meta data coherency. The first part of this talk explained the "Volume Topologies" and how pNFS block devices are identified ("fsid"). Each fsid can represent "arbitrarily complex" volume topologies which under DM, get "flattened" to a set of DM targets. But they didn't want to loose access to the hierarchy of the underlying storage paths in order to do failover. The proposal for "Failover to NFS" survived Benny's explaination of how a dirty page would be written out via block path and if that failed, then via NFS (file system) code path. The main steps for the first path would be "write", "commit", "logout commit" and for the failover path "write to MDS" and the "commit". This provoked sharp criticism from Christoph Hellwig (and others): This is stupid - adds complexity without significant benefit. Client has two paths that are completely different and the corner cases will kill us. The complexity he referred to was the "unwinding" of work after starting an IO request down the block IO code path and then restarting the IO request down a completely different code path (file system now). A lively debate ensued around changes needed to Block IO and VFS layers. Christoph was not the only person to object and this idea right now looks like a non-starter. The remaining "other issues" (slide title) became singular and covered block size: 4k is working but not interoperable with other implementations. 11:00 - 11:40 FS and Storage Layer Scalability Problems Dave Chinner SGI http://iou.parisc-linux.org/lsf2008/linux_storage_scalability-Dave_Chinner.pdf Dave offered "random thoughts" on 3-5 year challenges. The first comment was "Direct IO" is a solved problem and that we are "only working on micro-optimizations". He resurrected and somewhat summarized previous discussion on exposing the geometry and status of devices. He wanted to see "independent failure domains" being made known to the FS and device mapper so those could automate recovery. Load feedback could be used avoid "hot spots" on media/IO paths. And similar too "failure domains", dynamic "online growing" could make use of "Loss redundancy" metrics to automate redistribution of data to match application/user intent. Buffered IO writeback (discussing "pdflush") raise another batch of issues. It's very inefficient within a file system because the mix of meta-data and data in the IO stream causes both syncing and ordering problems. pdflush is also not NUMA aware and should use CPUsets (not Containers) to make pdflush NUMA aware. James Bottomley noted the IO completion is on wrong node as well - where IRQ is handled. And lastly, different FSs will use more/less CPU and functionality like checksuming data and aging FS might saturate a single CPU. He gave an example where the Raw HW can do 8GB/s. But only seeing 1.5 GB/s throughput with the CPU 90% utilized. And like the FS, knowing the "preferred alignment" of the underlying Storage would improve performance dramatically in those cases too. pdflush was summarized as "a bunch of heuristics" and no one objected to adding more. Dave also revisited the topic of Error Handling with the assertion that given enough disks, errors are COMMON. He endorsed the use of the existing error injection tools especially scsi_debug driver. His last "rant" was on the "IOPS challenge" (IO Per Second) that SSDs presented. He questioned that linux drivers and HBAs are ready for 50K IOPS from a single "spindle". RAW IOPS is limited by poor HBA design with excessive per transaction CPU overhead. HBA designers need to look at NICs. Using MSI-X direct interrupts intelligently would help alot but both SW and HW design to evolve. I'd like to point folks to "mmio_test" - see gnumonks.org - so they can measure this for themselves. Disclaimer: I'm one of the contributors to mmio_test (Robort Olsson, Andi Kleen and Harald Welte are the others). Any MMIO (Memory Mapped IO) reads in the driver "performance path" will prohibit the level of performance Dave Chinner described. And most SATA/SAS drivers have some MMIO reads. I estimate most "commercial" HBA designs are ~3 years behind NICs - e.g. RNIC (RDMA NIC). Not surprising given the cost models. There are some exceptions now (e.g. Marvell's 8-port 664x SAS/SATA chip) and certainly others I'm not aware of yet. Joern Engel added that tasklets were added about 2 years ago which now do the equivalent of NAPI ("New API" - for NIC drivers). NAPI was added about 5-6 years ago to prevent incoming NIC traffic from live-locking a system. All the CPU cycles could be consumed exclusively handling interrupts. This interrupt mitigation worked pretty well even if HW didn't support interrupt coalescing. 11:40 - 12:20 [IO] T10 DIF Martin Petersen Oracle http://iou.parisc-linux.org/lsf2008/IO-data_integrity-Martin_Petersen.pdf Martin pointed out the FAST2008 paper on "nearline vs SAS". His first point was data can get corrupted in nearly every stage between host memory and the final storage media. The typical "data at rest" corruption (aka "grown media defects") is just one form of corruption. The rest of the data corruption are grouped as "while data is in flight" and applications need to implement the first level of protection here. He also characterized Oracle's "HARD" as the most extreme implementation and compared others to "bong hits from outerspace". There was agreement given the volume of data being generated, the trivial CRC's would not be sufficient. While some vendors are pushing file systems with "logical block cryptographically strong checksumming" and similar techniques as bullet proof, they only detect the problems at read time. This could be months later when the original data is long gone. The goal of the TDIF (T10 Data Integrity Feature) standard was to prevent bad data from being written to disk in the first place. HW RAID controllers routinely reformat FC and SCSI drives to use 520 byte sectors to store additional data integrity/recovery bits on the drive. The goal of TDIF was to have "end to end" data integrity checks by standardizing and transmitting those extra 8 bytes from the application all the way down to the media. This could be validated at every "stop" on it's way to media and provide "end to end" integrity checking of the data. Slide 3 described what TDIF implements in the additional 8 bytes it adds to the normal 512 byte sector. Slide 4 explains which parts of the data path each of the competing standards "covers" and successive slides summarize more of the respective standards. Slide 16 (of 17) neatly summarized the "Application/OS Challenges" of how a robust "end to end" protection could be achieved. He pointed out which changes are needed in the SCSI and one of those (variable length CDBs) is already in the kernel. James Bottomley observed he could no longer get SCSI specs to implement new features like this one due to recent changes in distribution. He also pointed out the FS developers could use some of the "tag CRC" bits to implement a "reverse lookup" function they were interested in. The best comment which closed the discussion came from Boaz Harrosh: Integrity checks are great! They catch bugs during developement! 1:30 - 2:20 [IO] FCoE Robert Love Christopher Leech http://iou.parisc-linux.org/lsf2008/IO-FCoE-Chris_Leech.pdf Robert and Christopher took turns giving a description of the project, an update on current project status, and discussion of issues they needed help with. FCoE is succintly described as "an encapsulation protocol to carry Fibre Channel frames over Ethernet" and standardized in T11. The main goal of this is to integrate existing FC SAN into a 10GigE network and continue to use the existing FC SAN management tools. The discovery protocol is still underdevelopement. James Bottomley observed VLAN would allow the FC protocol to pretend there is no other traffic on the "ethernet network" since the on-wire protocol supports 802.1Q tags. Slide 4 (of 20) nicely maps the FC-0 and FC-1 layers to IEEE 802.3 PHY and MAC layers respectively. Key here is the FC standard was originally intended to support both TCP/IP and FC protocol traffic over FC transport and the layering already existed in the FC standard to enable a simple "addressing" scheme ("Fabric Id" maps to "MAC address"). And following Martin Petersen's talk on TDIF, the presenters pointed out that FC protocol implements it's own "end to end" CRC. Open-FCoE.org seems to be making good progress on several areas but it's not ready for "production use" yet. The current code has a functional initiator and SW gateway. Tools for wireshark to decode FCoE protocol are also upstream and working. But a re-architecture is underway and new functional developement on the current tree has temporarily been suspended. They are considering "library-izing" the FCP support (e.g. libfc) instead of putting everything into scsi_transport_fc. Current problems discussed included the complexity of the code, frustration with the (excessive) number of abstractions, and wanting to take advantage of current NIC offload capabilities. Current rework is taking direction from James Smart and making better use of existing linux SCSI/FC code and then determine how much code could be shared with existing FC HBA drivers. Discussion covered making use of a proposed "I_T_Nexus" support. Robert and Christopher agreed "I_T_Nexus" would be useful for FCoE as well since they had the same issues as others managing the connection state. James Bottomley also pointed out their current implementation didn't properly handle error states and got a commitment back that Robert would revisit that code. Use of sysfs vs ioctl to "send command" came around to "why not use SG Passthru?" Also the SCST was recommended as the "target mode" but that was debated since Target can be out of tree until it's stable. Current "target" support is STGT and implemented in user space. This resurrected an earlier discussion where iSCSI was told to "put everything in userspace". 2:20 - 3:00 [IO] SATA (was retitled "Linux Storage Stack performance") Kristen Carlson Accardi Intel http://iou.parisc-linux.org/lsf2008/IO-latency-Kristen-Carlson-Accardi.pdf Kristen and Matthew "willy" Wilcox provided "forward looking" performance tools to address expected performance issues with the linux storage stack when used with SSDs. This follows the "provide data and the problem will get solved" philosophy. Storage stacks are tuned for "seek avoidance" (waste of time for SSDs) and SSDs are still fairly expensive and uncommon. The underlying assumption is lack of SSDs in the hands of developers means the data won't get generated and no one will accept optimizations that help SSDs. Her first slides provided a summary of SSD cost/performance numbers (e.g. Zeus and Mtron) which stated a single device is now capable of 50,000+ IOPS (IO Per Second). Current rotational media can "only" do 150-250 IOPS per device on a random read workload (3000-4000 if it's only talking to the disk cache) and largely depends on IO request merging (larger IO sizes) to get better throughput. Ric Wheeler pointed out EMC's disk array can actually do much more but it requires racks of disks. Her point was this level of performance will be in many laptops "soon" and it would be great if linux could support that level of performance. scsi_ram and ata_ram drivers are the new kernel component that would allow developers to emulate SSD performance and focus on measuring performance of the protocol layers. Willy recently submitted those drivers to respective mailing lists: http://marc.info/?l=linux-scsi&m=120331663227540&w=2 http://www.ussg.iu.edu/hypermail/linux/kernel/0802.2/3695.html Concurrently, he developed "iolat" that would provide a "reasonable" workload _and_ performance measurements (CPU utilization) conventiently rolled into one tool. Personally, I'm not happy about "yet another benchmark" and would rather just see fio (or some other simple benchmark) and a performance monitoring tool rolled into one script. Criticisms of the "iolat" benchmark results were pointed out. The first was the "readprofile()" data presented had the wrong symbols in it's lookups. The second was readprofile uses timer ticks to sample and thus has will never measure code called in an interrupt handler or held under a lock acquired with spinlock_irqsave(). In this case, only the latter issue is a problem since the RAM driver doesn't generate any interrupts like a normal IO device would. I suggested using the IOAT (DMA engine for offloading user/kernel space copies) to emulate a real IO device - also churns the CPU cache and consumes memory bandwidth without burning CPU cycles. Despite criticisms of iolat, the results (slide 12 of 22) comparing ata_ram, scsi_ram, rd (ram disk), and normal disk are enlightening. "small direct random read" performance of scsi_ram and ata_ram is approximately 1/4th of the "rd" driver. This is all due to latency in the protocol layer code path. Outcome of this is Willy started rewriting iolat to use oprofile instead. We need accurate profile data to determine where the CPU is spending time. Kristen also suggested libata stop using the SCSI midlayer and thus avoid the SCSI-to-ATA translation layer (which is costing another 10% in performance but need to be careful with HBA drivers that support both SAS and SATA). 3:30 - 4:10 [IO] Sysfs Representations Hannes Reinecke and Kay Sievers SuSE http://iou.parisc-linux.org/lsf2008/IO-SCSI_sysfs_representation-Hannes_Reinecke.pdf http://iou.parisc-linux.org/lsf2008/IO-SCSI_sysfs_representation-Hannes_Reinecke.odp Summary by James Bottomley - thanks! Hannes Reinecke and Kay Sievers lead a discussion on sysfs in SCSI. They first observed that SCSI represents pure SCSI objects as devices with upper layer drivers (except sg - SCSI Generic) being scsi bus drivers. However, everything else, driven by the transport classes, are stored as class devices. Kay and Greg want to eliminate class devices from the tree and the SCSI transport classes are the biggest obstacle to this. The next topic was Object Lifetime. Hannes pointed to the nasty race SCSI has so far been unable to solve where a nearly dead device gets re-added to the system and can currently not be activated (because a dying devices is in the way). Hannes proposed the "resurrection" patch set (bringing dead devices back to life). James Bottomley declared that he didn't like this and a heated discussion ensued during which it was agreed that perhaps simply removing the dying device from visibility and allowing multiple devices representing the same SCSI target into the device list (but only allowing one to be visible) might be the best way to manage this situation and the references that depend on the dying device. Non controversial topics were reordering target creation at scan time to try to stem the tide of spurious events they generate and moving SCSI attributes to default attributes so they would all get created at the correct time and solve a race today where the upward propagation of the device creation uevent races with the attribute creation and may result in the root device not being found if udev wins the race. The session wound up with James demanding that Greg and Kay show exactly what the sysfs people have in store for SCSI , with the topics of multiple binding (necessary to allow sg to bind to an already bound driver and also to allow the transport classes to attach through the driver infrastructure) and elimination of the scsi_device class in favour of the same information provided by /sys/bus/scsi. James was OK with this in principle but pointed out that we have a lot of tools in existence today which depend on things like /sys/class/scsi_device so compatibility links would have to be provided. Summary, Lightning Talks, and Wrap-up Summary by James Bottomley - thanks! James Bottomley opened the Lightning talks with the presentation of some slides from Doug Gilbert about standards changes. The main observation was the new UAS (USB Attached SCSI) which James worried would end up being the compliance level of current USB with all the added complexity of SCSI. The INCITS decision to try to close off access to all SCSI standards was mentioned and discussed with the general comment being that this was being done against the wishes of at least the T10 and T13 members. Ted Ts'o observed that INCITS is trying to make money selling standards documents and perhaps it was time for a group like the Free Standards group to offer the SCSI Technical Committees a home. Val Henson told everyone how she'd spent her summer vacation trying to speed up fsck by parallelising the I/O; an approach which, unfortunately, didn't work. The main problems being that threaded async isn't better and the amount of read ahead is small. A question from the floor asked if what we really want is to saturate the system. Val answered maybe only about a quarter of the buffer cache, but that we'd like to hint to the Operating System about our usage patterns to avoid unnecessary I/Os. Chris Mason commented that both BTRFS and EXT3 FS could use this. Nick Bellinger asked about the trade off between putting things in the kernel and putting them in user space. His principal concern was the current target driver implementation in user space, which might lower the IOPS (IO Per Second) in iSCSI. Martin Petersen asserted we didn't really have enough data about this yet. There followed a discussion where the KVM virtual I/O drivers were given as a counter example (the initial storage driver was a user space IDE emulation) and wound up concluding that KVM wasn't a good example. The final discussion was around NFS with the majority of people suggesting that it went in-kernel not for performance reasons but for data access and concurrency reasons. The question of where to draw the line between kernel and user space in any implementation is clearly still a hot topic.