| Description | In the Linux kernel, the following vulnerability has been resolved:  btrfs: fix space cache corruption and potential double allocations  When testing space_cache v2 on a large set of machines, we encountered a few symptoms:  1. "unable to add free space :-17" (EEXIST) errors. 2. Missing free space info items, sometimes caught with a "missing free    space info for X" error. 3. Double-accounted space: ranges that were allocated in the extent tree    and also marked as free in the free space tree, ranges that were    marked as allocated twice in the extent tree, or ranges that were    marked as free twice in the free space tree. If the latter made it    onto disk, the next reboot would hit the BUG_ON() in    add_new_free_space(). 4. On some hosts with no on-disk corruption or error messages, the    in-memory space cache (dumped with drgn) disagreed with the free    space tree.  All of these symptoms have the same underlying cause: a race between caching the free space for a block group and returning free space to the in-memory space cache for pinned extents causes us to double-add a free range to the space cache. This race exists when free space is cached from the free space tree (space_cache=v2) or the extent tree (nospace_cache, or space_cache=v1 if the cache needs to be regenerated). struct btrfs_block_group::last_byte_to_unpin and struct btrfs_block_group::progress are supposed to protect against this race, but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit") subtly broke this by allowing multiple transactions to be unpinning extents at the same time.  Specifically, the race is as follows:  1. An extent is deleted from an uncached block group in transaction A. 2. btrfs_commit_transaction() is called for transaction A. 3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed    ref for the deleted extent. 4. __btrfs_free_extent() -> do_free_extent_accounting() ->    add_to_free_space_tree() adds the deleted extent back to the free    space tree. 5. do_free_extent_accounting() -> btrfs_update_block_group() ->    btrfs_cache_block_group() queues up the block group to get cached.    block_group->progress is set to block_group->start. 6. btrfs_commit_transaction() for transaction A calls    switch_commit_roots(). It sets block_group->last_byte_to_unpin to    block_group->progress, which is block_group->start because the block    group hasn't been cached yet. 7. The caching thread gets to our block group. Since the commit roots    were already switched, load_free_space_tree() sees the deleted extent    as free and adds it to the space cache. It finishes caching and sets    block_group->progress to U64_MAX. 8. btrfs_commit_transaction() advances transaction A to    TRANS_STATE_SUPER_COMMITTED. 9. fsync calls btrfs_commit_transaction() for transaction B. Since    transaction A is already in TRANS_STATE_SUPER_COMMITTED and the    commit is for fsync, it advances. 10. btrfs_commit_transaction() for transaction B calls     switch_commit_roots(). This time, the block group has already been     cached, so it sets block_group->last_byte_to_unpin to U64_MAX. 11. btrfs_commit_transaction() for transaction A calls     btrfs_finish_extent_commit(), which calls unpin_extent_range() for     the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by     transaction B!), so it adds the deleted extent to the space cache     again!  This explains all of our symptoms above:  * If the sequence of events is exactly as described above, when the free   space is re-added in step 11, it will fail with EEXIST. * If another thread reallocates the deleted extent in between steps 7   and 11, then step 11 will silently re-add that space to the space   cache as free even though it is actually allocated. Then, if that   space is allocated *again*, the free space tree will be corrupted   (namely, the wrong item will be deleted). * If we don't catch this free space tree corr ---truncated--- |