bcache: make bch_sectors_dirty_init() to be multithreaded
When attaching a cached device (a.k.a backing device) to a cache device, bch_sectors_dirty_init() is called to count dirty sectors and stripes (see what bcache_dev_sectors_dirty_add() does) on the cache device. The counting is done by a single thread recursive function bch_btree_map_keys() to iterate all the bcache btree nodes. If the btree has huge number of nodes, bch_sectors_dirty_init() will take quite long time. In my testing, if the registering cache set has a existed UUID which matches a already registered cached device, the automatical attachment during the registration may take more than 55 minutes. This is too long for waiting the bcache to work in real deployment. Fortunately when bch_sectors_dirty_init() is called, no other thread will access the btree yet, it is safe to do a read-only parallelized dirty sectors counting by multiple threads. This patch tries to create multiple threads, and each thread tries to one-by-one count dirty sectors from the sub-tree indexed by a root node key which the thread fetched. After the sub-tree is counted, the counting thread will continue to fetch another root node key, until the fetched key is NULL. How many threads in parallel depends on the number of keys from the btree root node, and the number of online CPU core. The thread number will be the less number but no more than BCH_DIRTY_INIT_THRD_MAX. If there are only 2 keys in root node, it can only be 2x times faster by this patch. But if there are 10 keys in the root node, with this patch it can be 10x times faster. Signed-off-by: Coly Li <colyli@suse.de> Cc: Christoph Hellwig <hch@infradead.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>