summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* parisc: fix kernel BUG at arch/parisc/include/asm/mmzone.h:50Helge Deller2013-06-011-4/+1
| | | | | | | | | | | | | | | | | | | | | | | With CONFIG_DISCONTIGMEM=y and multiple physical memory areas, cat /proc/kpageflags triggers this kernel bug: kernel BUG at arch/parisc/include/asm/mmzone.h:50! CPU: 2 PID: 7848 Comm: cat Tainted: G D W 3.10.0-rc3-64bit #44 IAOQ[0]: kpageflags_read0x128/0x238 IAOQ[1]: kpageflags_read0x12c/0x238 RP(r2): proc_reg_read0xbc/0x130 Backtrace: [<00000000402ca2d4>] proc_reg_read0xbc/0x130 [<0000000040235bcc>] vfs_read0xc4/0x1d0 [<0000000040235f0c>] SyS_read0x94/0xf0 [<0000000040105fc0>] syscall_exit0x0/0x14 kpageflags_read() walks through the whole memory, even if some memory areas are physically not available. So, we should better not BUG on an unavailable pfn in pfn_to_nid() but just return the expected value -1 or 0. Signed-off-by: Helge Deller <deller@gmx.de>
* parisc: memory overflow, 'name' length is too short for usingChen Gang2013-06-011-1/+1
| | | | | | | | | | | | | | | | 'path.bc[i]' can be asigned by PCI_SLOT() which can '> 10', so sizeof(6 * "%u:" + "%u" + '\0') may be 21. Since 'name' length is 20, it may be memory overflow. And 'path.bc[i]' is 'unsigned char' for printing, we can be sure the max length of 'name' must be less than 28. So simplify thinking, we can use 28 instead of 20 directly, and do not think of whether 'patchc.bc[i]' can '> 100'. Signed-off-by: Chen Gang <gang.chen@asianux.com> Signed-off-by: Helge Deller <deller@gmx.de>
* Merge tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mturquette/linuxLinus Torvalds2013-06-016-6/+25
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull clock subsystem fixes from Mike Turquette: "A mix of small fixes affecting mostly ARM platforms as well as a discrete clock expander chip. Most fixes are corrections to lousy clock data of one form or another." * tag 'clk-fixes-for-linus' of git://git.linaro.org/people/mturquette/linux: clk: mxs: Include clk mxs header file clk: vt8500: Fix unbalanced spinlock in vt8500_dclk_set_rate() clk: si5351: Set initial clkout rate when defined in platform data. clk: si5351: Fix clkout rate computation. clk: samsung: Add CLK_IGNORE_UNUSED flag for the sysreg clocks clk: ux500: clk-sysctrl: handle clocks with no parents clk: ux500: Provide device enumeration number suffix for SMSC911x
| * clk: mxs: Include clk mxs header fileFabio Estevam2013-05-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | Fix the following sparse warnings: drivers/clk/mxs/clk-imx28.c:72:5: warning: symbol 'mxs_saif_clkmux_select' was not declared. Should it be static? drivers/clk/mxs/clk-imx28.c:156:12: warning: symbol 'mx28_clocks_init' was not declared. Should it be static? Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com> Acked-by: Shawn Guo <shawn.guo@linaro.org> Signed-off-by: Mike Turquette <mturquette@linaro.org> [mturquette@linaro.org: fixed $SUBJECT line]
| * clk: vt8500: Fix unbalanced spinlock in vt8500_dclk_set_rate()Tony Prisk2013-05-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | With the addition of a DVO clock, a bug is now evident in the vt8500 clock code: [ 0.290000] WARNING: at init/main.c:698 do_one_initcall+0x158/0x18c() [ 0.300000] initcall wm8505fb_driver_init+0x0/0xc returned with disabled int This is caused by an unbalanced spinlock in vt8500_dclk_set_rate(). Replace the second call to spin_lock_irqsave() with spin_unlock_irqrestore(). Signed-off-by: Tony Prisk <linux@prisktech.co.nz> Signed-off-by: Mike Turquette <mturquette@linaro.org>
| * clk: si5351: Set initial clkout rate when defined in platform data.Marek Belisko2013-05-291-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | clock-frequency property from platform data was read but never used. Apply defined rate when clock is registered. Signed-off-by: Marek Belisko <marek.belisko@streamunlimited.com> Acked-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> Signed-off-by: Mike Turquette <mturquette@linaro.org> [mturquette@linaro.org: add missing changelog] Cc: stable@kernel.org Signed-off-by: Mike Turquette <mturquette@linaro.org>
| * clk: si5351: Fix clkout rate computation.Marek Belisko2013-05-291-1/+1
| | | | | | | | | | | | | | | | | | Rate was incorrectly computed because we read from wrong divider register. Signed-off-by: Marek Belisko <marek.belisko@streamunlimited.com> Acked-by: Sebastian Hesselbarth <sebastian.hesselbarth@gmail.com> Signed-off-by: Mike Turquette <mturquette@linaro.org> Cc: stable@kernel.org
| * clk: samsung: Add CLK_IGNORE_UNUSED flag for the sysreg clocksSylwester Nawrocki2013-05-291-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently no driver *) handles the sysreg clock, with an assumption that this clock is always left in its default state (enabled). Before commit 6e6aac7590f902d14d90bace3fd499 ARM: EXYNOS: Migrate clock support to common clock framework the sysreg clock was not even defined and hence wasn't handled explicitly in the kernel. To restore the previous behaviour disable masking the sysreg clock off in the clock core by default. *) Except the Exynos4x12 FIMC-IS driver, which will be modified to not touch the sysreg clock. Signed-off-by: Sylwester Nawrocki <s.nawrocki@samsung.com> Signed-off-by: Kyungmin Park <kyungmin.park@samsung.com> Signed-off-by: Mike Turquette <mturquette@linaro.org>
| * clk: ux500: clk-sysctrl: handle clocks with no parentsFabio Baltieri2013-05-291-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | Fix clk_reg_sysctrl() to set main clock registers of new struct clk_sysctrl even if the registered clock has no parents. This fixes an issue where "ulpclk" was registered with all clk->reg_* fields uninitialized, causing a -EINVAL error from clk_prepare(). Signed-off-by: Fabio Baltieri <fabio.baltieri@linaro.org> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Mike Turquette <mturquette@linaro.org>
| * clk: ux500: Provide device enumeration number suffix for SMSC911xLee Jones2013-05-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | First Ethernet device has a ".0" appended onto the device name. It appears that we need this in order to obtain the correct clock. Without this fix Ethernet does not function on Ux500 devices, which is a regression. Cc: Ulf Hansson <ulf.hansson@linaro.org> Signed-off-by: Lee Jones <lee.jones@linaro.org> Signed-off-by: Mike Turquette <mturquette@linaro.org> [mturquette@linaro.org: improved changelog]
* | Merge tag 'fbdev-for-3.10-rc4' of ↵Linus Torvalds2013-06-017-5/+42
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/plagnioj/linux-fbdev Pull fbdev fixes from Jean-Christophe PLAGNIOL-VILLARD: "This contains some small fixes - Atmel LCDC: fix blank the backlight on remove - ps3fb: fix compile warning - OMAPDSS: Fix crash with DT boot" * tag 'fbdev-for-3.10-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/plagnioj/linux-fbdev: atmel_lcdfb: blank the backlight on remove trivial: atmel_lcdfb: add missing error message OMAPDSS: Fix crash with DT boot fbdev/ps3fb: fix compile warning
| * | atmel_lcdfb: blank the backlight on removeRichard Genoud2013-06-011-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When removing atmel_lcdfb module, the backlight is unregistered but not blanked. (only for CONFIG_BACKLIGHT_ATMEL_LCDC case). This can result in the screen going full white depending on how the PWM is wired. Signed-off-by: Richard Genoud <richard.genoud@gmail.com> Signed-off-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
| * | trivial: atmel_lcdfb: add missing error messageRichard Genoud2013-06-011-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | When a too small framebuffer is given, the atmel_lcdfb_check_var silently fails. Adding an error message will save some head scratching. Signed-off-by: Richard Genoud <richard.genoud@gmail.com> Signed-off-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
| * | Merge branch 'fbdev-3.10-fixes' of git://gitorious.org/linux-omap-dss2/linux ↵Jean-Christophe PLAGNIOL-VILLARD2013-05-296-2/+30
| |\ \ | | |/ | |/| | | | | | | | | | | | | | | | into linux-fbdev/for-3.10-fixes Pull Tomi fixes for ps3fb and omap2 Signed-off-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
| | * OMAPDSS: Fix crash with DT bootTomi Valkeinen2013-05-235-1/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When booting with DT, there's a crash when omapfb is probed. This is caused by the fact that omapdss+DT is not yet supported, and thus omapdss is not probed at all. On the other hand, omapfb is always probed. When omapfb tries to use omapdss, there's a NULL pointer dereference crash. The same error should most likely happen with omapdrm and omap_vout also. To fix this, add an "initialized" state to omapdss. When omapdss has been probed, it's marked as initialized. omapfb, omapdrm and omap_vout check this state when they are probed to see that omapdss is actually there. Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com> Tested-by: Peter Ujfalusi <peter.ujfalusi@ti.com>
| | * fbdev/ps3fb: fix compile warningTomi Valkeinen2013-05-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 11bd5933abe0 ("fbdev/ps3fb: use vm_iomap_memory()") introduced the following warning: drivers/video/ps3fb.c: In function 'ps3fb_mmap': drivers/video/ps3fb.c:712:2: warning: suggest parentheses around '+' inside '<<' [-Wparentheses] Fix this by adding the parentheses. Signed-off-by: Tomi Valkeinen <tomi.valkeinen@ti.com>
* | | Merge branch 'for-linus' of ↵Linus Torvalds2013-06-017-16/+21
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull assorted fixes from Al Viro: "There'll be more - I'm trying to dig out from under the pile of mail (a couple of weeks of something flu-like ;-/) and there's several more things waiting for review; this is just the obvious stuff." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: zoran: racy refcount handling in vm_ops ->open()/->close() befs_readdir(): do not increment ->f_pos if filldir tells us to stop hpfs: deadlock and race in directory lseek() qnx6: qnx6_readdir() has a braino in pos calculation fix buffer leak after "scsi: saner replacements for ->proc_info()" vfs: Fix invalid ida_remove() call
| * | | zoran: racy refcount handling in vm_ops ->open()/->close()Al Viro2013-05-312-8/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | worse, we lock ->resource_lock too late when we are destroying the final clonal VMA; the check for lack of other mappings of the same opened file can race with mmap(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | befs_readdir(): do not increment ->f_pos if filldir tells us to stopAl Viro2013-05-311-2/+2
| | | | | | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | hpfs: deadlock and race in directory lseek()Al Viro2013-05-311-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For one thing, there's an ABBA deadlock on hpfs fs-wide lock and i_mutex in hpfs_dir_lseek() - there's a lot of methods that grab the former with the caller already holding the latter, so it must take i_mutex first. For another, locking the damn thing, carefully validating the offset, then dropping locks and assigning the offset is obviously racy. Moreover, we _must_ do hpfs_add_pos(), or the machinery in dnode.c won't modify the sucker on B-tree surgeries. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | qnx6: qnx6_readdir() has a braino in pos calculationAl Viro2013-05-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to mask lower 5 bits out, not leave only those and clear the rest... As it is, we end up always starting to read from the beginning of directory, no matter what the current position had been. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | fix buffer leak after "scsi: saner replacements for ->proc_info()"Jan Beulich2013-05-311-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | That patch failed to set proc_scsi_fops' .release method. Signed-off-by: Jan Beulich <jbeulich@suse.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | | vfs: Fix invalid ida_remove() callTakashi Iwai2013-05-311-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the group id of a shared mount is not allocated, the umount still tries to call mnt_release_group_id(), which eventually hits a kernel warning at ida_remove() spewing a message like: ida_remove called for id=0 which is not allocated. This patch fixes the bug simply checking the group id in the caller. Reported-by: Cristian Rodríguez <crrodriguez@opensuse.org> Signed-off-by: Takashi Iwai <tiwai@suse.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | Merge tag 'nfs-for-3.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2013-06-012-1/+3
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull two NFS client fixes from Trond Myklebust: - Fix a regression that broke NFS mounting using klibc and busybox - Stable fix to check access modes correctly on NFSv4 delegated open() * tag 'nfs-for-3.10-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: NFS: Fix security flavor negotiation with legacy binary mounts NFSv4: Fix a thinko in nfs4_try_open_cached
| * | | | NFS: Fix security flavor negotiation with legacy binary mountsChuck Lever2013-05-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Darrick J. Wong <darrick.wong@oracle.com> reports: > I have a kvm-based testing setup that netboots VMs over NFS, the > client end of which seems to have broken somehow in 3.10-rc1. The > server's exports file looks like this: > > /storage/mtr/x64 192.168.122.0/24(ro,sync,no_root_squash,no_subtree_check) > > On the client end (inside the VM), the initrd runs the following > command to try to mount the rootfs over NFS: > > # mount -o nolock -o ro -o retrans=10 192.168.122.1:/storage/mtr/x64/ /root > > (Note: This is the busybox mount command.) > > The mount fails with -EINVAL. Commit 4580a92d44 "NFS: Use server-recommended security flavor by default (NFSv3)" introduced a behavior regression for NFS mounts done via a legacy binary mount(2) call. Ensure that a default security flavor is specified for legacy binary mount requests, since they do not invoke nfs_select_flavor() in the kernel. Busybox uses klibc's nfsmount command, which performs NFS mounts using the legacy binary mount data format. /sbin/mount.nfs is not affected by this regression. Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: Darrick J. Wong <darrick.wong@oracle.com> Acked-by: Weston Andros Adamson <dros@netapp.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * | | | NFSv4: Fix a thinko in nfs4_try_open_cachedTrond Myklebust2013-05-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to pass the full open mode flags to nfs_may_open() when doing a delegated open. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
* | | | | Merge branch 'for_linus' of ↵Linus Torvalds2013-06-014-3/+25
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull reiserfs fixes from Jan Kara: "Three reiserfs fixes. They fix real problems spotted by users so I hope they are ok even at this stage." * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: reiserfs: fix deadlock with nfs racing on create/lookup reiserfs: fix problems with chowning setuid file w/ xattrs reiserfs: fix spurious multiple-fill in reiserfs_readdir_dentry
| * | | | | reiserfs: fix deadlock with nfs racing on create/lookupJeff Mahoney2013-05-311-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reiserfs is currently able to be deadlocked by having two NFS clients where one has removed and recreated a file and another is accessing the file with an open file handle. If one client deletes and recreates a file with timing such that the recreated file obtains the same [dirid, objectid] pair as the original file while another client accesses the file via file handle, the create and lookup can race and deadlock if the lookup manages to create the in-memory inode first. The create thread, in insert_inode_locked4, will hold the write lock while waiting on the other inode to be unlocked. The lookup thread, anywhere in the iget path, will release and reacquire the write lock while it schedules. If it needs to reacquire the lock while the create thread has it, it will never be able to make forward progress because it needs to reacquire the lock before ultimately unlocking the inode. This patch drops the write lock across the insert_inode_locked4 call so that the ordering of inode_wait -> write lock is retained. Since this would have been the case before the BKL push-down, this is safe. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | | | reiserfs: fix problems with chowning setuid file w/ xattrsJeff Mahoney2013-05-312-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | reiserfs_chown_xattrs() takes the iattr struct passed into ->setattr and uses it to iterate over all the attrs associated with a file to change ownership of xattrs (and transfer quota associated with the xattr files). When the setuid bit is cleared during chown, ATTR_MODE and iattr->ia_mode are passed to all the xattrs as well. This means that the xattr directory will have S_IFREG added to its mode bits. This has been prevented in practice by a missing IS_PRIVATE check in reiserfs_acl_chmod, which caused a double-lock to occur while holding the write lock. Since the file system was completely locked up, the writeout of the corrupted mode never happened. This patch temporarily clears everything but ATTR_UID|ATTR_GID for the calls to reiserfs_setattr and adds the missing IS_PRIVATE check. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>
| * | | | | reiserfs: fix spurious multiple-fill in reiserfs_readdir_dentryJeff Mahoney2013-05-311-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After sleeping for filldir(), we check to see if the file system has changed and research. The next_pos pointer is updated but its value isn't pushed into the key used for the search itself. As a result, the search returns the same item that the last cycle of the loop did and filldir() is called multiple times with the same data. The end result is that the buffer can contain the same name multiple times. This can be returned to userspace or used internally in the xattr code where it can manifest with the following warning: jdm-20004 reiserfs_delete_xattrs: Couldn't delete all xattrs (-2) reiserfs_for_each_xattr uses reiserfs_readdir_dentry to iterate over the xattr names and ends up trying to unlink the same name twice. The second attempt fails with -ENOENT and the error is returned. At some point I'll need to add support into reiserfsck to remove the orphaned directories left behind when this occurs. The fix is to push the value into the key before researching. Signed-off-by: Jeff Mahoney <jeffm@suse.com> Signed-off-by: Jan Kara <jack@suse.cz>
* | | | | | Merge tag 'for-linus-v3.10-rc4-crc-xattr-fixes' of git://oss.sgi.com/xfs/xfsLinus Torvalds2013-06-014-183/+307
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs extended attribute fixes for CRCs from Ben Myers: "Here are several fixes that are relevant on CRC enabled XFS filesystems. They are followed by a rework of the remote attribute code so that each block of the attribute contains a header with a CRC. Previously there was a CRC header per extent in the remote attribute code, but this was untenable because it was not possible to know how many extents would be allocated for the attribute until after the allocation has completed, due to the fragmentation of free space. This became complicated because the size of the headers needs to be added to the length of the payload to get the overall length required for the allocation. With a header per block, things are less complicated at the cost of a little space. I would have preferred to defer this and the rest of the CRC queue to 3.11 to mitigate risk for existing non-crc users in 3.10. Doing so would require setting a feature bit for the on-disk changes, and so I have been pressured into sending this pull request by Eric Sandeen and David Chinner from Red Hat. I'll send another pull request or two with the rest of the CRC queue next week. - Remove assert on count of remote attribute CRC headers - Fix the number of blocks read in for remote attributes - Zero remote attribute tails properly - Fix mapping of remote attribute buffers to have correct length - initialize temp leaf properly in xfs_attr3_leaf_unbalance, and xfs_attr3_leaf_compact - Rework remote atttributes to have a header per block" * tag 'for-linus-v3.10-rc4-crc-xattr-fixes' of git://oss.sgi.com/xfs/xfs: xfs: rework remote attr CRCs xfs: fully initialise temp leaf in xfs_attr3_leaf_compact xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalance xfs: correctly map remote attr buffers during removal xfs: remote attribute tail zeroing does too much xfs: remote attribute read too short xfs: remote attribute allocation may be contiguous
| * | | | | | xfs: rework remote attr CRCsDave Chinner2013-05-304-158/+247
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Note: this changes the on-disk remote attribute format. I assert that this is OK to do as CRCs are marked experimental and the first kernel it is included in has not yet reached release yet. Further, the userspace utilities are still evolving and so anyone using this stuff right now is a developer or tester using volatile filesystems for testing this feature. Hence changing the format right now to save longer term pain is the right thing to do. The fundamental change is to move from a header per extent in the attribute to a header per filesytem block in the attribute. This means there are more header blocks and the parsing of the attribute data is slightly more complex, but it has the advantage that we always know the size of the attribute on disk based on the length of the data it contains. This is where the header-per-extent method has problems. We don't know the size of the attribute on disk without first knowing how many extents are used to hold it. And we can't tell from a mapping lookup, either, because remote attributes can be allocated contiguously with other attribute blocks and so there is no obvious way of determining the actual size of the atribute on disk short of walking and mapping buffers. The problem with this approach is that if we map a buffer incorrectly (e.g. we make the last buffer for the attribute data too long), we then get buffer cache lookup failure when we map it correctly. i.e. we get a size mismatch on lookup. This is not necessarily fatal, but it's a cache coherency problem that can lead to returning the wrong data to userspace or writing the wrong data to disk. And debug kernels will assert fail if this occurs. I found lots of niggly little problems trying to fix this issue on a 4k block size filesystem, finally getting it to pass with lots of fixes. The thing is, 1024 byte filesystems still failed, and it was getting really complex handling all the corner cases that were showing up. And there were clearly more that I hadn't found yet. It is complex, fragile code, and if we don't fix it now, it will be complex, fragile code forever more. Hence the simple fix is to add a header to each filesystem block. This gives us the same relationship between the attribute data length and the number of blocks on disk as we have without CRCs - it's a linear mapping and doesn't require us to guess anything. It is simple to implement, too - the remote block count calculated at lookup time can be used by the remote attribute set/get/remove code without modification for both CRC and non-CRC filesystems. The world becomes sane again. Because the copy-in and copy-out now need to iterate over each filesystem block, I moved them into helper functions so we separate the block mapping and buffer manupulations from the attribute data and CRC header manipulations. The code becomes much clearer as a result, and it is a lot easier to understand and debug. It also appears to be much more robust - once it worked on 4k block size filesystems, it has worked without failure on 1k block size filesystems, too. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit ad1858d77771172e08016890f0eb2faedec3ecee)
| * | | | | | xfs: fully initialise temp leaf in xfs_attr3_leaf_compactDave Chinner2013-05-301-16/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_attr3_leaf_compact() uses a temporary buffer for compacting the the entries in a leaf. It copies the the original buffer into the temporary buffer, then zeros the original buffer completely. It then copies the entries back into the original buffer. However, the original buffer has not been correctly initialised, and so the movement of the entries goes horribly wrong. Make sure the zeroed destination buffer is fully initialised, and once we've set up the destination incore header appropriately, write is back to the buffer before starting to move entries around. While debugging this, the _d/_s prefixes weren't sufficient to remind me what buffer was what, so rename then all _src/_dst. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit d4c712bcf26a25c2b67c90e44e0b74c7993b5334)
| * | | | | | xfs: fully initialise temp leaf in xfs_attr3_leaf_unbalanceDave Chinner2013-05-301-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_attr3_leaf_unbalance() uses a temporary buffer for recombining the entries in two leaves when the destination leaf requires compaction. The temporary buffer ends up being copied back over the original destination buffer, so the header in the temporary buffer needs to contain all the information that is in the destination buffer. To make sure the temporary buffer is fully initialised, once we've set up the temporary incore header appropriately, write is back to the temporary buffer before starting to move entries around. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 8517de2a81da830f5d90da66b4799f4040c76dc9)
| * | | | | | xfs: correctly map remote attr buffers during removalDave Chinner2013-05-301-9/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we don't map the buffers correctly (same as for get/set operations) then the incore buffer lookup will fail. If a block number matches but a length is wrong, then debug kernels will ASSERT fail in _xfs_buf_find() due to the length mismatch. Ensure that we map the buffers correctly by basing the length of the buffer on the attribute data length rather than the remote block count. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 6863ef8449f1908c19f43db572e4474f24a1e9da)
| * | | | | | xfs: remote attribute tail zeroing does too muchDave Chinner2013-05-301-19/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an attribute data does not fill then entire remote block, we zero the remaining part of the buffer. This, however, needs to take into account that the buffer has a header, and so the offset where zeroing starts and the length of zeroing need to take this into account. Otherwise we end up with zeros over the end of the attribute value when CRCs are enabled. While there, make sure we only ask to map an extent that covers the remaining range of the attribute, rather than asking every time for the full length of remote data. If the remote attribute blocks are contiguous with other parts of the attribute tree, it will map those blocks as well and we can potentially zero them incorrectly. We can also get buffer size mistmatches when trying to read or remove the remote attribute, and this can lead to not finding the correct buffer when looking it up in cache. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 4af3644c9a53eb2f1ecf69cc53576561b64be4c6)
| * | | | | | xfs: remote attribute read too shortDave Chinner2013-05-301-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reading a maximally size remote attribute fails when CRCs are enabled with this verification error: XFS (vdb): remote attribute header does not match required off/len/owner) There are two reasons for this, the first being that the length of the buffer being read is determined from the args->rmtblkcnt which doesn't take into account CRC headers. Hence the mapped length ends up being too short and so we need to calculate it directly from the value length. The second is that the byte count of valid data within a buffer is capped by the length of the data and so doesn't take into account that the buffer might be longer due to headers. Hence we need to calculate the data space in the buffer first before calculating the actual byte count of data. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 913e96bc292e1bb248854686c79d6545ef3ee720)
| * | | | | | xfs: remote attribute allocation may be contiguousDave Chinner2013-05-301-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CRCs are enabled, there may be multiple allocations made if the headers cause a length overflow. This, however, does not mean that the number of headers required increases, as the second and subsequent extents may be contiguous with the previous extent. Hence when we map the extents to write the attribute data, we may end up with less extents than allocations made. Hence the assertion that we consume the number of headers we calculated in the allocation loop is incorrect and needs to be removed. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 90253cf142469a40f89f989904abf0a1e500e1a6)
* | | | | | | Merge tag 'for-linus-v3.10-rc4' of git://oss.sgi.com/xfs/xfsLinus Torvalds2013-06-0110-60/+92
|\ \ \ \ \ \ \ | |/ / / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Ben Myers: - Fix nested transactions in xfs_qm_scall_setqlim - Clear suid/sgid bits when we truncate with size update - Fix recovery for split buffers - Fix block count on remote symlinks - Add fsgeom flag for v5 superblock support - Disable XFS_IOC_SWAPEXT for CRC enabled filesystems - Fix dirv3 freespace block corruption * tag 'for-linus-v3.10-rc4' of git://oss.sgi.com/xfs/xfs: xfs: fix dir3 freespace block corruption xfs: disable swap extents ioctl on CRC enabled filesystems xfs: add fsgeom flag for v5 superblock support. xfs: fix incorrect remote symlink block count xfs: fix split buffer vector log recovery support xfs: kill suid/sgid through the truncate path. xfs: avoid nesting transactions in xfs_qm_scall_setqlim()
| * | | | | | xfs: fix dir3 freespace block corruptionDave Chinner2013-05-302-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the directory freespace index grows to a second block (2017 4k data blocks in the directory), the initialisation of the second new block header goes wrong. The write verifier fires a corruption error indicating that the block number in the header is zero. This was being tripped by xfs/110. The problem is that the initialisation of the new block is done just fine in xfs_dir3_free_get_buf(), but the caller then users a dirv2 structure to zero on-disk header fields that xfs_dir3_free_get_buf() has already zeroed. These lined up with the block number in the dir v3 header format. While looking at this, I noticed that the struct xfs_dir3_free_hdr() had 4 bytes of padding in it that wasn't defined as padding or being zeroed by the initialisation. Add a pad field declaration and fully zero the on disk and in-core headers in xfs_dir3_free_get_buf() so that this is never an issue in the future. Note that this doesn't change the on-disk layout, just makes the 32 bits of padding in the layout explicit. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 5ae6e6a401957698f2bd8c9f4a86d86d02199fea)
| * | | | | | xfs: disable swap extents ioctl on CRC enabled filesystemsDave Chinner2013-05-301-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, swapping extents from one inode to another is a simple act of switching data and attribute forks from one inode to another. This, unfortunately in no longer so simple with CRC enabled filesystems as there is owner information embedded into the BMBT blocks that are swapped between inodes. Hence swapping the forks between inodes results in the inodes having mapping blocks that point to the wrong owner and hence are considered corrupt. To fix this we need an extent tree block or record based swap algorithm so that the BMBT block owner information can be updated atomically in the swap transaction. This is a significant piece of new work, so for the moment simply don't allow swap extent operations to succeed on CRC enabled filesystems. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 02f75405a75eadfb072609f6bf839e027de6a29a)
| * | | | | | xfs: add fsgeom flag for v5 superblock support.Dave Chinner2013-05-302-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently userspace has no way of determining that a filesystem is CRC enabled. Add a flag to the XFS_IOC_FSGEOMETRY ioctl output to indicate that the filesystem has v5 superblock support enabled. This will allow xfs_info to correctly report the state of the filesystem. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 74137fff067961c9aca1e14d073805c3de8549bd)
| * | | | | | xfs: fix incorrect remote symlink block countDave Chinner2013-05-301-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CRCs are enabled, the number of blocks needed to hold a remote symlink on a 1k block size filesystem may be 2 instead of 1. The transaction reservation for the allocated blocks was not taking this into account and only allocating one block. Hence when trying to read or invalidate such symlinks, we are mapping a hole where there should be a block and things go bad at that point. Fix the reservation to use the correct block count, clean up the block count calculation similar to the remote attribute calculation, and add a debug guard to detect when we don't write the entire symlink to disk. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 321a95839e65db3759a07a3655184b0283af90fe)
| * | | | | | xfs: fix split buffer vector log recovery supportDave Chinner2013-05-302-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A long time ago in a galaxy far away.... .. the was a commit made to fix some ilinux specific "fragmented buffer" log recovery problem: http://oss.sgi.com/cgi-bin/gitweb.cgi?p=archive/xfs-import.git;a=commitdiff;h=b29c0bece51da72fb3ff3b61391a391ea54e1603 That problem occurred when a contiguous dirty region of a buffer was split across across two pages of an unmapped buffer. It's been a long time since that has been done in XFS, and the changes to log the entire inode buffers for CRC enabled filesystems has re-introduced that corner case. And, of course, it turns out that the above commit didn't actually fix anything - it just ensured that log recovery is guaranteed to fail when this situation occurs. And now for the gory details. xfstest xfs/085 is failing with this assert: XFS (vdb): bad number of regions (0) in inode log format XFS: Assertion failed: 0, file: fs/xfs/xfs_log_recover.c, line: 1583 Largely undocumented factoid #1: Log recovery depends on all log buffer format items starting with this format: struct foo_log_format { __uint16_t type; __uint16_t size; .... As recoery uses the size field and assumptions about 32 bit alignment in decoding format items. So don't pay much attention to the fact log recovery thinks that it decoding an inode log format item - it just uses them to determine what the size of the item is. But why would it see a log format item with a zero size? Well, luckily enough xfs_logprint uses the same code and gives the same error, so with a bit of gdb magic, it turns out that it isn't a log format that is being decoded. What logprint tells us is this: Oper (130): tid: a0375e1a len: 28 clientid: TRANS flags: none BUF: #regs: 2 start blkno: 144 (0x90) len: 16 bmap size: 2 flags: 0x4000 Oper (131): tid: a0375e1a len: 4096 clientid: TRANS flags: none BUF DATA ---------------------------------------------------------------------------- Oper (132): tid: a0375e1a len: 4096 clientid: TRANS flags: none xfs_logprint: unknown log operation type (4e49) ********************************************************************** * ERROR: data block=2 * ********************************************************************** That we've got a buffer format item (oper 130) that has two regions; the format item itself and one dirty region. The subsequent region after the buffer format item and it's data is them what we are tripping over, and the first bytes of it at an inode magic number. Not a log opheader like there is supposed to be. That means there's a problem with the buffer format item. It's dirty data region is 4096 bytes, and it contains - you guessed it - initialised inodes. But inode buffers are 8k, not 4k, and we log them in their entirety. So something is wrong here. The buffer format item contains: (gdb) p /x *(struct xfs_buf_log_format *)in_f $22 = {blf_type = 0x123c, blf_size = 0x2, blf_flags = 0x4000, blf_len = 0x10, blf_blkno = 0x90, blf_map_size = 0x2, blf_data_map = {0xffffffff, 0xffffffff, .... }} Two regions, and a signle dirty contiguous region of 64 bits. 64 * 128 = 8k, so this should be followed by a single 8k region of data. And the blf_flags tell us that the type of buffer is a XFS_BLFT_DINO_BUF. It contains inodes. And because it doesn't have the XFS_BLF_INODE_BUF flag set, that means it's an inode allocation buffer. So, it should be followed by 8k of inode data. But we know that the next region has a header of: (gdb) p /x *ohead $25 = {oh_tid = 0x1a5e37a0, oh_len = 0x100000, oh_clientid = 0x69, oh_flags = 0x0, oh_res2 = 0x0} and so be32_to_cpu(oh_len) = 0x1000 = 4096 bytes. It's simply not long enough to hold all the logged data. There must be another region. There is - there's a following opheader for another 4k of data that contains the other half of the inode cluster data - the one we assert fail on because it's not a log format header. So why is the second part of the data not being accounted to the correct buffer log format structure? It took a little more work with gdb to work out that the buffer log format structure was both expecting it to be there but hadn't accounted for it. It was at that point I went to the kernel code, as clearly this wasn't a bug in xfs_logprint and the kernel was writing bad stuff to the log. First port of call was the buffer item formatting code, and the discontiguous memory/contiguous dirty region handling code immediately stood out. I've wondered for a long time why the code had this comment in it: vecp->i_addr = xfs_buf_offset(bp, buffer_offset); vecp->i_len = nbits * XFS_BLF_CHUNK; vecp->i_type = XLOG_REG_TYPE_BCHUNK; /* * You would think we need to bump the nvecs here too, but we do not * this number is used by recovery, and it gets confused by the boundary * split here * nvecs++; */ vecp++; And it didn't account for the extra vector pointer. The case being handled here is that a contiguous dirty region lies across a boundary that cannot be memcpy()d across, and so has to be split into two separate operations for xlog_write() to perform. What this code assumes is that what is written to the log is two consecutive blocks of data that are accounted in the buf log format item as the same contiguous dirty region and so will get decoded as such by the log recovery code. The thing is, xlog_write() knows nothing about this, and so just does it's normal thing of adding an opheader for each vector. That means the 8k region gets written to the log as two separate regions of 4k each, but because nvecs has not been incremented, the buf log format item accounts for only one of them. Hence when we come to log recovery, we process the first 4k region and then expect to come across a new item that starts with a log format structure of some kind that tells us whenteh next data is going to be. Instead, we hit raw buffer data and things go bad real quick. So, the commit from 2002 that commented out nvecs++ is just plain wrong. It breaks log recovery completely, and it would seem the only reason this hasn't been since then is that we don't log large contigous regions of multi-page unmapped buffers very often. Never would be a closer estimate, at least until the CRC code came along.... So, lets fix that by restoring the nvecs accounting for the extra region when we hit this case..... .... and there's the problemin log recovery it is apparently working around: XFS: Assertion failed: i == item->ri_total, file: fs/xfs/xfs_log_recover.c, line: 2135 Yup, xlog_recover_do_reg_buffer() doesn't handle contigous dirty regions being broken up into multiple regions by the log formatting code. That's an easy fix, though - if the number of contiguous dirty bits exceeds the length of the region being copied out of the log, only account for the number of dirty bits that region covers, and then loop again and copy more from the next region. It's a 2 line fix. Now xfstests xfs/085 passes, we have one less piece of mystery code, and one more important piece of knowledge about how to structure new log format items.. Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Mark Tinguely <tinguely@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 709da6a61aaf12181a8eea8443919ae5fc1b731d)
| * | | | | | xfs: kill suid/sgid through the truncate path.Dave Chinner2013-05-301-15/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS has failed to kill suid/sgid bits correctly when truncating files of non-zero size since commit c4ed4243 ("xfs: split xfs_setattr") introduced in the 3.1 kernel. Fix it. Fix it. cc: stable kernel <stable@vger.kernel.org> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit 56c19e89b38618390addfc743d822f99519055c6)
| * | | | | | xfs: avoid nesting transactions in xfs_qm_scall_setqlim()Dave Chinner2013-05-301-17/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lockdep reports: ============================================= [ INFO: possible recursive locking detected ] 3.9.0+ #3 Not tainted --------------------------------------------- setquota/28368 is trying to acquire lock: (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50 but task is already holding lock: (sb_internal){++++.?}, at: [<c11e8846>] xfs_trans_alloc+0x26/0x50 from xfs_qm_scall_setqlim()->xfs_dqread() when a dquot needs to be allocated. xfs_qm_scall_setqlim() is starting a transaction and then not passing it into xfs_qm_dqet() and so it starts it's own transaction when allocating the dquot. Splat! Fix this by not allocating the dquot in xfs_qm_scall_setqlim() inside the setqlim transaction. This requires getting the dquot first (and allocating it if necessary) then dropping and relocking the dquot before joining it to the setqlim transaction. Reported-by: Michael L. Semon <mlsemon35@gmail.com> Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Ben Myers <bpm@sgi.com> Signed-off-by: Ben Myers <bpm@sgi.com> (cherry picked from commit f648167f3ac79018c210112508c732ea9bf67c7b)
* | | | | | | Merge tag 'please-pull-aertracefix' of ↵Linus Torvalds2013-06-015-24/+12
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras Pull aer error logging fix from Tony Luck: "Can't call pci_get_domain_bus_and_slot() from interupt context" * tag 'please-pull-aertracefix' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: aerdrv: Move cper_print_aer() call out of interrupt context
| * | | | | | | aerdrv: Move cper_print_aer() call out of interrupt contextLance Ortiz2013-05-305-24/+12
| | |_|_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following warning was seen on 3.9 when a corrected PCIe error was being handled by the AER subsystem. WARNING: at .../drivers/pci/search.c:214 pci_get_dev_by_id+0x8a/0x90() This occurred because a call to pci_get_domain_bus_and_slot() was added to cper_print_pcie() to setup for the call to cper_print_aer(). The warning showed up because cper_print_pcie() is called in an interrupt context and pci_get* functions are not supposed to be called in that context. The solution is to move the cper_print_aer() call out of the interrupt context and into aer_recover_work_func() to avoid any warnings when calling pci_get* functions. Signed-off-by: Lance Ortiz <lance.ortiz@hp.com> Acked-by: Borislav Petkov <bp@suse.de> Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: Tony Luck <tony.luck@intel.com>
* | | | | | | Merge tag 'arm64-stable' of ↵Linus Torvalds2013-06-014-6/+25
|\ \ \ \ \ \ \ | |_|_|_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64 Pull arm64 fixes from Catalin Marinas: - Module compilation issues (symbol not exported). - Plug a hole where user space can bring the kernel down. * tag 'arm64-stable' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-aarch64: arm64: don't kill the kernel on a bad esr from el0 arm64: treat unhandled compat el0 traps as undef arm64: Do not report user faults for handled signals arm64: kernel: compiling issue, need 'EXPORT_SYMBOL(clear_page)'
| * | | | | | arm64: don't kill the kernel on a bad esr from el0Mark Rutland2013-05-311-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rather than completely killing the kernel if we receive an esr value we can't deal with in the el0 handlers, send the process a SIGILL and log the esr value in the hope that we can debug it. If we receive a bad esr from el1, we'll die() as before. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Cc: stable@vger.kernel.org
OpenPOWER on IntegriCloud