Context Navigation

← Previous Ticket
Next Ticket →

#387 closed defect (worksforme)

OpenAFS getcwd() sometimes returns ENOENT

Reported by:	andersk	Owned by:
Priority:	blocker	Milestone:
Component:	internals	Keywords:
Cc:

Description (last modified by andersk)

Every few hours, a server gets stuck in a state where certain users receive ENOENT from getcwd(), resulting in errors like

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
job-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
job-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

Sometimes the problem goes away on its own, but usually the server needs to be rebooted.

I attempted to debug this with upstream on 2013-06-11 and 2013-06-16. Using crash, I found that we’re ending up with two dentries referring to the same directory inode, which is bad. We’re all confused.

Change History (16)

comment:1 Changed 10 years ago by andersk

Description modified (diff)

comment:2 Changed 10 years ago by andersk

Ugh. It turns out it’s really trivial to trigger this warning:

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1381,14 +1381,15 @@ EXPORT_SYMBOL(d_set_d_op);
 
 static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 {
        spin_lock(&dentry->d_lock);
        if (inode) {
                if (unlikely(IS_AUTOMOUNT(inode)))
                        dentry->d_flags |= DCACHE_NEED_AUTOMOUNT;
+               WARN_ON(S_ISDIR(inode->i_mode) && !hlist_empty(&inode->i_dentry));
                hlist_add_head(&dentry->d_alias, &inode->i_dentry);
        }
        dentry->d_inode = inode;
        dentry_rcuwalk_barrier(dentry);
        spin_unlock(&dentry->d_lock);
        fsnotify_d_instantiate(dentry, inode);
 }

with, say,

$ ls /afs/athena.mit.edu/user/a/n/andersk
$ ls /afs/.athena.mit.edu/user/a/n/andersk

Backtrace:

------------[ cut here ]------------
WARNING: at fs/dcache.c:1388 __d_instantiate+0x10c/0x150()
Modules linked in: lockd coretemp microcode openafs(PO) fuse binfmt_misc crc32_pclmul crc32c_intel xen_netfront xen_blkfront ghash_clmulni_intel sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi
Pid: 3201, comm: ls Tainted: P           O 3.9.10-100.scripts.1.fc17.x86_64 #1
Call Trace:
 [<ffffffff8105efd5>] warn_slowpath_common+0x75/0xa0
 [<ffffffff8105f01a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff811b248c>] __d_instantiate+0x10c/0x150
 [<ffffffff811b3d82>] d_splice_alias+0xd2/0x100
 [<ffffffffa023f284>] afs_linux_lookup+0x124/0x2b0 [openafs]
 [<ffffffff811a6e0d>] lookup_real+0x1d/0x60
 [<ffffffff811a7028>] __lookup_hash+0x38/0x50
 [<ffffffff81654696>] lookup_slow+0x45/0xa9
 [<ffffffff811a9b03>] path_lookupat+0x6c3/0x720
 [<ffffffff81184925>] ? kmem_cache_alloc+0x35/0x200
 [<ffffffff811a8410>] ? getname_flags.part.34+0x30/0x150
 [<ffffffff811a9b94>] filename_lookup+0x34/0xc0
 [<ffffffff811ac94e>] user_path_at_empty+0x8e/0x110
 [<ffffffff8108353c>] ? remove_wait_queue+0x3c/0x50
 [<ffffffff81004c92>] ? xen_mc_flush+0xb2/0x1b0
 [<ffffffff811ac9e1>] user_path_at+0x11/0x20
 [<ffffffff811a2042>] vfs_fstatat+0x52/0xb0
 [<ffffffff811a20be>] vfs_lstat+0x1e/0x20
 [<ffffffff811a232a>] sys_newlstat+0x1a/0x40
 [<ffffffff81014c15>] ? math_state_restore+0x95/0x140
 [<ffffffff8165caf9>] ? do_device_not_available+0x19/0x20
 [<ffffffff816652ce>] ? device_not_available+0x1e/0x30
 [<ffffffff810dcd84>] ? __audit_syscall_entry+0x94/0xf0
 [<ffffffff81664159>] system_call_fastpath+0x16/0x1b
---[ end trace 095fa0704440b6a9 ]---

Last edited 10 years ago by andersk (previous) (diff)

comment:3 Changed 10 years ago by andersk

Okay, I have a promising theory, and I’m now testing this patch on b-m, p-b, r-m, and w-e.

Last edited 10 years ago by andersk (previous) (diff)

comment:4 Changed 10 years ago by andersk

Darn. The getcwd bug showed up on b-m twice with the patch, although it did fix itself five minutes later.

The right answer might be that this dentry aliasing stuff isn’t supposed to work at all, and we need to do something more drastic like making separate inodes for the same volume at separate paths…

comment:5 Changed 10 years ago by andersk

kaduk sent this upstream: https://rt.central.org/rt/Ticket/Display.html?id=131780&user=guest&pass=guest

comment:6 Changed 10 years ago by andersk

Date:	Sat, 25 Jan 2014 15:51:58 -0500
From:	Marc Dionne
To:	Anders Kaseorg
Subject:	Re: Change in openafs[master]: Linux: canonical_dentry: Don't reverse the order if D_ALIAS_...

On Sat, Jan 25, 2014 at 3:32 PM, Anders Kaseorg wrote:

On Sat, 25 Jan 2014, Marc Dionne wrote:

Are you still experiencing this issue, and is it frequent enough that trying out a code change would be feasible and could give useful feedback?

I think I have an idea of what's going on and I can end up reproducing it, but not very deterministically so it's hard to be sure that the potential fix helps.

Yes! We're usually seeing this issue on multiple servers each day from our 8 server pool. We'd be delighted to try installing a patch on half of them.

Anders

Is there any indication that there was a problem contacting the file server hosting the directory that gets the ENOENT at some point before the symptom appears?

I think the root of the problem may be a few places where we do d_drop() of a dentry. In the case I've been able to trace, I get an afs_lookup() failure in afs_linux_dentry_revalidate because the server is temporarily offline or busy, and we then do d_drop of the dentry. This unhashes it, and if someone else is holding a reference, for instance a process having it as its cwd (which takes a reference), that process will see ENOENT until something causes the dentry to be revalidated again with the server up.

So I think that the problem is doing d_drop directly without checking for other references. We should just call d_invalidate() which will do the shrink_dcache_parent(), and then do d_drop but only if we're the only ones with a ref. Is that something you could try?

Marc

comment:7 Changed 10 years ago by andersk

Date:	Sat, 25 Jan 2014 16:51:39 -0500
From:	Marc Dionne
To:	Anders Kaseorg
Subject:	Re: Change in openafs[master]: Linux: canonical_dentry: Don't reverse the order if D_ALIAS_...

To be clearer, something simple like this:

src/afs/LINUX/osi_vnodeops.c

diff --git a/src/afs/LINUX/osi_vnodeops.c b/src/afs/LINUX/osi_vnodeops.c
index 28841c8..b7cae5a 100644

  afs_linux_dentry_revalidate(struct dentry *dp, int flags)
     if (credp)
         crfree(credp);
+    if (!valid) {
+        shrink_dcache_parent(dp);
+        d_drop(dp);
+    }
+    if (!valid)
+        d_invalidate(dp);
     return valid;
   bad_dentry:

Other uses of d_drop don't look as suspicious at first glance, and this is the case I hit.

Marc

Last edited 10 years ago by andersk (previous) (diff)

comment:8 Changed 10 years ago by andersk

Installed Marc’s patch on b-m, p-b, r-m, ~~and w-e~~ (edit: I failed to actually install it on w-e, oops). Rebooted all eight servers for good measure. We’ll see what happens.

Last edited 10 years ago by andersk (previous) (diff)

comment:9 Changed 10 years ago by andersk

Marc submitted this upstream as 10774.

The score so far is one getcwd explosion from the unpatched half of the pool, zero from the patched half.

comment:10 Changed 10 years ago by andersk

w-e was still producing getcwd errors for a user. It turns out that was because I had failed to install the patch on w-e. It’s installed there now.

comment:11 Changed 10 years ago by andersk

Resolution set to fixed
Status changed from new to closed

This patch seems to work. Committed as r2504 and installed on all servers.

comment:12 Changed 10 years ago by andersk

Resolution fixed deleted
Status changed from closed to reopened

This bug has reappeared with the upgrade from 1.6.8 to 1.6.10pre1. :-(

The most suspicious commit is 11358, I think.

comment:13 follow-up: ↓ 14 Changed 10 years ago by andersk

Reverted 11358 on c-w only. We’ll see if that helps.

comment:14 in reply to: ↑ 13 Changed 10 years ago by andersk

Replying to andersk:

Reverted 11358 on c-w only. We’ll see if that helps.

I added a log message in the case where this causes different behavior.

src/afs/LINUX/osi_vnodeops.c

  afs_linux_dentry_revalidate(struct dentry *dp, int flags)
             code = afs_lookup(pvcp, (char *)dp->d_name.name, &tvc, credp);
             if (!tvc || tvc != vcp) {
                 dput(parent);
+                /* Force unhash; the name doesn't point to this file
+                 * anymore. */
+                force_drop = 1;
+                if (code && code != ENOENT) {
+                    /* ...except if we couldn't perform the actual lookup,
+                     * we don't know if the name points to this file or not. */
+                    force_drop = 0;
+                /* Force unhash if name is known not to exist. */
+                if (code == ENOENT)
+                    force_drop = 1;
+                if (!code) {
+                    /* As of http://gerrit.openafs.org/11358, upstream
+                     * also sets force_drop = 1 in this case, but that
+                     * seems to exacerbate the getcwd() ENOENT
+                     * problem. */
+                    char buf[256];
+                    char *path = dentry_path_raw(dp, buf, sizeof(buf));
+                    pr_warn("openafs: Rescued %s from force_drop\n", IS_ERR(path) ? "(error)" : path);
+                }
                 goto bad_dentry;
+            }

This is now running on both c-w and b-b.

comment:15 Changed 9 years ago by andersk

This is still happening. Right now, getcwd() in /afs/athena.mit.edu returns ENOENT on three of our five servers.

comment:16 Changed 9 years ago by andersk

Resolution set to worksforme
Status changed from reopened to closed

After all the OpenAFS patches, this problem was still happening occasionally on kernel 3.17, and constantly on kernel 3.18, but we haven’t seen it since upgrading to kernel 3.19 last week. It may have finally disappeared.

Version 0, edited 9 years ago by andersk (next)

Note: See TracTickets for help on using tickets.

Download in other formats: