Opened 8 years ago

Closed 6 years ago

Last modified 6 years ago

#387 closed defect (worksforme)

OpenAFS getcwd() sometimes returns ENOENT

Reported by: andersk Owned by:
Priority: blocker Milestone:
Component: internals Keywords:
Cc:

Description (last modified by andersk)

Every few hours, a server gets stuck in a state where certain users receive ENOENT from getcwd(), resulting in errors like

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
job-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
job-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

Sometimes the problem goes away on its own, but usually the server needs to be rebooted.

I attempted to debug this with upstream on 2013-06-11 and 2013-06-16. Using crash, I found that we’re ending up with two dentries referring to the same directory inode, which is bad. We’re all confused.

Change History (16)

comment:1 Changed 8 years ago by andersk

  • Description modified (diff)

comment:2 Changed 8 years ago by andersk

Ugh. It turns out it’s really trivial to trigger this warning:

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1381,14 +1381,15 @@ EXPORT_SYMBOL(d_set_d_op);
 
 static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 {
        spin_lock(&dentry->d_lock);
        if (inode) {
                if (unlikely(IS_AUTOMOUNT(inode)))
                        dentry->d_flags |= DCACHE_NEED_AUTOMOUNT;
+               WARN_ON(S_ISDIR(inode->i_mode) && !hlist_empty(&inode->i_dentry));
                hlist_add_head(&dentry->d_alias, &inode->i_dentry);
        }
        dentry->d_inode = inode;
        dentry_rcuwalk_barrier(dentry);
        spin_unlock(&dentry->d_lock);
        fsnotify_d_instantiate(dentry, inode);
 }

with, say,

$ ls /afs/athena.mit.edu/user/a/n/andersk
$ ls /afs/.athena.mit.edu/user/a/n/andersk

Backtrace:

------------[ cut here ]------------
WARNING: at fs/dcache.c:1388 __d_instantiate+0x10c/0x150()
Modules linked in: lockd coretemp microcode openafs(PO) fuse binfmt_misc crc32_pclmul crc32c_intel xen_netfront xen_blkfront ghash_clmulni_intel sunrpc be2iscsi bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi libiscsi_tcp qla4xxx iscsi_boot_sysfs libiscsi scsi_transport_iscsi
Pid: 3201, comm: ls Tainted: P           O 3.9.10-100.scripts.1.fc17.x86_64 #1
Call Trace:
 [<ffffffff8105efd5>] warn_slowpath_common+0x75/0xa0
 [<ffffffff8105f01a>] warn_slowpath_null+0x1a/0x20
 [<ffffffff811b248c>] __d_instantiate+0x10c/0x150
 [<ffffffff811b3d82>] d_splice_alias+0xd2/0x100
 [<ffffffffa023f284>] afs_linux_lookup+0x124/0x2b0 [openafs]
 [<ffffffff811a6e0d>] lookup_real+0x1d/0x60
 [<ffffffff811a7028>] __lookup_hash+0x38/0x50
 [<ffffffff81654696>] lookup_slow+0x45/0xa9
 [<ffffffff811a9b03>] path_lookupat+0x6c3/0x720
 [<ffffffff81184925>] ? kmem_cache_alloc+0x35/0x200
 [<ffffffff811a8410>] ? getname_flags.part.34+0x30/0x150
 [<ffffffff811a9b94>] filename_lookup+0x34/0xc0
 [<ffffffff811ac94e>] user_path_at_empty+0x8e/0x110
 [<ffffffff8108353c>] ? remove_wait_queue+0x3c/0x50
 [<ffffffff81004c92>] ? xen_mc_flush+0xb2/0x1b0
 [<ffffffff811ac9e1>] user_path_at+0x11/0x20
 [<ffffffff811a2042>] vfs_fstatat+0x52/0xb0
 [<ffffffff811a20be>] vfs_lstat+0x1e/0x20
 [<ffffffff811a232a>] sys_newlstat+0x1a/0x40
 [<ffffffff81014c15>] ? math_state_restore+0x95/0x140
 [<ffffffff8165caf9>] ? do_device_not_available+0x19/0x20
 [<ffffffff816652ce>] ? device_not_available+0x1e/0x30
 [<ffffffff810dcd84>] ? __audit_syscall_entry+0x94/0xf0
 [<ffffffff81664159>] system_call_fastpath+0x16/0x1b
---[ end trace 095fa0704440b6a9 ]---
Last edited 8 years ago by andersk (previous) (diff)

comment:3 Changed 8 years ago by andersk

Okay, I have a promising theory, and I’m now testing this patch on b-m, p-b, r-m, and w-e.

Last edited 8 years ago by andersk (previous) (diff)

comment:4 Changed 8 years ago by andersk

Darn. The getcwd bug showed up on b-m twice with the patch, although it did fix itself five minutes later.

The right answer might be that this dentry aliasing stuff isn’t supposed to work at all, and we need to do something more drastic like making separate inodes for the same volume at separate paths…

comment:6 Changed 7 years ago by andersk

Date: Sat, 25 Jan 2014 15:51:58 -0500
From: Marc Dionne
To: Anders Kaseorg
Subject: Re: Change in openafs[master]: Linux: canonical_dentry: Don't reverse the order if D_ALIAS_...

On Sat, Jan 25, 2014 at 3:32 PM, Anders Kaseorg wrote:

On Sat, 25 Jan 2014, Marc Dionne wrote:

Are you still experiencing this issue, and is it frequent enough that trying out a code change would be feasible and could give useful feedback?

I think I have an idea of what's going on and I can end up reproducing it, but not very deterministically so it's hard to be sure that the potential fix helps.

Yes! We're usually seeing this issue on multiple servers each day from our 8 server pool. We'd be delighted to try installing a patch on half of them.

Anders

Is there any indication that there was a problem contacting the file server hosting the directory that gets the ENOENT at some point before the symptom appears?

I think the root of the problem may be a few places where we do d_drop() of a dentry. In the case I've been able to trace, I get an afs_lookup() failure in afs_linux_dentry_revalidate because the server is temporarily offline or busy, and we then do d_drop of the dentry. This unhashes it, and if someone else is holding a reference, for instance a process having it as its cwd (which takes a reference), that process will see ENOENT until something causes the dentry to be revalidated again with the server up.

So I think that the problem is doing d_drop directly without checking for other references. We should just call d_invalidate() which will do the shrink_dcache_parent(), and then do d_drop but only if we're the only ones with a ref. Is that something you could try?

Marc

comment:7 Changed 7 years ago by andersk

Date: Sat, 25 Jan 2014 16:51:39 -0500
From: Marc Dionne
To: Anders Kaseorg
Subject: Re: Change in openafs[master]: Linux: canonical_dentry: Don't reverse the order if D_ALIAS_...

To be clearer, something simple like this:

  • src/afs/LINUX/osi_vnodeops.c

    diff --git a/src/afs/LINUX/osi_vnodeops.c b/src/afs/LINUX/osi_vnodeops.c
    index 28841c8..b7cae5a 100644
    a b afs_linux_dentry_revalidate(struct dentry *dp, int flags) 
    12621262    if (credp)
    12631263        crfree(credp);
    12641264
    1265     if (!valid) {
    1266         shrink_dcache_parent(dp);
    1267         d_drop(dp);
    1268     }
     1265    if (!valid)
     1266        d_invalidate(dp);
     1267
    12691268    return valid;
    12701269
    12711270  bad_dentry:

Other uses of d_drop don't look as suspicious at first glance, and this is the case I hit.

Marc

Last edited 7 years ago by andersk (previous) (diff)

comment:8 Changed 7 years ago by andersk

Installed Marc’s patch on b-m, p-b, r-m, and w-e (edit: I failed to actually install it on w-e, oops). Rebooted all eight servers for good measure. We’ll see what happens.

Last edited 7 years ago by andersk (previous) (diff)

comment:9 Changed 7 years ago by andersk

Marc submitted this upstream as 10774.

The score so far is one getcwd explosion from the unpatched half of the pool, zero from the patched half.

comment:10 Changed 7 years ago by andersk

w-e was still producing getcwd errors for a user. It turns out that was because I had failed to install the patch on w-e. It’s installed there now.

comment:11 Changed 7 years ago by andersk

  • Resolution set to fixed
  • Status changed from new to closed

This patch seems to work. Committed as r2504 and installed on all servers.

comment:12 Changed 7 years ago by andersk

  • Resolution fixed deleted
  • Status changed from closed to reopened

This bug has reappeared with the upgrade from 1.6.8 to 1.6.10pre1. :-(

The most suspicious commit is 11358, I think.

comment:13 follow-up: Changed 7 years ago by andersk

Reverted 11358 on c-w only. We’ll see if that helps.

comment:14 in reply to: ↑ 13 Changed 7 years ago by andersk

Replying to andersk:

Reverted 11358 on c-w only. We’ll see if that helps.

I added a log message in the case where this causes different behavior.

  • src/afs/LINUX/osi_vnodeops.c

    a b afs_linux_dentry_revalidate(struct dentry *dp, int flags) 
    12451245            code = afs_lookup(pvcp, (char *)dp->d_name.name, &tvc, credp);
    12461246            if (!tvc || tvc != vcp) {
    12471247                dput(parent);
    1248                 /* Force unhash; the name doesn't point to this file
    1249                  * anymore. */
    1250                 force_drop = 1;
    1251                 if (code && code != ENOENT) {
    1252                     /* ...except if we couldn't perform the actual lookup,
    1253                      * we don't know if the name points to this file or not. */
    1254                     force_drop = 0;
     1248                /* Force unhash if name is known not to exist. */
     1249                if (code == ENOENT)
     1250                    force_drop = 1;
     1251                if (!code) {
     1252                    /* As of http://gerrit.openafs.org/11358, upstream
     1253                     * also sets force_drop = 1 in this case, but that
     1254                     * seems to exacerbate the getcwd() ENOENT
     1255                     * problem. */
     1256                    char buf[256];
     1257                    char *path = dentry_path_raw(dp, buf, sizeof(buf));
     1258                    pr_warn("openafs: Rescued %s from force_drop\n", IS_ERR(path) ? "(error)" : path);
    12551259                }
    12561260                goto bad_dentry;
    12571261            }

This is now running on both c-w and b-b.

comment:15 Changed 6 years ago by andersk

This is still happening. Right now, getcwd() in /afs/athena.mit.edu returns ENOENT on three of our five servers.

comment:16 Changed 6 years ago by andersk

  • Resolution set to worksforme
  • Status changed from reopened to closed

After all the OpenAFS patches, this problem was still happening occasionally on kernel 3.17, and constantly on kernel 3.18, but we haven’t seen it at all since upgrading to kernel 3.19 last week. It may have finally disappeared.

Last edited 6 years ago by andersk (previous) (diff)
Note: See TracTickets for help on using tickets.