#387 closed defect (worksforme)
OpenAFS getcwd() sometimes returns ENOENT
Reported by: | andersk | Owned by: | |
---|---|---|---|
Priority: | blocker | Milestone: | |
Component: | internals | Keywords: | |
Cc: |
Description (last modified by andersk)
Every few hours, a server gets stuck in a state where certain users receive ENOENT from getcwd(), resulting in errors like
shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory job-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory job-working-directory: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory
Sometimes the problem goes away on its own, but usually the server needs to be rebooted.
I attempted to debug this with upstream on 2013-06-11 and 2013-06-16. Using crash, I found that we’re ending up with two dentries referring to the same directory inode, which is bad. We’re all confused.
Change History (16)
comment:1 Changed 11 years ago by andersk
- Description modified (diff)
comment:2 Changed 11 years ago by andersk
comment:3 Changed 11 years ago by andersk
Okay, I have a promising theory, and I’m now testing this patch on b-m, p-b, r-m, and w-e.
comment:4 Changed 11 years ago by andersk
Darn. The getcwd bug showed up on b-m twice with the patch, although it did fix itself five minutes later.
The right answer might be that this dentry aliasing stuff isn’t supposed to work at all, and we need to do something more drastic like making separate inodes for the same volume at separate paths…
comment:5 Changed 11 years ago by andersk
kaduk sent this upstream: https://rt.central.org/rt/Ticket/Display.html?id=131780&user=guest&pass=guest
comment:6 Changed 11 years ago by andersk
Date: | Sat, 25 Jan 2014 15:51:58 -0500 |
From: | Marc Dionne |
To: | Anders Kaseorg |
Subject: | Re: Change in openafs[master]: Linux: canonical_dentry: Don't reverse the order if D_ALIAS_... |
On Sat, Jan 25, 2014 at 3:32 PM, Anders Kaseorg wrote:
On Sat, 25 Jan 2014, Marc Dionne wrote:
Are you still experiencing this issue, and is it frequent enough that trying out a code change would be feasible and could give useful feedback?
I think I have an idea of what's going on and I can end up reproducing it, but not very deterministically so it's hard to be sure that the potential fix helps.
Yes! We're usually seeing this issue on multiple servers each day from our 8 server pool. We'd be delighted to try installing a patch on half of them.
Anders
Is there any indication that there was a problem contacting the file server hosting the directory that gets the ENOENT at some point before the symptom appears?
I think the root of the problem may be a few places where we do d_drop() of a dentry. In the case I've been able to trace, I get an afs_lookup() failure in afs_linux_dentry_revalidate because the server is temporarily offline or busy, and we then do d_drop of the dentry. This unhashes it, and if someone else is holding a reference, for instance a process having it as its cwd (which takes a reference), that process will see ENOENT until something causes the dentry to be revalidated again with the server up.
So I think that the problem is doing d_drop directly without checking for other references. We should just call d_invalidate() which will do the shrink_dcache_parent(), and then do d_drop but only if we're the only ones with a ref. Is that something you could try?
Marc
comment:7 Changed 11 years ago by andersk
Date: | Sat, 25 Jan 2014 16:51:39 -0500 |
From: | Marc Dionne |
To: | Anders Kaseorg |
Subject: | Re: Change in openafs[master]: Linux: canonical_dentry: Don't reverse the order if D_ALIAS_... |
To be clearer, something simple like this:
-
src/afs/LINUX/osi_vnodeops.c
diff --git a/src/afs/LINUX/osi_vnodeops.c b/src/afs/LINUX/osi_vnodeops.c index 28841c8..b7cae5a 100644
a b afs_linux_dentry_revalidate(struct dentry *dp, int flags) 1262 1262 if (credp) 1263 1263 crfree(credp); 1264 1264 1265 if (!valid) { 1266 shrink_dcache_parent(dp); 1267 d_drop(dp); 1268 } 1265 if (!valid) 1266 d_invalidate(dp); 1267 1269 1268 return valid; 1270 1269 1271 1270 bad_dentry:
Other uses of d_drop don't look as suspicious at first glance, and this is the case I hit.
Marc
comment:8 Changed 11 years ago by andersk
Installed Marc’s patch on b-m, p-b, r-m, and w-e (edit: I failed to actually install it on w-e, oops). Rebooted all eight servers for good measure. We’ll see what happens.
comment:9 Changed 11 years ago by andersk
Marc submitted this upstream as 10774.
The score so far is one getcwd explosion from the unpatched half of the pool, zero from the patched half.
comment:10 Changed 11 years ago by andersk
w-e was still producing getcwd errors for a user. It turns out that was because I had failed to install the patch on w-e. It’s installed there now.
comment:11 Changed 11 years ago by andersk
- Resolution set to fixed
- Status changed from new to closed
This patch seems to work. Committed as r2504 and installed on all servers.
comment:12 Changed 10 years ago by andersk
- Resolution fixed deleted
- Status changed from closed to reopened
This bug has reappeared with the upgrade from 1.6.8 to 1.6.10pre1. :-(
The most suspicious commit is 11358, I think.
comment:13 follow-up: ↓ 14 Changed 10 years ago by andersk
Reverted 11358 on c-w only. We’ll see if that helps.
comment:14 in reply to: ↑ 13 Changed 10 years ago by andersk
Replying to andersk:
Reverted 11358 on c-w only. We’ll see if that helps.
I added a log message in the case where this causes different behavior.
-
src/afs/LINUX/osi_vnodeops.c
a b afs_linux_dentry_revalidate(struct dentry *dp, int flags) 1245 1245 code = afs_lookup(pvcp, (char *)dp->d_name.name, &tvc, credp); 1246 1246 if (!tvc || tvc != vcp) { 1247 1247 dput(parent); 1248 /* Force unhash; the name doesn't point to this file 1249 * anymore. */ 1250 force_drop = 1; 1251 if (code && code != ENOENT) { 1252 /* ...except if we couldn't perform the actual lookup, 1253 * we don't know if the name points to this file or not. */ 1254 force_drop = 0; 1248 /* Force unhash if name is known not to exist. */ 1249 if (code == ENOENT) 1250 force_drop = 1; 1251 if (!code) { 1252 /* As of http://gerrit.openafs.org/11358, upstream 1253 * also sets force_drop = 1 in this case, but that 1254 * seems to exacerbate the getcwd() ENOENT 1255 * problem. */ 1256 char buf[256]; 1257 char *path = dentry_path_raw(dp, buf, sizeof(buf)); 1258 pr_warn("openafs: Rescued %s from force_drop\n", IS_ERR(path) ? "(error)" : path); 1255 1259 } 1256 1260 goto bad_dentry; 1257 1261 }
This is now running on both c-w and b-b.
comment:15 Changed 10 years ago by andersk
This is still happening. Right now, getcwd() in /afs/athena.mit.edu returns ENOENT on three of our five servers.
comment:16 Changed 9 years ago by andersk
- Resolution set to worksforme
- Status changed from reopened to closed
After all the OpenAFS patches, this problem was still happening occasionally on kernel 3.17, and constantly on kernel 3.18, but we haven’t seen it at all since upgrading to kernel 3.19 last week. It may have finally disappeared.
Ugh. It turns out it’s really trivial to trigger this warning:
with, say,
Backtrace: