stage0/status: fix failure when systemd never runs in stage1 #3713

fabiokung · 2017-06-16T23:35:14Z

A pid file never gets written if systemd never gets to run in stage1. This can happen if the image had a bad command, i.e.: not in $PATH.

Steps to reproduce:

$ sudo rkt --insecure-options=image run --uuid-file-save=/tmp/id docker://alpine --exec=bad
stage1: cannot initialize immutable environment: unable to find "bad" in "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

$ sudo rkt status $(cat /tmp/id)
state=exited
created=2017-06-16 16:34:27.38 -0700 PDT
status: unable to print status: unable to get PID for pod "e6ccefea-1cdf-459a-b98d-7da95373a7d2": <nil>

Signed-off-by: Fabio Kung fabio.kung@gmail.com

ghost · 2017-06-16T23:35:16Z

Can one of the admins verify this patch?

lucab · 2017-06-19T08:52:06Z

ok to test

fabiokung · 2017-06-19T13:35:31Z

The CI failures seems unrelated, some flakiness on a TTY test.

lucab · 2017-06-19T14:21:54Z

@fabiokung I'll review this after today release, but you could please format your commit titles so that they include an area/component prefix?

fabiokung · 2017-06-19T14:48:25Z

@lucab done.

lucab · 2017-06-20T15:57:19Z

rkt/status.go


-		stdout.Printf("pid=%d\nexited=%t", pid, (state == pkgPod.Exited || state == pkgPod.ExitedGarbage))
+	if pid, err := p.Pid(); err == nil {


I think this change may re-introduce the race documented in the comment below, which was previously addressed by the goroutine+timeout. I would suggest to special-case the running case, block on it until pid exists or we hit the timeout, and then proceed with your logic.

lucab · 2017-06-20T15:59:23Z

I understand what you are experiencing but I think the current PR is a bit too aggressive and may bring back the previous racing issue.

lucab · 2017-06-20T16:11:26Z

On the other hand, this completely skips printing pid in the error and racing case, so it may actually be ok and push the retrying logic to the consumer side. I'm not sure what is the best behavior, but this approach may be more coherent.

fabiokung · 2017-06-20T23:59:14Z

On the other hand, this completely skips printing pid in the error and racing case, so it may actually be ok and push the retrying logic to the consumer side. I'm not sure what is the best behavior, but this approach may be more coherent.

Exactly my thoughts. There is already rkt status --wait that does the retrying/polling for users, so I'd be inclined to point people that need to guarantee a pid exists (and the container is up) to that.

The sleep code is also racy, nothing guarantees that the pid file will be written in 1s. I really dislike that approach.

lucab · 2017-06-26T13:03:42Z

@fabiokung I think I can agree with that.

@s-urbaniak @squeed any second opinion on this discussion?

s-urbaniak

I do agree with this change. Given we already have two separate --wait-... instructions this implicit wait does not fit into the semantics.

It is not guaranteed that a rkt run invocation "happened-before" a subsequent rkt status invocation which the current code tries to overcome. The reality is that the user should retry rkt status after invoking rkt run himself.

The only nit I'd have is that we should make the above fact more clear in the documentation of rkt status.

fabiokung · 2017-06-28T04:51:06Z

@s-urbaniak @lucab I added some bits to the doc (on rkt status) about pid being sometimes not available. PTAL

lucab

LGTM

lucab · 2017-07-18T07:43:45Z

@fabiokung it looks like the CI was flaking a bit at the time this PR was last pushed. Do you mind rebasing once more on top of current master? It should be ready to go then.

fabiokung · 2017-07-22T02:39:50Z

Will do.

A pid file never gets written if systemd never gets to run in stage1. This can happen if the image had a bad command, i.e.: not in $PATH. In that case, rkt status will constantly error with: status: unable to print status: unable to get PID for pod ... Signed-off-by: Fabio Kung <fabio.kung@gmail.com>

fabiokung · 2017-07-24T23:50:51Z

@lucab all green!

lucab

LGTM

lucab added component/stage0 kind/bugfix needs/review labels Jun 19, 2017

lucab added this to the 1.28.0 milestone Jun 19, 2017

fabiokung force-pushed the status-pid-bug branch from 613471f to f7b4ad8 Compare June 19, 2017 14:47

lucab suggested changes Jun 20, 2017

View reviewed changes

lucab added needs/second-opinion and removed needs/review labels Jun 20, 2017

lucab changed the title ~~stage0: status fails when systemd never runs in stage1~~ rkt/status: fix failure when systemd never runs in stage1 Jun 26, 2017

s-urbaniak reviewed Jun 26, 2017

View reviewed changes

fabiokung force-pushed the status-pid-bug branch from f7b4ad8 to 874227b Compare June 28, 2017 04:50

fabiokung changed the title ~~rkt/status: fix failure when systemd never runs in stage1~~ stage0/status: fix failure when systemd never runs in stage1 Jun 28, 2017

lucab approved these changes Jul 10, 2017

View reviewed changes

lucab added needs/rebase and removed needs/second-opinion labels Jul 18, 2017

lucab mentioned this pull request Jul 18, 2017

release: tracker for 1.28.0 #3721

Closed

fabiokung force-pushed the status-pid-bug branch from 874227b to 35dae38 Compare July 24, 2017 20:38

lucab approved these changes Jul 25, 2017

View reviewed changes

lucab removed the needs/rebase label Jul 25, 2017

lucab merged commit 5d5f72f into rkt:master Jul 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stage0/status: fix failure when systemd never runs in stage1 #3713

stage0/status: fix failure when systemd never runs in stage1 #3713

fabiokung commented Jun 16, 2017

ghost commented Jun 16, 2017

lucab commented Jun 19, 2017

fabiokung commented Jun 19, 2017

lucab commented Jun 19, 2017

fabiokung commented Jun 19, 2017

lucab Jun 20, 2017

lucab commented Jun 20, 2017

lucab commented Jun 20, 2017

fabiokung commented Jun 20, 2017 •

edited

lucab commented Jun 26, 2017

s-urbaniak left a comment

fabiokung commented Jun 28, 2017

lucab left a comment

lucab commented Jul 18, 2017

fabiokung commented Jul 22, 2017

fabiokung commented Jul 24, 2017

lucab left a comment


		stdout.Printf("pid=%d\nexited=%t", pid, (state == pkgPod.Exited \|\| state == pkgPod.ExitedGarbage))
		if pid, err := p.Pid(); err == nil {

stage0/status: fix failure when systemd never runs in stage1 #3713

stage0/status: fix failure when systemd never runs in stage1 #3713

Conversation

fabiokung commented Jun 16, 2017

ghost commented Jun 16, 2017

lucab commented Jun 19, 2017

fabiokung commented Jun 19, 2017

lucab commented Jun 19, 2017

fabiokung commented Jun 19, 2017

lucab Jun 20, 2017

Choose a reason for hiding this comment

lucab commented Jun 20, 2017

lucab commented Jun 20, 2017

fabiokung commented Jun 20, 2017 • edited

lucab commented Jun 26, 2017

s-urbaniak left a comment

Choose a reason for hiding this comment

fabiokung commented Jun 28, 2017

lucab left a comment

Choose a reason for hiding this comment

lucab commented Jul 18, 2017

fabiokung commented Jul 22, 2017

fabiokung commented Jul 24, 2017

lucab left a comment

Choose a reason for hiding this comment

fabiokung commented Jun 20, 2017 •

edited