Skip to content
This repository has been archived by the owner on Feb 24, 2020. It is now read-only.

stage0/status: fix failure when systemd never runs in stage1 #3713

Merged
merged 1 commit into from
Jul 25, 2017

Conversation

fabiokung
Copy link
Contributor

A pid file never gets written if systemd never gets to run in stage1. This can happen if the image had a bad command, i.e.: not in $PATH.

Steps to reproduce:

$ sudo rkt --insecure-options=image run --uuid-file-save=/tmp/id docker://alpine --exec=bad
stage1: cannot initialize immutable environment: unable to find "bad" in "/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

$ sudo rkt status $(cat /tmp/id)
state=exited
created=2017-06-16 16:34:27.38 -0700 PDT
status: unable to print status: unable to get PID for pod "e6ccefea-1cdf-459a-b98d-7da95373a7d2": <nil>

Signed-off-by: Fabio Kung fabio.kung@gmail.com

@ghost
Copy link

ghost commented Jun 16, 2017

Can one of the admins verify this patch?

@lucab
Copy link
Member

lucab commented Jun 19, 2017

ok to test

@fabiokung
Copy link
Contributor Author

The CI failures seems unrelated, some flakiness on a TTY test.

@lucab
Copy link
Member

lucab commented Jun 19, 2017

@fabiokung I'll review this after today release, but you could please format your commit titles so that they include an area/component prefix?

@fabiokung
Copy link
Contributor Author

@lucab done.


stdout.Printf("pid=%d\nexited=%t", pid, (state == pkgPod.Exited || state == pkgPod.ExitedGarbage))
if pid, err := p.Pid(); err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change may re-introduce the race documented in the comment below, which was previously addressed by the goroutine+timeout. I would suggest to special-case the running case, block on it until pid exists or we hit the timeout, and then proceed with your logic.

@lucab
Copy link
Member

lucab commented Jun 20, 2017

I understand what you are experiencing but I think the current PR is a bit too aggressive and may bring back the previous racing issue.

@lucab
Copy link
Member

lucab commented Jun 20, 2017

On the other hand, this completely skips printing pid in the error and racing case, so it may actually be ok and push the retrying logic to the consumer side. I'm not sure what is the best behavior, but this approach may be more coherent.

@fabiokung
Copy link
Contributor Author

fabiokung commented Jun 20, 2017

On the other hand, this completely skips printing pid in the error and racing case, so it may actually be ok and push the retrying logic to the consumer side. I'm not sure what is the best behavior, but this approach may be more coherent.

Exactly my thoughts. There is already rkt status --wait that does the retrying/polling for users, so I'd be inclined to point people that need to guarantee a pid exists (and the container is up) to that.

The sleep code is also racy, nothing guarantees that the pid file will be written in 1s. I really dislike that approach.

@lucab lucab changed the title stage0: status fails when systemd never runs in stage1 rkt/status: fix failure when systemd never runs in stage1 Jun 26, 2017
@lucab
Copy link
Member

lucab commented Jun 26, 2017

@fabiokung I think I can agree with that.

@s-urbaniak @squeed any second opinion on this discussion?

Copy link
Contributor

@s-urbaniak s-urbaniak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree with this change. Given we already have two separate --wait-... instructions this implicit wait does not fit into the semantics.

It is not guaranteed that a rkt run invocation "happened-before" a subsequent rkt status invocation which the current code tries to overcome. The reality is that the user should retry rkt status after invoking rkt run himself.

The only nit I'd have is that we should make the above fact more clear in the documentation of rkt status.

@fabiokung
Copy link
Contributor Author

@s-urbaniak @lucab I added some bits to the doc (on rkt status) about pid being sometimes not available. PTAL

@fabiokung fabiokung changed the title rkt/status: fix failure when systemd never runs in stage1 stage0/status: fix failure when systemd never runs in stage1 Jun 28, 2017
Copy link
Member

@lucab lucab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lucab
Copy link
Member

lucab commented Jul 18, 2017

@fabiokung it looks like the CI was flaking a bit at the time this PR was last pushed. Do you mind rebasing once more on top of current master? It should be ready to go then.

@fabiokung
Copy link
Contributor Author

Will do.

A pid file never gets written if systemd never gets to run in stage1.
This can happen if the image had a bad command, i.e.: not in $PATH.

In that case, rkt status will constantly error with:

status: unable to print status: unable to get PID for pod ...

Signed-off-by: Fabio Kung <fabio.kung@gmail.com>
@fabiokung
Copy link
Contributor Author

@lucab all green!

Copy link
Member

@lucab lucab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lucab lucab merged commit 5d5f72f into rkt:master Jul 25, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants