Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container deployment issues seen on a Swarm mode cluster with rebooted workers running on VMs #1112

Open
sisudhir opened this issue Jan 18, 2018 · 7 comments

Comments

@sisudhir
Copy link

sisudhir commented Jan 18, 2018

Description

In a mixed swarm mode cluster (baremetal and VMs) with Contiv 1.1.7, docker service scale issues are seen on rebooting the worker VMs.
Before the reboot the cluster had the containers running on all the nodes (baremetal and VMs) using Contiv network and policy framework.

Expected Behavior

The VM reboot should not affect the performance with Contiv network.

Observed Behavior

On rebooting the VMs that were running containers, the containers moved successfully on the surviving worker nodes. But the Docker service scale takes unusually long time. Also, connection errors are seen in netmaster log as:
Error dial tcp 10.65.121.129:9002: getsockopt: no route to host connecting to 10.65.121.129:%!s(uint16=9002). Retrying..

Steps to Reproduce (for bugs)

  1. Created DEE 17.06 cluster in swarm mode with mixed topology - baremetal and VM worker nodes. Master nodes are on baremetal and worker nodes are on VMs.
  2. Installed Contiv 1.1.7 and created back-end Contiv network and policies. Applied policies via group with Contiv tag and created corresponding Docker network
  3. Created Docker service using Contiv network as backend and checked network endpoint connectivity between them and SVIs. All working as expected.
  4. Rebooted 2 worker VMs, containers running on them moved successfully to surviving nodes.
  5. Tried scalling same Docker service to add 5 more containers on the same Contiv network.
  6. Service scale took unusually long, more than 30 minutes to complete for adding 5 more containers.
  7. Saw connections errors to rebooted worker VMs in netmaster logs.

Your Environment

  • netctl version - 1.1.7/v2Plugin
  • Orchestrator version (e.g. kubernetes, mesos, swarm): Swarm 17.06/UCP-2.3*
  • Operating System and version: RHEL7.3
  • Contiv Data Path: physical vNIC exposed by ESXi on worker VMs in pass-through mode
    contiv-logs.tar.gz
@vhosakot
Copy link
Member

vhosakot commented Jan 18, 2018

Looking at the logs in contiv-logs.tar.gz, looks like an RPC issue when netmaster connects to Ofnet:

netmaster.log has:

time="Jan 18 08:36:21.576831134" level=warning msg="Error dial tcp 10.65.121.129:9002: getsockopt: no route to host connecting to 10.65.121.129:%!s(uint16=9002). Retrying.."
time="Jan 18 08:36:22.578994895" level=error msg="Failed to connect to Rpc server 10.65.121.129:9002"
time="Jan 18 08:36:22.579084442" level=error msg="Error calling RPC: OfnetAgent.AddMaster. Could not connect to server"
time="Jan 18 08:36:22.579133952" level=error msg="Error calling AddMaster rpc call on node {10.65.121.129 9002}. Err: Could not connect to server"
time="Jan 18 08:36:22.579152875" level=error msg="Error adding node {10.65.121.129 9002}. Err: Could not connect to server"

Can you send the docker daemon's logs when you see this issue?

@blaksmit
Copy link

On today's call, there was an ask to see if this is an issue on K8s as well or just in Docker Swarm mode.

@vhosakot
Copy link
Member

@blaksmit This issue is seen Docker Swarm mode. Pretty sure that this issue cannot be seen in k8s as k8s does not even have the docker service scale command that exposes this issue.

@blaksmit
Copy link

@vhosakot the comment was to see whether a similar VM scale issue is seen with K8s.

@vhosakot
Copy link
Member

@blaksmit I see, got it. We could test if this issue is seen when kubectl scale is done (k8s equivalent of docker service scale).

@sisudhir sisudhir changed the title Docker service scale issues seen with rebooted workers running on VMs Container deployment issues seen on a Swarm mode cluster with rebooted workers running on VMs Jan 23, 2018
@sisudhir
Copy link
Author

Please note the changed title.
This is not just a scale issue as we are seeing this issue with deploying a new service having just a single container.

@g1rana
Copy link
Contributor

g1rana commented Mar 29, 2018

@sisudhir , Is this issue is seen at every iteration of your failure test ? Is it possible for you to share your setup with me . I can take a look at setup during error times

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants