Skip to main content

FAQs

Question 1: kube-proxy report iptables problems

E0627 09:28:54.054930 1 proxier.go:1598] Failed to execute iptables-restore: exit status 1 (iptables-restore: line 86 failed ) I0627 09:28:54.054962 1 proxier.go:879] Sync failed; retrying in 30s

Solution: Clear iptables directly.

iptables -F && iptables -t nat -F && iptables -t mangle -F && iptables -X

Question 2: calico and coredns are always in initializing state

The follwoing message will occur when using kubectl describe <podname> which is roughly related to network and sandbox issues.

Failed to create pod sandbox: rpc error: code = Unknown desc = [failed to set up sandbox container "7f5b66ebecdfc2c206027a2afcb9d1a58ec5db1a6a10a91d4d60c0079236e401" network for pod "calico-kube-controllers-577f77cb5c-99t8z": networkPlugin cni failed to set up pod "calico-kube-controllers-577f77cb5c-99t8z_kube-system" network: error getting ClusterInformation: Get "https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.96.0. 1:443: i/o timeout, failed to clean up sandbox container "7f5b66ebecdfc2c206027a2afcb9d1a58ec5db1a6a10a91d4d60c0079236e401" network for pod "calico-kube-controllers-577f77cb5c-99t8z": networkPlugin cni failed to teardown pod "calico-kube-controllers-577f77cb5c-99t8z_kube-system" network: error getting ClusterInformation: Get "https://[10.96.0.1]:443/apis/crd.projectcalico.org/v1/clusterinformations/default": dial tcp 10.96.0. 1:443: i/o timeout]

Reason: Such a problem occurs when a k8s cluster is initialized more than once and the network configuration of k8s was not deleted previously.

Solution:

# delete k8s network configuration
rm -rf /etc/cni/net.d/

# reinitialize k8s with the instruction

Question 3: metrics-server keeps unsuccessful state

Reason: Master node does not add taint.

Solution:

# add taint on master node
kubectl taint nodes --all node-role.kubernetes.io/master node-role.kubernetes.io/master

Question 4: 10002 already in use

Error message 'xxx already in use' occurs when using journalctl -u cloudcore.service -xe.

Reason: The previous processes were not cleaned up.

Solution: Find the process occupying the port and directly kill it.

lsof -i:xxxx
kill xxxxx

Question 5: edgecore file exists

When attempting to create a symbolic link in installing edgecore, the target path already exists and cannot be created.

execute keadm command failed:  failed to exec 'bash -c sudo ln /etc/kubeedge/edgecore.service /etc/systemd/system/edgecore.service && sudo systemctl daemon-reload && sudo systemctl enable edgecore && sudo systemctl start edgecore', err: ln: failed to create hard link '/etc/systemd/system/edgecore.service': File exists, err: exit status 1

Reason: edgecore.service already exists in the /etc/systemd/system/ directory if edgecore is installed more than once.

Solution: Just delete it.

sudo rm /etc/systemd/system/edgecore.service

Question 6: TLSStreamPrivateKeyFile not exist

 TLSStreamPrivateKeyFile: Invalid value: "/etc/kubeedge/certs/stream.key": TLSStreamPrivateKeyFile not exist
1214 23:02:23 cloud.kubeedge cloudcore[196229]: TLSStreamCertFile: Invalid value: "/etc/kubeedge/certs/stream.crt": TLSStreamCertFile not exist
1214 23:02:23 cloud.kubeedge cloudcore[196229]: TLSStreamCAFile: Invalid value: "/etc/kubeedge/ca/streamCA.crt": TLSStreamCAFile not exist
1214 23:02:23 cloud.kubeedge cloudcore[196229]: ]

Solution: Check whether directory /etc/kubeedge has file certgen.sh and run bash certgen.sh stream.

Question 7: EdgeMesh has successful edge-edge but failed cloud-edge connection

Troubleshooting:

First, based on the location model, ensure whether the edgemesh-agent container exists on the visited node and whether it is operating normally.

Q7-2

This situation is common since the master node usually has taints, which will evict other pods, thereby causing the deployment failure of edgemesh-agent. This issue can be resolved by removing the node taints.

If both the visiting node and the visited node's edgemesh-agent have been started normally while this error is still reported, it may be due to unsucessful discovering between the visiting node and the visited node. Please troubleshoot in this way:

  1. edgemesh-agent at each node has a peer ID (generated with the hash of node name):
edge1:
I'm {12D3KooWFz1dKY8L3JC8wAY6sJ5MswvPEGKysPCfcaGxFmeH7wkz: [/ip4/127.0.0.1/tcp/20006 /ip4/192.168.1.2/tcp/20006]}

edge2:
I'm {12D3KooWPpY4GqqNF3sLC397fMz5ZZfxmtMTNa1gLYFopWbHxZDt: [/ip4/127.0.0.1/tcp/20006 /ip4/192.168.1.4/tcp/20006]}
  1. If visiting node and visited node are in the same LAN with internet IP, refer to Question 12 in EdgeMesh Q&A. Logs of discovering node in the same LAN is [MDNS] Discovery found peer: <visited node peer ID: [visited IP list(include relay node IP)]>.

  2. If visiting node and visited node are across different LANs, check setting of relayNodes (Details at KubeEdge EdgeMesh Architecture). Logs of discovering node across LANs is [DHT] Discovery found peer: <visited node peer ID: [visited IP list(include relay node IP)]>.

Solution:

Before deploy EdgeMesh with kubectl apply -f build/agent/resources/, modify file 04-configmap.yaml and add relayNode.

Q7

Question 8: GPU is not found on master

Use nvidia-smi to see GPU status on the cloud server. To support GPU in k8s pod on cloud server, extra plugin is needed.

Please refer to the following link: Installing the NVIDIA Container Toolkit — NVIDIA Container Toolkit 1.14.3 documentation

Configure as the link, and add default-runtime in '/etc/docker/daemon.json' (remember to restart docker after modification):

{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
}
}

Question 9: cannot find GPU resources in jetson

k8s-device-plugin has been installed successfully, but nvidia pod on jetson nodes (with tegra architecture) reveals that gpu is not found:

2024/01/04 07:43:58 Retreiving plugins.
2024/01/04 07:43:58 Detected non-NVML platform: could not load NVML: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
2024/01/04 07:43:58 Detected non-Tegra platform: /sys/devices/soc0/family file not found
2024/01/04 07:43:58 Incompatible platform detected
2024/01/04 07:43:58 If this is a GPU node, did you configure the NVIDIA Container Toolkit?
2024/01/04 07:43:58 You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
2024/01/04 07:43:58 You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
2024/01/04 07:43:58 If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
2024/01/04 07:43:58 No devices found. Waiting indefinitely.

Q9-1

dpkg -l '*nvidia*'

Q9-2

Q9-3

Plug in does not detect Tegra device Jetson Nano · Issue #377 · NVIDIA/k8s-device-plugin (github.com)

Note that looking at the initial logs that you provided you may have been using v1.7.0 of the NVIDIA Container Toolkit. This is quite an old version and we greatly improved our support for Tegra-based systems with the v1.10.0 release. It should also be noted that in order to use the GPU Device Plugin on Tegra-based systems (specifically targetting the integrated GPUs) at least v1.11.0 of the NVIDIA Container Toolkit is required.

There are no Tegra-specific changes in the v1.12.0 release, so using the v1.11.0 release should be sufficient in this case.

Solution: Upgrade the NVIDIA Container Toolkit on jetson nodes.

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

Question 10: lc127.0.0. 53:53 no such host/connection refused

During the Sedna installation stage, an error occurs in logs: lc127.0.0. 53:53 no such host/connection refused.

Reason: See question 5 in EdgeMesh Q&A.

Solution:

First check whether Hostnetwork is in the script install.sh. Delete Hostnetwork options if it exists (install.sh in dayu-cusomized sedna is correct).

If Hostnetwork does not exist but the error is still reported, use the following method to temporarily solve (not recommend):

  1. Add nameserver 169.254.96.16 to the last line of file /etc/resolv.conf on edge nodes with sedna pod error.
  2. Check whether clusterDNS is 169.254.96.16 in /etc/kubeedge/config/edgecore.yaml on edge nodes.
  3. If error still exists, reinstall Sedna.

Question 11: 169.254.96.16:no such host

Check the configuration of EdgeMesh:

  1. Check the chain in iptables as Hybrid proxy | EdgeMesh.
  2. Check clusterDNS as required in EdgeMesh Installation.

Question 12: kubectl logs <pod-name> timeout

Q12-1

Reason: Refer to Kubernetes nodes cannot capture monitoring metrics(zhihu.com)

Q12-2

It can be found that the error happened in port 10350, which is responsible for forwarding in kubeedge. Therefore, it should be a configuration issue with cloudcore.

Solution: Please set as Enable Kubectl logs/exec to debug pods on the edge | KubeEdge.

Question 13: Stuck in kubectl logs <pod-name>

Reason: It maybe caused by the force termination of kubectl logs command.

Solution: Restart edgecore/cloudcore:

# on cloud
systemctl restart cloudcore.service
# on edge
systemctl restart edgecore.service

Question 14: CloudCore reports certficate error

Q14

Reason: Token of master has changed.

Solution: Use new token from cloud to join.

Question 15: deleting namespace stucks in terminating state

Method 1 (Probably won't work)

kubectl delete ns sedna --force --grace-period=0

Method 2

Open a proxy terminal:

kubectl proxy

Starting to serve on 127.0.0.1:8001

Open an operating terminal:

# save configuration file of namespace (use sedna as example here)
kubectl get ns sedna -o json > sedna.json
# Delete the contents of spec and status sections and the "," sign after the metadata field.

The remaining content in sedna.json is shown as follows:

Details
{
"apiVersion": "v1",
"kind": "Namespace",
"metadata": {
"creationTimestamp": "2023-12-14T09:12:13Z",
"deletionTimestamp": "2023-12-14T09:15:25Z",
"managedFields": [
{
"apiVersion": "v1",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:status": {
"f:phase": {}
}
},
"manager": "kubectl-create",
"operation": "Update",
"time": "2023-12-14T09:12:13Z"
},
{
"apiVersion": "v1",
"fieldsType": "FieldsV1",
"fieldsV1": {
"f:status": {
"f:conditions": {
".": {},
"k:{\"type\":\"NamespaceContentRemaining\"}": {
".": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
},
"k:{\"type\":\"NamespaceDeletionContentFailure\"}": {
".": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
},
"k:{\"type\":\"NamespaceDeletionDiscoveryFailure\"}": {
".": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
},
"k:{\"type\":\"NamespaceDeletionGroupVersionParsingFailure\"}": {
".": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
},
"k:{\"type\":\"NamespaceFinalizersRemaining\"}": {
".": {},
"f:lastTransitionTime": {},
"f:message": {},
"f:reason": {},
"f:status": {},
"f:type": {}
}
}
}
},
"manager": "kube-controller-manager",
"operation": "Update",
"time": "2023-12-14T09:15:30Z"
}
],
"name": "sedna",
"resourceVersion": "3351515",
"uid": "99cb8afb-a4c1-45e6-960d-ff1b4894773d"
}
}

Delete namespace by call the interface:

curl -k -H "Content-Type: application/json" -X PUT --data-binary @sedna.json http://127.0.0.1:8001/api/v1/namespaces/sedna/finalize
{
"kind": "Namespace",
"apiVersion": "v1",
"metadata": {
"name": "sedna",
"uid": "99cb8afb-a4c1-45e6-960d-ff1b4894773d",
"resourceVersion": "3351515",
"creationTimestamp": "2023-12-14T09:12:13Z",
"deletionTimestamp": "2023-12-14T09:15:25Z",
"managedFields": [
{
"manager": "curl",
"operation": "Update",
"apiVersion": "v1",
"time": "2023-12-14T09:42:38Z",
"fieldsType": "FieldsV1",
"fieldsV1": {"f:status":{"f:phase":{}}}
}
]
},
"spec": {

},
"status": {
"phase": "Terminating",
"conditions": [
{
"type": "NamespaceDeletionDiscoveryFailure",
"status": "True",
"lastTransitionTime": "2023-12-14T09:15:30Z",
"reason": "DiscoveryFailed",
"message": "Discovery failed for some groups, 1 failing: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request"
},
{
"type": "NamespaceDeletionGroupVersionParsingFailure",
"status": "False",
"lastTransitionTime": "2023-12-14T09:15:30Z",
"reason": "ParsedGroupVersions",
"message": "All legacy kube types successfully parsed"
},
{
"type": "NamespaceDeletionContentFailure",
"status": "False",
"lastTransitionTime": "2023-12-14T09:15:30Z",
"reason": "ContentDeleted",
"message": "All content successfully deleted, may be waiting on finalization"
},
{
"type": "NamespaceContentRemaining",
"status": "False",
"lastTransitionTime": "2023-12-14T09:15:30Z",
"reason": "ContentRemoved",
"message": "All content successfully removed"
},
{
"type": "NamespaceFinalizersRemaining",
"status": "False",
"lastTransitionTime": "2023-12-14T09:15:30Z",
"reason": "ContentHasNoFinalizers",
"message": "All content-preserving finalizers finished"
}
]
}
}

Question 16: Deployment failed after forcibly deleting pod

Deployment creates pods. But after deleting deployment through kubectl delete deploy <deploy-name>, pods are stuck in 'terminating' status.

Then, use kubectl delete pod edgeworker-deployment-7g5hs-58dffc5cd7-b77wz --force --grace-period=0 to delete pods. And the re-deployed pods are stuck in 'pending' state after assigning to node.

Solution:

Deleting pods with option --force does not actually terminate pods.

You should go to the specific node and manually delete the corresponding docker containers (including pause, see details at Pause in K8S Pod)

# Check pod position on cloud
kubectl get pods -A -owide
# Delete docker containers on specific node (the position of pod)
docker ps -A
docker stop <container>
docker rm <container>
# Restart edgecore on specific node (the pod)
systemctl restart edgecore.service

Question 17: After deleting deployments and pods, the pods still restart automatically

 journalctl -u edgecore.service  -f

Q17

Solution: Restart edgecore:

systemctl restart edgecore.service

Question 18: Large-scale Evicted (disk pressure)

Reason:

The free disk space (usually root directory) is below the threshold (15%), and kubelet will kill pods to recycle resources (Details at K8S Insufficient Resource).

Solution:

The fundamental solution is to increase the available disk space.

If a temporary solution is needed, the threshold can be changed by modifying the k8s configuration.

Specifically, modify the configuration file and add --eviction-hard=nodefs.available<5%:

systemctl status kubelet

● kubelet.service - kubelet: The Kubernetes Node Agent
Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/kubelet.service.d
└─10-kubeadm.conf
Active: active (running) since Fri 2023-12-15 09:17:50 CST; 5min ago
Docs: https://kubernetes.io/docs/home/
Main PID: 1070209 (kubelet)
Tasks: 59 (limit: 309024)
Memory: 67.2M
CGroup: /system.slice/kubelet.service

From the outputs you can see that configuration file is at directory /etc/systemd/system/kubelet.service.d and configuration file is 10-kubeadm.conf.

vim /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml --eviction-hard=nodefs.available<5%"

# add '--eviction-hard=nodefs.available<5%' at the end

Restart kubelet:

systemctl daemon-reload
systemctl restart kubelet

You will find that it can be deployed normally (it's just a temporary method, the disk space needs to be cleaned up).

Question 19: Executing 'iptables' report that system does not support '--dport'

When executing command iptables -t nat -A OUTPUT -p tcp --dport 10351 -j DNAT --to $CLOUDCOREIPS:10003, an error of not supporting --dport is reported.

Reason: The iptables version not support --dport:

# Check iptables version
iptables -V
# Version of 'iptables v1.8.7 (nf_tables)' does not support '--dport'

Solution:

# Switch version. In the three options, choose 'legacy'
sudo update-alternatives --config iptables
# verify (success with no error)
iptables -t nat -A OUTPUT -p tcp --dport 10351 -j DNAT --to $CLOUDCOREIPS:10003

Question 20: Report 'token format error' after 'keadm join'

After executing command keadm join --cloudcore-ipport=114.212.81.11:10000 --kubeedge-version=1.9.2 --token=……, journalctl -u edegecore.service -freport error of token format.

Solution:

Token is error or expired when joining. Therefore, get the latest token from cloud and redo from keadm reset.

Question 21: Report mapping errors after restart edgecore.service

After executing command systemctl restart edgecore.service to restart edgecore, journalctl -u edegecore.service -f report mapping error of failing to translate yaml to json.

Solution:

Check the format in file/etc/kubeedge/config/edgecore.yaml. Note that tab is not allowed in YAML files (use space instead).

Question 22: Restart edgecore and find error of 'connect refuse'

During the startup of EdgeMesh, modify the /etc/kubeedge/config/edgecore.yaml file and restart the service with systemctl restart edgecore.service. Command journalctl -u edegecore.service -f report an error of 'connect refuse' and the cloud denies communication.

Solution:

Check the status of cloudcore on cloud:

# check cloudcore status
systemctl status cloudcore
# restart cloudcore
systemctl restart cloudcore.service
# check error message
journalctl -u cloudcore.service -f

An error of Question 4 maybe found.

Question 23: Error of 'Shutting down' is reported when deploying metrics-service

During the deployment of metrics-service, the port in the components.yaml file is modified to '4443', but metrics-service still failed with error Shutting down RequestHeaderAuthRequestController occurred.

Solution:

When deploying KubeEdge, the port in the metrics-service will be automatically overwritten to '10250'. Manually modify the port in components.yaml file to '10250'.

Question 24: 169.254.96. 16:53: i/o timeout

When a new node joins in the cluster, Sedna will be deployed automatically. An error is reported in the log of sedna-lc:

client tries to connect global manager(address: gm.sedna:9000) failed, error: dial tcp: lookup gm.sedna on 169.254.96.16:53: read udp 172.17.0.3:49991->169.254.96.16:53: i/o timeout

Troubleshooting:

First check status of edgemesh-agent on the new node. The error is reported from edgemesh-agent:

Q24-1

Use kubectl describe pod <pod-name> and find that assigning to new node is the latest event:

Q24-2

Use journalctl -u edgecore.service -xe on the new node to check errors:

Q24-3

Reason: The new node cannot access dockerhub directly and has not been configured of docker registry. Thus, docker image cannot be pulled.

Solution: Configure docker registry and restart docker and edgecore. Refer to Docker Registry Configuration for details.

Question 25: keadm join error on edge nodes

Execution of keadm join on edges reported errors.

Solution: Check the edgecore.yaml file:

vim /etc/kubeedge/config/edgecore.yaml

Q25

Add the address of master node (cloud) in edgeHub/httpServer, such as https://114.212.81.11:10002, delete the redundant ':' in websocket/server.

Re-run the keadm join command after modification.

Question 26: failed to build map of initial containers from runtime

Edgecore report error when journalctl -u edgecore.service -f:

initialize module error: failed to build map of initial containers from runtime: no PodsandBox found with Id 'c45ed1592e75e885e119664d777107645a7e7904703c690664691c61a9f79ed3'

Solution:

Find the docker ID and delete it:

docker ps -a --filter "label=io.kubernetes.sandbox.id=c45ed1592e75e885e119664d777107645a7e7904703c690664691c61a9f79ed3"
# find the related docker ID
docker rm <docker ID>