Now runc uses mount bind to pass the device to the container. However, there are some limitations:
- The device should already exist in the host with the same name.
- The device's characteristic is the same as host device. e.g. the owner uid and gid is from initial user namespace that's why we have nobody/nogroup in the device's description.
- The device should have permission for other users because host user and container user are not equal. Otherwise, we have Permission denied error. For this reason we can not pass device with 0660 permission.
In order to solve them runc PoC was created (link) to call mknod instead of mount bind.
The main change for criu is that such devices won't be listed in the /proc/$pid/mountinfo (because we use mknod not bind). Also, mount tmpfs for /dev should be created in the initial user namespace.
However, there is a assumption in criu that the tasks live in the same set of namespaces https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/cr-restore.c#L2024 that's why /dev will be restored in the container user namespace.
The tmpfs image has the information about created devices and mknod will be called on restore. So, the restore is failed because you can not mknod in the non-initial user namespace.
To solve this problem a simple criu PoC was created to call usernsd to restore such mount in the initial user namespace. Also, the uid/gid should be preserved that's why /dev dump is called in the initial user namespace.
Current PoC limitations are:
- The devices major/minor should be the same for checkpoint/restore.
- The user namespace info (host id, container id, length) should be the same for checkpoint/restore.
- Only simple mount scenarios are checked (e.g. not checked if user will mount container dev to some path).
I will be glad to hear some feedback :)
Now runc uses mount bind to pass the device to the container. However, there are some limitations:
In order to solve them runc PoC was created (link) to call mknod instead of mount bind.
The main change for criu is that such devices won't be listed in the /proc/$pid/mountinfo (because we use mknod not bind). Also, mount tmpfs for /dev should be created in the initial user namespace.
However, there is a assumption in criu that the tasks live in the same set of namespaces https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/cr-restore.c#L2024 that's why /dev will be restored in the container user namespace.
The tmpfs image has the information about created devices and mknod will be called on restore. So, the restore is failed because you can not mknod in the non-initial user namespace.
To solve this problem a simple criu PoC was created to call usernsd to restore such mount in the initial user namespace. Also, the uid/gid should be preserved that's why /dev dump is called in the initial user namespace.
Current PoC limitations are:
I will be glad to hear some feedback :)