So, you might not be aware but there is a race going on in the realms of the internet. A friend of mine showed me a quite fresh git repo which has reimplementation of cp program but using new io_uring feature of the Linux kernel. Author is claiming, that his creation is "Up to 70% faster than cp". That's cool, but the second I've heard it, my brain went - "How on earth can u be faster than the kernel fs driver?"
So I have had cloned the repo and followed build instructions. Soon after, I have my copy of binary and decided to first confirm what the author claimed is true. So, I am running Samsung NVME SSD on 32Gb RAM machine with 5.11 kernel (btw I use Arch), somewhat similar to the one that author is running. But this shouldn't really matter since I am only interested in checking if speed_of_cp < speed_of_wcp holds. And tbh I couldnt reproduce the cp results of the author at first.
So I have generated 10GiB file with:
holz@XION > dd if=/dev/urandom of=bigfile bs=512 count=20971520
20971520+0 records in
20971520+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 232.9 s, 46.1 MB/s
And started "benchmarking":
holz@XION > time cp bigfile bigfile2
real 0m5.212s
user 0m0.010s
sys 0m3.938s
holz@XION > rm bigfile2
holz@XION > time cp bigfile bigfile2
real 0m6.775s
user 0m0.014s
sys 0m3.973s
holz@XION > rm bigfile2
holz@XION > time cp bigfile bigfile2
real 0m9.333s
user 0m0.015s
sys 0m4.031s
holz@XION > rm bigfile2
holz@XION > time cp bigfile bigfile2
real 0m6.266s
user 0m0.004s
sys 0m4.219s
holz@XION > rm bigfile2
holz@XION > time cp bigfile bigfile2
real 0m8.924s
user 0m0.008s
sys 0m4.075s
holz@XION > rm bigfile2
My cp seems to be extremally unstable. Don't really know why, It's probably related to the buffering inside kernel. Especially because cp is using very suprising (for me) method of copying the file. Check out this strace output ;]
holz@XION > strace --trace=read,write,openat cp -r bigfile bigfile2
# ... Some binary loading stuff
openat(AT_FDCWD, "bigfile", O_RDONLY|O_NOFOLLOW) = 3
openat(AT_FDCWD, "bigfile2", O_WRONLY|O_TRUNC) = 4
read(3, "=%~,\352b\375\220\242\271\233\4\362\240\226\237p\3611\351\44\255"..., 131072) = 131072
write(4, "=%~,\352b\375\220\242\271\233\4\362\240\226\237p\3611\351244\255"..., 131072) = 131072
read(3, "\317:\352\307\334\226\366y\336-\270\272\207\250z61\365\23_55B"..., 131072) = 131072
write(4, "\317:\352\307\334\226\366y\336-\270\272\207\250z61\365\23355B"..., 131072) = 131072
read(3, "i]\356\255\3753v\264y\350\352\264\343\226\262p)Wx\307\27\f30_\255"..., 131072) = 131072
write(4, "i]\356\255\3753v\264y\350\352\264\343\226\262p)Wx\307\27\230_\255"..., 131072) = 131072
read(3, "\256\16\265\306\204\344\213\340\207\336\203\260\227N\203\261\224\207\262-"..., 131072) = 131072
write(4, "\256\16\265\306\204\344\213\340\207\336\203\260\227N\203\261\224\207\262-"..., 131072) = 131072
read(3, "\323\317E\257\33IZ\263_\313\361\323\367\365t>\374\\\320j\2\27"..., 131072) = 131072
# ... More read - write pair of calls
If you are surprised by the implementation, you are not alone. The thing is, results of the cp doesnt really matter that much and I will shortly tell you why. But for the completeness, lets run the wcp and see if It's faster.
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2
Elapsed: 07s 10.00 GiB / 10.00 GiB 1.38 GiB/s ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████
real 0m7.340s
user 0m7.263s
sys 0m0.070s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2
Elapsed: 10s 10.00 GiB / 10.00 GiB 1007.68 MiB/s ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████
real 0m10.218s
user 0m10.149s
sys 0m0.060s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2
Elapsed: 10s 10.00 GiB / 10.00 GiB 951.50 MiB/s ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████
real 0m10.839s
user 0m10.751s
sys 0m0.077s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2
Elapsed: 10s 10.00 GiB / 10.00 GiB 973.20 MiB/s ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████
real 0m10.598s
user 0m10.511s
sys 0m0.077s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2
Elapsed: 10s 10.00 GiB / 10.00 GiB 942.91 MiB/s ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████
real 0m10.933s
user 0m10.850s
sys 0m0.074s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2
Elapsed: 11s 10.00 GiB / 10.00 GiB 921.61 MiB/s ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████
real 0m11.184s
user 0m11.103s
sys 0m0.070s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2
Elapsed: 10s 10.00 GiB / 10.00 GiB 938.59 MiB/s ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████
real 0m10.965s
user 0m10.904s
sys 0m0.054s
holz@XION >
Well, for 10GiB file it seems pretty stable for wcp and at the same time not so fast anymore. To be honest, I have not expected that. The thing is that time for cp seems to even tripple if we first dont delete existing files before doing a copy again. So I have redone the test the same way the author did them and got somewhat similar results. With many small files and without "predeleting" I get wcp being ~50% faster, which is what the author claims.
Yeah ok, cool, but none of that really matters! I must admit I am having hard time recovering from learning the shocking truth about implementation of cp, really devastating. If you would ask me this very morning, how is cp implemented I would say that It for sure is using filesystem driver interface for most effective copy of the file. Who can better know how to copy the file then the implementation itself. Also, each driver can implement a different way of doing it that is optimized for It's own structure (some of the filesystems optimize for certain scenarios and trade performance [or other aspects] in one areas to gain them in others, which makes sense if you know how you are going to use the filesystem).
Checkout the man page of the sendfile syscall.
SENDFILE(2) Linux Programmer's Manual SENDFILE(2)
NAME
sendfile - transfer data between file descriptors
SYNOPSIS
#include <sys/sendfile.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);
DESCRIPTION
sendfile() copies data between one file descriptor and another. Because this copying is
done within the kernel, sendfile() is more efficient than the combination of read(2) and
write(2), which would require transferring data to and from user space.
[ ... ]
CONFORMING TO
Not specified in POSIX.1-2001, nor in other standards.
Other UNIX systems implement sendfile() with different semantics and prototypes. It should
not be used in portable programs.
I know for the fact the Python3 is using sendfile when shutil.copyfile is called on Linux (source). So let's check how well it does in the benchmark.
holz@XION > time python3 copyfile.py bigfile bigfile2
real 0m4.683s
user 0m0.010s
sys 0m3.938s
holz@XION > rm bigfile2
holz@XION > time python3 copyfile.py bigfile bigfile2
real 0m4.697s
user 0m0.014s
sys 0m3.880s
holz@XION > rm bigfile2
holz@XION > time python3 copyfile.py bigfile bigfile2
real 0m4.594s
user 0m0.017s
sys 0m3.968s
holz@XION > rm bigfile2
holz@XION > time python3 copyfile.py bigfile bigfile2
real 0m5.212s
user 0m0.017s
sys 0m4.131s
holz@XION > rm bigfile2
holz@XION > time python3 copyfile.py bigfile bigfile2
real 0m4.651s
user 0m0.014s
sys 0m3.918s
holz@XION > rm bigfile2
holz@XION > time python3 copyfile.py bigfile bigfile2
real 0m4.751s
user 0m0.014s
sys 0m3.943s
holz@XION >
Pretty stable result everytime I run it on my machine (I skipped the first one, It was slightly slower). Also, sendfile is cool because it ensures the most efficient way of copying the file and if It doesn't, then It is a bug in the kernel! The downside of sendfile syscall is the thing stated in "CONFORMING TO" section in the manual I have pasted above. This syscall is not portable to other UNIX systems.
You might ask - why doesnt cp implement copying with the use of sendfile. I have found some interesting email in the GNU's coreutil mailing list I believe. I leave It for the reader to decide whether It was a good choice.
I know that the above "benchmarks" are kinda sketchy, but my intention was to go straight into the sendfile argument from the very begining. Hope you have learnt something, I certainly did ;]. Till next time.
EDIT:
--reflink flag allows user to create CoW copies of the file, u can read the manual of cp.
Big thanks to cxiao for spotting this!