So, you might not be aware but there is a race going on in the realms of the internet. A friend of mine showed me a quite fresh git repo which has reimplementation of cp program but using new io_uring feature of the Linux kernel. Author is claiming, that his creation is "Up to 70% faster than cp". That's cool, but the second I've heard it, my brain went - "How on earth can u be faster than the kernel fs driver?"
So I have had cloned the repo and followed build instructions. Soon after, I have my copy of binary and decided to first confirm what the author claimed is true. So, I am running Samsung NVME SSD on 32Gb RAM machine with 5.11 kernel (btw I use Arch), somewhat similar to the one that author is running. But this shouldn't really matter since I am only interested in checking if speed_of_cp < speed_of_wcp holds. And tbh I couldnt reproduce the cp results of the author at first.
So I have generated 10GiB file with:
holz@XION > dd if=/dev/urandom of=bigfile bs=512 count=20971520 20971520+0 records in 20971520+0 records out 10737418240 bytes (11 GB, 10 GiB) copied, 232.9 s, 46.1 MB/s
And started "benchmarking":
holz@XION > time cp bigfile bigfile2 real 0m5.212s user 0m0.010s sys 0m3.938s holz@XION > rm bigfile2 holz@XION > time cp bigfile bigfile2 real 0m6.775s user 0m0.014s sys 0m3.973s holz@XION > rm bigfile2 holz@XION > time cp bigfile bigfile2 real 0m9.333s user 0m0.015s sys 0m4.031s holz@XION > rm bigfile2 holz@XION > time cp bigfile bigfile2 real 0m6.266s user 0m0.004s sys 0m4.219s holz@XION > rm bigfile2 holz@XION > time cp bigfile bigfile2 real 0m8.924s user 0m0.008s sys 0m4.075s holz@XION > rm bigfile2
My cp seems to be extremally unstable. Don't really know why, It's probably related to the buffering inside kernel. Especially because cp is using very suprising (for me) method of copying the file. Check out this strace output ;]
holz@XION > strace --trace=read,write,openat cp -r bigfile bigfile2 # ... Some binary loading stuff openat(AT_FDCWD, "bigfile", O_RDONLY|O_NOFOLLOW) = 3 openat(AT_FDCWD, "bigfile2", O_WRONLY|O_TRUNC) = 4 read(3, "=%~,\352b\375\220\242\271\233\4\362\240\226\237p\3611\351\44\255"..., 131072) = 131072 write(4, "=%~,\352b\375\220\242\271\233\4\362\240\226\237p\3611\351244\255"..., 131072) = 131072 read(3, "\317:\352\307\334\226\366y\336-\270\272\207\250z61\365\23_55B"..., 131072) = 131072 write(4, "\317:\352\307\334\226\366y\336-\270\272\207\250z61\365\23355B"..., 131072) = 131072 read(3, "i]\356\255\3753v\264y\350\352\264\343\226\262p)Wx\307\27\f30_\255"..., 131072) = 131072 write(4, "i]\356\255\3753v\264y\350\352\264\343\226\262p)Wx\307\27\230_\255"..., 131072) = 131072 read(3, "\256\16\265\306\204\344\213\340\207\336\203\260\227N\203\261\224\207\262-"..., 131072) = 131072 write(4, "\256\16\265\306\204\344\213\340\207\336\203\260\227N\203\261\224\207\262-"..., 131072) = 131072 read(3, "\323\317E\257\33IZ\263_\313\361\323\367\365t>\374\\\320j\2\27"..., 131072) = 131072 # ... More read - write pair of calls
If you are surprised by the implementation, you are not alone. The thing is, results of the cp doesnt really matter that much and I will shortly tell you why. But for the completeness, lets run the wcp and see if It's faster.
holz@XION > rm bigfile2 holz@XION > time ./wcp bigfile bigfile2 Elapsed: 07s 10.00 GiB / 10.00 GiB 1.38 GiB/s ETA: ~00s 100% ██████████████████████████████████████████████████████████████████████████████████████████████████ real 0m7.340s user 0m7.263s sys 0m0.070s holz@XION > rm bigfile2 holz@XION > time ./wcp bigfile bigfile2 Elapsed: 10s 10.00 GiB / 10.00 GiB 1007.68 MiB/s ETA: ~00s 100% ██████████████████████████████████████████████████████████████████████████████████████████████████ real 0m10.218s user 0m10.149s sys 0m0.060s holz@XION > rm bigfile2 holz@XION > time ./wcp bigfile bigfile2 Elapsed: 10s 10.00 GiB / 10.00 GiB 951.50 MiB/s ETA: ~00s 100% ██████████████████████████████████████████████████████████████████████████████████████████████████ real 0m10.839s user 0m10.751s sys 0m0.077s holz@XION > rm bigfile2 holz@XION > time ./wcp bigfile bigfile2 Elapsed: 10s 10.00 GiB / 10.00 GiB 973.20 MiB/s ETA: ~00s 100% ██████████████████████████████████████████████████████████████████████████████████████████████████ real 0m10.598s user 0m10.511s sys 0m0.077s holz@XION > rm bigfile2 holz@XION > time ./wcp bigfile bigfile2 Elapsed: 10s 10.00 GiB / 10.00 GiB 942.91 MiB/s ETA: ~00s 100% ██████████████████████████████████████████████████████████████████████████████████████████████████ real 0m10.933s user 0m10.850s sys 0m0.074s holz@XION > rm bigfile2 holz@XION > time ./wcp bigfile bigfile2 Elapsed: 11s 10.00 GiB / 10.00 GiB 921.61 MiB/s ETA: ~00s 100% ██████████████████████████████████████████████████████████████████████████████████████████████████ real 0m11.184s user 0m11.103s sys 0m0.070s holz@XION > rm bigfile2 holz@XION > time ./wcp bigfile bigfile2 Elapsed: 10s 10.00 GiB / 10.00 GiB 938.59 MiB/s ETA: ~00s 100% ██████████████████████████████████████████████████████████████████████████████████████████████████ real 0m10.965s user 0m10.904s sys 0m0.054s holz@XION >
Well, for 10GiB file it seems pretty stable for wcp and at the same time not so fast anymore. To be honest, I have not expected that. The thing is that time for cp seems to even tripple if we first dont delete existing files before doing a copy again. So I have redone the test the same way the author did them and got somewhat similar results. With many small files and without "predeleting" I get wcp being ~50% faster, which is what the author claims.
Yeah ok, cool, but none of that really matters! I must admit I am having hard time recovering from learning the shocking truth about implementation of cp, really devastating. If you would ask me this very morning, how is cp implemented I would say that It for sure is using filesystem driver interface for most effective copy of the file. Who can better know how to copy the file then the implementation itself. Also, each driver can implement a different way of doing it that is optimized for It's own structure (some of the filesystems optimize for certain scenarios and trade performance [or other aspects] in one areas to gain them in others, which makes sense if you know how you are going to use the filesystem).
Checkout the man page of the sendfile syscall.
SENDFILE(2) Linux Programmer's Manual SENDFILE(2) NAME sendfile - transfer data between file descriptors SYNOPSIS #include <sys/sendfile.h> ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count); DESCRIPTION sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space. [ ... ] CONFORMING TO Not specified in POSIX.1-2001, nor in other standards. Other UNIX systems implement sendfile() with different semantics and prototypes. It should not be used in portable programs.
I know for the fact the Python3 is using sendfile when shutil.copyfile is called on Linux (source). So let's check how well it does in the benchmark.
holz@XION > time python3 copyfile.py bigfile bigfile2 real 0m4.683s user 0m0.010s sys 0m3.938s holz@XION > rm bigfile2 holz@XION > time python3 copyfile.py bigfile bigfile2 real 0m4.697s user 0m0.014s sys 0m3.880s holz@XION > rm bigfile2 holz@XION > time python3 copyfile.py bigfile bigfile2 real 0m4.594s user 0m0.017s sys 0m3.968s holz@XION > rm bigfile2 holz@XION > time python3 copyfile.py bigfile bigfile2 real 0m5.212s user 0m0.017s sys 0m4.131s holz@XION > rm bigfile2 holz@XION > time python3 copyfile.py bigfile bigfile2 real 0m4.651s user 0m0.014s sys 0m3.918s holz@XION > rm bigfile2 holz@XION > time python3 copyfile.py bigfile bigfile2 real 0m4.751s user 0m0.014s sys 0m3.943s holz@XION >
Pretty stable result everytime I run it on my machine (I skipped the first one, It was slightly slower). Also, sendfile is cool because it ensures the most efficient way of copying the file and if It doesn't, then It is a bug in the kernel! The downside of sendfile syscall is the thing stated in "CONFORMING TO" section in the manual I have pasted above. This syscall is not portable to other UNIX systems.
You might ask - why doesnt cp implement copying with the use of sendfile. I have found some interesting email in the GNU's coreutil mailing list I believe. I leave It for the reader to decide whether It was a good choice.
I know that the above "benchmarks" are kinda sketchy, but my intention was to go straight into the sendfile argument from the very begining. Hope you have learnt something, I certainly did ;]. Till next time.
EDIT:
--reflink flag allows user to create CoW copies of the file, u can read the manual of cp.
Big thanks to cxiao for spotting this!