Copy contest
09 / 03 / 2021

So, you might not be aware but there is a race going on in the realms of the internet. A friend of mine showed me a quite fresh git repo which has reimplementation of cp program but using new io_uring feature of the Linux kernel. Author is claiming, that his creation is "Up to 70% faster than cp". That's cool, but the second I've heard it, my brain went - "How on earth can u be faster than the kernel fs driver?"

So I have had cloned the repo and followed build instructions. Soon after, I have my copy of binary and decided to first confirm what the author claimed is true. So, I am running Samsung NVME SSD on 32Gb RAM machine with 5.11 kernel (btw I use Arch), somewhat similar to the one that author is running. But this shouldn't really matter since I am only interested in checking if speed_of_cp < speed_of_wcp holds. And tbh I couldnt reproduce the cp results of the author at first.

So I have generated 10GiB file with:

holz@XION > dd if=/dev/urandom of=bigfile bs=512 count=20971520
20971520+0 records in
20971520+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 232.9 s, 46.1 MB/s
          

And started "benchmarking":

holz@XION > time cp bigfile bigfile2

real	0m5.212s
user	0m0.010s
sys	0m3.938s
holz@XION > rm bigfile2
holz@XION > time cp bigfile bigfile2

real	0m6.775s
user	0m0.014s
sys	0m3.973s
holz@XION > rm bigfile2
holz@XION > time cp bigfile bigfile2

real	0m9.333s
user	0m0.015s
sys	0m4.031s
holz@XION > rm bigfile2
holz@XION > time cp bigfile bigfile2

real	0m6.266s
user	0m0.004s
sys	0m4.219s
holz@XION > rm bigfile2
holz@XION > time cp bigfile bigfile2

real	0m8.924s
user	0m0.008s
sys	0m4.075s
holz@XION > rm bigfile2
          

My cp seems to be extremally unstable. Don't really know why, It's probably related to the buffering inside kernel. Especially because cp is using very suprising (for me) method of copying the file. Check out this strace output ;]

holz@XION > strace --trace=read,write,openat cp -r bigfile bigfile2
# ... Some binary loading stuff
openat(AT_FDCWD, "bigfile", O_RDONLY|O_NOFOLLOW) = 3
openat(AT_FDCWD, "bigfile2", O_WRONLY|O_TRUNC) = 4
read(3, "=%~,\352b\375\220\242\271\233\4\362\240\226\237p\3611\351\44\255"..., 131072) = 131072
write(4, "=%~,\352b\375\220\242\271\233\4\362\240\226\237p\3611\351244\255"..., 131072) = 131072
read(3, "\317:\352\307\334\226\366y\336-\270\272\207\250z61\365\23_55B"..., 131072) = 131072
write(4, "\317:\352\307\334\226\366y\336-\270\272\207\250z61\365\23355B"..., 131072) = 131072
read(3, "i]\356\255\3753v\264y\350\352\264\343\226\262p)Wx\307\27\f30_\255"..., 131072) = 131072
write(4, "i]\356\255\3753v\264y\350\352\264\343\226\262p)Wx\307\27\230_\255"..., 131072) = 131072
read(3, "\256\16\265\306\204\344\213\340\207\336\203\260\227N\203\261\224\207\262-"..., 131072) = 131072
write(4, "\256\16\265\306\204\344\213\340\207\336\203\260\227N\203\261\224\207\262-"..., 131072) = 131072
read(3, "\323\317E\257\33IZ\263_\313\361\323\367\365t>\374\\\320j\2\27"..., 131072) = 131072
# ... More read - write pair of calls
          

If you are surprised by the implementation, you are not alone. The thing is, results of the cp doesnt really matter that much and I will shortly tell you why. But for the completeness, lets run the wcp and see if It's faster.

holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2

 Elapsed: 07s                             10.00 GiB / 10.00 GiB                  1.38 GiB/s   ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████

real	0m7.340s
user	0m7.263s
sys	0m0.070s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2

 Elapsed: 10s                             10.00 GiB / 10.00 GiB               1007.68 MiB/s   ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████

real	0m10.218s
user	0m10.149s
sys	0m0.060s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2

 Elapsed: 10s                             10.00 GiB / 10.00 GiB                951.50 MiB/s   ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████

real	0m10.839s
user	0m10.751s
sys	0m0.077s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2

 Elapsed: 10s                             10.00 GiB / 10.00 GiB                973.20 MiB/s   ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████

real	0m10.598s
user	0m10.511s
sys	0m0.077s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2

 Elapsed: 10s                             10.00 GiB / 10.00 GiB                942.91 MiB/s   ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████

real	0m10.933s
user	0m10.850s
sys	0m0.074s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2

 Elapsed: 11s                             10.00 GiB / 10.00 GiB                921.61 MiB/s   ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████

real	0m11.184s
user	0m11.103s
sys	0m0.070s
holz@XION > rm bigfile2
holz@XION > time ./wcp bigfile bigfile2

 Elapsed: 10s                             10.00 GiB / 10.00 GiB                938.59 MiB/s   ETA: ~00s
100% ██████████████████████████████████████████████████████████████████████████████████████████████████

real	0m10.965s
user	0m10.904s
sys	0m0.054s
holz@XION >
          
very funny meme

Well, for 10GiB file it seems pretty stable for wcp and at the same time not so fast anymore. To be honest, I have not expected that. The thing is that time for cp seems to even tripple if we first dont delete existing files before doing a copy again. So I have redone the test the same way the author did them and got somewhat similar results. With many small files and without "predeleting" I get wcp being ~50% faster, which is what the author claims.

Yeah ok, cool, but none of that really matters! I must admit I am having hard time recovering from learning the shocking truth about implementation of cp, really devastating. If you would ask me this very morning, how is cp implemented I would say that It for sure is using filesystem driver interface for most effective copy of the file. Who can better know how to copy the file then the implementation itself. Also, each driver can implement a different way of doing it that is optimized for It's own structure (some of the filesystems optimize for certain scenarios and trade performance [or other aspects] in one areas to gain them in others, which makes sense if you know how you are going to use the filesystem).

Checkout the man page of the sendfile syscall.

SENDFILE(2)                          Linux Programmer's Manual                          SENDFILE(2)

NAME
       sendfile - transfer data between file descriptors

SYNOPSIS
       #include <sys/sendfile.h>

       ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

DESCRIPTION
       sendfile()  copies  data  between  one file descriptor and another.  Because this copying is
       done within the kernel, sendfile() is more efficient than the  combination  of  read(2)  and
       write(2), which would require transferring data to and from user space.


            [ ... ]

CONFORMING TO
       Not specified in POSIX.1-2001, nor in other standards.

       Other  UNIX systems implement sendfile() with different semantics and prototypes.  It should
       not be used in portable programs.
          

I know for the fact the Python3 is using sendfile when shutil.copyfile is called on Linux (source). So let's check how well it does in the benchmark.

holz@XION > time python3 copyfile.py bigfile bigfile2

real	0m4.683s
user	0m0.010s
sys	0m3.938s
holz@XION > rm bigfile2
holz@XION > time python3 copyfile.py bigfile bigfile2

real	0m4.697s
user	0m0.014s
sys	0m3.880s
holz@XION > rm bigfile2
holz@XION > time python3 copyfile.py bigfile bigfile2

real	0m4.594s
user	0m0.017s
sys	0m3.968s
holz@XION > rm bigfile2
holz@XION > time python3 copyfile.py bigfile bigfile2

real	0m5.212s
user	0m0.017s
sys	0m4.131s
holz@XION > rm bigfile2
holz@XION > time python3 copyfile.py bigfile bigfile2

real	0m4.651s
user	0m0.014s
sys	0m3.918s
holz@XION > rm bigfile2
holz@XION > time python3 copyfile.py bigfile bigfile2

real	0m4.751s
user	0m0.014s
sys	0m3.943s
holz@XION >
          

Pretty stable result everytime I run it on my machine (I skipped the first one, It was slightly slower). Also, sendfile is cool because it ensures the most efficient way of copying the file and if It doesn't, then It is a bug in the kernel! The downside of sendfile syscall is the thing stated in "CONFORMING TO" section in the manual I have pasted above. This syscall is not portable to other UNIX systems.

You might ask - why doesnt cp implement copying with the use of sendfile. I have found some interesting email in the GNU's coreutil mailing list I believe. I leave It for the reader to decide whether It was a good choice.

I know that the above "benchmarks" are kinda sketchy, but my intention was to go straight into the sendfile argument from the very begining. Hope you have learnt something, I certainly did ;]. Till next time.

EDIT:

--reflink flag allows user to create CoW copies of the file, u can read the manual of cp.
Big thanks to cxiao for spotting this!