Quần Cam Blog

How to explain rsync to a six year old

Einstein had a saying that people always quote.

Einstein Quote

I have no clue why he said so or whether he did say so, as if a six year old knows Einstein, he/she is no longer six.

But I will try my best.

Everything starts with a tale

Once upon a time, there were three brothers separately Generator, Sender and Receiver (abbreviated GRS). They were so clever that the chief just hated them.

One day the chief came to the GRS and gave them a task. They had to synchronize library A and B in the town in order to make books in B exactly the same as A within 24 hours, otherwise they would be exiled.

That was kind of evil of the chief because it was an impossible mission!!! Imagine every library would have thousands of books and millions of pages. For sure the easiest way would be throwing away all the books in B and bringing over copies of books from A, but definitely it would be slow. Let’s say if the brothers could print 10 pages per minute, it would take a few months to synchronize all of them, but they only had 24 hours.

The chief thought this would be the end for the GRS, but surprise happened in the next day, library B had been completely and perfectly synchronized.

How did the GRS brothers do that?

The Algo

The brothers realized that books in both library were almost the same, only a few pages were too old to be readable. So this was how they got the task done.

1. Generator

First Generator look though all books in library A and for each of them, he will message to Receiver who is right now in library B.

  • The one that B has but A does not => Generator will ask Receiver to throw away.

  • The one that A has but B does not => Generator will ask Receiver to make a copy of the one from A.

  • The one that they both have => Generator will ask Receiver to do the math based on The Algo™, and draft it down to something called checksums as below.

1-10: 99
2-11: 98
3-12: 55
4-13: 102
...
91-100: 123

Then he sends them to Sender.

2. Sender

After receiving the checksums, Sender can now use them and The Algo™ to find diffs between his book and the one in B.

First from page 1-10, Sender quickly counts there were 90 words in the segment, then compare to Receiver’s checksum for page 1-10 (which was 99), obviously they are different, so he writes number 1 to his draft and makes a copy of page 1.

Then he repeats the work for page 2-11, the words counts seem to be different again, so he continues to write 2 to his draft and also makes a copy of page 2.

The same happens for page 3-12, 4-13 and 5-14, undoubtedly Sender writes 3, 4, 5 to his draft and, you might already know, makes copies of those pages.

But for the segment of page 6-15, Sender realizes that the number of words is equal to Receiver’s, therefore he jumped immediately to page 16-25 instead of page 7-16 like previous steps, without writing anything or making any copy.

So on, same step is repeated until the end of the book, here is what he has in his draft.

1, 2, 3, 4, 5, 16, 101, 102, 103

and also the copies definitely. He sends them to Receiver and continues to work on other books.

3. Receiver

After receiving the number list from Sender, Receiver will start synchronizing the book.

The list shows that 1, 2, 3, 4, 5, 16, 101, 102, 103 are pages that changed, so Receiver will remove those pages from his book and replace with the corresponding ones from Sender.

So, a book has been successfully synchronized without having to reprint everything.

Back to real life

Yes this is exactly how rsync works.

Library A and B are thereby the source and destination directories, books are files, and page segments are byte buffers.

The Algo™ mentioned is rolling checksum which is the heart of rsync. Of course the checksum algorithm is not as easy as counting words and it should not be. It is much more complex which you can see here.

There are two types of checksum in rsync: the weak and the strong. If the weak matches, senders will calculate the strong to make sure the buffers are really equal.

Practically the receivers won’t make changes directly on their local files but to a temp file instead. This ensures the local file undamaged in case of errors. After the syncing complete, this temp file will be renamed and replace the old one.

So as you see, this architecture brings effectiveness to rsync.

  1. Partial file synchronization: equalize two files without having to copy everything.
  2. Each part of GRS works separately and is non-blocking to each other, data is transferred through communication.

NGUY HIỂM! KHU VỰC NHIỀU GIÓ!
Khuyến cáo giữ chặt bàn phím và lướt thật nhanh khi đi qua khu vực này.
Chức năng này hỗ trợ markdown và các thứ liên quan.

Bài viết cùng chủ đề

Trứng lòng đào và các vấn đề đồng hồ trong lập trình

Vì sao đồng hồ lại không đáng tin cậy? Dùng đồng hồ trong máy tính như thế nào thì hợp lý?

IO data và Vectored IO

Bài viết giới thiệu về IO data, Vectored I/O và tối ưu hóa hệ thống dùng Elixir bằng cách tận dụng Vectored I/O.

[Web nhà nghèo] Tui đã viết tính năng “chém gió” như thế nào?

Trình bày cách tui xây dựng chức năng comment cho blog thay cho Disqus mà không tốn một đồng nào cả.