I think a big difference between cloud computing and grid or cluster computing is that you use Cloud as a storage that you frequently access. In this sense, Cloud is our great friend. However, today I realise this could be a point where Cloud messes with you.
I created a lot of small files (in many deeply hiearchical folders) by mistake on a Dropbox-watching folder. I have 4 computers, all connected to Dropbox. When I was deleting files on one computer, Dropbox was adding them to another computer, and later, it synced the files I just deleted back!
So I thought about the reason. This is what I got on my Linux server - where files are always synchronized with my dekstops by Dropbox.
Too many files are the cause. This is what happened, to the best of my brain power:$ python ~/bin/dropbox.py status Uploading 13,904 files... Indexing 24,132 files... Downloading file list...
- When file F is deleted on computer A, it won't be removed from the Cloud unless local synchronzation client indexes it and/or has time to report the Cloud.
- Before that, synchronization client on computer B will download F.
- After A reports the Cloud to remove F, it will take some time for the Cloud tells B to do the same. In my case, "some time" is long enough to cause problem.
- It is very possible that before 3 happens, B indexes F and considers it as a new file and upload it.
- Since F now is a new file recently uploaded, the Cloud will tell A to download F.
- This may take forever - maybe even a dead loop.
But Dropbox uses a non-conservative strategy for quick synchronization across computers.
This purpose of this blog post is not to attack Dropbox, which, i have no doubt, is a great company. But this is a problem when using Cloud as main storage.
(Picture) My precious Saturday afternoon:
I may make mistaking in steps 1-5, please feel free to tell me. I didn't think very carefully.