A Distributed File Store
By Usman Fazil
In the traditional web, user data is stored on centralized storage servers that have complete control over it. This control provides them elevated privileges that may be abused without the user’s knowledge or consent. Moreover, centralized storage may have availability issues, especially if the data is stored in just one place, hence creating a single point of failure.
File Storage on the Web
The web uses location-based addressing to store and retrieve files. Let’s say we want to access a cat picture
cat.png from domain
abc.com. We’ll access this location (i.e
abc.com/cat.png) via a web browser and in return we’ll get the cat picture. If however, the file has now been removed from the abc servers for whatever reason, we won’t be able to access the picture anymore. Now there is a possibility that someone else on the internet has a copy of that same cat picture, but we have no way of connecting with them and grabbing a copy of that picture. A lot of files on the internet may have the same name but the contents will likely be different.
IPFS is a peer-to-peer protocol for file storage with content-based addressing instead of location-based addressing. This means that to find a file, we do not need to know where it is (
abc.com/cat.png), but rather what it contains (
QmSNssW5a9S3KVRCYMemjsTByrNNrtXFnxNYLfmDr9Vaan), which is denoted by a hash of the content.
Hash function creates a unique fingerprint for every file. So if we want to retrieve a file, we’ll ask the network “who has this file (
QmSNssW5a9S3KV...)” , and then someone from the IPFS network who has it will provide it to us. We can verify the integrity of the file by comparing the hash of what we requested against what we received, and if the hashes match, then we know that the file hasn’t been changed. This hash function also helps de-duplicate the network, such that no file with the same content can be submitted twice, since the same content yields the same hash. This optimizes storage requirements and also improves the performance of the network.
How IPFS Stores Files
Files are stored as IPFS objects, which is a data structure that consists of:
- Data — a blob that can store up to 256 kB.
- Links — an array that links IPFS objects.
If our file is bigger than 256 KB, then it will be split up and stored in multiple IPFS objects and then an empty object will be created which links all the other objects of the file. As shown in the figure below:
- IPFS works as an immutable storage, once something is added to the network it can not be changed, as changing the file will change the hash. So how do we update the file? For this, IPFS uses version control system, widely used in the open source community specially, called Git. IPFS has commit objects, which helps keep track of all the versions of a file since it was created. Every time we add a file on the IPFS network, a commit object is made for that file and when we update that file, a new commit object is created which points to the older commit object of that file, as shown in the figure below.
Does my file live forever on the network?
Only important files are kept on the network and unimportant files are removed by the garbage collector, where the importance of a file is determined by “pinning”. By pinning a file, we mark that file as an important one so that it persists while unimportant files are only cached temporarily.
Issues with IPFS
Now let’s talk about the challenges associated with IPFS. One of the biggest issues with IPFS is keeping files available. IPFS promises permanence but it does not promise persistence. This implies that if a file is made available by Alice and Bob accesses it; if Alice goes offline, Bob may still be able to access it if it hasn’t been deleted by the garbage collector. On the other hand, if Bob has pinned that file, it will remain accessible even if Alice’s node goes down. Pinning, therefore, is a major problem. As of writing this, there are more than a few public pinning services, so persistence comes at a cost.
Another limitation is the actual sharing of files. You have to share a file link (content address) with someone else on the network via a traditional communication mechanism, say an instant message, email, Skype, Slack, etc. This means file sharing is not built into the system. People have developed crawlers and search engines the network, but it will likely take some time before it all catches on.