Resilient Sync for Local First

When I started programming in the 80s, the situation was simple: data was written to a file on a local disk. If files needed to be exchanged, a floppy disk was simply inserted into another computer and edited.

Later, floppy disks and CD drives disappeared and files were either exchanged by e-mail or the data was stored directly on a server. Even Microsoft Word moved to the cloud at some point.

However, moving data from the local computer to the cloud created new problems. What if the provider closed its service? However, it was not only data loss, but also being trapped by the provider because the data could no longer be retrieved from the service that increasingly became a problem for some and a business model for others.

Local-First

The local-first movement wants to bring the data back to the user while retaining the advantages of the internet. A robust and widely accepted solution has been found for the simultaneous processing of data with CRDT, but not yet for the exchange of data. This is because it is not (yet) possible to do completely without services on a server. Even with peer-to-peer, where the devices mainly communicate directly with each other, the connection must be mediated by servers at the beginning.

In an ideal local-first world, the following criteria should be met with regard to the exchange of data:

  1. data is processed and stored locally (offline).
  2. the sync can be interrupted for a longer period of time, but data can be processed locally during this time and still be identical everywhere after a new sync (CRDT).
  3. data should not be visible on the move (E2EE).

And I would also like to add:

  1. the whole thing should also work with the technology of the 80s.

Why? For reasons of resilience, because there are many reasons why things could go offline.

Resilient Sync

For this reason, I propose a data exchange format that works both in the file system and on the Internet. It should be so simple that the implementation can work on any underlying data carrier or service.

Log of changes

To start with, each client writes a kind of simple, continuous log with the changes. We call the position of the changes index and it starts at 0 and continues in whole numbers: 0, 1, 2, 3, .... Usually these changes are CRDT compliant, but this is not significant for the protocol. This data can also be stored in encrypted form. Let’s call this data.

Each client is given a unique identifier, usually a unique ID (UUID), we call it clientId. The change logs are stored linked to the clientId. We will call a collection of such logs workspace.

We can enrich further data, such as a timestamp, which can be useful if we want to process the data changes historically or implement a kind of endless “undo”.

It is also conceivable that each new entry contains a hash of the previous entry in order to ensure data consistency in the style of a blockchain. Further refinements in the direction of the Merkle tree are also conceivable. However, a hash on the content of the current entry can also contribute to data security.

Assets, blobs, the big binary chunks

The reality is that there is also larger data that rarely changes: Images, videos, audios, i.e. files. These are not part of the content changes and would quickly bring data synchronization to a standstill. In addition, they are not always needed immediately and could also be loaded “on demand”. I therefore suggest treating these files separately from the changes. Let’s call them asset.

But here, too, the data should be stored according to clientId and in ascending order with index. The data sets can thus refer to an asset, e.g. with a data entry in the form of a URL that contains all the necessary information:
asset:///<clientId>/<index>/<filename>?size=<sizeInBytes>&type=<mimeType>&hash=<checksumOfContents>. An example could look like this asset:///abc/1/test.txt?size=100&type=text%2Fplain&hash=1a2b3c.

Benefits

The key advantage of this method is that we always know where the next data will appear. Because if a client has just written the index ‘123’, the next one will be ‘124’.

Why is this important? For the following reasons:

  1. we are not dependent on being notified of new data (“push”), but can also ask for it ourselves (“pull”).
  2. we can load the individual entries even without knowing anything about their content. A sync can even take place without an intermediary having to “understand” the data.
  3. we notice immediately if data is missing and can request it again.
  4. the data can be replicated as often as required, there does not have to be just one sync source. It therefore makes sense to use it as a backup.
  5. clients that do not have a direct connection to each other can use “stupid copying processes”.
  6. data exchange can be easily and comprehensibly documented, which can be important when recognizing legally compliant stored data for tax and compliance purposes.
  7. there are no conflicts when storing the data because it is only updated but never deleted (“append only”).

Databases

Let’s start with the implementation as a database. As described, a table with the following fields would be sufficient:

  • index: Integer
  • clientId: String or integer
  • data: String or binary
  • timestamp: Integer (optional)
  • prevHash: String or integer (optional)

This would look similar for the assets.

Database can refer to both the local storage of the client, e.g. in the IndexedDB, and the storage on a synchronization server.

Filesystems

The data is stored in a directory in a file system. Metadata, such as the clientId of the creator and details such as the encryption used, are stored in a JSON file called index.json.
Furthermore, there is a directory for each client that is named after the clientId.

These in turn each contain two directories:

  1. changes
  2. assets

The changes are recorded in changes and numbered consecutively. The same is done in assets with the binary data described.

Modern file systems have no significant restrictions on the number of files per directory, but it can’t hurt to limit the number anyway. The following algorithm in TypeScript ensures an even distribution:

/**
 * Distribute file, named by natural numbers, in a way that each folder only
 * contains `maxEntriesPerFolder` subfolders or files. Returns a list of
 * names, where the last one is the file name, all others are folder names.
 * 
 * Example: `distributedFilePath(1003)` results in `['2', '1', '3']` which 
 * could be translated to the file path `2/1/3.json`.
 */
export function distributedFilePath(index: number, maxEntriesPerFolder: number = 1000): string[] {
  if (index < 0)
    throw new Error('Only numbers >= 0 supported')
  const names: string[] = []
  do {
    names.unshift((index % maxEntriesPerFolder).toString())
    index = Math.floor(index / maxEntriesPerFolder)
  } while (index > 0)
  names.unshift(names.length.toString())
  return names
}

Source: zeed Framework

Online Services and Peer-To-Peer

A suitable mix of the database or file system approach can be selected for online services, depending on which is better suited to the selected service. For Dropbox or WebDAV, for example, this would be the file system approach. For Apple CloudKit, the database approach.

But a simple service of your own is also conceivable, with a REST, WebSocket or other useful interfaces.

There is also no reason not to synchronize the data via peer-to-peer (P2P) or another local communication channel. After all, the data is identical and can therefore be synchronized with other clients via the fastest route. Redundancy is therefore an advantage and does not make things any more complicated. Theoretically, the sequence and multiple application of changes is also unproblematic with CRDT.

Refinements

There is still potential for improvement in some areas:

  • Control of data size per log entry. Collecting several changes for larger packages or splitting a change into several packages if the scope becomes too large.
  • Compression or summarization of data.
  • Announcement of new clients, e.g. through special log entries. Evaluation using cryptographic methods.
  • By adding a logical clock, such as the Lamport clock, entries can be sorted logically, thereby improving the chronology of an entry.
  • Writing the data in a single file per client for reasons of resource optimization.
  • Through the clever use of cryptographic means, it may even be possible to implement rights management (Cryptree).

Outlook

I have been using this technique in my apps for several years, for example in the now completed project Onepile. New projects using this approach will be published soon.

Simplicity and flexibility seem to me to be the biggest advantages of this approach. As a result, it should also be future-proof and be able to adapt quickly to new technical conditions.

The following diagram is an example of an ecosystem for a web app:

R2v3.png

Published on June 24, 2024

 
Back to posts listing