View on GitHub

Oak ZeroMQ Nodestore

Mirror of Apache Jackrabbit Oak

Oak ZeroMQ Nodestore

The ZeroMQ nodestore is a distributed nodestore for Apache Oak using ZeroMQ as its communication layer.

NodeStates

We adopt the Oak NodeState model also for the storage backend. Nodestates contain properties and references to their child nodestates. They are serialised and identified by the hash of their serialisation (content-addressable storage). Therefore the serialised nodestate is the basic unit of storage (key/value pair or document) and can be treated like any other blob.

Oak NodeState Model

Journal

A reference to the root nodestate is stored separately in a journal head and defines a repository. Several root nodestates and therefore journals/repositories can share the same blobstore bucket. These root nodestates can even belong to different tenants. A repository can be cloned in no time by creating a new journal referencing the same root nodestate.

The log

Changes are written to a log, which gives them a global order. When a commit is merged, the blobs and nodestates are written first, then the new journal entry. All participating Oak instances listen to and process the log and its commits, including their own, in the same manner, and therefore eventually have the same state. In the case of conflicts, all instances use the same resolution mechanism.

In addition to Oak instances, other processes listen to the log:

Services

Security

The current PoC implementation does not use any encryption and therefore assumes that blobstore buckets and the logs are kept separate between tenants. If tenants share a common blobstore and/or the log, the data would need to be encrypted before sending it to the log and decrypted when read by the Oak nodestore. This is what we’re aiming for so the PoC should do that eventually and check the performance impact. An alternative would be a per-tenant blob cache doing the encryption and decryption.

The nice thing about using ZeroMQ for communication is that it provides us with a pluggable architecture where such a cache can be easily added, as long as we keep the communication protocol stable.

PoC implementation

Communication hub

The “comm-hub” process is the communication backbone. It opens two sockets, one for writing messages to the queue, and one for reading.

The message format is a bit verbose but makes debugging easier. Fields are separated by a space. The first field starts with a service identifier (currently “read” or “write”), a dash “-“, and a type (“req” for request, “rep” for reply). The second field identifies the requesting thread («pid»@«host»-«threadId»). The third field should be a message id set by the requester and echoed in the reply, but I think that’s not fully implemented yet. In a reply message, the fourth field indicates the message status: «N» for “not found”, «F» for “fault/error”, «C» for “continued” (this is a multi-part message and more parts will follow), or «E» “end” (this is the end of a multi-part message or the only message part). The remaining fields are service dependent.

Reader service

The reader service provides blobs/nodestates («blob» and «hasblob») and the current root of a journal («journal»). For example

read-req 33624@Axels-MacBook-Pro.local-306 2 hasblob 8578AEA22CF6ADE2C845ECE9C375B4DA
read-rep 33624@Axels-MacBook-Pro.local-306 2 E false

Writer service

The writer service stores blobs/nodestates and the journal(s) in the blobstore. For efficiency the «braw» command is used most of the time. It contains the blob in binary format in the last field of the message. For blobs exceeding the max message size, they are split into b64 encoded chunks and sent to the queue using «b64+» (start with blobId as parameter), «b64d» (with b64 encoded data), and «b64!» (end). Upon success the service responds with an empty «E» message.

A journal entry is written as

write-req 33624@Axels-MacBook-Pro.local-306 618 journal golden DE6E888A0CCF62B208982F63CC7BB6FA 5B77940AA4269AB940431E4C54D2D786

where the «journal» command is followed by the name of the journal/repository, followed by the new root, followed by the old (expected) root.

Nodestore implementation

The frontend, the actual NodeStore implementation, consists of two parts.

The “merge” method uses the builder to recursively diff against the current state and generate new nodestates along the way. These new nodestates are sent to the queue (writer service). Finally the new root nodestate is written and its id is sent to the queue (writer service) as the new journal root. The thread then waits to be notified if the merge was successful or not, which is the responsibility for the second part.

The second part is a thread which continuously listens on the queue for new journal messages. If it sees one and the journal/repository name matches its own, it tries to apply it (rebasing if necessary). If this fails (conflict) and the commit is not its own, it just skips it. If the commit happens to be its own, it notifies the merge thread about success or failure to apply it.

Build it

git clone https://github.com/ahanikel/jackrabbit-oak
cd jackrabbit-oak
mvn -DskipTests clean install

Run the backend processes

cd oak-run/target

Communication hub

Open a new terminal window and run

java -jar oak-run-1.42.0-zeromq.jar comm-hub tcp://localhost:8000 tcp://localhost:8001

Reader service

Open a new terminal window and run

java -jar oak-run-1.42.0-zeromq.jar simple-blob-reader-service simple:///tmp/blobstore tcp://localhost:8000 tcp://localhost:8001

assuming that /tmp/blobstore does not exist.

Writer service

Open a new terminal window and run

java -jar oak-run-1.42.0-zeromq.jar simple-blob-writer-service simple:///tmp/blobstore tcp://localhost:8000 tcp://localhost:8001

ZeroMQ Listener for debugging

If you want to see live what messages are being sent back and forth, open a new terminal and run

java -jar oak-run-1.42.0-zeromq.jar simple-queue-listener tcp://localhost:8001

but beware that this will produce lots of output and will slow down the initial sling start considerably. Press CTRL-C when you’ve seen enough.

Run the frontend it with Sling

cd org-apache-sling-starter
mvn -DskipTests clean install
cd target
# edit slingfeature-tmp/feature-oak_tar.json (TBD)
JAVA_HOME=/usr/local/opt/openjdk@8 $JAVA_HOME/bin/java -jar dependency/org.apache.sling.fe
ature.launcher.jar -f slingfeature-tmp/feature-oak_tar.json