{"id":5126,"date":"2020-10-01T08:45:12","date_gmt":"2020-10-01T07:45:12","guid":{"rendered":"https:\/\/sandbox.weareadaptive.com\/?p=5126\/"},"modified":"2020-10-10T13:56:02","modified_gmt":"2020-10-10T12:56:02","slug":"building-fault-tolerant-low-latency-exchanges","status":"publish","type":"post","link":"https:\/\/sandbox.weareadaptive.com\/fr\/2020\/10\/01\/building-fault-tolerant-low-latency-exchanges\/","title":{"rendered":"Building fault-tolerant, low-latency exchanges"},"content":{"rendered":"<div style=\"text-align: justify;\">\n<p>This article is the first in a series where we\u2019ll be sharing our experience of building fault-tolerant, highly-performant marketplaces.<\/p>\n<p>Before we discuss how we build bespoke marketplaces at Adaptive, let\u2019s consider an alternative: buying an off-the-shelf solution.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5140\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image11-1024x768.png\" alt=\"\" width=\"1024\" height=\"768\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image11.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image11-300x225.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image11-768x576.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image11-667x500.png 667w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Building a bespoke solution has the obvious benefit that we can tailor it to match requirements. For example, in a marketplace, we might introduce a novel matching algorithm or order type that domain experts think will improve liquidity. These are the kinds of customisation that are impossible, or at least prohibitively expensive, to achieve with an off-the-shelf solution whilst retaining the underlying intellectual property.<\/p>\n<p>On the other hand, building a bespoke solution is riskier. Building marketplaces typically requires a large team with in-depth knowledge both at the domain level and at the infrastructure level. An organisation may need to hire a team if it doesn\u2019t have enough capacity in-house. Hiring great people takes time and effort. Factors like this can lengthen the time to market beyond what it takes to develop a solution from scratch. What if a competing marketplace innovates first? How many people are needed to maintain the solution after its initial delivery? Can those people be retained?<\/p>\n<p>At Adaptive, we offer what we think is a compelling third option: a hybrid approach.<\/p>\n<p>Between 2012 and 2020, together with various clients, we\u2019ve built many trading systems, including exchanges and RFQ workflows, across different asset classes using similar architectures (read more about them in <a href=\"https:\/\/sandbox.weareadaptive.com\/2020\/05\/27\/aeronhydra-part-ii\/\" target=\"_blank\" rel=\"noopener noreferrer\">this blog post<\/a>). In the most recent of these deliveries, we had a budget of 100 microseconds to respond to a NewOrderSingle FIX message, i.e., a \u201cplace order\u201d instruction, with execution reports at the 99th percentile. Below is a birds-eye view of the key logical components in this system. We\u2019ll revisit it later in more detail.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5133\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image4-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image4-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image4-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image4-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image4-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image4.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Over those five years, it became increasingly clear which parts of the architecture remained constant between projects. To offer better value to our clients and to accelerate their deliveries, we wanted to stop building these parts from scratch on each new project; therefore, back in 2017, Adaptive began to invest in a solution: Hydra Platform. Hydra Platform provides an opinionated architecture and building-blocks for the typical logical components that one would find in a trading system. It allows teams to focus on business logic and deliver value to the client straight away without having to worry about low-level details or cross-cutting concerns like how fault-tolerance or disaster recovery will work.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5134\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image5-1024x768.png\" alt=\"\" width=\"1024\" height=\"768\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image5.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image5-300x225.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image5-768x576.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image5-667x500.png 667w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Hydra Platform has and continues to be a compelling option for our clients.<\/p>\n<ul>\n<li>It supports bespoke functionality and customisation.<\/li>\n<li>It de-risks project delivery.\n<ul>\n<li>Projects require smaller development teams.<\/li>\n<li>Projects use a proven architecture where we\u2019ve already established patterns for cross-cutting concerns like fault-tolerance.<\/li>\n<li>Projects cost less up-front than designing and developing the underlying patterns and infrastructure from scratch.<\/li>\n<\/ul>\n<\/li>\n<li>It reduces the time to market.\n<ul>\n<li>From the very first day, project teams focus on the business problems they are solving, i.e., the essential complexity, rather than building the underlying infrastructure, i.e., the <a href=\"http:\/\/faculty.salisbury.edu\/~xswang\/Research\/Papers\/SERelated\/no-silver-bullet.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">accidental complexity<\/a>.<\/li>\n<li>Smaller development teams require less effort to hire.<\/li>\n<li>Projects can leverage building-blocks for the most common logical components in marketplaces (which makes Hydra Platform an excellent fit for building RFQ engines, exchanges, internalisation engines, etc.).<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>In the following sections, we\u2019ll describe how we glue Hydra Platform components together while preserving fault-tolerance across the whole system. By the end, we\u2019ll end up with a map of the architecture of projects built with Hydra Platform. In future articles, we will zoom into particular areas of this map, discuss how we\u2019ve distilled our experience to build specific Hydra Platform components, and demonstrate how easy they are to use.<\/p>\n<p>To find out more about the evolution of this architecture and why we\u2019ve used it over more traditional architectures, please read <a href=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2017\/04\/Application-Level-Consensus.pdf\" target=\"_blank\" rel=\"noopener noreferrer\">our whitepaper on application-level consensus<\/a>.<\/p>\n<h2>The clustered heart of the architecture<\/h2>\n<p>A clustered engine sits at the heart of marketplaces built on Hydra Platform.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5137\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image8-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image8-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image8-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image8-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image8-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image8.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Each node of the clustered engine contains several stateful, application-level modules. These might be reference data, risk management, and a matching engine, for example. A module processes a consensus-agreed sequence of commands using <em>deterministic<\/em> business logic to generate a sequence of events that is identical across all nodes.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5142\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image13-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image13-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image13-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image13-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image13-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image13.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Modules behave a bit like the <em>idealised<\/em>, <a href=\"https:\/\/en.wikipedia.org\/wiki\/Pure_function\" target=\"_blank\" rel=\"noopener noreferrer\">pure function<\/a> signature below, which is similar to the <a href=\"https:\/\/en.wikipedia.org\/wiki\/Fold_(higher-order_function)\" target=\"_blank\" rel=\"noopener noreferrer\">signature for a fold<\/a>. On a single thread, modules process each command in the context of the current state to produce a sequence of events and the next state.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5135\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image6-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image6-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image6-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image6-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image6-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image6.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Over the last decade or so, similar, log-based patterns have emerged across the whole stack. For example, <a href=\"https:\/\/martinfowler.com\/eaaDev\/EventSourcing.html\" target=\"_blank\" rel=\"noopener noreferrer\">event-sourcing<\/a> has become popular on the back-end, and <a href=\"https:\/\/redux.js.org\/introduction\/core-concepts\" target=\"_blank\" rel=\"noopener noreferrer\">Redux<\/a> and <a href=\"https:\/\/elmprogramming.com\/model-view-update-part-1.html\" target=\"_blank\" rel=\"noopener noreferrer\">Elm<\/a>-based applications have become popular on the front-end. Based on our experience, we attribute this to two main factors.<\/p>\n<ul>\n<li>Deterministic, single-threaded code is easy to understand; especially when paired with clearly defined inputs, outputs and state.<\/li>\n<li><a href=\"https:\/\/en.wikipedia.org\/wiki\/Time_travel_debugging\" target=\"_blank\" rel=\"noopener noreferrer\">Time-travel debugging<\/a> makes it easy to track down and fix bugs, and it is trivial to implement in systems where one can replay commands\/events\/actions.<\/li>\n<\/ul>\n<p>We place our modules at the core of a <a href=\"https:\/\/en.wikipedia.org\/wiki\/Hexagonal_architecture_(software)\" target=\"_blank\" rel=\"noopener noreferrer\">hexagonal architecture<\/a> to eliminate infrastructure-level concerns from our business logic. It makes testing straightforward and reduces lock-in to messaging infrastructure. We implement adapters for a few simple interfaces that our business logic exposes. These adapters plug the business logic into Hydra Platform so that it runs with fault-tolerance.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5131\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image2-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image2-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image2-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image2-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image2-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image2.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h3>Divergence protection<\/h3>\n<p>Earlier, we described modules as processing \u201ceach command in the context of the current state to produce a sequence of events and the next state\u201d. When clustered business logic is <em>accidentally<\/em> not deterministic, it may cause divergence of these outputs, i.e., the next state and the emitted events might differ across cluster nodes. Non-determinism occurs when a developer uses something that isn&rsquo;t the same across all nodes or isn\u2019t the same if we replay the log, for example, system time.<\/p>\n<p>To prevent divergence, we use two strategies in Hydra Platform. First, we use static-analysis tooling to nudge developers\u00a0into writing deterministic code. Second, we deploy monitoring software to detect divergence at runtime.<\/p>\n<p>&nbsp;<\/p>\n<h3>High-performance replication, consensus and distribution<\/h3>\n<p>Hydra Platform uses a highly-efficient <a href=\"https:\/\/raft.github.io\/\" target=\"_blank\" rel=\"noopener noreferrer\">RAFT<\/a> implementation, built on top of <a href=\"https:\/\/github.com\/real-logic\/aeron\" target=\"_blank\" rel=\"noopener noreferrer\">Aeron<\/a>, to replicate, persist and sequence the commands that the FIX and web gateways send to the engine. It also uses Aeron to distribute and replay events to downstream components efficiently.<\/p>\n<p>Performance and protocol-design gurus <a href=\"https:\/\/mechanical-sympathy.blogspot.com\/\" target=\"_blank\" rel=\"noopener noreferrer\">Martin Thompson<\/a> and Todd Montgomery, of Disruptor and 29West fame respectively, built Aeron for low-latency messaging. To get the best out of it, we use zero-allocation, zero-copy messaging codecs similar to <a href=\"https:\/\/github.com\/real-logic\/simple-binary-encoding\" target=\"_blank\" rel=\"noopener noreferrer\">SBE<\/a>, <a href=\"https:\/\/google.github.io\/flatbuffers\/\" target=\"_blank\" rel=\"noopener noreferrer\">Flatbuffers<\/a> and <a href=\"https:\/\/capnproto.org\/\" target=\"_blank\" rel=\"noopener noreferrer\">Cap\u2019n Proto<\/a> on top.<\/p>\n<p>Let\u2019s look at some numbers!<\/p>\n<p>We recently ran a series of benchmarks to measure round-trip latency between different kinds of Hydra Platform components. We ran our benchmarks on:<\/p>\n<ul>\n<li>\u201cm5d.metal\u201d EC2 instances,<\/li>\n<li><a href=\"https:\/\/docs.google.com\/document\/d\/1RgGKM5KZ1NFJvIAgEint-Q7G0nA2I2OURY6nC15sojg\/edit?usp=sharing\" target=\"_blank\" rel=\"noopener noreferrer\">bare metal<\/a> with kernel-based networking, and<\/li>\n<li><a href=\"https:\/\/docs.google.com\/document\/d\/1RgGKM5KZ1NFJvIAgEint-Q7G0nA2I2OURY6nC15sojg\/edit?usp=sharing\" target=\"_blank\" rel=\"noopener noreferrer\">bare metal<\/a> with kernel bypass.<\/li>\n<\/ul>\n<p>In any system, latency measurements will change as the load upon it changes. Therefore, it is essential to consider the load a system was under when its latency was measured and not look at latency figures in isolation. We ran each benchmark under a load of 100,000 round-trips per second from a single client, where each message was 100 bytes in length.<\/p>\n<p>Between non-clustered Hydra components, on two machines, our measurements showed that 99.99% of all round-trips, that is, two hops, were less than:<\/p>\n<ul>\n<li>200 microseconds on EC2,<\/li>\n<li>50 microseconds on bare metal, and<\/li>\n<li>23 microseconds on bare metal with kernel bypass.<\/li>\n<\/ul>\n<p>Between a Hydra Platform engine clustered across three machines and its client on another, our measurements showed that 99.99% of all round-trips, including hops for consensus, were less than:<\/p>\n<ul>\n<li>175 microseconds on bare metal, and<\/li>\n<li>73 microseconds on bare metal with kernel bypass.<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3>Limitations on the size of the clustered engine state<\/h3>\n<p>The programming model embraced by Hydra Platform limits the business logic inside our cluster to processing commands using only its (\u201csnapshottable\u201d) in-memory state. The finite nature of main memory imposes an upper limit on the size of this state. Therefore, we avoid storing unbounded collections, e.g., trade executions, inside it. Instead, we record events to disk using our event log.<\/p>\n<p>As our business logic is on a single thread, expensive queries slow down the processing of latency-sensitive commands like cancelling an order. Wherever it is sensible, we avoid executing queries inside the clustered engine.<\/p>\n<p>If a downstream component needs data, but we\u2019re not willing to store it in the clustered engine or query it, where do we obtain such data? Let\u2019s look at an example.<\/p>\n<p>The admin gateway and web trading gateway both present live pages of tabular data based on user-defined queries using Hydra Platform\u2019s LiveQuery. They source real-time data from the streaming tail of the event log. However, this streaming tail doesn\u2019t give enough information to answer questions like, \u201cwhat ten trades have the highest prices in the last 48 hours on BTC\/USD?\u201d. Answering questions like this requires historical data too.<\/p>\n<p>While it is possible to replay two days of data in an exchange environment, we don\u2019t recommend it! Consider the following questions.<\/p>\n<ul>\n<li>From what position in our (unindexed) event log should we replay?<\/li>\n<li>How expensive is it to ingest two days worth of data to compute the result set?<\/li>\n<li>At what rate will we receive these queries?<\/li>\n<\/ul>\n<\/div>\n<p>&nbsp;<\/p>\n<div style=\"text-align: justify;\">\n<h3>Accumulating data in read-models<\/h3>\n<p>If replaying is not an option, where do we obtain historical data? The answer is that we continuously update a data-structure to answer our queries efficiently. In <a href=\"https:\/\/martinfowler.com\/bliki\/CQRS.html\" target=\"_blank\" rel=\"noopener noreferrer\">CQRS<\/a>-speak, we call this data-structure a \u201cread-model\u201d.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5138\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image9-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image9-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image9-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image9-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image9-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image9.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>In the case of historical trade execution data, the history service acts as our read-model. It maintains an indexed table inside a relational database. It converts each \u201ctrade executed\u201d event that the engine appends to the event log into a row insert. We store the table in a denormalised form to avoid the computation of expensive joins when the database executes queries.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5143\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image14-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image14-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image14-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image14-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image14-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image14.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>The history service exposes an API for other components to issue queries. Upon receiving a request, it converts the query into SQL and executes the query against its database. It then transforms the result set into messages that it sends directly back to the component that issued the request.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5139\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image10-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image10-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image10-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image10-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image10-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image10.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h3>Read-model resilience and recovery<\/h3>\n<p>We run redundant instances of the history service in an active\/active configuration to avoid introducing a single point of failure. Whenever a service writes a row out to its database, it writes a high-water mark alongside it that indicates how far into the event log the service has consumed. When the service restarts, it queries its database for this high-water mark and replays only the unprocessed events from the event log.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5130\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image1-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image1-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image1-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image1-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image1-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image1.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>&nbsp;<\/p>\n<h3>External connectivity<\/h3>\n<p>At the fringes of our initial component diagram, coloured in blue and red, are gateways that provide connectivity with external systems.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5141\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image12-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image12-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image12-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image12-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image12-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image12.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>In line with the cluster, we tend to use the hexagonal pattern in our gateways too. Again, we implement adapters over the interfaces that the business logic exposes to plug it into Hydra Platform.<\/p>\n<p>Gateways and clients vary between projects. Hydra Platform allows developers to build custom gateways and clients quickly using its building-blocks.<\/p>\n<p>There are several gateways shown in the original diagram.<\/p>\n<ul>\n<li>We built some gateways using the Hydra Platform FIX Gateway building-block.\n<ul>\n<li>The FIX Order Management (OM) Gateway allows customers to modify orders.<\/li>\n<li>The FIX Market Data (MD) Gateway gives customers access to live updates for selected order-books.<\/li>\n<li>The FIX Drop Copy Gateway integrates with customers\u2019 back-office infrastructure to support reconciling trades.<\/li>\n<\/ul>\n<\/li>\n<li>We built other gateways using the Hydra Platform Web Gateway building-block.\n<ul>\n<li>The Web Trading Gateway provides similar functionality to the FIX OM and MD gateways but exposes a JSON API over WebSocket.<\/li>\n<li>The Admin Gateway exposes an API and Web GUI for market operators to configure and observe the marketplace, e.g., they might use this functionality to list instruments or to see live orders.<\/li>\n<\/ul>\n<\/li>\n<\/ul>\n<p>&nbsp;<\/p>\n<h3>Protocol transformation<\/h3>\n<p>Gateways typically transform one protocol into another, e.g., from FIX into an efficient internal protocol and vice-versa. We often split gateways into two parts: an inbound flow and an outbound flow. This split is similar to the division in CQRS.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5132\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image3-1024x768.png\" alt=\"\" width=\"1024\" height=\"768\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image3.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image3-300x225.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image3-768x576.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image3-667x500.png 667w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>The inbound flow receives messages from external sessions that might mutate the state of the system, e.g., a command to place an order. First, it validates these commands, e.g., checking that a place order command references a tradable instrument. Next, it transforms these commands. Externally, we may expose heavyweight types, e.g., strings to represent instrument identifiers. In contrast, internally, we might use more efficient representations, e.g., 64-bit integers, that we might need to look up in a map. Finally, it forwards the transformed commands to the clustered engine.<\/p>\n<p>The outbound flow receives events from the clustered engine. It transforms these events into messages of the external protocol, e.g., FIX messages, and disseminates them to external sessions. It might also maintain a cache of data to service queries.<\/p>\n<p>&nbsp;<\/p>\n<h3>Keeping the clustered engine lean by farming out work to gateways<\/h3>\n<p>Gateways can be useful for offloading work from the clustered engine. For example, we built an exchange-like system where we had to apply a markup to prices before disseminating them. Each customer observed slightly different prices (with tailored markups). We put this calculation logic in gateway instances rather than inside the clustered engine, as it allowed us to scale (more) horizontally to match the number of concurrent customer connections.<\/p>\n<p>&nbsp;<\/p>\n<h3>In-memory read-models for joining data<\/h3>\n<p>Sometimes transformations in gateways, like price markup, require more information than is available in the individual events that we transform. For example, we might need price and customer tier to calculate a price with markup. Price information but not all customers\u2019 tier information may be available on something akin to an \u201corder placed\u201d event. Therefore, we need to obtain and maintain, via the processing of events, this customer tier data separately to join together with our \u201corder placed\u201d events later. For this purpose, in addition to servicing queries, gateways sometimes maintain in-memory read-models.<\/p>\n<p>&nbsp;<\/p>\n<h3>Gateway resilience and recovery<\/h3>\n<p>Unlike the history service that is backed by a durable database, there is no high-water mark for an in-memory read-model on restart. All the state is gone. To seed our in-memory read-model at startup time, we \u201cprime\u201d it by either querying the clustered engine, if it is reasonably quick and a small amount of data, or by requesting the data from another durable read-model like the history service.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5136\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image7-1024x768.png\" alt=\"\" width=\"1024\" height=\"768\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image7.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image7-300x225.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image7-768x576.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image7-667x500.png 667w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>We usually run redundant instances of gateways in an active\/active configuration. If we have a business requirement that only one gateway should accept connections at any time, e.g., to avoid differences in latency for modifying orders, we can run groups of gateway instances in an active\/passive configuration. Hydra Platform will detect failures of active gateways and \u201cpromote\u201d passive instances. It frees the application layer, and application development team, from dealing with the complexity around failover orchestration.<\/p>\n<p>&nbsp;<\/p>\n<h2>Conclusion<\/h2>\n<p>We\u2019re at the end! We\u2019ve covered all of the components in our initial diagram.<\/p>\n<p>In the diagram below, we unfurl the inbound and outbound parts of our gateways, that we mentioned earlier, to represent our components slightly differently. It shows that, if we squint, we have a unidirectional data flow when we structure our gateways in this manner. This flow, when coupled with deterministic business logic, extends some of the benefits of the clustered engine that we mentioned earlier, such as time-travel debugging, to everything downstream of it.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-5129\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image15-1024x640.png\" alt=\"\" width=\"1024\" height=\"640\" srcset=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image15-1024x640.png 1024w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image15-300x188.png 300w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image15-768x480.png 768w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image15-667x417.png 667w, https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/image15.png 1280w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>Thank you for reading this article. We hope that you:<\/p>\n<ul>\n<li>have picked up a high-level understanding of some of the components we use to build highly-performant, fault-tolerant trading systems; and<\/li>\n<li>understand where Hydra Platform sits on the buy vs build spectrum.<\/li>\n<\/ul>\n<p>In future articles, we\u2019ll cover various aspects of Hydra Platform in more detail and show how simple its APIs are to use. In the meantime, you can also find out more in this <a href=\"https:\/\/sandbox.weareadaptive.com\/2020\/05\/20\/aeronhydra\/\" target=\"_blank\" rel=\"noopener noreferrer\">blog series about Hydra Platform by our CTO<\/a>.<\/p>\n<p>If you would like more information about Aeron or Hydra Platform, please don\u2019t hesitate to <a href=\"https:\/\/www.linkedin.com\/in\/zachary-bray-5b729315\/\" target=\"_blank\" rel=\"noopener noreferrer\">reach out to me on LinkedIn<\/a> or click the button below.<\/p>\n<\/div>\n<p>&nbsp;<\/p>\n<p style=\"text-align: right;\"><img loading=\"lazy\" decoding=\"async\" class=\"size-thumbnail wp-image-4061 alignright\" src=\"https:\/\/sandbox.weareadaptive.com\/wp-content\/uploads\/2020\/09\/Zach-Bray.png\" alt=\"\" width=\"150\" height=\"150\" \/><\/p>\n<h1 style=\"text-align: right;\">Zachary Bray<\/h1>\n<p style=\"text-align: right;\">Senior Software Engineer, Adaptive Financial Consulting Ltd<\/p>\n<p style=\"text-align: right;\"><button class=\"waves-effect waves-light btn-flat alt cta-talk\">Let\u2019s talk<\/button><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This article is the first in a series where we\u2019ll be sharing our experience of building fault-tolerant, highly-performant marketplaces. Before &#8230;<\/p>\n","protected":false},"author":24,"featured_media":5147,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[6,138,219,217,216],"tags":[],"class_list":["post-5126","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-blog","category-accelerators","category-electronic-trading","category-exchanges","category-hydra-platform"],"_links":{"self":[{"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/posts\/5126"}],"collection":[{"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/users\/24"}],"replies":[{"embeddable":true,"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/comments?post=5126"}],"version-history":[{"count":1,"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/posts\/5126\/revisions"}],"predecessor-version":[{"id":5204,"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/posts\/5126\/revisions\/5204"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/media\/5147"}],"wp:attachment":[{"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/media?parent=5126"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/categories?post=5126"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/sandbox.weareadaptive.com\/fr\/wp-json\/wp\/v2\/tags?post=5126"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}