Lambda Architecture for your Big data solutions
Lambda Architecture
Lambda architecture is a generic architecture for the distributed data processing system, introduced by Nathan Marz on his great experience at Backtype and Twitter. The architecture is considered for fault tolerant against both hardware failure and human mistakes. By using this architecture, we can achieve the use cases like low-latency reads and updates. As the system suggests a distributed data processing system, it should be linearly scalable and it should be scale out rather than scale up.
1. Data
The data can be of high volume, high velocity, and even different varieties of data. In the IoT world, the data may be some sort of sensor data, machine logs etc. All these data will be dispatched to the Batch Layer and the Speed Layer for further processing.
2. Batch Layer
The Batch Layer stores the master copy of the dataset and precomputes batch views on that master dataset. The master dataset can be thought of as a very large list of records.
The Batch Layer needs to be able to do two things: store an immutable, constantly growing master dataset and compute arbitrary functions on that dataset. This type of processing is best done using batch processing systems.
The simplest form of the batch layer can be represented in pseudo-code as follows:
function runBatchLayer():
while(true):
recomputeBatchViews()
The Batch Layer runs in a while(true) loop and continuously recomputes the batch views from the scratch.
Key features of Batch Layer :
- It is very simple to use
- Batch computations are written like single-threaded programs
- Easy to write robust,
- As we use distributed system for Batch Layer, we can easily write robust, highly scalable computations and will get the parallel processing for free
- Adding new machines will increase the Batch Layer scale.
3. Serving Layer
The Batch Layer emits the batch view as the result of its functions. The Serving Layer is a specialized distributed database that loads the batch view and makes it possible to do random reads on it. When new batch views are available, the serving layer automatically swaps those in so that more up-to-date results are available.
The Serving Layer database supports batch updates and random reads. By not supporting random writes, these databases are extremely simple and makes them robust, predictable, easy to configure and easy to operate.
4. Speed Layer
The Speed Layer as being similar to the Batch Layer in that it produces views based on the data it receives. One big difference is that the Speed Layer only looks at the recent data, whereas the Batch Layer looks at all the data at once. The Speed Layer does incremental computation instead of the recomputation done in the Batch Layer.
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
5. Incoming Queries
Any incoming query can be answered by merging results from batch views and real-time views.
References
- http://lambda-architecture.net/
- Big Data - Principles and best practices of scalable realtime data systems, by Nathan Marz and James Warren
Comments
Post a Comment