Distributed processing - Design Patterns

Distributed processing has always fascinated me and here is my take-away notes about it. This post has overview of Big data, Distributed storage and processing systems.

Target audience

A basic understanding of any distributed storage system like HDFS (Hadoop Distributed File System) would make this post more helpful. Additionaly, understanding of any distributed processing system like Apache Spark would make this post more interesting.

Be aware, this is mostly a theoretical post “no code zone” ahead :)

Content

Note on Big Data and Distributed Storage Systems
Distributed processing - Design patterns
An example making use of distributed processing patterns

Note on Big Data and Distributed Storage Systems

Handling data in computer science applications has been an interesting domain for a long time now. The recent advancement in user space of computer technologies has led to generation of huge amount of data. There are many articles which talk about data generated by the internet like this.

HDFS is one of the most versatile distributed storage systems in open source big-data world. A beginner-level knowledge of how HDFS works would help you get better understanding of the below concenpts. But don’t worry if you don’t know what it is.

Storage and maintainance of vast amount data is indeed a challanging task and that’s where distributed storage systems come into picture. There are variety of storage systems each having it’s own pros and cons. But most of them have the same underlying principles to achieve the below requirements:

Fault Tolerance Don’t want to loose the entire data for triviel failures

High Availability Always accessible from anywhere

High Reliability Always maintain data consistency

Replication Duplicating data as backup/fallback mechanisms in case of failures

Scalability Provide verticle and horizontal scalability to increase the storage capacities

Some systems may be good at fault tolerance but doesn’t scale much, some systems may scale very well but may not be reliable. No system is perfect, but based on the business requirement trade off could be made to choose the most suitable storage system.

Below are a few of the key points that most of the distributed storage systems have in common today:

Split the data to be stored into small chunks and distribute them across multiple nodes or machines
Duplicate each of the chunks and make sure duplicate copies of a given chunk are far apart
Increase the duplicate copies or re-distribute existing chunks there by providing the elasticity to the storage system
Hide all these complexities from the storage clients with abstractions like distributed files, tables, streams etc

This is just a crude introduction to distributed storage systems and there is definitely more to it. For now these key points should be enough understand the next concept: Distributed processing.

Distributed processing - Design patterns

Now that we know about distributed storage systems, let’s see how do we process the data stored in any distributed storage system.

The processing could be any operation that we apply on the data including. Type of processing could be an ETL on the data, pre-processing for the consumption of business intelligence, or feature engineering in machine learning etc.

Processing a huge amount of data will involve many challanges related to speed, fault tolerance, scalability, consistency and more. Most of the currently available distributed processing systems make use of a set of patterns to solve these challanges.

Patterns are nothing but solutions/guidelines to common occuring problems in a given domain. It looks like many of these patterns come from an interesting field of computer science called functional programming. Understanding these patterns should help one answer the question why distributed processing systems are built like that?.

Abstraction
Immutability
Lazy evaluation
Iterator pattern
Pure Functions
Pipelining or Keep-it-flowing
Divide and conquer
Associative operations
Bring the code to data

That’s a long list! Now let’s understand each one of these in detail.

Abstraction

Abstraction is nothing but hiding out the complexities of a component to make it easy to understand, use and generalize. By means of abstraction, one can concentrate on the bigger picture of a problem and ignore the low level details.

For example, in any database a table is an abstraction of structured data. The complexities of manipulating the underlying data is hidden from the user of the database. This way, user can concentrate on how to organize his data as multiple tables and build complex relationships between the tables, rather than worrying about how to store and index the data in the file system directly.

Similarly in HDFS, the user/client accessing an HDFS file doesn’t have to worry about how the actual data is split into parts and stored in multiple machines.

Immutable

A rule of functional-programming which says data cannot be modified, it can only be created (btw, this is my own fancy definition!). Well, some people may not feel the power of this feature and for some, it may seem as more of a restriction.

Modifying a shared resource simultaneously via multiple threads/processes and getting the consistent result everytime is impossible without proper synchronization and locking mechanisms. These mechanisms are very very costly because, only a few(in most of the cases 1) threads/processes will be modifying the shared resource and rest of them will have to wait for their turn! And this waiting time increases as the concurrency(number of threads/processes trying to modify the shared data) increases!

The solution is simple: don’t modify the data at all! But how do we process the data if we are not allowed to modify it in the first place? Answer is, always create new data from existing data! Everytime we want to process some data, we follow these steps

read the data
process it and produce new data
save the new data

Trivial example would be immutable String in java. Obviously, this will increase the storage requirements exponentially, as we keep duplicating the data. In the next sections, we will see how lazy evaluation and iterator pattern will solve this problem.

Lazy evaluation

This characteristic says, don’t do anything until and unless it is absolute necessary. That way, we don’t unnecessarily consume/hold the computing resources like CPU, memory, network, storage etc. This helps in making optimal use of resources.

In an application, certain code snippets may be used only in few rare cases or some snippets may not be necessary at all. If these snippets consume considerable amount of computing resources, then lazy evaluation of these snippets would optimize the application greatly. Also, because of lazy evaluation, the order of code execution will change and may speed up the execution of remaining parts of the application.

One of the best examples for lazy evaluation would be the way virtual memory is implemented in Operating Systems. By loading only required data from the secondary storage into memory, the operating system makes the optimal use of limited memory space. This is a very broad example, but explains the power of lazy evaluation.

Iterator pattern

One of the versatile desing patterns used in programming langauges is iterator pattern. Accessing a sequence of items, only one at a time, untill the entire sequence has been consumed. The best part is that the sequence of items could be infinite! An iterator can produce one element from a sequence of items and it knows when the sequence runs out.

If above explanation sounds a bit abstract, think of Java iterators. Usually, iterators will be backed by a finite collection of items and the iterator will produce one item from the collection everytime we call the next() method. The consumer of this iterator has to make sure that the underlying collection has more items to serve, by calling hasNext() method before calling the next() method everytime.

The question is, what makes this iterator guy so powerful? Iterators could serve infinitly large sequence of items, that too within the memory and storage limits of a traditional commodity hardware.

Think of a custom iterator (not backed by any finite collection) that reads one line from a file from a local file system everytime the next() method is called. Note that, the iterator doesn’t have to load the entire file into memory to read line by line. There are APIs to read contents at a known location from a file stored in the secodary storages. The method hasNext() keeps track of the offset/position of current line that has been read and total number of lines in the file. Based on these details, hasNext can decide whether there are more lines to read or not.

The above described iterator can read infintely large file with a very small memory requirements. The iterators, as described above, can be designed carefully to serve such large amount of data, optimally. This concept is used in most of the Distributed Stream Processing engines also.

One importent thing to note here is that, the items consumed by an iterator should not be collected or accumulated in memory (because of the obvious reason that, the large sized file may not fit into memory). Let’s understand the reason behind this in the next section.

Pure Functions

Pure functions is one of the core concept of functional programming. Any function/method is a pure function if it produces exactly same results for a given input without any side-effects. This concept is what makes it difficult to get the hang of distributed processing sysmtes, especially for those who come from non-functional programming paradigm.

If you are 100% sure that a function produces exactly same results, everytime you call it on a given input, you can blindly run the function parallelly on parts of data by using divide and conquer technique. Also, if for some reason, the operation fails in the middle of processing, the operation can be restarted by discarding the results produced so far. This is makes distributed systems resilient to failures!

Pipelining or Keep-it-flowing

One of the commonly used pattern in distributed processing systems is, not accumulating/collecting the data at any point. When we think of big data, it is obvious that we cannot estimate the size/amount of data that we are going to process. So, accumulating data during any processing step leads to the obvious question of how much storage/memory is required to store the accumulated data? And the answer is, we really don’t know! It could be in GBs today and in TBs tomorrow!

The question of storage requirement can be avoided by use of pipelining. An intuitive example would be car manufacturing assembly line. The manufacturing assembly line would contain multiple stages. The car being manufactured, will go through each of the stages in order. At each stage, some changes are made to the car. There are two important things to observe here:

At any given time, there will be at most one car at each assembly stage
All the stages will apply their changes to the cars that are currently present in that stage, simultaneously

The point is, we don’t take all the cars from one stage to another stage at once. We make sure that the manufacturing stages are applied on one car at a time. That way, we make efficient use of time and resources.

Same principle is applied in the distributed processing systems. Any data processing step is similar to a manufacturing stage. Assume, we are processing data from a database table, think of each row as a car being manufactured. Each row will go through a series of stages or processing steps untill the last stage, resulting in a transformed row. Both the points mentioned above regarding the car manufacturing assembly will apply here: First, at any given time, there will be at most one row being processed at each of the processing steps, second, all the processing steps will be processing different rows simultaneously.

The pipeline concept in distributed processing systems is one of the key concepts to understand.

Divide and conquer

One of the trivial concept that almost every computer science student would have heard about is divide and conquer. This is one of the best algorithmic approaches that can make use of parallel processing to speed up the computation. Merging or combining the results is the final step of any divide and conquer procedure.

Distributed processing systems, one way or the other, make use of divide and conquer technique to scale the data processing steps. Consider HDFS for instance, the underlying principle in HDFS is to split or divide the data into parts and distribute them across the cluster nodes. Now, any processing on top of HDFS can easily distribute the processing steps to individual parts. These processing steps can run in parallel there by increasing the processing speed. The result produced by processing each of the part files will inherently be distributed and can be merged whenever needed.

Associative operations

The mathematical definition of associative property can be summarized as order of operations does not matter and the final result will not change because of it. What does this have to do with distributed processing? A lot! Especially, if you are a beginner in distributed data processing domain, this is a very importent concept to know about.

Let us consider addition of few values: [4, 2, 7, 9, 1, 6]

Based on associative property of addition, we know that the order of numbers in the above equation can be changed. Is that it? What more does the associative property say?

If order of operations does not matter, then we are free to apply the operations in any order and importantly, we can apply the operations at any point of time and we would still get the same result!

That means, we can run different parts of the operations simultaneously, instead of adding all the numbers sequentially.

More importantly, we can define the over-all operation using same operation, recursively!! Yes the sequence of associative operations can be defined recursively and the final result will be same!

Let us define a addition function that takes two numbers and produces another number.

def add(val a, val b) = (a + b)

//Step-1
val a1 = add(4,2)  //produces 6
val a2 = add(7,9)  //produces 16
val a3 = add(1,6)  //produces 7

//Step-2
val b1 = add(a1,a2)  //produces 22

//Step-3
val res = add(a3, b1)  //produces 29

The result is 29 as expected.

Let us look at the code snippet carefully. All the additions of step-1 could be parallellized. The step-2 depends on results of step-1, step-3 depends on the result of step-2. But all the processing steps use the same underlying add function.

Let us say we had to add 1 billion numbers, we can make the best use of associative property to recursively divide and conquer the 1 billion numbers.

One importent property to note here is that, the result type of the operation should be same as that of the inputs.

Bring the code to data

A simple but most effective pattern that all the distributed processing systems use is run the code where the data resides.

Assume, we have a cluster of 3 machines or nodes named A, B and C. A worker process that is running in machine A and it wants to process the data stored in the machine B. The only way is to transfer all the data from machine B to A over a network! But network operations are costly both in terms of time and money(don’t forget the bandwidth usage!). This cost of network operation increases as the size of the data increases.

How do we avoid this cost and optimize the data processing? Simple, don’t move the data, bring the code to data! All we have to do is keep track of where the data required for a processing step resides in the cluster of machines and send the code to that machine and run it there. Once the results are ready, then keep track of the results for future.

When we look the relative size/bandwidth usage of transfering code and data, we can easily grasp the power of this pattern. The individual part of data that gets distributed could be few hundred MBs to GBs where as, the code that processes a given part could be few hundred KBs or few MBs max! That’s a huge difference of bandwidth, isn’t it!!

HDFS distributes the data into parts across a cluster of machines and more importantly keeps track of all the parts. Interesting fact is that, HDFS even knows on which machine a given part of a data resides, where in the rack does that machine exist, what other machines exist in a given rack, etc. All these information about a single part of data helps a data processing framework to decide where to run the processing code!

Another mystery is, how do we send the code to different machines! Code serialization and deserialization is the solution. The code required to process a part of the data will be serialized and sent to the worker machine over the network and it gets deserialized and executed there.

##An illustrative example of patterns

Let us consider a simple example and try to make use of all the design patterns that we have seen previously.

Assume we have to perform below processing steps on an array A = [26, 62, 93, 17, 25],

divide each number by 2
add 10 to each number
replace all the even numbers by -1
add the corresponding elements of result of step 1 and 2
add the corresponding elements of result 3 and 4
add the corresponding elements of result 5 and original array A
find the average of all the numbers in the result of step 6

Now let’s try to use most of the patterns(tricky ones) and see how our solution would be:

Each step will take one array as input and produces another array as result [immutability]
Don’t perform a step untill and unless it is required, or to be specific, untill and unless the result of a step is required [Lazy evaluation]
Convert each of the steps into a function that takes an integer or two integeres as input and produces an integer as output [Associative property and Iterator pattern]
Because of the immutability rule, it may look like those 6 steps are going to produce 6 different arrays. But this won’t be necessary because at any given time, we’ll store only those arrays that are required for the next step. And once the next step has consumed the input and produced the result, we may not need the input data and no need to save them [Pipelining]

What’s next?

There will be a next part of this post about Apache Spark. That will mostly cover how Apache Spark implements the common patterns of distributed processing listed above.

References

https://techvidvan.com/tutorials/features-of-hdfs/