Distributed processing has always fascinated me and here is my take-away notes about it. This post has overview of Big data, Distributed storage and processing systems.
A basic understanding of any distributed storage system like HDFS (Hadoop Distributed File System) would make this post more helpful. Additionaly, understanding of any distributed processing system like Apache Spark would make this post more interesting.
Be aware, this is mostly a theoretical post “no code zone” ahead :)
Handling data in computer science applications has been an interesting domain for a long time now. The recent advancement in user space of computer technologies has led to generation of huge amount of data. There are many articles which talk about data generated by the internet like this.
HDFS is one of the most versatile distributed storage systems in open source big-data world. A beginner-level knowledge of how HDFS works would help you get better understanding of the below concenpts. But don’t worry if you don’t know what it is.
Storage and maintainance of vast amount data is indeed a challanging task and that’s where distributed storage systems come into picture. There are variety of storage systems each having it’s own pros and cons. But most of them have the same underlying principles to achieve the below requirements:
Fault Tolerance Don’t want to loose the entire data for triviel failures
High Availability Always accessible from anywhere
High Reliability Always maintain data consistency
Replication Duplicating data as backup/fallback mechanisms in case of failures
Scalability Provide verticle and horizontal scalability to increase the storage capacities
Some systems may be good at fault tolerance but doesn’t scale much, some systems may scale very well but may not be reliable. No system is perfect, but based on the business requirement trade off could be made to choose the most suitable storage system.
Below are a few of the key points that most of the distributed storage systems have in common today:
This is just a crude introduction to distributed storage systems and there is definitely more to it. For now these key points should be enough understand the next concept: Distributed processing.
Now that we know about distributed storage systems, let’s see how do we process the data stored in any distributed storage system.
The processing could be any operation that we apply on the data including. Type of processing could be an ETL on the data, pre-processing for the consumption of business intelligence, or feature engineering in machine learning etc.
Processing a huge amount of data will involve many challanges related to speed, fault tolerance, scalability, consistency and more. Most of the currently available distributed processing systems make use of a set of patterns to solve these challanges.
Patterns are nothing but solutions/guidelines to common occuring problems in a given domain. It looks like many of these patterns come from an interesting field of computer science called functional programming. Understanding these patterns should help one answer the question why distributed processing systems are built like that?.
That’s a long list! Now let’s understand each one of these in detail.
Abstraction is nothing but hiding out the complexities of a component to make it easy to understand, use and generalize. By means of abstraction, one can concentrate on the bigger picture of a problem and ignore the low level details.
For example, in any database a table is an abstraction of structured data. The complexities of manipulating the underlying data is hidden from the user of the database. This way, user can concentrate on how to organize his data as multiple tables and build complex relationships between the tables, rather than worrying about how to store and index the data in the file system directly.
Similarly in HDFS, the user/client accessing an HDFS file doesn’t have to worry about how the actual data is split into parts and stored in multiple machines.
A rule of functional-programming which says data cannot be modified, it can only be created (btw, this is my own fancy definition!). Well, some people may not feel the power of this feature and for some, it may seem as more of a restriction.
Modifying a shared resource simultaneously via multiple threads/processes and getting the consistent result everytime is impossible without proper synchronization and locking mechanisms. These mechanisms are very very costly because, only a few(in most of the cases 1) threads/processes will be modifying the shared resource and rest of them will have to wait for their turn! And this waiting time increases as the concurrency(number of threads/processes trying to modify the shared data) increases!
The solution is simple: don’t modify the data at all! But how do we process the data if we are not allowed to modify it in the first place? Answer is, always create new data from existing data! Everytime we want to process some data, we follow these steps
Trivial example would be immutable String in java. Obviously, this will increase the storage requirements exponentially, as we keep duplicating the data. In the next sections, we will see how lazy evaluation and iterator pattern will solve this problem.
This characteristic says, don’t do anything until and unless it is absolute necessary. That way, we don’t unnecessarily consume/hold the computing resources like CPU, memory, network, storage etc. This helps in making optimal use of resources.
In an application, certain code snippets may be used only in few rare cases or some snippets may not be necessary at all. If these snippets consume considerable amount of computing resources, then lazy evaluation of these snippets would optimize the application greatly. Also, because of lazy evaluation, the order of code execution will change and may speed up the execution of remaining parts of the application.
One of the best examples for lazy evaluation would be the way virtual memory is implemented in Operating Systems. By loading only required data from the secondary storage into memory, the operating system makes the optimal use of limited memory space. This is a very broad example, but explains the power of lazy evaluation.
One of the versatile desing patterns used in programming langauges is iterator pattern. Accessing a sequence of items, only one at a time, untill the entire sequence has been consumed. The best part is that the sequence of items could be infinite! An iterator can produce one element from a sequence of items and it knows when the sequence runs out.
If above explanation sounds a bit abstract, think of Java iterators. Usually, iterators will be backed by a finite collection of items and the iterator will produce one item from the collection everytime we call the next() method. The consumer of this iterator has to make sure that the underlying collection has more items to serve, by calling hasNext() method before calling the next() method everytime.
The question is, what makes this iterator guy so powerful? Iterators could serve infinitly large sequence of items, that too within the memory and storage limits of a traditional commodity hardware.
Think of a custom iterator (not backed by any finite collection) that reads one line from a file from a local file system everytime the next() method is called. Note that, the iterator doesn’t have to load the entire file into memory to read line by line. There are APIs to read contents at a known location from a file stored in the secodary storages. The method hasNext() keeps track of the offset/position of current line that has been read and total number of lines in the file. Based on these details, hasNext can decide whether there are more lines to read or not.
The above described iterator can read infintely large file with a very small memory requirements. The iterators, as described above, can be designed carefully to serve such large amount of data, optimally. This concept is used in most of the Distributed Stream Processing engines also.
One importent thing to note here is that, the items consumed by an iterator should not be collected or accumulated in memory (because of the obvious reason that, the large sized file may not fit into memory). Let’s understand the reason behind this in the next section.
Pure functions is one of the core concept of functional programming. Any function/method is a pure function if it produces exactly same results for a given input without any side-effects. This concept is what makes it difficult to get the hang of distributed processing sysmtes, especially for those who come from non-functional programming paradigm.
If you are 100% sure that a function produces exactly same results, everytime you call it on a given input, you can blindly run the function parallelly on parts of data by using divide and conquer technique. Also, if for some reason, the operation fails in the middle of processing, the operation can be restarted by discarding the results produced so far. This is makes distributed systems resilient to failures!
One of the commonly used pattern in distributed processing systems is, not accumulating/collecting the data at any point. When we think of big data, it is obvious that we cannot estimate the size/amount of data that we are going to process. So, accumulating data during any processing step leads to the obvious question of how much storage/memory is required to store the accumulated data? And the answer is, we really don’t know! It could be in GBs today and in TBs tomorrow!
The question of storage requirement can be avoided by use of pipelining. An intuitive example would be car manufacturing assembly line. The manufacturing assembly line would contain multiple stages. The car being manufactured, will go through each of the stages in order. At each stage, some changes are made to the car. There are two important things to observe here:
The point is, we don’t take all the cars from one stage to another stage at once. We make sure that the manufacturing stages are applied on one car at a time. That way, we make efficient use of time and resources.
Same principle is applied in the distributed processing systems. Any data processing step is similar to a manufacturing stage. Assume, we are processing data from a database table, think of each row as a car being manufactured. Each row will go through a series of stages or processing steps untill the last stage, resulting in a transformed row. Both the points mentioned above regarding the car manufacturing assembly will apply here: First, at any given time, there will be at most one row being processed at each of the processing steps, second, all the processing steps will be processing different rows simultaneously.
The pipeline concept in distributed processing systems is one of the key concepts to understand.
One of the trivial concept that almost every computer science student would have heard about is divide and conquer. This is one of the best algorithmic approaches that can make use of parallel processing to speed up the computation. Merging or combining the results is the final step of any divide and conquer procedure.
Distributed processing systems, one way or the other, make use of divide and conquer technique to scale the data processing steps. Consider HDFS for instance, the underlying principle in HDFS is to split or divide the data into parts and distribute them across the cluster nodes. Now, any processing on top of HDFS can easily distribute the processing steps to individual parts. These processing steps can run in parallel there by increasing the processing speed. The result produced by processing each of the part files will inherently be distributed and can be merged whenever needed.
The mathematical definition of associative property can be summarized as order of operations does not matter and the final result will not change because of it. What does this have to do with distributed processing? A lot! Especially, if you are a beginner in distributed data processing domain, this is a very importent concept to know about.
Let us consider addition of few values: [4, 2, 7, 9, 1, 6]
Based on associative property of addition, we know that the order of numbers in the above equation can be changed. Is that it? What more does the associative property say?
If order of operations does not matter, then we are free to apply the operations in any order and importantly, we can apply the operations at any point of time and we would still get the same result!
That means, we can run different parts of the operations simultaneously, instead of adding all the numbers sequentially.
More importantly, we can define the over-all operation using same operation, recursively!! Yes the sequence of associative operations can be defined recursively and the final result will be same!
Let us define a addition function that takes two numbers and produces another number.
def add(val a, val b) = (a + b)
//Step-1
val a1 = add(4,2) //produces 6
val a2 = add(7,9) //produces 16
val a3 = add(1,6) //produces 7
//Step-2
val b1 = add(a1,a2) //produces 22
//Step-3
val res = add(a3, b1) //produces 29
The result is 29 as expected.
Let us look at the code snippet carefully. All the additions of step-1 could be parallellized. The step-2 depends on results of step-1, step-3 depends on the result of step-2. But all the processing steps use the same underlying add function.
Let us say we had to add 1 billion numbers, we can make the best use of associative property to recursively divide and conquer the 1 billion numbers.
One importent property to note here is that, the result type of the operation should be same as that of the inputs.
A simple but most effective pattern that all the distributed processing systems use is run the code where the data resides.
Assume, we have a cluster of 3 machines or nodes named A, B and C. A worker process that is running in machine A and it wants to process the data stored in the machine B. The only way is to transfer all the data from machine B to A over a network! But network operations are costly both in terms of time and money(don’t forget the bandwidth usage!). This cost of network operation increases as the size of the data increases.
How do we avoid this cost and optimize the data processing? Simple, don’t move the data, bring the code to data! All we have to do is keep track of where the data required for a processing step resides in the cluster of machines and send the code to that machine and run it there. Once the results are ready, then keep track of the results for future.
When we look the relative size/bandwidth usage of transfering code and data, we can easily grasp the power of this pattern. The individual part of data that gets distributed could be few hundred MBs to GBs where as, the code that processes a given part could be few hundred KBs or few MBs max! That’s a huge difference of bandwidth, isn’t it!!
HDFS distributes the data into parts across a cluster of machines and more importantly keeps track of all the parts. Interesting fact is that, HDFS even knows on which machine a given part of a data resides, where in the rack does that machine exist, what other machines exist in a given rack, etc. All these information about a single part of data helps a data processing framework to decide where to run the processing code!
Another mystery is, how do we send the code to different machines! Code serialization and deserialization is the solution. The code required to process a part of the data will be serialized and sent to the worker machine over the network and it gets deserialized and executed there.
##An illustrative example of patterns
Let us consider a simple example and try to make use of all the design patterns that we have seen previously.
Assume we have to perform below processing steps on an array A = [26, 62, 93, 17, 25],
Now let’s try to use most of the patterns(tricky ones) and see how our solution would be:
There will be a next part of this post about Apache Spark. That will mostly cover how Apache Spark implements the common patterns of distributed processing listed above.