Big Data Books
It's not a news anymore that many people are reacting in different ways to the news of National Security Agency accessing the phone records of Verizon - cellphone and land lines - to learn who called who. But one thing is common - all these reactions are pretty intense. While the debate lingers on whether the government overstepped its boundaries in accessing the records of private citizens or whether it was perfectly justified in the context of the threats our country is facing right now - and there will be a debate for some time to come - I had a different thought of my own. I was wondering about the technological aspects of the case. The phone records must be massive. What kind of tools and technologies the folks in the "black room", a.k.a Room 641A must have used to collate and synthesize the records into meaningful intelligence - the very task NSA is supposed to gather and act upon? Along with anything it piques the interest from an engineering point of view.
And that brings the attention to the aspect of computation involving massive amounts of data. It's one thing to slice and dice a finite, reasonable amount of dataset; but in the case of the phone records, and especially collated with other records to identify criminal or pseudo-criminal activities such as financial records, travel records, etc., the traditional databases such as Oracle and DB2 likely will not scale well. But the challenge is not just in the realm of romanticized espionage; it's very much a concern for purely un-romantic corporations who, among other things, want to track customer behavior to fine tune their service and product offerings. Well, it's espionage of a slightly different kind. Perhaps the sources of data are different - website logs, Facebook feeds as opposed to phone records; but the challenges are the same - how to synthesize enormous amounts of seeming unrelated data into meaningful intelligence. This is the challenge of the "Big Data".
Meet the V'sSeveral years ago, Yahoo! faced the same issue - how to present the attractiveness the webpages it puts on its portal and analyze the pattern of clicking by users to attract advertisers to put relevant ads on their pages. Google had a similar challenge of indexing the entire World Wide Web in its servers so that it present search results very, very quickly. Both of these issues represent the issues that others probably didn't face earlier. Here are the relatively unique aspects of this data, which are known as the "Three V's of Big Data".
- Volume - the sheer mass of the data made it difficult, if not impossible, to sort through them
- Velocity - the data was highly transient. Website logs are relevant only for that time period; for a different period it was different
- Variety - the data was not pre-defined and not quite structured - at least not the way we think of structure when we think of relational databases
- There was no need to use large servers, which typically costs a lot more than small servers
- There was a built-in data redundancy since the data was replicated between these small servers
- But the most important, it could scale well, very well simply by adding more of those small servers
Name=Value PairsA typical relational database works by logically arranging the data into rows and columns. Here is an example. You decide on a table design to hold your customers, named simply CUSTOMERS. It has the columns CUST_ID, NAME, ADDRESS, PHONE. Later, your organization decides to provide some incentives to the spouses as well and so you added another column - SPOUSE.
Everything was well, until the time you discovered customer 1 and spouse are divorced and there is a new spouse now. However the company decides to keep the names of the ex-spouses as well, for marketing analytics. Like the right relational application, you decide to break SPOUSE away from the main table and create a new table - SPOUSES, which is a child of CUSTOMERS, joined by CUST_ID. This requires massive code and database changes; but you survive. Later you had the same issue with addresses (people have different addresses - work, home, vacation, etc.) and phone numbers (cell phone, home phone, work phone, assistant's phone, etc.). So you decide to break them into different tables as well. Again, code and database changes. But the changes did not stop there. You had to add various tables to record hobbies, associates, weights, dates of birth - the list is endless. Every thing you record requires a database change and a code change. But worse - not all the tables will be populated for every customer. In your company's quest to build a 360 degree view of the customer, you collect some information; but there is no guarantee that all the data points will be gathered. you are left with sparse tables. Now, suddenly, someone says there is yet another attribute required for the customer - professional associations. So, off you go - build yet another table, followed by change control, code changes to incorporate that.
If you look at the scenario above, you will find that the real issue is trying force a structure around a dataset that inherently unstructured - akin to a square peg in a round hole. The lack of structure of the data is what makes it agile and useful; but the lack of structure is also what makes it difficult in a relational database that demands structure. This is the primary issue if you wan tto capture social media data - Twitter feeds, Facebook updates, LinkedIn updates and Pinterest posts. It's impossible to predict in advance, at least accurately, the exact information you will expect to see in them. So, putting a structure around the data storage not only makes life difficult for everyone - the DBAs will constantly need to alter the structures and the developers/designers will constantly wait for the structure to be in the form they want - slowing down capture and analysis of data.
So, what is the solution. If you think about it, think about how we - human beings - process information. Do we parse information in form of rows in some table? Hardly. We process and store information by associations. For instance, let's say I have a friend John. I probably have nuggets of information like this:
Last Name = Smith
Lives at = 13 Main St, Anytown, USA
Age = 40
Birth Day = June 7th
Wife = Jane
Child = Jill
Jill goes to school = Top Notch Academy
Jill is in Grade = 3
... and so on. Suppose I meet another person - Martha- who tells me that her child also goes to Grade 3 in Top Notch Academy. My brain probably goes through a sequence like this:
Search for "Top Notch Academy"
Found it. It's Jill
Search for Jill.
Found it. She is child of John
Who is John's wife?
Found it. It's Jane.
Where do John and Jill live? ...
And finally, after this processing is all over, I say to Martha as a part of the coversation - "What a coincidence! Jill - the daughter of my friends John and Jane Smith goes there are well. Do you know them?". "Yes, I do," replies Martha. "In fact they are in the same class, that of Mrs Gillen-Heller".
Immediately my brain processed this new piece of information and filed the data as:
Jill's Teacher = Mrs. Gillen-Heller
Jane's Friend = Martha
Martha's Child Goes to = ...
Months later, I meet with Jane and mention to her that I met Martha whose child went to Mrs. Gillen-Heller's class - the same one as Jill. "Glad you met Martha,", Jane says. "Oh, Jill is no longer in that class. Now she is in Mr. Fallmeister's class."
Aha! My brain probably stored that information as:
Jill's former teacher = Mrs. Gillen-Heller
And it updated the already stored information:
Jill's Teacher = Mr. Fallmeister
This is called storing by a name=value pair. You see I stored the information as a pair of property and it value. As information goes on, I keep adding more and more pairs. When I need to retrieve information, I just get the proper property and by associations, I get all the data I need. But the storing of data by name=value pairs gives me enormous flexibility in storing all kinds of information without modifying any data structures I may currently have.
This is also how the Big Data is tamed for processing. Since the data coming of Twitter, Facebook, LinkedIn, Pinterest, etc. is impossible to categorize in advance, it will be practically impossible to put it all in the relational format. Therefore, a name=value pair type storage is the logical step in compiling and collating the data. the name is also known as "key"; so the model is sometimes called key-value pair. The value doesn't have to have a datatype. In fact it's probably a BLOB; so anything can go in there - booking amount, birth dates, comments, XML documents, pictures, audio and even movies. It provides an immense flexibility in capturing the information that is inherently unstructured.
NoSQL DatabaseNow that you know about name value pairs, the next logical question you may have is - how do we store these? Our thoughts about databases are typically colored by our long-standing association with relational databases, making them almost synonymous with the concept of database. Before relational databases were there, even as a concept, big machines called mainframes ruled the earth. The databases inside them were stored in hierarchical format. One such database from IBM was IMS/DB, which was hierarchical. Later, when relational databases were up and coming, another type of database concept - called a network database - was developed to compete against it. An example of that category was IDMS (now owned by Computer Associates) developed for mainframes. The point is, relational databases were not the answer to all the questions then; and it is clear that they are not now either.
This leads to the development a different type of database technologies based on the the key-value model. Relational database systems are queried by SQL language, which I am sure is familiar to almost anyone reading this blog post. SQL is a set-oriented language - it operates on sets of data. In the key-value pair mode, however, that does not work anymore. Therefore these key-value databases are usually known as NoSQL, to separate them from the relational SQL-based counterparts. Since their introduction, some NoSQL databases actually support SQL, which is why "NoSQL" is not a correct term anymore. Therefore sometimes it is referred to as "Not only SQL" databases. But the point is that their structure is not dependent on relational. But how exactly the data is stored is usually left to the implementer. Some examples are MongoDB, Dynamo, Big Table (from Google) etc.
I would stress here that almost any type of non-relational database can be classified as NoSQL; not just the name-value pair models. For instance, Object Store, an object database is also NoSQL. But for this blog post, I am assuming only key-value pair database as the NoSQL one.
Map/ReduceLet's summarize what we have learned so far:
- The key-value pair model in databases offer flexibility in data storage without the need for a predefined table structure
- The data can be distributed across many machines where they are independently processed and then collated.
Here is a very rudimentary but practical example of Map/Reduce. Suppose you get Facebook feeds and you are expected to find out the total of likes for our company's recent post. Facebook feed comes in the form of a massive dataset. The first task is to divide that among many servers - a principle described earlier to make the process scale well. Once the dataset is divided, each machine run some code to extract and collate the information and then present the data to some central coordinator. to collate for the final time. Here is a pseudo-code for the process for each server doing the processing on a subset of data:
while (there_are_remaining_posts) loop
extract status of "like" for the specific post
if status = "like" then
like_count := like_count + 1
no_comment := no_comment + 1
Let's name this program counter(). Counter runs on all the servers, which are called Nodes. As shown in the figure, there are three nodes. The raw dataset is divided into three sub-datasets which are then fed to each of the three Nodes. A copy of the subdataset is kept another server as well. That takes care of redundancy. Each node perform their computation, send their results to an intermediate result set where they are collated.
(1) First, since the data is stored in chunks and the copy of a chunk is stored in a different node, there is built-in redundancy. There is no need to protect the data being fed since there is a copy available elsewhere.
(2) Second, since the data is available elsewhere, if a node fails, all it needs to done is that some other nodes will pick up the slack. There is no need to reshuffle or restart the job.
(3) Third, since the nodes all perform task independently, when the datasize becomes larger, all you have to do is to add a new node. Now the data will be divided four ways instead of three and so will be processing load.
This is very similar to parallel query processes in Oracle Databases, with PQ servers being analogous to nodes.
There are two very important points to note here:
(1) The subset of data each node gets is not needed to be viewed by all the nodes. Each node gets its own set of data to be processed. A copy of the subset is maintained in a different node - making the simultaneous access to the data unnecessary. This means you can have the data in a local storage; not in expensive SANs. This is not only brings cost significantly down but may performs better as well due to a local access. As cost of Solid State Devices and flash-based storage plummets, it could also mean the that the storage cost per performance will be even better.
(2) The nodes need not be super fast. A relatively simple commodity class server is enough for the processing as opposed to a large server. Typically servers are priced for their use, e.g. a Enterprise class server with 32 CPUs is probably roughly equivalent in performance to eight 4-CPU blades. But the cost of the former is way more than 8 times the cost of the blade server. This model takes advantage of the cheaper computers by scaling horizontally; not vertically.
HadoopNow that you know how the processing data in parallel and using a concept called Map/Reduce allows you to shove in several compute intensive applications to dissect large amounts of data, you will often wonder - there are a lot of moving parts to be taken care of just to empower this process. In a monolithic server environment you just have to kick off multiple copies of the program. The Operating System does the job of scheduling these programs on the available CPUs, taking them off the CPU (paging) to roll in another process, prevent processes from corrupting each others' memory, etc. Now that these processes are occurring on multiple computers, there has to be all these manual processes to make sure they work. For instance, in this model you have to ensure that the jobs are split between the nodes reasonably equally, the dataset is split equitably, the queue for feeding data and getting data back from the Map/Reduce jobs are properly maintained, the jobs fail over in case of node failure, so on. In short, you need an operating system of operating systems to manage all these nodes as a monolithic processor.
What would you do if this operating procedures were already defined for you? Well, that would make things really easy, won't it? you can then focus on what you are good at - developing the procedures to slice and dice the data and derive intelligence from it. This "procedure", or the framework is available, and it is called Hadoop. It's an open source offering, similar to Mozilla and Linux; no single company has exclusive ownership of it. However, many companies have adopted it and evolved it into their offerings, similar to Linux distributions such as Red Hat, SuSe and Oracle Enterprise Linux. Some of those companies are Cloudera, Hortonworks, IBM, etc. Oracle does not have a Hadoop distribution. Instead it licenses the one from Cloudera for its own Big Data Appliance. The Hadoop framework runs on all the nodes of the cluster of nodes and acts as the coordinator.
A very important point to note here that Hadoop is just a framework; not the actual program that performs Map/Reduce. Compare that to the operating system analogy; an OS like Windows does not offer a spreadsheet. You will need to either develop or buy an off the shelf product such as Excel to have that functionality. Similarly, Hadoop offers a platform to run the Map/Reduce programs that you develop and you put that logic in the code what you "map" and how you "reduce".
Remember another important advantage you had seen in this model earlier - the ability to replicate data between multiple nodes so that the failure of a single node does not cause the processing to be abandoned. This is offered through a new type of filesystem called Hadoop Distributed Filesystem (HDFS). HDFS, which is a distributed (and not a clustered) filesystem, by default has 3 copies of data on three different nodes - two on the same rack and the third on a different rack. The nodes communicate to each other using a HDFS- specific protocol that is built on TCP/IP. The nodes are aware of the data present on the other nodes, which is precisely what allows Hadoop job scheduler to divide the work among the nodes. Oh, by the way, HDFS is not absolutely required for Hadoop; but as you can see, HDFS is the only way for Hadoop to know which node has what data for smart job scheduling. Without it, the division of labor will not be as efficient.
HiveNow that you learned how Hadoop fills a major void for computations on a massive dataset, you can't help but see the significance for datawarehouses where massive datasets are common. Also common are jobs that churn through this data. However, there is a little challenge. Remember, NoSQL databases mentioned earlier? That means they do not support SQL. To get the data you have to write a program using the APIs the vendor supplies. This may reek of COBOL programs of yesteryear where you had to write a program to get the data out, making it inefficient and highly programmer driven (although it did strut up job-security, especially during the Y2K transition times). The inefficacy of the system gave rise to 4th Generation Languages like SQL, which brought the power of queries to common users, ripping the power away from programers. In other words, it brought the data and its users closer, reducing the role the middleman significantly. In datawarehouses, it was especially true since the power users issued queries to after getting the result from the previous queries. It was like a conversation - ask a question, get the answer, formulate your next question - and so on. If the conversation were dependent on writing programs, it would have been impossible to be effective.
With that in mind, consider the implications of lack of SQL in these databases highly suitable for datawarehouses. This requirement to write a program to get the data everytime would have taken it straight to the COBOL-days. Well, not to worry, Hadoop has another product that allows a SQL-like language called HiveQL. Just as users could query relational databases with SQL very quickly, HiveQL allows users to get the data for analytical processing directly. It was initially developed at Facebook.
Comparing to Oracle RACWhen we talk about clustering in Hadoop, you may not help wonder - shouldn't the same functionality be provided by Oracle Real Application Cluster? Well, the short answer is - a big resounding NO. Oracle RAC combines the power of multiple nodes in the cluster which communicate with one another and transfer data (cache fusion); but that's where the similarity ends. The biggest difference is the way the datasets are accessed. In RAC, the datasets are common, i.e. they must be visible to all the nodes of the cluster. In Hadoop, the datasets are specific to the individual nodes, which allows them to be local. The filesystems in RAC can't be local; they have to be clustered or available globally, either by a clustered filesystem, shared volumes, clustered volume managers (such as ASM) or by NFS mounting. In Hadoop the local files are replicated to other nodes, which means there is no reason to create a RAID lever protection at the storage level. ASM does provide a software level mirroring, which may sound similar to Hadoop's replication; but remember, ASM's mirrors are not node specific. The mirrored copies of ASM must be visible to all the nodes. There is a preferred node concept in ASM; but that simply means that data is read by a specific node from one mirror copy. The mirror copies can't be local; they must be be globally visible.
Besides, Oracle RAC if for a relational database. The Hadoop cluster is not one. There are traits such as transnational integrity, multiple concurrent writes that are innate features of any modern database system. In Hadoop, these things are not present and not really necessarily. So while both are technically databases, a comparison may not be fair to either Hadoop or RAC. They are like apples and tomatoes. (Tomato is technically a fruit, just in case you are wondering about this analogy).
The PlayersSo, who are the players for this new branch of data processing? The major players are show below with small description. This list is by no means exhaustive. It simply is a collection of companies and products I have studied.
- Cloudera - they have their distribution called, what else, Cloudera Distribution for Hadoop (CDH). But perhaps the most impressive from them is Impala - a real-time SQL like interface to query data from the Hadoop cluster.
- Hortonworks - may of the folks who founded this company came from Google and Yahoo! where they built or added to the building blocks of Hadoop
- IBM - they have a suite called Big Insights which has their distribution of Hadoop. This is one of the very few companies who offer both the hardware and software. the most impressive feature from IBM is a product called Streams that can mine data frm a non-structured stream like Facebook in realtime and send alerts and data feeds to other systems.
- Dell - like IBM they also have a hardware/software combination running a licensed version of Cloudera, along with Pentaho Business Analytics
ConclusionIf the buzzing of the buzz-words surrounding any new technology annoy you and all you get is tons of websites on the topic but not a small consolidated compilation of terms, you are just like me. I was frustrated by the lack of information in a digestible form on these buzzwords that are too important to ignore but would take too much time to understand fully. This is my small effort to bridge that gap and get you going on your quest for more information. If you have 100's or 1000's of questions after reading this, I would congratulate myself - that is precisely what my objective was. For instance how HiveQL differs from SQL or how Map/Reduce jobs are written - these are questions that should be flying in your mind now. My future posts will cover them and some more topics like HBase, Zookeeper, etc. that will unravel the mysteries around the technology that is going to be commonplace in the very near future.
Welcome aboard. I wish you luck in learning. As always, your feedback will be highly appreciated.