Count all the Bees in the “Hive”

After working on the Map-Reduce word counting example and tinkering with a couple of other map-reduce programs, I thought I would try something with Hive. I ended up doing the same word count exercise because it is simple to set up, and though trivial, gives a nice contrast to using map-reduce directly. Briefly, Hive presents a SQL like interface to Hadoop. The Hive query language is called QL and like SQL it is declarative. That means one would use QL to declare what results are wanted and not specify any detailed procedures for getting the result. Hive goes and figures out the best way to answer the request. This is in contrast to the map-reduce example I presented before which uses a procedural approach, one has to program all the steps in the form of mappers, and reducers in order to produce the desired result. QL hides the map-reduce procedures from the user just like a SQL based relational database abstracts away the procedures and structures used to return results. In that way QL is easier to use than programming map-reduce but it is less flexible and requires more structure to the data. To be fair using map reduce directly in Hadoop also hides much of the details of distributed computing by automatically partitioning input data, scheduling tasks across a cluster, scaling, inter-machine communication and handling failures. This allows programmers to focus their work of how to express a problem in terms of key/value pairs and the needed aggregation on like keys etc. Once that is done what remains is writing rather straight forward mapper and reducer methods. This is further aided by using and extending the rich set of Hadoop and Java API’s that are part of the implementation and framework. Hive sits on top of all this hiding even more of the complexity of haddoop. I do not want to create the false impression that Hive and map-reduce have nothing to do with each other. Hive takes the QL statements and “compiles” them into map-reduce jobs and submits them to the Hadoop cluster for execution. The results emitted by the reduce phase are displayed by Hive to the end user. As you would expect, the query language needs to have some structure in order for it to read,write and process the data. As such, in order to use Hive you must create a schema and load data into it. In order to facilitate that, Hive has a data definition language (DDL). used to  create tables with attributes and data types and other structures. Hive makes Hadoop look a lot like a SQL database. Again, in the interest in  guarding against false impressions, there are many other differences between Hive and relational technology. I suggest reading this for a more comprehensive description of Hive. Note that Hive use cases are data warehouse use cases involving long scans rather then short reads or updates as in a transactional application. I do not believe that Hive supports atomic transactions, commits and roll-backs, etc. Things are evolving fast however and a lot of work is being done by many to merge and blur the distinction between a SQL relational database and the Hadoop map-reduce HDFS framework. Specifically look at the work Hortonworks, Cloudera and Hadapt are doing on this topic. More then enough said, lets get on with the simple word count exercise using Hive. One more thing, the title of this blog is taken from an old song by the Nitty Gritty Dirt band  entitled.  “The House at Pooh Corner”, which was based on some famous children’s books by A. A Milne in which the main character is a stuffed bear named Winnie-the-pooh. Very famous indeed. Now on this this simple example. Step 1. Create the schema and prep the data. This example uses one table with one column in it. I called the table pooh and give it a column called words. Here is the interaction with Hive for this step.

hive > CREATE DATABASE mytest;
hive > USE mytest;
hive > CREATE TABLE pooh
(words string);
hive > DESC pooh;
OK
words	string
OK

So far looks  pretty familiar to anyone who has used  a database such as MySQL  One thing to point out is we use a data type called string instead of varchar which is commonly used in a database. Step 2. Load the data into the table I wrote a short Java program to take a text file and output every word on one line. It converts all the text to lower cases and strips out punctuation except apostrophes and hyphens. The file with the single word lines is then loaded into the single column table created in step 1.

hive > LOAD DATA LOCAL INPATH
/home/tgendron/dataFiles/poohWords.out
INTO TABLE mytest.pooh;
Copying data from file:
/home/tgendron/dataFiles/poohWords.out
[...snip...]
Loading data to table mytest.pooh
chgrp: changing ownership of
'/user/hive/warehouse/mytest.db/pooh/poohWords.out':
 User does not belong to hive
[...snip...]
Time taken: 0.797 seconds

Once again, we see very familiar syntax to MySQL. Hive was pretty chatty about its actions and the error about the user not being in Hive does not seem to matter to this excersice. The data was loaded and queries on the table did what was expected. Step 3 Using Hive to count the words in  a Winnie -the-Pooh book. .

hive > SELECT words, count(*) FROM pooh GROUP BY words;

Total MapReduce jobs = 1
 Launching Job 1 out of 1
Number of reduce tasks not specified.
Estimated from input data size: 1
[...snip]
Hadoop job information for Stage-1:
number of mappers: 1; number of reducers: 1
 [...snip]
Time taken: 36.297 seconds
[...snip]

And the results:

OK
 a         32
about      3
after      1
all       15
always     1
and       69
another    2
[snip]
terrifying     1
than     14
thank     8
thank-you     2
thanks     1
that     259
that's     37
thaws     1
the     717
the-pooh     2
their     11
them     52
themselves     4
then     139
there     111
[....snip...]
Time taken: 34.18 seconds

There you have it. Counting words with Hive.  Something interesting to note is the time taken. Just over 34 seconds. I have scalability but not much speed. Lets try this with Vertica which is also a scalable solution.

=> \timing
Timing is on.
=> select words, count(*) from pooh group by words;
[...snip...]
Time: First fetch (1000 rows): 37.122 ms.
All rows formatted: 1821.101 ms

Wow big difference. Others have shown this sort of thing too. Look at this blog on counting triangles. There is also these results at UC Berkeley. Scalability, Availability, Speed. Pick two or three. I am a little partial to Vertica since I have a little experience with it. It would be good to see the how the performance of vertica compares to Spark and Impala. Just looking at the ratios on the Berkeley benchmarks  of these againts Hive, I think Vertica has a significant performance advantage. And you don’t give up scalability or availability. Also you can use the same commodity hardware. Not to mention excellent compression. Just saying…. Use the best tool for the job what ever that may be for your particular task.