Migrating Hive data from Google Cloud to Windows Azure

I've recently performed the task of moving some Hive tables which were stored on Google cloud to a Windows Azure HDInsight cluster, and thought it might be useful to share my experience here.

So, you won't see any super exciting code in this post, just a bunch of commands and links that might make your life easier in case you'll need to do the same thing.

The migration procedure will consist of 3 simple steps:
1. Download the Hive data files from the google cloud bucket to some local storage.
2. Upload the data to a blob storage on Azure.
3. Create external Hive tables in the HDInsight cluster on top of the uploaded data.

Download data to a local storage

Assuming your Hive google cluster is configured to use google cloud storage you can use the gsutil command-line tool in order to download the data locally.

After installing and authenticating gsutil it should be ready to use.

We will next use the cp command:

gsutil -m cp -r gs://[your hive google storage path] [local path]  

Where the -r switch specifies a recursive copy operation and -m specifies a multithreaded operation. (This is super important and can save you lots of time).

Note that you can easily find the location of your hive data files by running the following command inside your Hive console:

DESCRIBE FORMATTED [table name];  

One of (the many) outputs of the command will be a Location field which contains the path of the Hive data files.

Upload data to Azure blog storage

Assuming you have an Azure blob storage set up there are several options you can use in order to move your files to the cloud.

Note: In case you have an HDInsight cluster already configured, make sure to use the same storage account that the cluster uses.

I. In case you are using a windows machine, AzCopy is an easy to use command-line tool that can be used in order to upload your data.
Your AzCopy command will look something like this:

"C:\Program Files (x86)\Microsoft SDKs\Azure\AzCopy\AzCopy.exe" /Source:[local data dir] /Dest:[Azure blob storage path]/[your data path] /DestKey:[your storage key] /S /Pattern:* /Y  

Note that you can retrieve your storage keys & path from the Azure management portal, by opening the storage settings:

The storage key is located under the Keys tab inside the the PRIMARY ACCESS KEY field.
The storage path will be located under the Properties tab inside the Primary Blob Service Endpoint field.

Note that [your data path] should contain the relative directory path in which the data files will eventually reside. (for example: my/awesome/path).

II. An alternative for using AzCopy is the Azure Storage Explorer which provides an easy to use explorer-like UI to explore your storage and upload or download data. It works on Windows and Mac as well, and let you easily connect to any of your storage accounts after authentication is completed.

III. In case you still prefer to use a command line tool (or you're using linux), you can also checkout the Azure Command-Line Interface and its set of storage management commands.

Create external Hive tables in HDInsight


After we've moved our data to Azure we still need to tell Hive how to use it.
If you don't have one yet, you can follow this tutorial in order to set up a new Azure HDInsight cluster. When you choose a storage account, make sure to choose the storage account that you've used in the previous section.

Next, we will take advantage of Hive's external table feature.

An external table is a table that is backed by a set of file which aren't located in Hive's default location for table data.
Whenever an external table is created, the location of the table should also be specified. In the case that the specified location already contain data files, we can immediately start to run queries on the existing data!

In order to create an external table, log into your hive console, and run the following command:

CREATE EXTERNAL TABLE  
[table name](
    [column1 name] [column1 type],
    [column2 name] [column 2type],
    ...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '[DELIMITER]'  
STORED AS [file format] LOCATION 'wasb:///[path to data]';  

For example, If our table is called 'my_table', has 2 string fields, the data files, which are Ctrl-A delimited text file format, were uploaded to the path somepath/my_table, our command will look like this:

CREATE EXTERNAL TABLE  
my_table(  
    first STRING,
    second STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\001'  
STORED AS [TEXTFILE] LOCATION 'wasb:///somepath/my_table/';  

Note: The 'Describe Formatted [table name]' command that we've mentioned earlier will also give you the description of the table fields and the format of the data files.

This is it! You should now be able to run Hive queries on your migrated data using your HDInsight cluster.