Hadoop and the members of the Hadoop ecosystem have been known as primarily tools for big data analytics and durable storage. Although they can analyze Terrabytes and even Petabytes of data efficiently, the original Hadoop applications were not meant to efficiently handle transactions, particularly writes to HDFS.
HBase (Hadoop Database) is an open source project that attempts to give Hadoop the ability to make fast, transactional writes to its filesystem. It takes advantage of Hadoop's distributed computing environment to build a durable, consistent, yet high-performing database.
HBase is a wide column key-value store adapted from Google's Big Table. Each row has a specific key known as a rowkey. The data in your HBase cluster is stored lexicographically by that rowkey. The data in the row are represented by columns, which are grouped together into column families. Columns within a column family are stored physically close to each other to take advantage of data locality and minimize communication across nodes.
The following is an example of a table schema in HBase.
Schema adapted from AWS Blog Post
The key in this case is a concatenation of last name, first name, and customer identification number (e.g., armstrong_alice_8570365786
). There are three column families: address
, cc
, and contact
. This means that all of a specific customer's address information is stored together on the same physical machine. Their credit card information, since it all belongs to one column family, is also stored together on the same machine, though it may not be on the same machine as the other column families.
Assuming you have Hadoop and HBase installed in your cluster, you can create a table by first running an HBase shell:
1hbase shell
Then run the create
command. The general syntax of the create command is:
1create '<table_name>', '<column_family_1>', '<column_family_2>', '<column_family_3>', . . . '<column_family_n>'
To create the table illustrated by the schema above, run the following command:
1create 'customer', 'address', 'cc', 'contact'
Source: HBase Architecture
HBase consist of several services that coordinate with each other through Zookeeper, a distributed service discovery and configuration management server. A certain range of rows is stored in HRegions
and served by an HRegionServer
. An HRegionServer
may serve one or many HRegions
. The HMaster
determines what HRegion
each row is stored in, assigns the store
each column family is stored in, and updates the state of the cluster on Zookeeper.
The HBase client communicates with the HMaster
for DDL statements, but mostly interfaces with the HRegionServer
directly to query table data. The client relies on Zookeeper to identify what rows are stored in a particualr region.
Each store on HBase consists of a MemStore
and zero or more StoreFiles
. The MemStore
stores all recent writes to HBase in memory for faster data retreival. When the MemStore
reaches the limit specified in the hbase.hregion.memstore.flush.size
configuration property, the data in the MemStore
is flushed to HDFS in the form of StoreFiles
. A StoreFile
contains an HFile
where the HBase data actually resides. Technically speaking, HFiles
and StoreFiles
are separate entities, but they are used interchangeably in practice.
High availability may be an issue if data is not flushed to HDFS. If there is some sort of cluster failure, all data stored in memory, and therefore all the data in the MemStore
, is erased. To recover from such failures, HBase maintains a Write Ahead Log, or WAL, to store all recent changes made to the table on HDFS. This allows HBase to replay these changes should there be any node failure in the cluster.
Now that you understand the basic schema and architecture of HBase, you can begin inserting data into HBase. The general syntax for data inserts using the HBase shell is as follows:
1put '<table_name>','<rowkey>','<column_family>:<column>','<value>'
You can insert only one value in one column per put
command. The following code block inserts information for two customers in your customer table.
1put 'customer','armstrong_alice_8570365786','address:city','Claraport'
2put 'customer','armstrong_alice_8570365786','address:state','NH'
3put 'customer','armstrong_alice_8570365786','address:street','Melany Gateway'
4put 'customer','armstrong_alice_8570365786','address:zip','53522'
5put 'customer','armstrong_alice_8570365786','cc:number','1212-1221-1121-1234'
6put 'customer','armstrong_alice_8570365786','cc:expire','2024-04-12'
7put 'customer','armstrong_alice_8570365786','cc:type','visa'
8put 'customer','armstrong_alice_8570365786','contact:phone','1-871-480-5984'
9
10put 'customer','bailey_bob_7073092045','address:city','Graysonfurt'
11put 'customer','bailey_bob_7073092045','address:state','NV'
12put 'customer','bailey_bob_7073092045','address:street','Hodkiewicz Glens'
13put 'customer','bailey_bob_7073092045','address:zip','45250'
14put 'customer','bailey_bob_7073092045','cc:number','1228-1221-1221-1431'
15put 'customer','bailey_bob_7073092045','cc:expire','2024-06-12'
16put 'customer','bailey_bob_7073092045','cc:type','mastercard'
17put 'customer','bailey_bob_7073092045','contact:phone','1-828-193-0549'
There are two operations you can perform to read data from HBase: get
and scan
.
The get
operation allows you to get information from a specific rowkey
. The general syntax of a get
operation is as follows:
1get ‘<table_name’>, ‘<rowkey>’, ‘<information>’
For example, the following query returns the credit card type of Alice Armstrong.
1get 'customer', 'armstrong_alice_8570365786', 'cc:type'
You can also get all the information in a given column family.
1get 'customer', 'armstrong_alice_8570365786', {COLUMN => 'address'}
If you omit the information field, you can also get all the information of a specific key across all column families.
1get 'customer', 'armstrong_alice_8570365786'
It is important to note, however, that the get operation only returns information about a single rowkey. To look for information across row keys, you will need to perform a scan operation. For example, the following query looks for all cities whose rowkey begins with armstrong
.
1scan 'customer', {STARTROW => 'armstrong_', ENDROW => 'armstronh', COLUMN => 'address:city'}
HBase is a fast and durable database based on Google Big Table and built on top of the Hadoop ecosystem. It gives users full control over where their data is stored and how the system stores and groups together that data. The architecture takes advantage of HDFS' durable and fault-tolerant file system while also utilizing cluster memory to speed up queries.