Tuesday, May 4, 2010

Getting started with InfiniDB Part I

My new job with Calpont has me scrambling to learn all the facets of the InfiniDB storage engine. For those of you new to the scene, InfiniDB is column-orientated as opposed to row-orientated and is designed for data warehouses and business intelligence. Most MySQL DBAs are probably not used to thinking sideways and hopefully this series of postings will change that.

To make things interesting, I am using real data from BP's Statistical Review of World Energy 2009. The data is a historical information on various forms of energy. And it is free to download and comes in Excel workbook format. I wanted data that was unfamiliar to me and in a format that would be commonly used in a BI setting.

The first step is to obtain and install the InfiniDB community software. Since the install instructions are extremely clear, I will not go over those steps.

Next I exported the data from the Crude oil prices 1861 to 2008 worksheet into a CSV file. Now you have choices on how to load the data into your table. The first is the good old MySQL LOAD DATA INFILE statement.
mysql>USE energy;
LOAD DATA INFILE '/tmp/oil.csv'

Or you can use the supplied bulk loading tool. In this example the table is simple and there are only 148 rows. LOAD DATA INFLE works fine for small sets. I copied CSV file into the directory InfiniDB used for the bulk loading tool -- /usr/local/Calpont/data/bulk/data/import --, changed the commas to pipes (s/,/|/g) preferred by the bulk loader, and prepare it for the columns tructure:

# /usr/local/Calpont/bin/colxml energy -t oil -j04

Bulkload root directory : /usr/local/Calpont/data/bulk
job description file : Job_04.xml
2010-05-04 12:42:21 (5364) INFO : successfully load job file /usr/local/Calpont/data/bulk/job/Job_04.xml
2010-05-04 12:42:21 (5364) INFO : PreProcessing check starts
2010-05-04 12:42:21 (5364) INFO : PreProcessing check completed
2010-05-04 12:42:21 (5364) INFO : preProcess completed, total run time : 0 seconds
2010-05-04 12:42:21 (5364) INFO : No of Read Threads Spawned = 1
2010-05-04 12:42:21 (5364) INFO : No of Parse Threads Spawned = 3
2010-05-04 12:42:21 (5364) INFO : For table energy.oil: 148 rows processed and 148 rows inserted.
2010-05-04 12:42:22 (5364) INFO : Bulk load completed, total run time : 1 seconds

[root@localhost import]# 
Now to load the data:
 /usr/local/Calpont/bin/cpimport -j 04
Running colxml with the following parameters:
2010-05-04 12:42:13 (5312) INFO : 
 Schema: energy
 Tables: oil  
 Load Files: 
 -b 0
 -c 1048576
 -d |
 -e 10
 -f CSV
 -j 04
 -p /usr/local/Calpont/data/bulk/job/
 -r 5
 -w 10485760
 -x tbl

File completed for tables:
Normal exit.
Almost forgot to show you the table!
CREATE TABLE oil (`year` INT(4), price_historical DECIMAL(5,2), price_current(5,2)) ENGINE=InfiniDB;

So now we can ask for the historical average price of a barrel of oil.
mysql>SELECT AVG(price_historical) AS 'Historical Avg Price' FROM oil\g

| Historical Avg Price |
|             7.772297 |
Many of you reading this are probably muttering 'I could have done this with any storage engine except Blackhole! and you are right. The beauty behind the column orientated approach does not appear until you are working with large sets of data. In the next post we will get into large sets of data and compare speeds with other storage engines.


hingo said...

Congrats for the new job!

Will there be InfiniDB certifications then?

Unknown said...

Hi Dave,

There is a typo in the actual link to www.infinidb.org at the top of the page. The incorrect text is www.infinisb.org.

Just an FYI.

I will be looking into the InfiniDB storage Engine a lot more... (does it replicate)? if so, does it keep live table checksums? Cheers.