To make things interesting, I am using real data from BP's Statistical Review of World Energy 2009. The data is a historical information on various forms of energy. And it is free to download and comes in Excel workbook format. I wanted data that was unfamiliar to me and in a format that would be commonly used in a BI setting.
The first step is to obtain and install the InfiniDB community software. Since the install instructions are extremely clear, I will not go over those steps.
Next I exported the data from the Crude oil prices 1861 to 2008 worksheet into a CSV file. Now you have choices on how to load the data into your table. The first is the good old MySQL LOAD DATA INFILE statement.
mysql>USE energy; LOAD DATA INFILE '/tmp/oil.csv' INTO TABLE oil FIELDS TERMINATED BY ',';Or you can use the supplied bulk loading tool. In this example the table is simple and there are only 148 rows. LOAD DATA INFLE works fine for small sets. I copied CSV file into the directory InfiniDB used for the bulk loading tool -- /usr/local/Calpont/data/bulk/data/import --, changed the commas to pipes (s/,/|/g) preferred by the bulk loader, and prepare it for the columns tructure:
# /usr/local/Calpont/bin/colxml energy -t oil -j04 Bulkload root directory : /usr/local/Calpont/data/bulk job description file : Job_04.xml 2010-05-04 12:42:21 (5364) INFO : successfully load job file /usr/local/Calpont/data/bulk/job/Job_04.xml 2010-05-04 12:42:21 (5364) INFO : PreProcessing check starts 2010-05-04 12:42:21 (5364) INFO : PreProcessing check completed 2010-05-04 12:42:21 (5364) INFO : preProcess completed, total run time : 0 seconds 2010-05-04 12:42:21 (5364) INFO : No of Read Threads Spawned = 1 2010-05-04 12:42:21 (5364) INFO : No of Parse Threads Spawned = 3 2010-05-04 12:42:21 (5364) INFO : For table energy.oil: 148 rows processed and 148 rows inserted. 2010-05-04 12:42:22 (5364) INFO : Bulk load completed, total run time : 1 seconds [root@localhost import]#Now to load the data:
/usr/local/Calpont/bin/cpimport -j 04 Running colxml with the following parameters: 2010-05-04 12:42:13 (5312) INFO : Schema: energy Tables: oil Load Files: -b 0 -c 1048576 -d | -e 10 -f CSV -j 04 -n -p /usr/local/Calpont/data/bulk/job/ -r 5 -s -u -w 10485760 -x tbl File completed for tables: energy.oil Normal exit.Almost forgot to show you the table!
CREATE TABLE oil (`year` INT(4), price_historical DECIMAL(5,2), price_current(5,2)) ENGINE=InfiniDB;
So now we can ask for the historical average price of a barrel of oil.
mysql>SELECT AVG(price_historical) AS 'Historical Avg Price' FROM oil\g +----------------------+ | Historical Avg Price | +----------------------+ | 7.772297 | +----------------------+Many of you reading this are probably muttering 'I could have done this with any storage engine except Blackhole! and you are right. The beauty behind the column orientated approach does not appear until you are working with large sets of data. In the next post we will get into large sets of data and compare speeds with other storage engines.
2 comments:
Congrats for the new job!
Will there be InfiniDB certifications then?
Hi Dave,
There is a typo in the actual link to www.infinidb.org at the top of the page. The incorrect text is www.infinisb.org.
Just an FYI.
I will be looking into the InfiniDB storage Engine a lot more... (does it replicate)? if so, does it keep live table checksums? Cheers.
Post a Comment