Wednesday, June 30, 2010

Doing your own on-time flight time analysis Part III

In the last post, the data from the on-time flight database was loaded in a column-orientated storage engine. Now the numbers can be crunched.

The original goal of this exercise was to find the flight from Los Angeles International Airport, LAX, to Dallas Fort Worth International Airport, DFW, that was the most likely to arrive on-time.

The data is 'opportunity rich' in that there is a lot information in there. It is easy to start wondering about the various nuggets of information in there. Are their certain aircraft (tail numbers) that are routinely bad performers? Are some days of the week better than others? Do national holidays have an effect on the on-time performance? If you are delayed, is there a 'regular amount' of delay? Does early departure make for an early arrival? Can the flight crew make up for a late departure? How much time is usually spend on runways?

But to look for the flight from LAX to DFW with the lowest arrival delay1.

SELECT FlightNum, SUM(ArrDelayMinutes) as Delay
FROM flight
WHERE Origin = 'LAX'
AND Dest = 'DFW'
ORDER BY FlightNum;

So there are lots of good questions to ask with this data but it ends this series on doing your own on-time flight analysis. If anyone digs into the flight data further, please let the rest of us know what you are finding.

1. For the data on hand the answer is American Airlines flight 2463.

Thursday, June 24, 2010

Doing your own on-time flight analysis, Part II

The On Time Flight Data (see previous entry) has 93 rows per flight and the first line of the CSV file has the names of the columns. Use head -1 on that file to obtain that information and to build a schema by piping the output to a file (and do not forget to remove that top line before loading the data).

"Year", "Quarter", "Month", "DayofMonth",
"Div5LongestGTime", "Div5WheelsOff", "Div5TailNum",

Strip out the quotation marks, add CREATE TABLE flight ( at the front and ) ENGINE=InfiniDB; at the end. Year is the title of the first column and it has been renamed to FlightYear to avoid confusion with the function of the same name. The next part is to assign a data type to each column and a quick perusal of the data will show which fields are numeric, which ones are alphanumeric, and the one column of type DATE. Personal preference on the designation are left up to the reader as an exercise in data architecture.

CREATE TABLE flight (FlightYear smallint,
Quarter tinyint,
Month tinyint,
Div5WheelsOff int,
Div5TailNum int) engine=InfiniDB;

Create the table by running MySQL and sourcing the file created. Now the data needs some attention.

The InfiniDB bulk loader cpimport can use any character to delimit fields. In CSV files it is painfully obvious that the fields are separated by commas. However the On Time Flight Data has a column, OriginCityName, that contains a comma. A extra column can be added after OriginCityName, the comma can be removed from that one column, OR the other commas need to be changed to another character. The easiest for most will be adding an extra column.

Next a XML file needs to be created using colxml, colxml -j 50 -d "," -l On_Time_Performance_2010_1.csv -t flight air. The the data can be loaded with cpimport, cpimport -j 50.

So for a quick test! Can I find the flights from LA back to Texas?

mysql> SELECT FlightNum, COUNT(FlightNum) FROM flight WHERE Origin = 'LAX' and Dest = 'DFW' GROUP BY FlightNum ORDER BY FlightNum;
| FlightNum | COUNT(FlightNum) |
| 2400 | 31 |
| 2404 | 30 |
| 2410 | 31 |
| 2412 | 31 |
| 2416 | 31 |
| 2422 | 31 |
| 2428 | 31 |
| 2434 | 31 |
| 2436 | 31 |
| 2440 | 31 |
| 2444 | 21 |
| 2446 | 31 |
| 2448 | 31 |
| 2450 | 26 |
| 2464 | 31 |
| 2470 | 31 |
| 2501 | 1 |
| 6219 | 31 |
| 6234 | 31 |
19 rows in set (0.52 sec)

Next time, find the flight least likely to be delayed!

Tuesday, June 22, 2010

Doing your own on-time flight analysis, Part I

This will be a quick tutorial on looking at on-time flight analysis. This material will be part of a lab for a class on InfiniDB that I am developing. The information is from Data.Gov Website and you are free to follow the steps presented.

What I want to know is what flight from a certain airport arrives at my local airport on time the most frequently. Traveling from LAX to DFW can often be a combination of cancellations, flight delays, and being the nth plane in line for takeoff. So what is the best flight choice for that route?

The first step is getting the data. And is is available for free from Airline On-Time Performance and Causes of Flight Delays. Be sure to select the check box for documentation so that there will be a readme.html to described the file fields included in the zipped file you will download. If you pick all the available information you will receive over 500,000 lines in a .CSV file.

It is worth comparing what the readme.html file and a few sample lines of data from the .CSV file. Any programmer who has written to the specification and then found the 'real world' data is different from what was expected will expound on the value of double checking what you are 'supposed to have' and 'what you got'. In this case, for example, the FlightDate is documented as yyyymmdd but is yyyy-mm-dd in the data. This in itself is not a big deal but a little time can save aggravation later.

So in Part II, we will create a table and load up the data.

Tuesday, June 15, 2010

Anybody have a few millions lines of Apache Log files they can share?

Anybody have a few millions lines of Apache Log files they can share? I am working on a lab/demo on InfiniDB and need a few millions lines of an Apache HTTPD log file. I no longer run any large websites and would rather use real data over creating something. I will sanitize your URL so you will be anonymous, so will end up as or something similar.

The demo/lab will show how to load data into the columnar InfiniDB storage engine and run some analytics against the data. Please let me know if you can help.

Wednesday, June 9, 2010

InfinDB Extent Maps saving I/O

InfiniDB extent map example 1 InfiniDB uses Extent Maps to retrieve data. Data is stored in extents and each extent is a logical block of space that holds the same number of rows.

The Extent Map catalogs all extents and their corresponding blocks. The Extent Map also maintains minimum and maximum values for a column’s data within an extent.

The Extent Map lets InfiniDB to retrieve only the blocks needed to satisfy a query. There is another benefit – that of logical range partitioning. This is accomplished via the minimum and maximum values for each extent that are contained within the Extent Map. Extent elimination is first accomplished in InfiniDB via the column-oriented architecture (only needed columns are scanned), but the process is accelerated because of the logical horizontal partitioning that is implemented in the Extent Map. So only the needed sections of the column are read. So if only the data from Col1 between 105 and 180 is requested, InfiniDB will return data from Extent 2, skipping Extents 1, 3 and 4. That saves a 75% of I/O compared to a row based storage engine.

InfiniDB extent map example 2 The savings in I/O also carries over to more complex queries. If a column WHERE filter of “col1 BETWEEN 220 AND 250 AND col2 < 10000” is specified, InfiniDB will eliminate extents 1, 2 and 4 from the first column filter, then, looking at just the matching extents for col2 (i.e. just extent 3), it will determine that no extents match and return zero rows without doing any I/O at all.

Upon system startup, the Extent Map file is read into memory and updated automatically as needed.