Tuesday, June 22, 2010

Doing your own on-time flight analysis, Part I

This will be a quick tutorial on looking at on-time flight analysis. This material will be part of a lab for a class on InfiniDB that I am developing. The information is from Data.Gov Website and you are free to follow the steps presented.

What I want to know is what flight from a certain airport arrives at my local airport on time the most frequently. Traveling from LAX to DFW can often be a combination of cancellations, flight delays, and being the nth plane in line for takeoff. So what is the best flight choice for that route?

The first step is getting the data. And is is available for free from Airline On-Time Performance and Causes of Flight Delays. Be sure to select the check box for documentation so that there will be a readme.html to described the file fields included in the zipped file you will download. If you pick all the available information you will receive over 500,000 lines in a .CSV file.

It is worth comparing what the readme.html file and a few sample lines of data from the .CSV file. Any programmer who has written to the specification and then found the 'real world' data is different from what was expected will expound on the value of double checking what you are 'supposed to have' and 'what you got'. In this case, for example, the FlightDate is documented as yyyymmdd but is yyyy-mm-dd in the data. This in itself is not a big deal but a little time can save aggravation later.

So in Part II, we will create a table and load up the data.

4 comments:

Jeremy Cole said...

Dave,

Have you seen flightstats.us? I did a bunch of work on this before (for a MySQL/subqueries demo) which you could probably reuse. :)

Justin Swanhart said...

You also have to be very careful because data between different time periods in those files is structured differently. You'll notice that some fields are marked as "not in data set before X" and those files don't have those columns.

Further, the order of the columns is different in different years.

A guess a good ETL tool will automatically map the column headers to rows, but I didn't have much luck with that.

Justin Swanhart said...

Apparently those consistency problems in the data were cleaned up since the last time I downloaded the data, in early 2008 :)

Sheeri K. Cabral said...

Interestingly I have been toying with the idea of a similar project - comparing weather forecasts with actual weather.

Often I'll wonder if yesterday really reached that high temperature that was forecasted, or if the snow/rainfall accumulation was what the projection was, etc.