Thursday, June 24, 2010

Doing your own on-time flight analysis, Part II

The On Time Flight Data (see previous entry) has 93 rows per flight and the first line of the CSV file has the names of the columns. Use head -1 on that file to obtain that information and to build a schema by piping the output to a file (and do not forget to remove that top line before loading the data).

"Year", "Quarter", "Month", "DayofMonth",
...
"Div5LongestGTime", "Div5WheelsOff", "Div5TailNum",

Strip out the quotation marks, add CREATE TABLE flight ( at the front and ) ENGINE=InfiniDB; at the end. Year is the title of the first column and it has been renamed to FlightYear to avoid confusion with the function of the same name. The next part is to assign a data type to each column and a quick perusal of the data will show which fields are numeric, which ones are alphanumeric, and the one column of type DATE. Personal preference on the designation are left up to the reader as an exercise in data architecture.

CREATE TABLE flight (FlightYear smallint,
Quarter tinyint,
Month tinyint,
...
Div5WheelsOff int,
Div5TailNum int) engine=InfiniDB;


Create the table by running MySQL and sourcing the file created. Now the data needs some attention.

The InfiniDB bulk loader cpimport can use any character to delimit fields. In CSV files it is painfully obvious that the fields are separated by commas. However the On Time Flight Data has a column, OriginCityName, that contains a comma. A extra column can be added after OriginCityName, the comma can be removed from that one column, OR the other commas need to be changed to another character. The easiest for most will be adding an extra column.

Next a XML file needs to be created using colxml, colxml -j 50 -d "," -l On_Time_Performance_2010_1.csv -t flight air. The the data can be loaded with cpimport, cpimport -j 50.

So for a quick test! Can I find the flights from LA back to Texas?


mysql> SELECT FlightNum, COUNT(FlightNum) FROM flight WHERE Origin = 'LAX' and Dest = 'DFW' GROUP BY FlightNum ORDER BY FlightNum;
+-----------+------------------+
| FlightNum | COUNT(FlightNum) |
+-----------+------------------+
| 2400 | 31 |
| 2404 | 30 |
| 2410 | 31 |
| 2412 | 31 |
| 2416 | 31 |
| 2422 | 31 |
| 2428 | 31 |
| 2434 | 31 |
| 2436 | 31 |
| 2440 | 31 |
| 2444 | 21 |
| 2446 | 31 |
| 2448 | 31 |
| 2450 | 26 |
| 2464 | 31 |
| 2470 | 31 |
| 2501 | 1 |
| 6219 | 31 |
| 6234 | 31 |
+-----------+------------------+
19 rows in set (0.52 sec)



Next time, find the flight least likely to be delayed!

1 comment:

Swany said...

The loader doesn't support optional enclosures? All the string columns in the ontime data are enclosed in quotes.

You can load it without modifications use the regular MySQL loader:
LOAD DATA INFILE ... FIELDS TERMINATED BY ',' optionally enclosed by '"'