WSDOT Research Paper
June 9th, 2011 by Anthony Curreri
I was the database administrator on a project called Developing a GPS-Based Truck Freight Performance Measure Platform. The project resulted in a research paper, on which I am a named author and which was published by the Washington State Department of Transportation (WSDOT), and the Transportation Research Board (TRB).
I wrote this portion of the paper, which outlines the work I did on the project:
The database feed from each GPS vendor is different. One vendor emails a CSV file, a flat file with comma-separated values. The research team has written a PHP script to parse the data in this file and load it into a database table. Vendor A provided the install file for a JavaScript client, which also loads data into a single database table. This client only requires the connection information for the MySQL database, and now it automatically retrieves live data.
Vendors B and C each provided a sample program written in C#, both of which required considerable modification before they would retrieve data properly. Vendor C’s program retrieves data from its Web service in a compressed format and saves this directly to a ZIP file. It was desirable to decompress the data stream in the C# program and then use C# to load the data into MySQL. The database administrator was unable to find and install the proper decompression classes into Visual Studio so instead wrote a script that uses the C# program to download a ZIP file, invokes another program to decompress the ZIP (which contains a CSV file), then runs yet another script to open the CSV file, parse it, and insert it into a single database table. Vendor B’s C# application downloads data from its Web service in an XML object. The administrator decided to use C# to parse the XML and write the data out as a CSV file, which is then parsed and inserted into a single database table in the same way as vendor C’s data.
Our largest dataset receives about 3 million rows of data each month. Having such a large amount of data in a single table makes processing very time consuming. Also, each dataset provides special information. This special, extended information varies from vendor to vendor. To speed up processing, at the end of every month the project team creates two new tables for each dataset. These tables contain only data for the previous month. The data are split into “core” values, which we need for our processing, and “extended” values, which comprise everything else we are provided. These tables are easy to join together to do processing that utilizes data points in the extended table; however, to speed processing time we keep all our active data points in the core tables.
For origin and destination (OD) processing, the project team runs a script directly on these MySQL database tables, which have been split into month-long datasets. For GIS processing, we then export the core table for each month into CSV files. These CSV files are easier to work with because they leave off data points we are not able to use in GIS (those stored in the extended tables). These CSV files also contain O/D trip information we generated. When GIS processing has been completed, transportation analysis zone and other information is written into a new CSV file, which is then parsed and used to update the database to contain the GIS information.
