The database process ingestion process is complex. Mike Colee and Alex Clayton can answer questions about how the data flow from the site to the servers at UCSB. Ned Bair can answer questions about how the data are then ingested into the database and made available on the web.
A brief description of ingestion process:
Tables from each of the data loggers are sent as flat files from the CUES site every to a LINUX server at UCSB every 15 minutes. The number of fields and measurement intervals vary with the data logger and instrument.
Once the flat files arrive at UCSB, they are parsed using a number of command line GNU text editing tools, such as “sed” and “tail”, and imported to a MYSQL server, with the MYSQLimport program. In order to import the flat files, tables are created in MYSQL beforehand which have appropriate fields and data types. A new table must be created each time a data logger is installed, or if tables from an existing logger are altered. 8 byte MYSQL DATETIME stamps are used as a primary key. This prevents duplicate records, since loggers should never record multiple measurements for one instrument at the same time. Since primary keys are indexed, this also speeds up searches significantly since almost all queries for the CUES data are sorted by a time range. Current tables, as of this writing, are all named with the prefix “CUES_CR3K” (outlined in yellow) [FIGURE 1]. These tables extend back to August 2007. Logger tables with the suffix “02_07” (outlined in red) [FIGURE 1] contain records for November 2002- July 2007. Prior to 2002, it was impossible to reconstruct all of the logger tables, so data from an old SQL Server database (called mmsaSQL) was used. This data was stored in a more normalized format than in the data logger tables, which accounts for the different structure of the “publicdata9203_small” table (outlined in brown) [FIGURE 1]. The “small” suffix was added since these data came from a parent table with extraneous fields, which were eliminated. The import process is scheduled using “crontab” with tasks executing several times every hour.
Data logger table graphs:
After import into MYSQL, data logger tables are queried and graphed several times per hour with the GNUplot program via another “crontab” entry. Output is a PNG file. GNUplot uses column numbers as references, therefore every time an incoming data logger table is changed, column numbers in the plotting functions must be changed accordingly. The graphs allow users to monitor current conditions and to check on loggers, since it is common for a logger to stop transmitting data without warning.
Having different tables for each data logger makes it difficult to query specific measurements or groups of measurements across loggers. For instance, a query for 20 years of incoming solar radiation would require a complex set of UNIONS. To overcome this, we normalize measurements dynamically when the database is queried. This is accomplished using stored procedures (in green) [Figure 2] which build temporary tables, and then populate them with values for user selected measurements. The stored procedures make use of related tables (in teal) [Figure 2], which define contents of data tables, such as the ones described above. Below is a list of tables which define records in data tables from all time periods:
· DataTable – ID and name for each table containing measurements in the database.
· DataTableColumns – ID and name for columns in each table listed in DataTable, as well as foreign keys which describe the instrument and type of measurement in each column.
· Instruments – ID, name, and description of each instrument.
· Measurement – ID, name, description, and units for each type of measurement. Also contains a foreign key which groups measurements.
· MeasurementGroup – ID, name, and description of measurement groups.
PHP web front-end:
Web page forms to submit query parameters and to output query data were built using the PHP (Personal Home Page) scripting language with MYSQL and MYSQLi database extensions.