What

Some random doodles I'm inspired to make, mixed in with some tech related ideas.

February 28, 2012

Database Import is a Series of Tubes

TLDR;

DBTruck is a (prealpha!) tool that automatically imports your data file into your database, so you only worry about running queries and not data import or making a schema.  Here’s an example session to import FEC’s presidential candidate donations data into a new PostgreSQL table named “contrib” in the database “election”:

git clone git://github.com/sirrice/dbtruck.git

cd dbtruck/

wget ftp://ftp.fec.gov/FEC/Presidential_Map/2012/P00000001/P00000001-ALL.zip

unzip P00000001-ALL.zip

python dbtruck.py  P00000001-ALL.txt contrib election

That’s it!

The long story:

As Ted Stevens famously said,

It’s a series of tubes. And if you don’t understand, those tubes can be filled and if they are filled, when you put your message in, it gets in line and it’s going to be delayed

True.   Tubes are long, thin, and easily get clogged.  When a clog happens, it’s a pain to get things flowing.  The process of importing raw data into a database is like connecting a series of tubes that goes something like this:

  1. Download a raw text file (e.g., from data.gov, FEC, etc)
  2. Stare hard at the data and figure out how to delimit each column
  3. Figure out each column’s type
  4. Come up with a name for each column (100 columns?  too bad!)
  5. Write and run a CREATE TABLE statement
  6. Clean the data by removing invalid values, corrupted rows, etc
  7. Reformat the data into the proper CSV-like format that your database expects (e.g., escaping commas)
  8. Use your database’s bulk loading command
  9. Whoops, one of your rows was corrupted.  Everything resets. Go back to step 6.
  10. Run a query

There are 9 steps for things to screw up (and they will), but there is a reason for this madness.  Modern databases are designed for two major markets — banks and business intelligence.   In both of these markets, the businesses are long running, and are willing to spend days (or weeks! or months!) figuring out the best way to load and store and curate their data.  In fact, there is a whole industry around getting huge datasets into databases.  It’s called ETL.  

The crux of the issue is that the data analysis tools make a trade-off between “pain in the butt to analyze” and “pain in the butt to setup”.  Tools like grep  don’t require any setup, but are limited to “looking for strings”.  On the other extreme, databases do step 10 really really well — SQL is an incredibly powerful language, and what would take a hundred lines of scripting can be done with a single SQL statement.  The cost of this power is the data import phase, and that scares a lot of people away.

In the end, what is a poor data analyst going to do?  The analyst that gets her hands on a modest dataset (from a night of web scraping, or downloaded from the web), and just wants to “take a look”.  She will analyze the data for one or two sessions, compute some histograms and look for some trends, and really only cares about a handful of columns.  She doesn’t care about a proper schema, and is perfectly happy to throw out 5% of the bad data.  

The burning question is:

“Is there a way to go straight to step 10?”

DBTruck is a pre-alpha tool designed to automatically do steps 1-9 for you.  Give it a text data file and it’ll do everything imaginable to get your data into a database.  It’ll figure out how to split your data, infer the data types, throw out rows that cause the loading to fail and retry.   The current requirement is that the data file contains one database row per line.

Right now, DBTruck works for PostgreSQL and expects a bunch of command line options to tell it what database to load into, if you want to append to or create a new table, etc.  In the future, you shouldn’t even need to figure the command line options out — it will interactively ask you when it’s confused.

I’m really interested in what parts of this are useful and what breaks, so let me know!

In the end, Ted Steven’s wasn’t completely wrong:

And again, … [DBTruck] is … something that you just dump something on. It’s … a big truck.  It’s [not] a series of tubes.

Some similar projects

  • Mike Cafarella and Cloudera’s Record Breaker is a great tool for taking a structured text file and inferring column names and column types.  I plan to integrate something like this in the future.
  • Google Refine and Data Wrangler are web-based tools that help you clean up messy data, and transform it into something ready to be loaded into a database.  

No Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URL

Leave a comment