What
Older Posts »

Some random doodles I'm inspired to make, mixed in with some tech related ideas.

April 3, 2012

News Class: Explainers

A friend brought the recent press surrounding the Trayvon shooting to my attention. While many of the articles I read provided insight into how the shooting progressed, few articles placed the story in the broader context.

Here are attempts at several explainers for

  1. The headline-only reader
  2. The afternoon-break reader
  3. The I’m-in-bed-so-help-put-me-to-sleep reader

Honestly, Wikipedia does a fantastic job with (3), so I wrote up (1) and (2).

Do you have 1 minute or 5 minutes or 10+ minutes?

March 14, 2012

Fact Checking a Technique

Fact Checking a Lyrical Genius

Felipe Andres Coronel, better known as Immortal Technique, is a popular underground rapper of Afro-Peruvian whose rap lyrics focus on controversial issues such as class inbalance, racial inequality, institutional oppression. Unfortunately, many people consider many of Immortal Technique’s lyrics as conspiracy theories and the antics of a wild man. In response he lucidly argues that his lyrics are simply “the truth”, and the truth is often seen as revolutionary.

In his own words

I give niggaz the truth, cause they pride is indigent

On March 17, 2012, Immortal Technique will take the stage in Boston’s Paradise Rock Club. As a fan of Mr. Coronel’s lyrical prowress and a budding journalist, I will naturally attend the concert and listen to his thoughts on current issues. I find it valuable to perform a cursory factual assessment of his lyrics. To do this, I picked his verses from the song Young Lords from his most recent album, The Martyr:

Enjoy the song on Grooveshark!

I survived the cointelpro assassinations.
AIDS epidemic, Crack era, fractured a nation,
The Interpretation of American Democracy,
Is best exemplified in it's foreign policy dichotomy,
I live a double life of political philosophy,
But revolution follows me, the struggle for equality,
Against the morally bankrupt claiming to be born again,
It's a civil war again like MS-13s origin
Ban ethnic studies claiming our culture will swallow them,
But you can't conquer people and build a country on top of them,
And then feel offended that they breathe the same oxygen,
Your family values lack the wisdom of Solomon,
But Operation Condor and Operation Bootstrap are Polisci 101,
Research for the new jack,
It's hard to reach Communist Utopia tomorrow,
When your hands are in a fuckin glass jar like Che Guevara,
Forget the distorted historical facts you were given,
Slave trade was the capital for capitalism,
Trapped in a prison mentally, dying existentially,
Separated from people you can't see yourself to be,
Then racially integrated into a burning house colony of an empire,
Economically burning out,
Can't win a debate so they sponsor every threat to me,
I wonder if agent 800 is standing next to me!

Let’s go through the first half.

COINTEL assassinations: COINTELPRO were a series of declassified, covert and illegal projects to remove power from domestic policital organizations, such as the KKK and Black Panthers. The summary report by the Senate acknowledges that “… the domestic activities of the intelligence community at times violated specific statutory prohibitions and infringed the constitutional rights of American citizens.” However, the projects were active between 1956 to 1971, and are unlikely to have directly affected Felipe (born 1978).

The Interpretation of American Democracy / Is best exemplified in it’s foreign policy dichotomy: The US government’s foreign policy has often been called a dichotomy – for example, when the government calls to reduce weapons in the Middle East while supplying tanks to countries in the Gulf. Similarly, while the United States is called the “greatest democracy on the planet”, controversies such as the financial institutions’ ties with the Federal Reserve, and the 1%.

Civil war like MS-13s origin: MS-13 is an L.A.-Mexican gang notorious for their excessive cruelty. It originated as a group to protect Salvadoran immigrants, fleeing civil war in their home country, from existing, well establish Mexican gangs in the area, despite both sides being immigant populations living in the same region. Stepping back, we can see that much of the news in the past several months have focused on income disparities and the resulting unrest. Comparing current events to a civil war between a militant government and a guerrilla coalition is certainly an overstatement.

Based on an admittedly small sample set, the relationships that Technique weaves between (factually accurate) historical events to current events and himself are tenuous at best. This is an instance where the individual facts are correct but the contextual information is “pants on fire”.

To perform the fact checking I used a combination of RapGenious (not a very good source), Wikipedia, and old fashioned Google searches.

February 28, 2012

Database Import is a Series of Tubes

TLDR;

DBTruck is a (prealpha!) tool that automatically imports your data file into your database, so you only worry about running queries and not data import or making a schema.  Here’s an example session to import FEC’s presidential candidate donations data into a new PostgreSQL table named “contrib” in the database “election”:

git clone git://github.com/sirrice/dbtruck.git

cd dbtruck/

wget ftp://ftp.fec.gov/FEC/Presidential_Map/2012/P00000001/P00000001-ALL.zip

unzip P00000001-ALL.zip

python dbtruck.py  P00000001-ALL.txt contrib election

That’s it!

The long story:

As Ted Stevens famously said,

It’s a series of tubes. And if you don’t understand, those tubes can be filled and if they are filled, when you put your message in, it gets in line and it’s going to be delayed

True.   Tubes are long, thin, and easily get clogged.  When a clog happens, it’s a pain to get things flowing.  The process of importing raw data into a database is like connecting a series of tubes that goes something like this:

  1. Download a raw text file (e.g., from data.gov, FEC, etc)
  2. Stare hard at the data and figure out how to delimit each column
  3. Figure out each column’s type
  4. Come up with a name for each column (100 columns?  too bad!)
  5. Write and run a CREATE TABLE statement
  6. Clean the data by removing invalid values, corrupted rows, etc
  7. Reformat the data into the proper CSV-like format that your database expects (e.g., escaping commas)
  8. Use your database’s bulk loading command
  9. Whoops, one of your rows was corrupted.  Everything resets. Go back to step 6.
  10. Run a query

There are 9 steps for things to screw up (and they will), but there is a reason for this madness.  Modern databases are designed for two major markets — banks and business intelligence.   In both of these markets, the businesses are long running, and are willing to spend days (or weeks! or months!) figuring out the best way to load and store and curate their data.  In fact, there is a whole industry around getting huge datasets into databases.  It’s called ETL.  

The crux of the issue is that the data analysis tools make a trade-off between “pain in the butt to analyze” and “pain in the butt to setup”.  Tools like grep  don’t require any setup, but are limited to “looking for strings”.  On the other extreme, databases do step 10 really really well — SQL is an incredibly powerful language, and what would take a hundred lines of scripting can be done with a single SQL statement.  The cost of this power is the data import phase, and that scares a lot of people away.

In the end, what is a poor data analyst going to do?  The analyst that gets her hands on a modest dataset (from a night of web scraping, or downloaded from the web), and just wants to “take a look”.  She will analyze the data for one or two sessions, compute some histograms and look for some trends, and really only cares about a handful of columns.  She doesn’t care about a proper schema, and is perfectly happy to throw out 5% of the bad data.  

The burning question is:

“Is there a way to go straight to step 10?”

DBTruck is a pre-alpha tool designed to automatically do steps 1-9 for you.  Give it a text data file and it’ll do everything imaginable to get your data into a database.  It’ll figure out how to split your data, infer the data types, throw out rows that cause the loading to fail and retry.   The current requirement is that the data file contains one database row per line.

Right now, DBTruck works for PostgreSQL and expects a bunch of command line options to tell it what database to load into, if you want to append to or create a new table, etc.  In the future, you shouldn’t even need to figure the command line options out — it will interactively ask you when it’s confused.

I’m really interested in what parts of this are useful and what breaks, so let me know!

In the end, Ted Steven’s wasn’t completely wrong:

And again, … [DBTruck] is … something that you just dump something on. It’s … a big truck.  It’s [not] a series of tubes.

Some similar projects

  • Mike Cafarella and Cloudera’s Record Breaker is a great tool for taking a structured text file and inferring column names and column types.  I plan to integrate something like this in the future.
  • Google Refine and Data Wrangler are web-based tools that help you clean up messy data, and transform it into something ready to be loaded into a database.  

December 12, 2011

Seasonal Robots

all 4 seasons

December 8, 2011

Oh no. It’s christmas season

That means buying presents… or making them! Here’s 1 on 4!

Older Posts »