Big Data Democratized

How I mashed-up Google and NOAA’s massive computing infrastructure to answer a Big Data question in 3 hours with no tech skills.  (It cost me $1.39.)


Democratization of Big Data
2.5 exabytes – the equivalent of one trillion digital photos – is captured each day by computer systems across the globe.  This number is growing exponentially with proliferation of mobile devices, cameras, and other information sensors at the edge of the network.  This presents both an enormous challenge and a game-changing opportunity.  The challenge is how to deal with data at a scale that is almost inconceivable.  The opportunity is this:  If we can tame the data, it will usher in the next generation of technological breakthroughs.   It’s already happened in fields as far-flung as genomics, banking, and national security and it’s still early days.

Big Data refers to datasets so large that they exceed the capabilities of traditional database systems.  By today’s standards that means a few dozen terabytes.  Datasets this large require big tech infrastructure – massively parallel software running on hundreds, thousands, or in some cases millions of servers.  Well-funded, tech savvy companies are investing heavily in this technology, but what about startups and other small businesses that can’t afford the computing infrastructure required to deal with their own ever-expanding data assets?  Small businesses – or even individuals – might want to use big data to out-fox larger competitors or simply make better informed decisions.

It’s no secret that cloud computing has removed barriers to access and driven the cost of computing power down dramatically.  But just how far?  With this question in mind, I set out to conduct a small experiment to determine how quickly an individual with limited tech skills could spin up a big (or at least big-ish) data infrastructure and use it to solve a simple but data-intensive problem.

The Experiment

Which US cities have the most days of perfect weather?

It’s a question I ask myself each September, after enduring five months of relentless heat and humidity in Texas.  As an outdoor enthusiast, weather plays a major role in my psyche.  Too often I feel like I’m limited to just a few short, precious windows of great weather.  Maybe it’s time to relocate – but where?

There’s no shortage of weather information on the web, but most of it is based on averages – and averages often don’t tell the whole story.  I was more interested in a new metric:  Perfect Weather Days.  Put simply, how many days each year is the weather ideal for outdoor activities?  The definition of a Perfect Weather Day is, of course, highly subjective.  My personal definition of Perfect Weather had four simple criteria:

  1. No precipitation (it doesn’t necessarily have to be sunny as long as there’s no rain or snow)

  2. A low temperature no lower than 50 degrees (as they say in Texas, I’m “warm blooded”)

  3. A high temperature of at least 65 degrees (after all, it’s not a perfect day if you can’t wear a t-shirt and shorts)

  4. High temperature no higher than 82 degrees

I considered other criteria such as humidity and wind speed but in the end decided to keep it simple.  If there’s no rain and good temps then I’m not going to complain.

My first task was to find a source of daily weather data for US cities.  It turns out that NOAA makes its enormous database of global weather data available for free.  Using an online tool on the NOAA site I submitted a request for a dataset containing the following daily measures for thousands of weather stations across the US for the 8 year period between 2005 and 2012 (I learned on the NOAA site that the quality of data prior to 2005 is lower):

  • station ID

  • latitude / longitude

  • date

  • precipitation

  • low temperature

  • high temperature

It took about a day for NOAA to deliver the dataset, which totaled 46.9 million records.   Granted, this may not technically qualify as big data.  However,  it turns out that I could have just as easily completed the project with a dataset 10 or even 100 times larger.

The Technology
Just a few years ago the idea of spinning up IT infrastructure to answer one trivial question would be preposterous.  But with a few minutes of digging, I discovered BigQuery, a new product from Google that seemed to be purpose-built for my little experiment.  In Google’s words:

Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure. Simply move your data into BigQuery and let us handle the hard work.

Score!  In less than an hour I was able to upload my NOAA dataset to Google and start using a web-based SQL query tool to play with the data.  Keep in mind that I have almost no programming skills – I remembered a bit of SQL from college and spent a few minutes reading the BigQuery documentation.  With a little trial and error I came up with a query to rank order weather stations by the annual percentage of Perfect Days.  I was blown away by how fast and easy BigQuery is.  My queries usually took just 3 or 4 seconds to run and I could move large datasets around just as quickly. I didn’t have to think about database schema, indexing, or optimization at all.  It all just worked.

Here’s the query I used (it’s not nearly as complex as it looks).

weatherdata.stations_US_clean.station as station,
weatherdata.stations_US_clean.state as USstate, as stationname,
weatherdata.stations_US_clean.latitude as lat,
weatherdata.stations_US_clean.longitude as lon,
count( as percentperfectdays,
count( as perfectdays,
count( as totalperfectdays
FROM [weatherdata.2005_2012]
JOIN [weatherdata.stations_US_clean] ON weatherdata.stations_US_clean.station = weatherdata.2005_2012.station
WHERE weatherdata.2005_2012.prcp=0
AND (weatherdata.2005_2012.tmin*.18 + 32) >= 50
AND (weatherdata.2005_2012.tmax*.18 + 32) >= 65
AND (weatherdata.2005_2012.tmax*.18 + 32) <= 82
group by station, stationname, USstate, lat, lon
order by percentperfectdays DESC

The Results
With BigQuery I was able to crunch 47 million records down to a summary of Perfect Weather Days for 7,102 weather stations across the country.  This dataset was small enough to move into Excel for more analysis.  It’s worth noting that the weather data is for weather stations, which don’t necessarily correlate to cities (many weather stations are located on mountain tops, in national or state parks, or other remote areas).  So I had to manually map top-ranking weather stations to cities using latitude/longitude coordinates and Google Maps.

As I sifted through the results I noticed a couple of things.  First, it was clear that California dominated the top of the list.  No surprise there – California weather is legendary.  Second, I noticed that many cities that ranked high in number of Perfect Days are cities that I know to have extreme weather.  In other words, they may have a large number of Perfect Days each year, but they also have a lot of bad weather, making it impractical (or at least uncomfortable) to be outside.  For example, my home town, Fort Worth, might have 40-50 Perfect Days a year (not too bad) but we also get months on end of daily highs near 100 degrees.  Conversely, other cities might have mild summers but brutally cold winters.  So I concluded that I needed to also consider the number of Bad Weather Days for each city.   I defined a Bad Weather Day as a day that meets any of these criteria:

  1. Half an inch or more of rain

  2. Low temperature above 85 degrees

  3. Low temperature below 15 degrees

  4. High temperature above 95 degrees

  5. High temperature below 32 degrees

(Again, this is highly subjective.  If you’re into snow skiing, a day with clear skies and a high temp below 32 might be perfect.)

I then combined Perfect Days and Bad Weather Days into a score using this simple formula:

Weather Score = Perfect Days – ½ * Bad Days

My logic was simple (and yet again, subjective):  It takes two Bad Weather Days to erase the joy of a Perfect Day.  Now I could rank order each weather station based on my Weather Score.  Back to the results…

At the top of the list is Kula on the Hawaiian island of Maui, a true paradise where 74% of all days are Perfect.  While Hawaii claims the top spot, California dominates the top of the list with San Diego and LA running neck and neck for the highest scores in the continental U.S.  Those cities average over 200 Perfect Days each year with less than 10 Bad Weather Days.  In contrast, Fort Worth gets 47 Perfect Days but 86 Bad Days (about 2 Bad Days for every Perfect Day!).

Here’s a list of the top 25 cities by Weather Score:


And on a map…


Every city in the top 25 is coastal.  I probably should have included humidity or heat index in my definition of perfect days since I know first hand that 80 degrees in Key West can be far more uncomfortable than 85 in Phoenix or Boulder.

I also ranked the 50 largest US cities by weather score.  It’s interesting how sharply the number of Perfect Days drops off outside of California.


The bottom of the list was dominated by Alaska.  No surprise given that my definition of a Perfect Day involves fairly warm temps and no precipitation.  In dead last is Deadhorse, Alaska with only 2 Perfect Days and 235 Bad Weather Days each year (I’m guessing that the town’s name was inspired by its brutal winters).  In the continental U.S., the lowest score goes to Lajitas, TX in Big Bend country, with just 5 Perfect Days and 154 Bad Days.  (Interestingly, just 30 miles away in Big Bend National Park is the Chisos Basin which ranks near the top of the list.)

For those of us that prefer mountain air, I took a quick look at locations above 4,000 feet of elevation.  Many of these locations (which were mostly in California) didn’t correspond to cities, so there was a lot of noise in the data.  For example, the highest ranking location above 4,000 feet is the aforementioned Chisos Basin in Big Bend National Park with 83 perfect days and 16 bad.  Yosemite Valley (my favorite place on Earth) also ranked high.  The highest ranking cities at altitude include:  Carson City (NV), Alamogordo (NM), Cortez (NM), Sierra Vista (AZ), Silver City (NM), and Provo (UT).  I was surprised that one of my favorite mountain towns, Boulder, CO, didn’t rank very high with only 14 perfect and 48 bad days.  It’s clear that mountain towns get penalized by my definitions of Perfect and Bad, which favor warmer temperatures.

The Bill
When all was said and done, this experiment confirms the obvious  – for great weather California wins by a landslide.  But it also provides a quantitative measure to compare other cities.  My home town of Fort Worth scored a 4 compared to a 203 in LA.  Ouch.

But the most interesting part of this project was the project itself.  With about 3 total hours of late night work I was able to sift through an impressive amount of data to answer my question.  And because BigQuery runs on top of Google’s massively parallel infrastructure, I could have completed the project in about the same amount of time if my dataset had been billions of records instead of tens of millions of records.

What about the cost?  Here’s my statement from Google totaling $1.39.


That’s $0.19 for storage, $1.13 for BigQuery compute cycles, and $0.07 in tax.

Clearly democratization of Big Data is happening, and that means small businesses and even individuals like me can take advantage of massive computing power for almost no cost and with minimal technical skills.  Imagine millions of businesses and individuals with the power to hack huge amounts of data covering just about every field of human interest.  It’s like the Internet all over again – the Internet of data.


I Heart Surveys (a marketing rant)

As a marketer, I’ve always been a huge fan of customer research and more specifically, surveys.  Maybe I need to get a life, because surveys excite me.   They’re an incredibly cheap and fast way to take some of the guesswork out of marketing.  Why not let your customers tell you how to sell to them?

This afternoon I was reviewing some customer research for an upcoming TV campaign and it reminded me of “Vortex”, a marketing program I created 10 years ago at a small software company…

In 2002 I went to work for a venture-backed software startup called Metallect (I know, strange name – I didn’t pick it).  Metallect developed a cool (and geeky) software product called the “IQ Server” that was used by large I.T. organizations.  The IQ Server had bots that would crawl through all of the software code in a company’s systems to catalog it, organize it, and make it searchable.  The IQ Server did for software code what Google does for web pages.

Why is this important?

Because over the past 20+ years, every large company has built and purchased dozens of software and database systems containing millions and millions of lines of software code.  These systems are critical to running the business, but they are poorly documented and tend to break.  Fixing those breaks is expensive and time-consuming.  In fact, most big companies spend over 70% of their I.T. budget maintaining their existing systems.  We’re talking about billions of dollars wasted.

Our business plan was simple:  The IQ Server would help I.T. organizations quickly pinpoint problems and proactively assess risk.  A $100,000 IQ Server could save a company millions of dollars!

The product had great potential but we had one big problem:  Our target customers — CIO’s at Fortune 500 companies — didn’t know the IQ Server existed, let alone that they needed it.  We were defining a new product category and our #1 challenge was explaining a new and complex product in a way that was compelling enough that a busy I.T. executive would: A) pay attention; and B) Write us (a 10 person software company in Texas) a big check!

As a marketer, you’re lucky if you come up with 4 or 5 truly great marketing ideas in your career.  I think the survey-based program I’m about to describe was one of those ideas (here’s another).   Like most good ideas, it was deceptively simple.  I would conduct an online survey of prospective customers, asking them to read my best description of the IQ Server and its key selling points.  Then I would ask three simple questions:

1) On a scale of 1 to 10, how valuable would this product be to your company?
2) Why?
3) Would you be interested in learning more about this product?

Question #1 provided me with a quantitative measure of how compelling my marketing messaging was.  Question #2 provided qualitative insight into the first question.  Real target customers told me in their own words why they did or didn’t find my sales pitch compelling.  And Question #3 was the moment of truth:  Who’s ready to write a check (or at least have a meeting)?

Now all I needed was a bunch of Fortune 500 CIO’s to take my survey.  But how?  Everybody wants these folks’ attention.

I solved this problem the old fashioned way — with bribes (completely legal, of course).  I bought highly targeted mailing lists from CIO Magazine and sent personalized letters offering to give each person a $100 Amazon gift certificate just for completing a 10 question research survey.  My bet was that the learning I’d gain from the survey would justify the cost.

As you might guess, the response rate to my direct mail invitation was high (over 10%).  But the answers to my survey questions weren’t very encouraging.  The average response to question #1 (How valuable would the IQ Server be to your company?) was around 6.  Potential customers saw some value in the IQ Server but they weren’t jumping up and down demanding it.  That won’t cut it for a start-up that needs to win customers quickly.

Fortunately, question #2 provided a wealth of “verbatim” feedback.  Without knowing it, prospective customers were telling me how to sell the IQ Server to them.  So I adjusted my messaging and ran the same test again (to a different list of customer prospects).

This time the results were a little better.  The average response to question #1 was around 7.  To make a long story short, I repeated this process maybe half a dozen times.  I spent tens of thousands of dollars on Amazon gift certificates, but along the way I was able to hone my sales pitch and get the average response to question #1 up to almost 9.

More importantly, through question #3 I generated over a hundred leads for our sales team.  These leads turned into our first few paying customers, generating hundreds of thousands in revenue.  My customer research had paid for itself – score!  I dubbed this program “Vortex”, a nod to the classic business book Inside the Tornado by Geoffrey Moore.

So what’s the point of this story?

The point is this:  Through the years I’ve used this approach over and over in different industries and it’s worked every time.  It may sound like marketing 101 (and it is) but very few marketers make intelligent use of surveys.

The beauty of surveys lies in their simplicity and that’s where most marketers screw it up (in my humble opinion).  Almost all of the surveys I see are poorly thought out and ask way too many questions (most of which aren’t actionable).

I believe that the quality of survey data is inversely proportional to the number of questions being asked.  So don’t get greedy and ask for too much.  Ten questions max and preferably only five to seven!  But think hard about which five to seven questions to ask and obsess over the wording of each question to make sure it’s crystal clear and will provide the exact insight you need.

One of the things I love about marketing is that it’s a combination of science and art.  The best idea wins, but you can stack the deck in your favor with the right data.  I believe that simply listening to customers is the best way to do that.

How to Fix the Auto Industry

tesla Exhibit A:  The Tesla Model S

There’s a lot of talk these days about how to fix the U.S. auto industry.  It’s a tough problem to solve because the industry is such a big part of our economy — we’re talking about millions of jobs and hundreds of billions of dollars of GDP.  But the big automakers are so bloated and broken it’s hard to see how they get fixed without becoming a lot smaller.

I’m a firm believer that the only sustainable way to grow a business or a whole economy is through innovation and entrprepreneurism.  Create new technologies, products, and services that provide real value to consumers (and capture their imagination) and you’ll stimulate spending and create new jobs.  The good news is that this is what America is best at.  It’s why we have the world’s most powerful economy.

The bad news is that the U.S. auto industry is the exception to this rule.  Case in point:  I’m in the market for not one but two cars right now, but there’s not a single vehicle out there that provides the combination of design, utility, and fuel economy that I’m looking for.  I’ve talked to lots of other folks that feel the same way.

Thank goodness for Silicon Valley…

So how do you fix the auto industry?  Allow me to submit Exhibit A, the Model S from Tesla.  It’s an all-electric 4-door sedan that will do 0-60 in 5.6 seconds and has a 300 mile range.  The base price is around $50,000, but the fuel cost will be about one tenth of a comparable gasoline-powered sedan.  For a typical 12,000 mile a year driver this translates into about $800 a year in fuel savings.

The only problem is that the Model S won’t be available for another year.  But that hasn’t stopped over 500 people (in the first week alone) from paying a $5,000 fee to get on the waiting list.  If Tesla’s claims are accurate, this car will be a hit and their biggest problem will be production capacity — which, ironically, is one thing the big auto makers have in spades.

Our country is struggling with lots of big economic issues:  Failing industries, rising unemployment, dependence on foreign oil, a growing national deficit.  I’m not saying that Tesla Motors is the solution for any of these problems, but it’s a metaphor for the type of forward-looking, game-changing moves we should be making to get our economy back on track.   I’d rather invest my tax dollars in defining the future of the auto industry rather than salvaging the past. 

Why can’t the U.S. become the “Saudi Arabia” of alternative fuel and the “Apple” of automobiles?

The Legend of the Key: A Short History of Farstar


Tonight I’m going to a private concert at Farstar’s new offices in Frisco, so I’ve been reflecting a bit on the history of Farstar.  It’s a long story, but it’s a pretty good one…

Kevin Lofgren and I founded Farstar in the summer of 2002 based on an idea I had while working at Claria.  The concept was to combine direct mail and the Internet to create marketing campaigns that are incredibly intriguing, personalized, and trackable.  We were certain it would work — all we needed was a paying customer.

Somehow, KL managed to get us in the door at Oracle.  So we put together a PowerPoint deck to pitch a product that didn’t exist (we called it “ResponsePlus”) to one of the world’s largest companies.  Abracadabra — they bit.

Oracle:  “We love it.  How much does it cost?”

Kevin & Kevin (looking at each other in dismay): “Uhhhh… fifty thousand dollars?”

Oracle: “Sold.  Who do we make the check out to?”

Kevin & Kevin:  “Uhhhh… let us get back to you on that.”

The folks at Oracle wanted to use ResponsePlus to generate leads for their buisness software.  Great, all we need to do now is build the actual product (oh, and open a bank account).  We called our friend Shane, who agreed to build the technology for a piece of the action.

 To make this first campaign a smashing success, we needed a killer creative concept for the direct mail piece. 

Continue reading