How I mashed-up Google and NOAA’s massive computing infrastructure to answer a Big Data question in 3 hours with no tech skills. (It cost me $1.39.)
Democratization of Big Data
2.5 exabytes – the equivalent of one trillion digital photos – is captured each day by computer systems across the globe. This number is growing exponentially with proliferation of mobile devices, cameras, and other information sensors at the edge of the network. This presents both an enormous challenge and a game-changing opportunity. The challenge is how to deal with data at a scale that is almost inconceivable. The opportunity is this: If we can tame the data, it will usher in the next generation of technological breakthroughs. It’s already happened in fields as far-flung as genomics, banking, and national security and it’s still early days.
Big Data refers to datasets so large that they exceed the capabilities of traditional database systems. By today’s standards that means a few dozen terabytes. Datasets this large require big tech infrastructure – massively parallel software running on hundreds, thousands, or in some cases millions of servers. Well-funded, tech savvy companies are investing heavily in this technology, but what about startups and other small businesses that can’t afford the computing infrastructure required to deal with their own ever-expanding data assets? Small businesses – or even individuals – might want to use big data to out-fox larger competitors or simply make better informed decisions.
It’s no secret that cloud computing has removed barriers to access and driven the cost of computing power down dramatically. But just how far? With this question in mind, I set out to conduct a small experiment to determine how quickly an individual with limited tech skills could spin up a big (or at least big-ish) data infrastructure and use it to solve a simple but data-intensive problem.
Which US cities have the most days of perfect weather?
It’s a question I ask myself each September, after enduring five months of relentless heat and humidity in Texas. As an outdoor enthusiast, weather plays a major role in my psyche. Too often I feel like I’m limited to just a few short, precious windows of great weather. Maybe it’s time to relocate – but where?
There’s no shortage of weather information on the web, but most of it is based on averages – and averages often don’t tell the whole story. I was more interested in a new metric: Perfect Weather Days. Put simply, how many days each year is the weather ideal for outdoor activities? The definition of a Perfect Weather Day is, of course, highly subjective. My personal definition of Perfect Weather had four simple criteria:
No precipitation (it doesn’t necessarily have to be sunny as long as there’s no rain or snow)
A low temperature no lower than 50 degrees (as they say in Texas, I’m “warm blooded”)
A high temperature of at least 65 degrees (after all, it’s not a perfect day if you can’t wear a t-shirt and shorts)
High temperature no higher than 82 degrees
I considered other criteria such as humidity and wind speed but in the end decided to keep it simple. If there’s no rain and good temps then I’m not going to complain.
My first task was to find a source of daily weather data for US cities. It turns out that NOAA makes its enormous database of global weather data available for free. Using an online tool on the NOAA site I submitted a request for a dataset containing the following daily measures for thousands of weather stations across the US for the 8 year period between 2005 and 2012 (I learned on the NOAA site that the quality of data prior to 2005 is lower):
latitude / longitude
It took about a day for NOAA to deliver the dataset, which totaled 46.9 million records. Granted, this may not technically qualify as big data. However, it turns out that I could have just as easily completed the project with a dataset 10 or even 100 times larger.
Just a few years ago the idea of spinning up IT infrastructure to answer one trivial question would be preposterous. But with a few minutes of digging, I discovered BigQuery, a new product from Google that seemed to be purpose-built for my little experiment. In Google’s words:
Querying massive datasets can be time consuming and expensive without the right hardware and infrastructure. Google BigQuery solves this problem by enabling super-fast, SQL-like queries against append-only tables, using the processing power of Google’s infrastructure. Simply move your data into BigQuery and let us handle the hard work.
Score! In less than an hour I was able to upload my NOAA dataset to Google and start using a web-based SQL query tool to play with the data. Keep in mind that I have almost no programming skills – I remembered a bit of SQL from college and spent a few minutes reading the BigQuery documentation. With a little trial and error I came up with a query to rank order weather stations by the annual percentage of Perfect Days. I was blown away by how fast and easy BigQuery is. My queries usually took just 3 or 4 seconds to run and I could move large datasets around just as quickly. I didn’t have to think about database schema, indexing, or optimization at all. It all just worked.
Here’s the query I used (it’s not nearly as complex as it looks).
weatherdata.stations_US_clean.station as station,
weatherdata.stations_US_clean.state as USstate,
weatherdata.stations_US_clean.name as stationname,
weatherdata.stations_US_clean.latitude as lat,
weatherdata.stations_US_clean.longitude as lon,
count(weatherdata.2005_2012.date)/1460 as percentperfectdays,
count(weatherdata.2005_2012.date)/4 as perfectdays,
count(weatherdata.2005_2012.date) as totalperfectdays
JOIN [weatherdata.stations_US_clean] ON weatherdata.stations_US_clean.station = weatherdata.2005_2012.station
AND (weatherdata.2005_2012.tmin*.18 + 32) >= 50
AND (weatherdata.2005_2012.tmax*.18 + 32) >= 65
AND (weatherdata.2005_2012.tmax*.18 + 32) <= 82
group by station, stationname, USstate, lat, lon
order by percentperfectdays DESC
With BigQuery I was able to crunch 47 million records down to a summary of Perfect Weather Days for 7,102 weather stations across the country. This dataset was small enough to move into Excel for more analysis. It’s worth noting that the weather data is for weather stations, which don’t necessarily correlate to cities (many weather stations are located on mountain tops, in national or state parks, or other remote areas). So I had to manually map top-ranking weather stations to cities using latitude/longitude coordinates and Google Maps.
As I sifted through the results I noticed a couple of things. First, it was clear that California dominated the top of the list. No surprise there – California weather is legendary. Second, I noticed that many cities that ranked high in number of Perfect Days are cities that I know to have extreme weather. In other words, they may have a large number of Perfect Days each year, but they also have a lot of bad weather, making it impractical (or at least uncomfortable) to be outside. For example, my home town, Fort Worth, might have 40-50 Perfect Days a year (not too bad) but we also get months on end of daily highs near 100 degrees. Conversely, other cities might have mild summers but brutally cold winters. So I concluded that I needed to also consider the number of Bad Weather Days for each city. I defined a Bad Weather Day as a day that meets any of these criteria:
Half an inch or more of rain
Low temperature above 85 degrees
Low temperature below 15 degrees
High temperature above 95 degrees
High temperature below 32 degrees
(Again, this is highly subjective. If you’re into snow skiing, a day with clear skies and a high temp below 32 might be perfect.)
I then combined Perfect Days and Bad Weather Days into a score using this simple formula:
Weather Score = Perfect Days – ½ * Bad Days
My logic was simple (and yet again, subjective): It takes two Bad Weather Days to erase the joy of a Perfect Day. Now I could rank order each weather station based on my Weather Score. Back to the results…
At the top of the list is Kula on the Hawaiian island of Maui, a true paradise where 74% of all days are Perfect. While Hawaii claims the top spot, California dominates the top of the list with San Diego and LA running neck and neck for the highest scores in the continental U.S. Those cities average over 200 Perfect Days each year with less than 10 Bad Weather Days. In contrast, Fort Worth gets 47 Perfect Days but 86 Bad Days (about 2 Bad Days for every Perfect Day!).
Here’s a list of the top 25 cities by Weather Score:
And on a map…
Every city in the top 25 is coastal. I probably should have included humidity or heat index in my definition of perfect days since I know first hand that 80 degrees in Key West can be far more uncomfortable than 85 in Phoenix or Boulder.
I also ranked the 50 largest US cities by weather score. It’s interesting how sharply the number of Perfect Days drops off outside of California.
The bottom of the list was dominated by Alaska. No surprise given that my definition of a Perfect Day involves fairly warm temps and no precipitation. In dead last is Deadhorse, Alaska with only 2 Perfect Days and 235 Bad Weather Days each year (I’m guessing that the town’s name was inspired by its brutal winters). In the continental U.S., the lowest score goes to Lajitas, TX in Big Bend country, with just 5 Perfect Days and 154 Bad Days. (Interestingly, just 30 miles away in Big Bend National Park is the Chisos Basin which ranks near the top of the list.)
For those of us that prefer mountain air, I took a quick look at locations above 4,000 feet of elevation. Many of these locations (which were mostly in California) didn’t correspond to cities, so there was a lot of noise in the data. For example, the highest ranking location above 4,000 feet is the aforementioned Chisos Basin in Big Bend National Park with 83 perfect days and 16 bad. Yosemite Valley (my favorite place on Earth) also ranked high. The highest ranking cities at altitude include: Carson City (NV), Alamogordo (NM), Cortez (NM), Sierra Vista (AZ), Silver City (NM), and Provo (UT). I was surprised that one of my favorite mountain towns, Boulder, CO, didn’t rank very high with only 14 perfect and 48 bad days. It’s clear that mountain towns get penalized by my definitions of Perfect and Bad, which favor warmer temperatures.
When all was said and done, this experiment confirms the obvious - for great weather California wins by a landslide. But it also provides a quantitative measure to compare other cities. My home town of Fort Worth scored a 4 compared to a 203 in LA. Ouch.
But the most interesting part of this project was the project itself. With about 3 total hours of late night work I was able to sift through an impressive amount of data to answer my question. And because BigQuery runs on top of Google’s massively parallel infrastructure, I could have completed the project in about the same amount of time if my dataset had been billions of records instead of tens of millions of records.
What about the cost? Here’s my statement from Google totaling $1.39.
That’s $0.19 for storage, $1.13 for BigQuery compute cycles, and $0.07 in tax.
Clearly democratization of Big Data is happening, and that means small businesses and even individuals like me can take advantage of massive computing power for almost no cost and with minimal technical skills. Imagine millions of businesses and individuals with the power to hack huge amounts of data covering just about every field of human interest. It’s like the Internet all over again – the Internet of data.