Amazon Redshift And You
In early 2013, Amazon publicly released their new hosted data warehousing product, much to the fanfare of BI analysts and executives alike. Over the ensuing years, many prominent companies (Amazon included) switched large swathes of their data warehouse stacks over to the Amazon Redshift technology. Should you follow suit?
Cutting Edge Technology
Prior to Amazon Redshift, building a truly scalable data warehouse usually meant investing in distributed map-reduce solutions (like Hadoop and Mongo) or column-based systems (like Teradata and Vertica). While these technologies afforded you powerful abilities in turning your data into actionable intelligence, it came at a cost.
There’s no two ways about it – getting off the ground with these solutions was a time-consuming and expensive endeavor, even for seasoned veterans of the BI industry. If you weren’t investing capital in subject matter experts (and it does take a couple of them) to do the work in-house, you were probably paying an arm and a leg for third-party consultants to implement the product in your environment. When all was said and done, you would have absolutely cutting edge technology at your fingertips!
As any BI executive will tell you – this was only half the battle. Now that you had the ability to analyze your data, you actually had to, well… analyze it. This was a lot easier said than done. With these technologies, you were often coding and tuning queries in very complex languages to get answers to even simple questions. Again – more time and more money.
Until 2013, this was (actually) your best bet, and it is what most successful BI teams worked towards. After exhausting their patience with SQL-based technologies that didn’t scale, there was simply no other way to analyze incredibly large data sets efficiently.
The Shift to Red
Amazon, in rather prescient fashion, realized they had the power to flip the script entirely. Current solutions were just too complex (from and end-user perspective), too time consuming (from an engineering perspective) and too expensive (from an executive perspective) for an industry that thrived on reducing complexity, increasing available time, and saving money in the long run. Something had to change.
First and foremost (from this BI analyst’s perspective) is Redshift’s ability to be queried using the same basic syntax underlying all SQL-based languages. You know…
SELECT (some stuff)
FROM (some place)
WHERE (some things are true)
The bread and butter of many BI analysts. Unfortunately, prior to Redshift, desiring to stay in this language also meant housing your data in a SQL-based warehouse. As lots of people will tell you, when you start venturing into the world 10-million+ records tables and complex joins, these systems start to crawl – and that’s if they were set up correctly. In many cases, you might find yourself waiting minutes for a query that used to take seconds (back when you had a tenth of the data). In some cases, you might wait an hour for a query that used to take minutes. If you see an analyst looking at the screen of their laptop with a blank stare, their hands on the side of their head, with a look of utter frustration (and confusion) – this might be why!
I’ve been there. It’s no fun at all.
All companies in the “hockey-stick” phase of growth eventually hit this point. The data is just growing faster than the system can handle in any reasonable amount of time. While frustrating, it is generally a good sign – you have lots of data to make use of, and now it’s time to join the big-leagues.
Switching over to a scalable solution usually meant learning the complex query language that came with that solution. Redshift’s ability to be queried using the same language you are likely growing out of, should put it high on the short-list of technologies you are eyeing to make the move to.
Amazon knew that their product shouldn’t just be easy to query, it should be easy to actually set up. As we discussed in the first part of this post, to get going with a solution like Hadoop or proprietary column-based systems was usually a heroic undertaking. The gains would be there in the end, but only if you had the time, money and knowledge (or dare I say – gumption) to stay the course. What if it was easier than that?
How long does it take to start your Redshift cluster? Minutes. You just press go. This might sound like an exaggeration, but anyone who has been tasked with setting up a Redshift data warehouse will happily tell you, while unsuccessfully trying to hide the smirk on their face, that this is truly the case.
To be clear – this doesn’t mean you flip a switch and your data is in there, optimized, and ready to spew actionable intelligence from the query browser. But it does remove a major painpoint that many a BI engineer experiences – the management and hosting of the analytical system.
Before you enjoy your home (the analytical system), you’re going to have to purchase the lot (machines/servers) and build your house (software, patches, drivers, etc). Sometimes the lot was cleared and ready for development (hosted machines/servers), and you built the house yourself or paid someone to do it. Sometimes you cleared the lot yourself (purchased machines/servers yourself) and did all the heavy lifting yourself. Both solutions had their benefits and drawbacks, usually striking a balance between setup time and customizability.
But the story was the same – you were spending gobs of time and money (I’m sensing a theme here!) before you could even place a record into your database. Amazon Redshift removes this phase entirely – you can be up and running within minutes. No setup, no servers to manage, no one pulling their hair out. The engineers thank you ahead of time.
To come back to our metaphor – with Redshift, you’re deciding to buy a pre-fab home instead of building it yourself. But you don’t want an empty house. You need to fill your home with your furniture (data), and you need your furniture in the right place. With other analytical solutions, you might find yourself sawing the legs off the armour to get it into the bedroom (change the format of your data so that your analytical system accepts it). In some cases, you might rather blow out the door frame because the armour is a family heirloom (change the structure of the analytical system so that it accepts your format). But who’s kidding who – no one wants to have to make that decision.
We get it… You’re the VP of Analytics. The company is pumping on all cylinders. You just raised a new round of funding. Everything is apple pie. But you know the truth.
That MySQL server you’ve been chugging along on for 3 years just isn’t cutting it. You know it, the analyst (definitely) know it, and the rest of the business is starting to notice it. Emails are going out late, queries are taking forever, you’re acting on old data. You’re going to make the push for a new system, and you already know the concerns that your superiors are going to have…
Will the new system fix our current issues? Will we still be able to do everything we’re already doing? And most importantly – How much time? How much money?
You think about your options. You’ve read up on Hadoop as much as possible. So far, your conclusion is (in order):
Hopefully. Probably not. Lots. I have no idea.
Amazon had you in mind when they created Redshift. Let’s change that to:
Yes. Absolutely. Minimal. Here’s the cost sheet.
We know it’s going to be easy to use because the analysts won’t have to learn a new language. We know it’s going to be easy to get off the ground for the engineers because it’s hosted and managed. But what about the (huge) chunk in between:
What about the nitty gritty of actually warehousing the data?
Will it be worth it?
Tune in to our next blog post where we will discuss Redshift on a more technical level, and how utilizing this technology will help you drive actionable intelligence within your organization.
– See more at: http://das42.com/2015/10/amazon-redshift-and-you/#sthash.Bw5vLs5b.dpuf