The 7 Big Data Skills Every Consultant Needs to Know
Today’s world is a Big Data world.
Those who are able to tap meaningful data flows of business opportunity in internal company data and freely available online information will be the primary business value creators of the 21st century.
At a successful big data analytics firm, employees come from many backgrounds: product managers, data scientists, software engineers, database architects, and business-focused solutions specialists (aka “Big Data consultants”). In particular, professional services should frame customers’ Big Data challenges into a problem that analytics scientists, data architects, and data scientists can use cutting-edge algorithms to solve.
In order to compete more effectively in the Big Data marketplace, consultants should have the following skill sets.
1. Understand Hardware
Do you know how a computer works? If you don’t, you should. In particular, understanding what kinds of software runs best on different types of computers is vital to being able to scope the hardware requirements for your Big Data work to succeed. Some considerations: memory vs. disk/storage space (e.g. for analytics on in-memory databases), the number of cores (i.e. processing power), and whether the business problem is best done on one server (such as a regression) or distributed among multiple servers (such as determining how many tweets on a topic are negative).
If you are running your algorithms on a distributed system where it’s OK for one “node” (computer or server) to fail, commodity hardware can keep your costs low. If your algorithms have to be run on a supercomputer (like weather forecasts, perhaps), that probably means you’ll need expensive specialized hardware. Not surprisingly, the trend these days is towards using more commodity hardware and writing software that can distribute the work across multiple servers, rather than spending a lot of money on an expensive supercomputer that will become obsolete in a few years.
You should know what hardware is required for your analytics work because you’ll need to know whether to put a given solution in the cloud (such as Rackspace or Amazon’s Elastic Computing cloud) or running it on hardware bought, set up, and maintained in-house by your IT department.
In order to scope your hardware needs appropriately, when starting a software (or SaaS) project, here are a few preliminary questions to ask your customer:
- Do you want to host the software inside the customer’s firewall, or is hosting on the cloud or elsewhere OK? (i.e. “Your place or mine.”)
- Will the software/SaaS need to be available with 95% uptime, 99% uptime, and during what hours? (More uptime is expensive because you need to buy more hardware in case one fails.)
- What is the maximum number of users who will be accessing the website at the same time, and in what time zones? (More concurrent users means more hardware.)
- Can the solution be run on a “virtual machine” or does it need to be on its own dedicated hardware? (This varies with the number of concurrent users, uptime and cost requirements. See the Wikipedia article on Virtualization for details.)
- What kind of data security is needed? (This affects network structure and whether you host the software behind the customer’s firewall or on the cloud.)
Scoping hardware requirements is more complex than this, but these questions are helpful for getting started.
2. Understand the Software Development Life Cycle and how Software Works
As Arthur C. Clarke said, “Any sufficiently advanced technology is indistinguishable from magic.” If you showed an iPhone to someone from the 1920s, it might seem like magic to them. But in reality, the iPhone — or any software — starts with a detailed list of requirements. Think of building a skyscraper: you have to have the blueprints before you lay a single steel beam.
Software itself is nothing more than a very specific set of instructions that tells a computer what to do. How would you tell a robot how to get from New York to Los Angeles? Would you start with “Get into a car, and take highway number…?” If so, you’re assuming that the robot knows what a car is, what a highway is, what different highway numbers mean, what your location is, and how to drive. The difference between a “high-level” software language and a “low-level” software language is the amount of detail the human has to work with. In a low-level language, you need to specify more details to the computer, whereas in a “high-level” language, some of the details come pre-built (e.g. the concept of “car” in the traveling robot example). Low-level languages (such as C) tend to run more quickly, because the computer can translate the software to 1’s and 0’s faster, whereas higher level languages (like Python) need some intermediate steps to translate that language into the 1’s and 0’s on the microprocessor — which can takes some time. When you’re making a website about your dog, this doesn’t matter much, but if you’re writing a trading algorithm where microseconds can influence how much money you make, the time to compile the software does matter.
The software development life cycle, often abbreviated SDLC, describes the process for writing software. The stages are fairly analogous to building a house:
Define requirements, write code (“dev”), test it (“QA” or Quality Assurance), and ship it (“prod” or production) are like creating a blueprint, building the house, getting a certificate of habitability, and then someone can live in the house. The dev-QA-prod process is repeated to fix bugs. QA is important because if there is any code in the production environment that hasn’t gone through testing, it might break — just as a 2x4 that hasn’t been factory-tested under pressure might split and break a house wall.
A few notes on databases — this layer is a whole field in itself, but the basic things to know about your customers’ data that will make your database architects happier are:
- Will the categories of things being stored in the database change fairly often (so a more flexible structure like XML or JSON are better), or will the categories of data be fairly static (so SQL is better)?
- Will any of your data be empty (have null values)?
- What is the unique identifying piece of information that can link other data together? (For example, a phone number or a Social Security Number can identify a person, but people may have more than one phone number. SSN is a unique identifier, but because of that it is more sensitive and people don’t share it as freely.)
Stanford has a great online class (Intro to Databases) for more info.
Understanding software architecture is like being able to read blueprint. Some jargon:
Server-side (“fat client”) means software that is designed to run on a more powerful computer (usually a server), which is usually always on and always connected to the Internet. Server-side software languages tend to be called the “backend” since they’re not usually something an end user (like an executive) sees.
Client side (“thin client”) means software that runs on something less powerful, like your laptop. If you wanted to host a website from your laptop, you could — but if you turned off your computer to go out for dinner, the website would cease to exist online because the host computer would be offline, not serving up webpages. Client side software languages tend to be called “front end” because they face the customer — such as a software-as-a-service offering accessed through a browser like Chrome, Firefox, or Internet Explorer.
Like the walls and furniture of a house, there are different parts to software architecture. Most commonly used is a 3-tier architecture consisting of one or more databases, application servers, and web servers. For example, consider a car rental company. The database store the information about the cars that can be rented, all the customers’ names, email addresses, contact info and credit card numbers. The application layer does various operations on the underlying data — such as calculating how much a customer owes on a week-long rental for a luxury car. The web servers take the application and make that information accessible in a password-protected way, online all the time, to customers. It’s a best practice to have more than one server for each layer, so that if one of the servers crashes (Windows users — you know what I mean!), people can still reserve cars through your website and not get an error message.
3. Learn to Code
Just as it’s very useful to learn to speak a language before going to another country, it’s very valuable to know how to code if you are a consultant working in Big Data. Similarly, being able to introduce yourself in another language, ask directions, and order from a restaurant means you are significantly ahead of tourists that only speak English. Similarly, even if you don’t know how to write an entire iPhone app or machine learning algorithm, just taking a few introductory courses in Ruby (for example) will put you well ahead of the consultant who only knows PPT and Excel. If you have the time to become fluent in one software language (i.e. Python rather than just Excel), you will understand the nuances of the environment and the software implications of business decisions far better.
That said, PPT and Excel are the bread and butter of traditional consultants. You crunch some numbers and put your findings in a “deck” that executives look at and take action on. However, there is a big difference between running some numbers in Excel and putting findings in PPT and a full SaaS offering, such as Python scripts crunching data in a Hadoop environment with a MySQL database and an Apache Tomcat web server pushing recommendations to a user’s Internet browser.
As anyone who has spent hours perfecting every last detail of a PPT knows, PPT as a presentation layer is highly manual and not scalable. Although Excel macros can automate some processes, even VBA is limited in its ability to crunch large data sets. (Anyone who has ever dealt with a 10MB Excel file knows it can take agonizing minutes to open or save.) When you can crunch data on a server or cluster of servers (which can handle gigabytes, rather than megabytes, of data), you can truly work with Big Data in a scalable way. When your algorithms feed output into a visual layer (such as a website), you don’t need to check every number in your PPT because if the code is written well, Big Data analytics turn into business insights that an executive can get anywhere she has Internet access.
4. Know When to Use Hadoop, and When Not To
Hadoop is not the only Big Data tool out there, but it’s a good one. It matters because Hadoop is a software environment that enables you to crunch a lot of data relatively cheaply, using commodity hardware, where it doesn’t matter as much if one of your servers crashes because the work can be distributed among the rest of the computing cluster.
Hadoop is best for problems that run well when they are distributed among multiple computers. If the business problem you’re trying to solve uses a regression to predict something, you’re probably better off with a single server rather than a Hadoop cluster. Put simply: you should peel an apple with an apple peeler, not a bazooka. Using the right tool matters.
5. Use Data Visualization to Tell a Story that Inspires Action, and Get Clear on UI / UX
Data visualization is about more pie charts and bar graphs or pretty pictures. Good data visualization is about telling a story that leads to action. To paraphrase the rules of journalism, good data visualization shows what happens, when, and to whom — and what actions can be taken to improve business outcomes.
On a related topic, UI = User Interface; UX = User Experience. How people interact with websites should also be as easy as possible. For examples, look at:
- The Amazon.com checkout process (shopping cart)
- The Netflix homepage after you login (recommendations)
- The New York Times main website (for news)
Each of these sites has been fine-tuned to be optimal for a specific user to interact in a specific way (to buy, to watch, to read). Understanding your user and what they want to do — and how to help them get what they want — is the first (of many) considerations in a good UI/UX strategy.
6. Use Customer-Facing Communications Skills to Scope Requirements
The basic business communications skills consultants need to know are how to communicate clearly verbally and in writing, including PPT and Excel. But that is only the beginning — and par for the course, not a strategic differentiator — in Big Data analytics. A consultant who knows what the customer wants to see at the end — the website UI instead of a PPT — will be well placed to translate the customer’s data flows into a problem statement and framework that data scientists and software engineers can work with. This role as “business translator” cannot be underestimated. As the saying goes, “Garbage in, garbage out” — software just tells a computer what to do. It can’t guess what the customer wants, if those wants were not specified in the code. So the better the consultant can translate customer needs into software requirements, the better the code and the happier the customer will be.
7. Be Able to Do Project Management AND Product Management
In traditional consulting, project management is one thing consultants are expected to do. But in a Big Data world, analytics providers need to “productize” their offerings to be more valuable companies. This is because investors value consulting firms at 1x revenues — you’re only as valuable as last year’s client list — but product firms like Salesforce.com are valued at many multiples of revenues, because the people who use their products have “stickier” licenses for one or more years.
In order to be a good project manager, you have to keep the customer happy. In traditional consulting done with PPT and Excel, any last minute changes merely mean that the analysts and associates have to pull a few all-nighters to meet deadlines. But in a product management setting, last minute changes may mean re-thinking the entire software code; it’s far better to set things up right the first time because thousands of lines of code usually can’t be rewritten AND bug-free in a few nights. Additionally, product management is about building a re-usable tool that multiple customers can use, rather than tailoring all work to one customer’s needs. So “project manager” consultants who are used to doing whatever it takes to keep a customer happy may have a hard time creating a product that multiple customers can use. Transitioning to be a product manager from a project manager can start with learning how to build your own website, re-create Tetris, or otherwise make a tool that multiple people can use, in a software language of your choice.
Lastly, project management in consulting often involves a lot of meetings (and PPT and Excel). Product management involves working more with software engineers and data scientists who build the product according to the specifications (translated from the customer’s needs by a consultant or product manager). If you ever wonder why software engineers keep odd hours, it might be because they want to work in the middle of the night because they know they won’t get asked to join yet another meeting. Because of the nature of software and analytics work, uninterrupted stretches of time are vital to productivity. Smart product managers set aside blocks of time where their engineers and scientists know they will not be interrupted. Agile software development is one methodology which includes daily meetings for 15 minutes, where everyone goes through what they accomplished the day before, what they plan to do that day, and anything blocking them from making progress. Typically held standing up (so people want to leave and to keep the meeting short), agile “scrum” style meetings are an effective way to keep consultants in daily sync with engineers and scientists while maximizing the time the latter can spend working on the code.
The Bottom Line
Traditional business consulting skills are important, but the ability to speak and write clearly and create nice Excel and PPT documents are only one layer of the new Big Data consultant skill set. If you look at job sites or LinkedIn postings, the number of “pure strategy” consulting roles are few compared to those involving data science. Consultants who do not learn about hardware, software, data visualization, UI/UX, Hadoop and product management will soon fall behind in the marketplace. Even if you don’t intend to get a PhD in machine learning or build the next Bloomberg terminal, being able to “talk tech” is a great way to become invaluable to any company.