The question is, how are you going to explain your job to other people?
Intro Paragraphs Are The ‘Captain America: The First Avenger’ of the Essay Universe
Data science is a lot more like Thor, to be honest: as powerful as it is attractive, but relies entirely on faith in the gods to get any of your programs to run.
Chris Hemsworth love aside, data science is an enormous field that encompasses countless disciplines and professions. You could be living under a rock and still have heard of it. Glassdoor has even ranked ‘data scientist’ as the best job in America for 3 consecutive years, to say nothing of how it has affected other areas of work.
But what is data science? What does jargon like ‘big data’ and ‘data analytics’ even mean? How does one even get started to learn about it?
I have no idea. But Google does, so let’s go with that.
What Is Data Science?
It’s hard to pin down an exact definition, but the best way to put it would be that it’s a mixture of various algorithms, tools, and machine learning principles that allow people to discover patterns and connections within raw data. It has a heavy emphasis on statistics/probabilities as well as computer science/programming but also involves some soft skills like being able to express your findings verbally or through writing.
Since there’s no concrete bound on what data science can be, its effects can be felt in numerous career paths. IT workers, statisticians, salespeople, analysts, and accountants are different professions yet are all integral to businesses and organizations in their own way. While data scientist is a job that many strive to have, there is no need to feel limited to just that. Much like the nation of Wakanda is diverse in the culture of the tribes that inhabit it, so too is the world of data science and data science-adjacent careers. That comparison was a stretch but it works, and I won’t apologize for trying.
Data scientists, in particular, build and use a variety of tools to organize large amounts of digital data and synthesize different conclusions and arguments from them. They then communicate their findings to others and supply their knowledge and leadership to help deliver tangible results to their organization or business.
Data science encompasses that entire path, from researching and preparing data to communicating results. Because of this, it’s a field that has more variety and complexities than, say, the average Marvel movie.
What Do Some Data Science Terms Mean?
This article is about Marvel movies (and maybe data science too I guess), so while DC movies may be related, we neither care about nor want anything from them. Similarly, data science shares a lot of overlaps with mathematical-based fields like statistics, but for the sake of brevity, I’ll ignore terms from those areas.
There are almost as many data science terms as Marvel movies, so here’s a list of some of the most important ones:
Used most often in product development, A/B Testing is when two (or more) variants of a given product are shown to different groups of users. Changes can be extremely simple, such as just changing a single button or icon, but what matters most is that experimenters can determine which variant their users respond best to (and work with that information).
For example, take the three actors who played Hulk (Eric Bana, Edward Norton, and Mark Ruffalo). If you showed each of their respective movies to Marvel Cinematic Universe (MCU) fans, you could find out which one they responded best to and plan your future movies off of that.
Mark Ruffalo > Edward Norton > Eric Bana, because honestly who can remember anything from the latter two’s movies? At least Ruffalo’s Hulk talks.
Artificial Intelligence (AI)
Put simply, this is an area of computer science that involves creating intelligent machines that are aware of their surroundings and can perform tasks that normally require some level of human intelligence. Everything from Facebook’s algorithm to find and ban inappropriate content to the behavior of enemies in modern shooter games is AI.
Think Ultron, but less evil and nothing is voiced by James Spader. So kinda lame.
Big data is a term that suffers from being exceptionally broad, but here’s a brief rundown. Essentially, it describes the large volume of digital data that a business or organization handles on a day-to-day basis. Tangentially related to it are various strategies and tools that help computers and people do complex analysis of it.
It can be categorized by the 4 V’s: velocity (how fast the data comes in), volume (the measurable amount of data), variety (the different types of data), and veracity (the consistency of daily, seasonal, or event-triggered data loads).
Marvel is no stranger to big data, considering the research and planning that have gone into their ‘4 Phases’ strategy as well as the gigantic comic book universe that their characters are drawn from. There’s an interesting article about it here that discusses a graph database of every known Marvel character (from the comics) that’s worth reading if you’re interested!
Sometimes referred to as ‘data munging,’ this is a process of formatting or restructuring raw data to fulfill a certain need or make it easier to use in a broader project. It helps data scientists use the data for whatever purpose they have in mind.
A good analogy would be the Border Tribe in Wakanda wrangling rhinos around (as in, herding or moving them for a specific purpose). Sure, I could use more normal animals like cows, but rhinos are cool. Especially war rhinos.
A tree-shaped visual model for a decision-making process, often used in machine learning, that uses a set of branching questions or observations about a certain data set to predict a target value.
The decision tree on the left is a more complex version of this, but it still relies on close-ended questions like “is he an alien” or “is he a war criminal in hiding” to determine a final outcome. Naturally, the Punisher only has one outcome.
You didn’t think I was going to ignore the Marvel Netflix shows, did you? Not a chance.
Decision trees start from a center question and move in two different paths depending on the answer. As for why the decision tree is upside down rather than upwards like a normal tree, nobody can say for certain. According to one math textbook, conventional wisdom says that the people who named them never went outside to see what real trees looked like. After spending the time to research and write this, I’m inclined to agree.
Machine Learning (ML)
A subset of AI, ML is a process where a computer uses an adaptable algorithm to identify patterns within a set of data, and then ‘learns’ from those patterns by applying this knowledge to new or existing problems and requests. As more data flows in, the algorithm becomes updated and modified so it can be more accurate and efficient.
Interestingly enough, Thanos’s face was developed by ML software. Between 100 and 150 tracking dots were attached to Josh Brolin’s face, which captured recordings and then funneled them to an ML algorithm that determined which high-resolution face shape (out of a database of different faces and emotions) would work best. The solution could be tweaked based on the visual effects crew’s input, which the algorithm would then factor in for future use. Read more about it here!
This term refers to a set of algorithms, loosely modeled after the human brain, that interpret sensory data and helps cluster and label it. They function as components of broader machine learning algorithms or applications.
One example would be identifying faces. A neural network would take in a low-resolution image of someone’s face, process different parts of it at a higher resolution for more accurate facial recognition, and put all these pieces together to determine if it is, in fact, a face. So if S.H.I.E.L.D. needed to track someone down, chances are they’d use a similar neural network to determine who their suspect is.
These are both branches of machine learning but differ based on the amount of human interaction.
Supervised learning refers to a data scientist training an algorithm to draw what they consider correct analysis, much like teaching a child how to do math properly. This typically starts with a well-defined set of data so the computer can know exactly what it’s looking for.
Unsupervised learning refers to the computer having no reliance on human input and building its own understanding from the data. It is less free of bias than supervised learning but is much more complicated, and is usually left for more complex tasks as a result.
Web scraping is the process of extracting data from websites and putting it into a file for analysis. It can help create lists of product names or IDs, contact information like phone numbers or emails, and much more. The possibilities for this kind of data are endless, which is one of the reasons why web scraping is such a useful and in-demand skill.
Data scientist Christopher Redino scraped the Marvel Wikipedia to generate some interesting data visualizations about a few key Marvel characters, like how often they appear or how often they meet up with one another. All the characters he discusses are shown in the movies, so for my purposes, it still technically counts and I can use it. Read about it here!
How Do I Get Started In Data Science?
Jumping in headfirst into a topic as deep as data science is a possible strategy, but it’s not a highly recommended one. Instead, it would be smarter to search for online resources that outline some basic courses and resources you could take advantage of to learn more. We here at the Data Science Library have compiled such a guide, which I will shamelessly plug here. Once you have more experience with basic data science topics, feel free to move on to more complicated ones; the Data Science Library can offer these guides too, like the one we have for machine learning (located here).
Yes, I’m a shill, but have you ever met a Marvel fan that wasn’t one? That’s what I thought.
But more than just knowing data science concepts, you should also know the programming languages that go hand-in-hand with these topics. I’ll briefly go over 3 such languages that are a crucial part of every productive data scientist’s toolkit:
Python is an object-oriented programming language, which is a fancy way of saying it organizes data as objects that can be manipulated with code rather than logic or functions. If that sounds complicated, consider the people who watched ‘Avengers: Infinity War’ without seeing more than 5 other Marvel movies, and then rethink how difficult your understanding really is.
Python is a popular language in the data science community, since it’s easy and reliable to use, has a thriving community around it, and has a wide array of libraries and add-ons to suit different needs.
There’s a whole lot more to Python, but the best way to learn is to find some online courses and try it out on your own. In case you’re interested in doing so, the Data Science Library has a very neat Python guide that you might find interesting (right here).
I could make a few pirate jokes here, but let’s be honest, they’re more interested in ‘C’ anyways. Even the Ravagers, the space pirates from ‘Guardians of the Galaxy,’ wouldn’t bother with that joke, but it’s too late to back away from it now.
R is primarily a statistics language, which is why it’s so useful for data science (graphing, visualizing, and analyzing different data sets). But it can also be more complicated and intimidating for those without a coding background; going from Python to R can be daunting if you aren’t too confident in your skills with the former.
In case you need a confidence-booster, or just want to strengthen your already solid coding foundations, check out this Data Science Library article on R right here!
Pronounced like the word ‘sequel,’ SQL stands for ‘Structured Query Language.’ It’s just a fancy way of saying it interacts with a database, typically by inserting, updating, removing, or retrieving different sets of data. It’s not actually a programming language, so it’s pretty easy to pick up and learn.
Unfortunately, we here at the Data Science Library are still working on a SQL guide at the moment. But for those who want to learn but don’t want to end up like Peter Parker struggling to figure out how to work his supersuit’s built-in A.I. (affectionately named Karen), try these guides:
- Code Academy’s SQL tutorial, designed for a beginner with little to no prior coding experience. Highly recommend if you’re going in totally blind.
- W3School’s SQL tutorial, an extremely in-depth guide from one of the biggest online collections of programming information.
- Vertabelo Academy’s SQL tutorial, similar to Code Academy’s tutorial but much more colorful and flashy.
Conclusion Paragraphs Are Basically Post-Credits Scenes But In Words
I genuinely hope this article isn’t as forgettable as ‘Thor: The Dark World’ because data science is an extremely interesting (and profitable) field. Not to mention, data is everywhere in life. According to a Forbes article on the issue, there’s more than 40,000 Google searches every second, and 456,000 Tweets every minute, and 1.5 billion people active on Facebook every day. Someone’s gotta look at all that data, find connections in it, draw conclusions from it, and figure out how to best utilize this information. With enough training and practice, that could very well be you.
This article is dedicated to Zoshua, the person in charge of this blog who decided to spoil the entirety of ‘Avengers: Endgame’ for me a day before I went to see it because he has no empathy for those who “live in a fantasy world” and thinks I should “stop being a child, grow up, and focus on more real-world subjects.” He has since been hounding me to write something.
And so this entire article had to be Marvel themed. Hope you enjoyed it Zoshua. Being the child I am, I know I did.)