ai like im 5: what really is data? (article 5)

ai like im 5
11 min readDec 19, 2023

--

before i start, a couple things

  1. really encourage you to read before diving in — https://medium.com/@ailikeim5/list/ai-like-im-5-in-order-87ef4064afe8
  2. these articles covered ai, data sciences, machine learning, and deep learning
  3. if you don’t have a brief idea of what these things are, i really encourage you to read before.
  4. this content is not aimed at 5 year olds but instead simple!

so you now… hopefully… are getting a grasp on how these things work!

if not

  1. ai is our attempt at creating intelligence in machines
  2. data science is the vessel for this creation and data is the key to that vessel
  3. data is a collection of a bunch of numbers
  4. machines can learn thanks to data, and we call this machine learning
  5. there are three types of this learning

a. supervised -> we know the outcome of our situation or what it is

b. unsupervised -> we don’t know the outcome of our situation or what it is

c. reinforcement -> we interact with an environment and receive feedback

6. this learning is computational and logical, therefore it lacks human ideas like real emotion and creativity

7. conventional machine learning is amazing and can solve the majority of real world data problems, but it has its flaws

8. deep learning is an expansion of machine learning, attempting to address the flaws and achieve “human-like” intelligence and learning

9. the backbone of deep learning is the neural network

10. neural networks are a simplified mathematical representation of how our brains process information

these articles are going to mainly focused on deep learning

anyways,

before i go deeper into machine and deep learning, there’s a couple of topics that need to be covered and questions that should be answered

  • before i said data is just a bunch of numbers, and although this is true… there is some context missing
  • data is a lot more than just numbers, its a collection of many different things like images, audio, video, words, just normal numbers, and more
  • data is everywhere, it could be what type of music you listen to, what food you ordered off of uber eats, or what youtube videos you watch

so why would i call it just a bunch of numbers???

  1. computers and machine love numbers, it is the how they communicate.
  2. and if mathematics is a language, numbers are the words of that language.
  3. so if ai is about machines and computers learning with the help of mathematics and other fields, it is going to really helpful and at times required to speak the language of mathematics (numbers)

heres a story to illustrate this:

i am going to drop you off in a foreign country, lets say italy, with 1 task: get home before the end of the week… easy right?

but there’s a catch

  • you do not speak the language of anyone in this place, there is not a single soul that can understand you!
  • everything is in italian, signs, stores, etc
  • technology does not exist… there are no phones
  • and you have no money, you are broke

hopefully, you can recognize that there is only one way to be successful with this very hard challenge: learn how to communicate.

  1. the first thing you might do is to draw pictures and shove them in people faces

2. but that really doesn’t work, so you start using gestures, facial expressions, charades, and body language

3. after harassing enough people with your charades and gibberish, you finally meet someone who is very nice and they hand you a book, a translation from english to italian.

  • you are saved, for the most part, and you begin to read the book and learn!
  • you finally feel confident enough to ask for help and you go up to a random stranger and forget everything
  • you then open up the book, going page by page, and word for word in order to complete one sentence. then you hand the stranger the book and back and fourth you guys go until you reach a full conversation
  • so after conversation and conversation with different strangers, writing notes, and learning signs, you end up getting enough money, finding the airport, and getting home safely

here’s what i want you to take away

  • if we do not speak the language, we cannot communicate without the proper tools!
  • and sometimes, the proper tool is human intervention and it takes time away and that time is important to us!
  • although data is words, pictures, videos… and just regular numbers, we would want to avoid passing a translation book back and forth as much as possible or find a better solution, like a magic translator.

and if we want to complex tasks, like deep learning, we are going to need direct communication or better tools to perform communication, i.e numbers!

data types

  • we often divide data into separate types
  • data types are super important for direct communication
  • some important data types:

a. numeric data: our data is already expressed in numbers:

  • numerical data is the most straightforward data type, it is already numbers
  • and because our data is already numbers, there is almost no processing required!

numerical data is divided into two categories

  1. discrete -> countable data
  • distinct and separate numbers that you can count one by one.
  • the numbers of pieces in a puzzle, number of cars in the lot, leaves on a tree
  • there are no fractions or decimals for discrete data

2. continuous data -> measurable data

  • can take any value within a range
  • provides a detailed representation of physical things: like how fast that plane is going, how much eggnog your uncle drinks during christmas, and how big your pet goldfish has grown.
  • measurements: weight, temperature, height, and many other things
  • decimals, fractions, and more!

b. categorical data: our data is expressed in categories

  • divided and grouped by characteristics or attributes
  • instead it’s about labels or names that help us understand and categorize our data.
  • the different types of ice cream, the different colors of hair, the different breeds of dogs, grades of school!
  • as you may have guessed: this is not numbers.

we are going to divide categorical data into three categories:

  1. nominal -> categories without order
  • the order and rank of our categories has no meaning
  • colors, types of food, type of fish

2. ordinal -> categories with order

  • the order and rank of our categories is very significant and has meaning
  • not just about being apart of a category, but instead your position inside that category
  • let’s say we compete in race, we are not just grouped based on our performance, we are ranked.
  • rank provides insight into the performance and holds special meaning
  • whoever finishes 1st is the fastest, and last is going to be the slowest

3. binary -> belonging to two possible categories

  • things belong to one or the other, there are only two possible categories!
  • these categories are usually opposites
  • yes/no, true/false, male/female, on/off
  • this is the simplest form of categorical data, and due to its simplicy is very important
  • note: binary data is a type of nominal data, but deserves its own category in the context of machine learning

categorical encoding:

  • remember our illustration before about us and the italian passing the book back and forth
  • so we need to convert this non-numerical data so it is ready for direct communication!
  • there are techniques for this that will be illustrated in the future

there are many types of data but there is one more that needs to be talked about especially for deep learning:

c. images, text, audio, videos, etc

  • we know that images, audio, text, and videos are complex, they are not as simple as categorical data
  • so how are we going to speak the language of mathematics!
  • i am going to, briefly, introduce you to a really important idea to machine learning and the most important idea to deep learning: tensors

so imagine you are talking to the italian, but this time i am going to give a helper: a robot that translates everything and makes the conversation so much easier

  • tensors are like this robot, they allow us to represent complex things: images, audio, text, videos!
  • tensors are a mathematical idea, that translates complex things into an organized structure, similar to the way a dataset does to data
  • these tensors have certain mathematical properties, that make them very complex
  • but for now you can think of them as this robot translator
  • and without this robot translator, creating those magical black boxes would be impossible!

these tensors are complex and for this reason, so this is a brief introduction!

data structuring

  • data structuring is very important to data and the world
  • we dedicate a lot of resources and time to making sure data is
  1. understandable → if i have a recipe to bake a cake, i would want my recipe to be as clear as possible with directions, ingredients, etc.
  2. accurate → if i want to make a cake, i want the best ingredients possible… if i add a bad ingredient(bad data), and i keep adding them, my cake is going to stink!
  3. consistent → let’s say i make a 1000 cakes and finally i find the best recipe, i am not going to change anything after that and keep it as consistent as possible
  4. efficient → let’s say i have the most understandable, accurate, and consistent cake recipe in the world, if it takes 3 days to make, am i really going to want to make it.
  5. and accessible → so imagine i have a 10 minute, to die for cake recipe, would i store the recipe in the bottom of stack of a million papers, and sort through the papers every time for it?

without proper organization and these factors and many others, data is pretty much useless

datasets

  • going back to our cake example, let’s say you dedicated your entire life to writing a cook book and ended up with ten thousand recipes
  • after enough cooking, it is time to write the book, you think about how you want to organize the recipes so that the reader can find that recipe asap!
  • there are so many different ways organize it or characterize, in alphabetical order, by type of food, by culture, by length, by diet, etc.
  • we are going to call these the features of recipes, and that can be anything we want that better help organize and categorize our recipes or data

features

  • distinct and measurable attributes or properties
  • features have a distinct data type
  • often the input of our machine learning and deep learning models

your first book is successful, you write three more, and end up meeting other cooking authors. one of them suggests you combine everything you ever wrote into a website, where people can use the different features to find the perfect recipe before.

  • remember you don’t want your computer to flip through page after page for your just one recipe
  • so you want the best way to store this information and speaks the computers language!

this is where we introduce datasets:

  • datasets are the primary storage of online data
  • dataset store data in a table, where we have rows (horizontal) and columns (vertical)
  • columns represents the names of our different features, let’ s say

the name of a recipe, who wrote the recipe, what culture it’s from, what time, etc…

  • rows recipe the actual data, let’s say:

chicken alfredo, jenny benny, italy, 45 minutes, etc

here’s what that table or dataset would like:

  1. data and types of collection have been studied for years before modern technology, this study is called statistics
  2. you probably have worked with this type of organization in products excel, google sheets, and more.

so why this structure?

datasets allow for these 6 important things and more

  1. organization -> all types of data can incredibly organized
  2. efficient retrieval -> accessing elements is incredibly quick if we know what features or traits we are looking for
  3. easy modification -> editing and cleaning our datasets is incredibly easy
  4. insights -> datasets minimize the context necessary to understand a situation and gain insight
  5. scalability -> there is no limit to how big these tables can get, with data and features!
  6. easy storage-> dataset storage is easy

note: there are many different types of data structuring but datasets are fundamental to machine learning, deep learning, and ai!

we use main two types of storage for data and datasets

a. database -> live dataset storage, we need access the information constantly, optimized for live data

b. data warehouse -> historical dataset storage, optimized for big big big amounts of data

data processing

  • so now that you get data and the best way to store it, i have to say 1 thing:

there is lot of work to do before this data is ready for machine learning and deep learning

we are going to call this data processing, but it takes many names such4 as data cleaning.

and there’s just 1 problem… this requires human intervention and judgement (for now)

  • this human judgment involves identifying and correcting inaccurate data, deciding what to do with outliers (data that might seem confusing), and other things like labelling data!
  • these tasks require context into the situation, it is not an exact science, there is no right answer in a lot situations
  • like any skill, you begin to learn and get better and better and better

and the cool thing is: the creation and curation of an amazing dataset can be just as important then the actual model itself!

keep in mind with data:

  • data is a precious resource -> the more data you can get, the better you get at learning, the more recipes we learn, the better we get at cooking
  • more data does not mean good data-> if we wanted to create the best chef, would we want our book to contain our best recipes, or all our recipes!
  • good data is hard to find and at times requires effort → building that recipe book took you your whole life

so ai’s like chat gpt, alexa, and more are nothing without amazing data!

a recap

  • data is more than numbers and its everywhere
  • computers and machines love numbers
  • and if math is a language, numbers are the words
  • direct communication is important and especially for complex tasks
  • there are many data types for storing data
  • we have a dedicated data structure for complex things like images, audio, videos, and more
  • these are tensors and they are rooted in mathematics and are very complex!
  • data without the proper organization can mean nothing!
  • for that reason, we store data in special tables called datasets
  • these datasets are perfect for machine learning and deep learning
  • but before it is ready for that, it requires a lot of tidying up and human intervention, at least for now!
  • data is a precious resource, more data does not mean good data, and good data is hard to find
  • the best ai’s are nothing without amazing data and data processing can be just as important as the model!

so anyways,

here’s my human moment of the day:

ross creations is one of the funniest youtube channels and truly a florida man. this video has been in my head all day, please check out this part!

oh one more thing favorite here is some data of me (pink beard), my buddy, guy, and some random strangers at a rave!

godspeed!

--

--