Data ethics, Data science concept and Machine learning concept do they have significance differences or similarities?

Yalwa
12 min readJan 29, 2024

--

This write up is purposely for Data ethics and Data science, however, data science cannot be separated with machine learning in the context for the feedback of AI. So, it contained the following: Data ethics, Data science concept, what is machine learning, Artificial intelligence (AI), and the relationship between Data science and machine learning.

1.0 Data ethics

1. Data Quality

Certifying that data is accurate, reliable, and relevant for its intended use. Poor data quality can lead to incorrect analyses, imperfect decision-making, and unreliable outcomes. For example, a healthcare system ensures that patient records are accurate and up-to-date. Fabricated data could lead to incorrect diagnoses and treatments.

2. Misinterpretation

Avoiding the misreading of data, which can occur due to biases, lack of context, or misunderstandings.

Accurate interpretation is crucial for making informed decisions and drawing reliable conclusions. For example, an analytics team presenting data on customer satisfaction without providing the context that the survey was conducted during a temporary service outage. This could lead to misinterpretation of the overall customer experience.

3 Free-Choice

Respecting individuals’ right to choose how their data is used and shared. Upholding privacy and autonomy, allowing individuals to have control over their personal information. For example, a social media platform allows users to control their privacy settings and choose who can access their posts, demonstrating respect for users’ choices regarding the visibility of their content.

4. Intellectual Property

Respecting the rights and ownership of data creators and contributors. Ensuring that individuals or entities are appropriately credited for their intellectual contributions and protecting against unauthorized use. For example, a research institution acknowledging and properly citing the contributions of individual researchers in a collaborative project, respecting their intellectual property rights.

5. Fairness of Algorithms

Ensuring that algorithms and data-driven systems do not produce or perpetuate unfair or biased outcomes. Promoting fairness and preventing discrimination in automated decision-making processes. A financial institution using machine learning algorithms for credit scoring ensures that the model does not discriminate against certain demographic groups, preventing biased outcomes in loan approval decisions.

6. Data Ownership

Clarifying who owns the data and has the right to control its use. Determining responsibility and accountability for data, which is crucial for ethical and legal considerations. For example, a technology company implementing robust encryption and access controls to protect user data, preventing unauthorized access and ensuring the privacy of users’ personal information.

7. Data Privacy

Protecting individuals’ personal information from unauthorized access and use. Safeguarding privacy rights and preventing misuse of sensitive data. For example, a technology company implementing robust encryption and access controls to protect user data, preventing unauthorized access and ensuring the privacy of users’ personal information.

8. Data Bias

Identifying and mitigating biases in data that could result in unfair or discriminatory outcomes. Ensuring that data-driven processes do not perpetuate or amplify existing societal biases. For example, a recruitment platform regularly auditing its algorithms to identify and mitigate biases that could lead to discriminatory hiring practices, ensuring fair opportunities for all candidates

Collectively, these data ethics principles aim to guide organizations, individuals, and policymakers in responsible data management, ensuring that data is treated ethically, and its use aligns with societal values and legal standards. Adhering to these principles contributes to building trust, fostering innovation, and promoting a fair and just digital society. These examples illustrate how each data ethics principle plays a crucial role in different scenarios, emphasizing the importance of responsible and ethical data practices across various domains. Applying these principles helps build trust, protect individual rights, and promote fairness in the use of data.

2.0 Data science concept

For the definition of data science is like asking a group of experts what it really is, and you’d likely get different answers. The challenge lies because of the complex nature of the data science domain. So, it can be seen as a science, as a research paradigm, as a method, as a discipline, as a workflow, and as a profession, because it wears so many hats. So, reaching to a single agreed definition upon has proven difficult data science is a vast consolidated domain of knowledge (Koby Mike, 2023).

Data science is an interdisciplinary field that combines a number of fields in one umbrella. The Main disciplines in data science are: Computer Science: Data science is situated at the intersection of computer science and information technology (IT). Math and Statistics: Data science utilises powerful computational techniques to extract useful information and insights from large amounts of structured and unstructured data for making wise decisions. Other domains of Knowledge: Data science incorporates domain-specific knowledge, such as geography, Economics and many more to gain insights from data (Koby Mike, 2023).

Data Science should be more about improving how we learn from data, not just dealing with big amounts of data. Data science proverbs state that everything in science will be treated like data, and we’ll be able to predict how changes can be affected or affecting in the way we analyse and we live.

(Edouard Duchesnay, Feb 03, 2017)

data science domain

Data plays a big part in machine learning It is important to understand and use the right terminology when talking about data there are so many terminologies of processing data such as:

I. Standard data terminology used in general when talking about spreadsheets of data.

II. Data terminology used in statistics and the statistical view of machine learning.

III. Data terminology used in the computer science perspective of machine learning.

This will greatly help in understanding machine learning algorithms in general.

2.1 spreadsheets of data

Ø Column: A column describes data of a single type. For example, you could have a column of weights or heights or prices. All the data in one column will have the same scale and have meaning relative to each other.

Ø Row: A row describes a single entity or observation and the columns describe properties about that entity or observation. The more rows you have, the more examples from the problem domain that you have.

Ø Cell: A cell is a single value in a row and column. It may be a real value (1.5) an integer (2) or a category (red).

2.2 Statistical Learning Perspective

The statistical perspective frames data in the context of a hypothetical function (f) that the machine learning algorithm is trying to learn.

That is, given some input variables (input), what is the predicted output variable (output).

Output = f(Input) (2.1)

Those columns that are the inputs are referred to as input variables. Whereas the column of data that you may not always have and that you would like to predict for new input data in the future is called the output variable. It is also called the response variable.

OutputVariable = f(InputVariables) (2.2)

Typically, you have more than one input variable. In this case the group of input variables are referred to as the input vector.

OutputVariable = f(InputVector) (2.3)

If you have done a little statistics in your past you may know of another more traditional terminology. For example, a statistics text may talk about the input variables as independent variables and the output variable as the dependent variable. This is because in the framing of the prediction problem the output is dependent (a function of) the input (also called the independent variables).

DependentVariable = f(IndependentVariables) (2.4)

The data is described using a shorthand in equations and descriptions of machine learning algorithms. The standard shorthand used in the statistical perspective is to refer to the input variables as capital x (X) and the output variables as capital (Y ).

Y = f(X) (2.5)

When you have multiple input variables they may be dereferenced with an integer to indicate their ordering in the input vector, for example X1, X2 and X3 for data in the first three columns.

2.3 Computer Science Perspective

There is a lot of overlap in the computer science terminology for data with the statistical perspective. We will look at the key differences. A row often describes an entity (like a person) or an observation about an entity. As such, the columns for a row are often referred to as attributes of the observation. When modelling a problem and making predictions, we may refer to input attributes and output attributes.

OutputAttribute = Program(InputAttributes) (2.6)

Another name for columns is features, used for the same reason as attribute, where a feature describes some property of the observation. This is more common when working with data where features must be extracted from the raw data in order to construct an observation. Examples of this include analog data like images, audio and video.

Output = Program(InputFeatures) (2.7)

Another computer science phrasing is that for a row of data or an observation as an instance. This is used because a row may be considered a single example or single instance of data observed or generated by the problem domain.

Prediction = Program(Instance) (2.8)

(Brownlee, 20119)

Another name for columns is features, used for the same reason as attribute, where a feature describes some property of the observation. This is more common when working with data where features must be extracted from the raw data in order to construct an observation. Examples of this include analog data like images, audio and video.

Output = Program(InputFeatures) (2.7)

Another computer science phrasing is that for a row of data or an observation as an instance. This is used because a row may be considered a single example or single instance of data observed or generated by the problem domain.

Prediction = Program(Instance) (2.8)

(Brownlee, 20119)

2.4 Other purpose of manipulating data

Data science is a combination of data analysis, algorithmic development and technology to solve analytical problems. The main goal is the use of data to generate business value.

Data mining is a study of extracting useful information from structured/unstructured data from various sources. This is done usually for

1. Mining for frequent patterns

2. Mining for associations

3. Mining for correlations

4. Mining for clusters

5. Mining for predictive analysis

Data Mining is done for purposes like Market Analysis, determining customer purchases

(https://www.analyticsvidhya.com, n.d.)

2.5 Here is an example of data for making decision from NYC Restaurant

This sample of raw data without machine application, for manual observation.

3.0 What is machine learning?

Machine learning is the idea of generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problems. Instead of writing code, you feed data to the generic algorithms and it builds its own logic based on the data. For example, one kind of algorithm is classification algorithm. It can put data into different groups. The same classification algorithms used to recognize handwritten numbers, could also be used to classify emails into spam and non–spam without changing a line of code. It fed different training data so it came up with different classification logic.

List of Common Machine Learning Algorithms Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

1. Linear Regression

2. Logistic Regression

3. Decision Tree

4. SVM 5. Naive Bayes

6. KNN

7. K-Means

8. Random Forest

9. Dimensionality Reduction Algorithms

10. Gradient Boosting algorithms

I. GBM

II. XGBoost

III. LightGBM

IV. CatBoost

Broadly speaking these are used based on how data are classified.

Machine is used to classify data in three categories, before choosing appropriate tools from the above.

1. Supervised Learning how it works: This algorithm consists of a target / outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using these sets of variables, we generate a function that maps inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, KNN, Logistic Regression etc.

2. Unsupervised Learning

How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering populations in different groups, which is widely used for segmenting customers in different groups for specific intervention. Examples of Unsupervised Learning: Apriori- algorithm, K-means.

3. Reinforcement Learning: How it works: Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process.

The data affects not only science knowledge domain but all lives. So, in simple terms, the work describes the way we study in data and capture trends of changes, and it’s not just about dealing with a lot of data but improving how we learn from it in all areas of science and social science (Donoho, 2017).

4.0 Artificial intelligence (AI)

Is the ability of machine, (computer system or related) to emulate the ability of human sense (vision, talking, hearing, feeling or testing) for making decision, before decision made the intelligence has to be converted into data (Text, data and time, Numerical data (vector, Matrix, array)) to mention but few, then machine learning tools work upon. Consider the following diagram.

Natural or Artificial intelligence is used to produce data from sense consent. Humans can analyse, identify patterns such as converting into numerical data that brought to the kingdom of machine learning and deep learning from numerical data. Humans can listen and speak such as converting into text/Audio data and this gave birth to the kingdom of machine learning & natural language processing. Also, humans can see and recognise such things as converting into image/video data and this makes provision the kingdom of computer vision & Deep learning.

5.0 The relationship between Data science and machine learning

It’s hard to buy food stuff without considering the mode of processing. Likewise, whenever there is data it has to be followed by a processing medium preferably machine learning to make wise decisions.

Data produced by artificial/ natural intelligence then becomes the input of machine learning then machine learning exhibits the outcome of artificial intelligence. These three domains can never be separated. When you talk about one, others (two) are contained. That is why a number of published books titled data science are built upon anatomy of machine learning and AI, likewise books titled AI/ML are a bunch of data manipulation techniques. There are a number of programming languages used in data science, the most widely used are python- programming followed by R- programming.

Data ethics is about the degree of integrity of data, Data science is the study to have inside from data to make decisions, however, data science cannot be separated with machine learning in the context for the feedback of AI which has patterns as supervised and unsupervised techniques which can be used. You now know that you can group machine learning algorithms as supervised, unsupervised and semi-supervised learning (Reinforcement learning). So, it contained the following: Data ethics, Data science concept, what is machine learning, Artificial intelligence (AI), and the relationship between Data science and machine learning. In the next chapter you will discover the two biggest sources of error when learning from data, namely bias and variance and the tension between these two concerns. Naturally, life is dynamic so we cannot understand that unless we study from data of our daily lives.

Muhammadu Hamza Yalwa

from Rumfa college Computer science lab, kano

--

--