Stories by João Renato on Medium

Did AI write this article?

João Renato — Sun, 14 Apr 2024 23:17:06 GMT

Could an artificial intelligence write this article here on medium? Or was it an actual human being?

Microsoft Designer — Stunning designs in a flash

Well, it’s been a while since the last time I published here (over a year to be more accurate, a few months before ChatGPT being launched).

As you might know, English is not my first language. So for my last post I had to research a lot of words and idiomatic expressions just to make sure that what I was writing made sense (I don’t trust Google Translate that much to just input a text in Portuguese and use the output in English).

I basically spent twice the time to write in English compared to what it would have taken me if it was in Portuguese.

In other words, it’s a lot of time invested to write a simple post, so there is no surprise that I haven’t been so productive on this platform (only five posts, two of them in English).

During this time, I’ve always wanted to come back writing. I’ve had some ideas, even created a few drafts. But they never became anything real. The time to be invested in such task aligned with my tendency to procrastinate blocked me for almost two years to get something done.

But then… in 2023 everything changed.

We all heard about this free tool invented by OpenAI that could answer questions like a real person. And even more, it could give us ideas, insights, review texts and…even write stories!

ChatGPT reached one million users in only five days. It was the fastest growing platform until Meta launched Threads, with one million users in only one hour.

Source: Statista

By the mid of 2023, practically anyone I knew was using it. Whether it was for writing fancy emails, reviewing texts, coming up with ideas for a marketing campaign, or even debugging programming code (me included).

While it appeared to be a highly useful tool and a potentially game-changer in the market, it also generated some controversies.

Students could use it to cheat in essays, or online tests, influencers could use it to create digital content without any effort or creativity. And, why not, writers could write articles or even books in a couple hours without too much thinking.

People started wondering the possibilities. There were even cases of hackers using ChatGPT to break cybersecurity.

With this new technology, it came new rules. Teachers started forbidding it, companies created policies for the use of this tool.

Now, one year later, every time I see some post with fancy words on LinkedIn I wonder who actually wrote that.

Photo by Possessed Photography on Unsplash

Turing test

In 1950, the mathematician Alan Mathison Turing wrote an article proposing to consider the question: Can machines think?

The solution to this problem described in Turing’s article is now known as “Turing’s test”, although his own denomination was “Imitation Game” (yes, that’s where the title of the movie came from).

I’m not gonna annoy you with the technical details (in reality because I didn’t understand) but I can say that the essence of the game is to discover if a “person” you are talking to is human or actually a computer. Considering the possibility that this machine can imitate a human being, to win you need to ask specific questions that depending on the answers will guide you towards the truth.

https://medium.com/media/e041efd2cbc30461707e8da3e6e18dca/href

Turing in his article describes how digital computers operate and then theorizes on possible questions we could ask and what would be the possible outcomes from these questions. Basically, you win the game if you can tell that it is indeed a machine, and the machine wins if it tricks you to think that it is human.

Turing than theorizes the concept of “learning machines” (is this expression familiar to you?). He finishes his article with the following:

We may hope that machines will eventually compete with men in all purely intellectual fields. But which are the best ones to start with? Even this is a difficult decision. Many people think that a very abstract activity, like the playing of chess, would be best. It can also be maintained that it is best to provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English. This process could follow the normal teaching of a child. Things would be pointed out and named, etc. Again I do not know what the right answer is, but I think both approaches should be tried. We can only see a short distance ahead, but we can see plenty there that needs to be done.

The Turing Test is often discussed in the context of artificial intelligence and the quest to create machines that can demonstrate human-like intelligence. While it has been a benchmark for AI development, many argue that passing the Turing Test doesn’t necessarily mean a machine truly understands or possesses consciousness. It primarily assesses the ability to mimic human conversation.

Since his article was released, Turing’s idea of a machines tricking humans was always linked to sci-fi movies. It was really hard (at least for me) to think of a scenario where we could not be sure if we were talking to another human being or to a machine.

That was before ChatGPT…

A game changing..

Photo by Jonathan Kemper on Unsplash

The rise of ChatGPT was a gaming changing even for the giant Google, because people became more interested in finding information by asking ChatGPT than using the search engine.

But…why?

How do Generative Language Models work

If you ask ChatGPT what is ChatGPT, it answers:

ChatGPT is a conversational AI developed by OpenAI, based on the GPT (Generative Pre-trained Transformer) architecture. It’s designed to understand and generate human-like text based on the input it receives. Essentially, it’s a language model that can engage in dialogue, answer questions, provide information, and assist with various tasks based on the context provided. It’s trained on a vast amount of text data from the internet, allowing it to generate responses that are contextually relevant and coherent.

In other words, ChatGPT is a specific implementation of a generative language model. From an input, it generates responses, complete text, and engage in conversation by predicting the next word or token in a sequence given the preceding context, through a process called Input Processing.

There are other tools like ChatGPT, but none of them gained so much popularity (so far).

There are a lot of possible explanations on why ChatGPT suddenly became more popular to find information than Google. But I personally like to think that the idea of having a “human” conversation at anytime about anything you want to know is the key answer.

Perhaps, if we didn’t know we were talking to an artificial intelligence, we could really think that this was an actual human being.

To be polite with ChatGPT

from Sangwang Rai’s post in facebook

Once I saw a post in linkedin from a developper saying that everytime he asked something to ChatGPT he used the words “please”, “thank you”, “goodbye”. He argued that, since it was a model that was constantly training with new data (in this case, our inputs), this would help the AI to consider the polite words statistically significant in a conversation, which would be eventually amended to its own vocabulary.

Well…I’m not gonna say it doesn’t make sense, but…

Like any robot, it is programmed, so it can be programmed to recognize bad words, impoliteness, etc. The same way it can be programmed to be always polite. So, in my point of view, if you ask ChatGPT

Who is the president of United States?

Hello! I hope you are fine. Could you please tell me who is the president of United States? Thank you in advance!

It’s the same!

But I understand the point. There were cases of robots becoming racists because of trolls interacting with them with racists expressions. But I still think that it all depends on how the model is pre-programmed.

AI content detectors

Now, in a scenario where even programmers are treating ChatGPT as a real human, how can we know if a text we read in a blog, in an instagram post, or even in a book was written by a real human?

To help with this, OpenAI itself created a tool to help us classify human written text to AI written. That’s because AI doesn’t create, but actually recreates texts mixing the texts from the training data. It’s basically copy pasting, but in a very efficient way.

However, when you go to the website to use this tool:

https://openai.com/blog/new-ai-classifier-for-indicating-ai-written-text

and when you click on Try the classifier you are redirected to a page not found

https://platform.openai.com/ai-text-classifier

Thankfully, there are other AI classifier tools available like QuillBot and AI text classifier. If you know any other classifiers, please share them in the comments.

So… is this article an AI text generated?

Now, with all that said, do you think this was a human being or an AI-text generated?

Have you ever been tricked by a robot? Do you think that ChatGPT passes the Turing test?

Let me know your thoughts!

Thank you for reading!

Data Science skills: Why learn SQL?

João Renato — Sun, 21 Aug 2022 02:54:28 GMT

If you are a data scientist or if you want to become one, this article will help you understand the importance of learning SQL for this field.

Context
SELECT * FROM
ORDER BY
GROUP BY
JOINS
Conditions if else (CASE WHEN)
Application in Data Science
Conclusion

This article was inspired by 5 Reasons Every Aspiring Data Scientist Must Learn SQL, written by Francis Onyango. Here, I will show you, in a simple, didactic and objective way, the advantages of this programming language and its applications in the data science field.

5 Reasons Every Aspiring Data Scientist Must Learn SQL

1. Context

Structered Query Language, or just SQL, is a query-oriented language for relational data basis developed by IBM in the 1970's. It was created based on Relational Algebra — descending derivation from first order logic and set theory.

As the years went by, SQL became so popular that now it’s not limited to IBM’s domain. A lot of popular data analytics software, as SAS, R and others, use it for data table crossing and pivoting.

2. SELECT * FROM

Its popularity can be explained from the fact that it’s an easy and intuitive language.

Suppose we have a data table as the following:

Source: Author

If we want to select only the column (variable) Name from the table Table, we use the following command:

select Name from Table

If we want to select the occupations (variable Occupation) in which the people are between 30 and 40 years old, we use the following command:

select 'Occupation' from Table where Age between 30 and 40

Observe that, to specify the records (table rows) to be filtered, I use the conditional command where.

If we want to select all variables from the table, with no restriction — in other words, the whole table — we have two command options:

Option 1:
select ID, Name, 'Occupation', Age from Table

Option 2:
select * from Table

The two command lines above create the same query. The main difference is that for option 1, the data analyst has the freedom to query the table in any variable order. The asterisk (* ) in option 2 is understood, in SQL, as the selection of all columns from the table.

Obs.: There's also a matter of performance among the two types of query. However, this is not the subject in this article. You can check some performance tips in SQL here.

The SELECT command is part of a set of commands called Data Manipulation Language (DML). Among them, there are:

DELETE

UPDATE

INSERT

As you probably imagine by now, each command above results in deleting, updating and inserting data in a data table.

3. ORDER BY

Making queries in data bases is not limited to only select the desired columns/rows. Often we need to visualize the data ordered by some specific feature.

Coming back to the prior example. If we wish to visualize the data in ascending order by age, we’ll use the following code line:

select * from Table order by Age

Or if we want to visualize the data by age in a descending order, we have:

select * from Table order by desc Age

Obs.: Since SQL is used by many different analytics tools, there might be little changes in certain commands, in a sense that, for instance, desc might be after the variable or between parenthesis. That said, it’s important that the user (programmer, data analyst, data base administrator, etc.) is aware to the peculiarities of the SQL language in the chosen environment tool.

Notice how every functionality in SQL is basically to select (select ) some or all (* ) columns (from ) from the table, which reinforces the idea of how easy this language is.

4. GROUP BY

Now, let’s say the table we want to wok on is the following:

Source: Author

If we want to select the philosophers age’s average, we have the following code options:

Option 1:
select 
       sum(Age)/count(*) 
from Data
where 'Occupation' = 'Philosopher'
group by 'Occupation'

Option 2:
select 
       avg(Age)
from Data
where 'Occupation' = 'Philosopher'
group by 'Occupation'

Both commands will result in the same query. Notice that this time we had to use, besides the group by, the conditional where, because I had limited my query to only the records in which the occupation was philosopher where 'Occupation' = 'Philosopher' .

Both functions count() and sum() are also very important for algebraic manipulations. The first counts the number of rows for a specific column count(column) or the whole table count(*) . While the function sum() sums the values from the chosen column, in this case the column Age sum(Age) . This way, we obtain the age’s average using the formula sum(Age)/count(*) .

The second code line has the function avg(), from average. A more straightforward way to calculate the average.

Obs.: The function to calculate the average in SQL might change from one software to another, which can be avg() or mean(). Once more, the programmer must pay attention to this detail.

5. JOINS

Suppose we have two tables, following the same style as the table presented on the last topic. Let’s call these tables A and B.

Imagine there are records in common among these tables. In other words, there’s information inside table A that is also inside table B. This information is known as intersection, which has been already explained in my last article.

Set theory — from pure math to SQL

However, the advantage of “Joins” goes beyond the intersection of elements among tables. This feature is useful to join, discard or even compare elements within data tables.

Source: https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

You can learn more about joins from the following article.

Visual Representation of SQL Joins

6. Conditions If Else (CASE WHEN)

Suppose we want to create a new variable that classifies the age from the table Data. If I wish to create the categories “young”, “adult” and “senior”, accordingly to the age, I use the when clause, as shown below:

Option 1

select 
  case
    when Age<30 then "young"
    when Age between 30 and 40 then "adult"
    when Age >= 40 then "senior"
  end as Age_range
from Data

Option 2

select 
  case
    when Age<30 then "young"
    when Age between 30 and 40 then "adult"
    else "senior"
  end as Age_range
from Data

Notice that the CASE command has a block structure.

case when then end

In other words, the CASE command requires an END to close the algorithm. case something happens, then a new column is created.

Also, you can use the ELSE clause using the same logic as you would use for if else.

7. Application in Data Science

The advantage (or necessity) in using SQL for data science can be resumed in how easy it is to use it and in the capacity of better comprehension of the data sets from the analyst’s perspective.

After all, any statistician/data scientist knows that every analysis, no matter how complex it is, begins with the extraction and exploration of the data, which covers cleaning, structuring or even data base table crossing.

For instance, in software R, there is a package called “sqldf”, with which is possible to code in SQL for Data Frames manipulation, crossing and/or restructuring.

sqldf package - RDocumentation

For Python, we have the SQLite library to connect to a data base. You can read more about it in the following.

sqlite3 - DB-API 2.0 interface for SQLite databases

Subqueries

A very useful resource in SQL is SUBQUERY, which consists of creating a query to extract only a part of an existing query. This can be used inside a where command or a case when.

Example 1:
select 
       avg(Age) as avg_age
from Data as a
where (select Age from Data where Id = a.Id)>30
group by 'Occupation'

Example 2:
select case
       when Age < 
(select avg(Age) from Data where 'Occupation' = a.'Occupation') then 'New'
       else 'In average'
       end as compare_ages
from Data as a

You can see that in both examples the same table was used to compare its data, in a way that there is a query inside the comparing algorithm.

In Example 3, we compare the individual’s age to the average of all ages.

Example 3:
select 
   case 
      when Age < (select avg(Age) from Data) then 'Smaller Age'
      else 'Age equal or bigger'
   end as compare_age_average
from Data

Auto Join

Because of its relational characteristic, there's a limitation when it comes to exploring / comparing data within the same column.

Data2 = 
select * from Data order by Id

select 
       case
         when  a.Age < b.Age then 'younger'
         else 'older or same age'
       end compare_Age
from Data2 a left join Data2 b
on a.Id < b.Id
group by 'Occupation'

In the above example, we compare the individual’s ages. Notice that a left join is required from the table Data2 with itself.

Obs.: Here, first we created the table Data2 ordered by column Id. This was necessary in order to apply the auto join on the variable.

Another example would be comparing the philosopher’s age with the other individuals from the table.

select 
    case
      when a.Age = b.Age then 'Same Age'
      when a.Age < b.Age then 'Smaller Age'
      when a.Age > b.Age  then 'Bigger Age'
    end as compare_age_philosopher
from Data a left join Data b on a.Id <> b.Id
where b.'Occupation' = 'Philosopher'

A platform to learn/train SQL

As you already know, the best way to understand our knowledge gaps is to test ourselves. That said, I’d like to recommend a platform that I honestly find it interesting to test my SQL skills:

SQLZoo

More tips to Data Scientists

I strongly recommend you to read the following article with more tips and valuable information about the application possibilities for this language in data science.

Extra 4 SQL Tricks Every Data Scientist Should Know

CONCLUSION

Here you saw how SQL is easy to learn and to use. You also saw how it is useful when it comes to dealing with data exploration and data table crossing.

It’s important to highlight that the code examples that we explored until here are extremely simple and for didactic purposes. An auto join or subquery can become very complex with dozens of lines of code, depending on the application (they can even be nested, with a subquery inside another subquery).

That said, we can conclude that knowing SQL, for a Data Scientist, is as important as any professional knowing English in the XXI century. Nowadays, there are several platforms like Coursera, Udemy, and others with accessible courses, besides, of course, the good and old stackoverflow.

Thank you for reading!

Stay connected

Connect on LinkedIn.

Data Science skills: Why learn SQL? was originally published in Dev Genius on Medium, where people are continuing the conversation by highlighting and responding to this story.

Set theory — from pure math to SQL

João Renato — Thu, 07 Jul 2022 21:33:56 GMT

From math to SQL

Set theory — from pure math to SQL

This article intends to explain the fundamental theory applied in SQL, probability and statistics.

The Set Theory is one way to explain how different elements are distributed within groups, either sharing more than one group or none. That being said, the theory quantifies the possibilities of clustering those elements. There are practically an infinity number of applications. You might use it for social bubbles, profile statistical surveys, books or products classifications or even propositional logic. Besides, of course, the fact that it is the fundamental for the Strutucted Query Language (SQL) and for the “sets” concept in Python.

Venn Diagram
Elements belonging to sets
Intersection
Union
When one set contains another (set operations)
Review
References

1. Venn diagram

Venn Diagram design by author

The so called Venn Diagram mathematically illustrates the association between two or more sets that may or may not have elements in common.

2. Elements belonging to sets

On the above illustration, the elements a1 and a2 belong to set A, while b1 and b2 belong to set B. The math notation for this association example can be defined as bellow:

a1, a2 ∈ A

b1, b2 ∈ B

3. Intersection

Notice that both a2 and b2 belong to sets A and B. However, these two elements are in a common area. As for a1 and b1, each one belongs to a specific set. We might say that the area where a2 and b2 are found is a third set. This common area is called intersection, that being said:

a2, b2 ∈ A ∩ B

meaning that a2 and b2 belong to the intersection between A and B.

4. Union

When we talk about the union between two sets we are basically talking about those elements that belong to either one of the two sets.

The union between sets A and B is the set with elements {a1, b1, a2, b2}. In other words, as well as the idea of the intersection between sets results in a third set, the union of sets also results in a new set. The math notation for this is:

{a1, a2, b1, b2} ∈ (A ∪ B)

Then, we can define the following sets C and D:

C = (A ∩ B)

D = (A ∪ B)

The set C consists on the intersection of A and B. As for set D, it consists on the elements that belong to the union of A and B. Formally, we say that C is the A intersection B and D is equal A union B.

5. When one set contains another (set operations)

Let’s take a look on the following image:

On the above diagram we have the sets A and B, with their respective elements a1 and b1.

We say that set B is a subset of A (or A contains B)

B ⊂ A

All elements belonging to B also belong to A, but not every element in A belong to B.

From this example, we can extract the following properties:

A ∩ B = B

A ∪ B = B

b1 ∈ A, B

Using the concept of set operations, getting back to the Venn Diagram from item 4, we can conclude that:

C ⊂ A, B

A, B ⊂ D

In other words, the set C is contained, simultaneously, in A and B, while the set D contains both A and B. As a consequence, the elements from C belong to sets A and B, although the elements belonging to A or B belong to D.

6. Review

The set theory is an excellent start for those who want to understand a little further about probability and SQL programming.

In this article, we saw that the intersection between two or more sets is linked to the idea of its elements belonging. We also saw that the union between two sets consists in a set that contains all their elements.

Although we used two sets in this article, the same ideas can be extended to 3, 4, …, n different sets.

There’s a caveat here:

The union between two or more sets is not the same as the sum of its elements, although intuitively it makes sense. If we add the elements of A and B, as a result we’ll have the duplicity of the elements that belong to their intersection.

We need to remember this concept, specially if we’re talking about probability.

References:

https://plato.stanford.edu/entries/set-theory/#:~:text=Set%20theory%20is%20the%20mathematical,whose%20members%20are%20also%20sets.

Venn Diagram - Examples, Definition, Formula, Symbols, Types

Stay connected

Connect on LinkedIn.

Por que aprender SQL

João Renato — Tue, 16 Jun 2020 18:15:24 GMT

Ciência de Dados: Por que aprender SQL

Se você é cientista de dados ou aspirante a cientista de dados, entenda aqui a importância de aprender SQL.

Contextualização
SELECT * FROM
Order by
Group by
Joins
Aplicação em Ciência de Dados
Conclusão

Como muito bem exposto no artigo 5 Reasons Every Aspiring Data Scientist Must Learn SQL, aqui busca-se mostrar, de forma simples, didática e concisa, as vantagens da linguagem e quais suas aplicações no campo da Ciência de Dados.

Contextualização

Structered Query Language, mais conhecida como SQL, é uma linguagem voltada para consultas em bancos de dados relacionais desenvolvida pela IBM na década de 1970.

http://blog.dbaacademy.com.br/t-sql-nao-suportadas-no-sql-azure/

Inspirada na Álgebra Relacional, com o passar dos anos, SQL se disseminou e se popularizou de tal forma que não se limita mais ao domínio da IBM. Muitos softwares populares de análise de dados, como SAS, R, dentre outros, se utilizam dela para manuseio de tabelas e data frames.

SELECT * FROM

Sua popularidade pode ser explicada pelo fato de que é uma linguagem fácil e intuitiva.

Suponha que temos uma tabela de dados relacionais conforme a seguir:

Fonte: Autor

Se quisermos selecionar apenas a coluna (variável) Nome da tabela Dados, utilizamos o seguinte comando:

select Nome from Dados

Já se quisermos selecionar as ocupações (variável Ocupação) cujos indivíduos têm entre 30 e 40 anos, tem-se o seguinte comando:

select 'Ocupação' from Dados where Idade between 30 and 40

Observe que, para limitar os registros (linhas da tabela) a serem consultados, eu utilizo o comando condicionante where.

Se quisermos selecionar todas as variáveis da tabela sem nenhuma restrição, ou seja, a tabela inteira, temos duas opções de comando:

Opção 1:
select ID, Nome, 'Ocupação', Idade from Dados

Opção 2:
select * from Dados

As duas linhas de código apresentadas acima resultam na mesma consulta. A diferença básica é que, na primeira, o usuário tem a liberdade para consultar sua tabela em qualquer ordem de suas variáveis. Em suma, o asterisco (* )é entendido, na linguagem SQL, como a seleção de todas as colunas da tabela.

Obs: Existe também uma questão de performance envolvida entre os dois tipos de consulta. Todavia, para fins didáticos, tal assunto não será abordado neste artigo.

O comando SELECT faz parte do conjunto de comandos denominado Data Manipulation Language (DML). Entre os comandos do tipo DML estão:

DELETE

UPDATE

INSERT

Como você já deve imaginar, cada comando desse resulta, respectivamente, em deletar, atualizar e inserir dados em uma tabela.

Order by

Fazer consultas em bancos de dados, obviamente, não se limita a apenas selecionar as colunas/linhas desejadas. Muitas vezes há necessidade de visualização dos dados ordenados por alguma categoria ou variável.

Voltemos ao exemplo anterior. Se desejamos visualizar os dados ordenados, de forma crescente, pela idade, temos a seguinte linha de código:

select * from Dados order by Idade

Já se desejamos visualizar esses dados ordenados pela idade de forma decrescente, temos:

select * from Dados order by desc Idade

Obs: Dado o fato de que a linguagem SQL é utilizada por muitos softwares diferentes, podem haver pequenas divergências em certos comandos, de forma que o desc pode estar após a variável ou entre parênteses, por exemplo. É preciso então que o usuário se atente às peculiaridades da linguagem SQL no ambiente em que ele está programando.

Repare como toda a funcionalidade da linguagem se resume a selecionar (select ) algumas ou todas as colunas (* ) da tabela de interesse (from ), o que reforça a ideia da facilidade no uso de SQL.

Group by

Suponha que a tabela na qual queremos trabalhar seja:

Fonte: Autor

Se desejamos selecionar a média das idades dos indivíduos que são filósofos, temos as seguintes opções de código:

Opção 1:
select 
       sum(Idade)/count(*) 
from Dados 
where 'Ocupação' = 'Filósofo'
group by 'Ocupação'

Opção 2:
select 
       avg(Idade)
from Dados 
where 'Ocupação' = 'Filósofo'
group by 'Ocupação'

Ambos os códigos trarão a mesma consulta. Note que dessa vez tive que utilizar, além do recurso group by, a condicionante where, pois limitei minha consulta a apenas registros em que o indivíduo possuía ocupação de filósofo where 'Ocupação' = 'Filósofo'

As duas funções count() e sum() também são de suma importância para manipulações algébricas. A primeira conta o número de linhas de determinada coluna count(coluna) ou da tabela como um todo count(*) . Já a função sum() soma os valores da coluna de interesse, no caso a coluna Idade sum(Idade). Dessa forma, obtém-se a média das idades pela fórmula sum(Idade)/count(*).

A segunda linha de código traz a função avg(), de average. Uma forma mais direta de se calcular a média da coluna de interesse.

Obs: A função que calcula a média em SQL pode variar de software para software podendo ser avg() ou mean(). Mais uma vez, o programador deverá se atentar a esse detalhe.

Joins

Suponha que temos duas tabelas, nos moldes da tabela apresentada no tópico anterior. Chamemos essas tabelas de A e B.

Imagine então que há registros em comum entre essas tabelas, ou seja, há informações contidas na tabela A que também estão na tabela B. Tais informações são conhecidas por Interseção, conforme já explicado no artigo sobre Teoria dos Conjuntos.

Teoria dos Conjuntos

Mas a utilidade dos “Joins” vai muito além de interseção entre elementos de duas tabelas. Esse recurso serve para unir, descartar ou até mesmo comparar elementos entre tabelas.

Fonte: https://www.codeproject.com/Articles/33052/Visual-Representation-of-SQL-Joins

Uma explicação mais completa sobre junções de tabelas relacionais se encontra neste excelente artigo de C.L. Moffatt.

Condições If Else (CASE WHEN)

Suponha que eu deseje criar uma nova variável que classifica a faixa etária dos indivíduos contidos na tabela Dados. Se eu desejo criar as categorias “jovem”, “adulto” e “maduro”, de acordo com a idade, eu utilizo a condicionante when, conforme o exemplo abaixo.

select 
  case
    when Idade<30 then "jovem"
    when Idade between 30 and 40 then "adulto"
    when Idade >= 40 then "maduro"
  end as faixa_etaria
from Dados

Observe que a função CASE possui uma estrutura em bloco.

case when then end

Dessa forma, a estrutura em bloco do case when gera o algoritmo condicionante para que eu possa criar a variável de interesse.

Aplicação em Ciência de Dados

A vantagem (ou necessidade) de se usar SQL em ciência de dados pode ser resumida, como já mostrado, em sua facilidade de uso e na capacidade de melhor compreensão dos conjuntos de dados (data sets) que o analista tem em mãos.

Afinal, qualquer Estatístico/Cientista de Dados sabe que toda análise, por mais complexa que seja, começa pela extração e exploração dos dados a serem analisados, o que envolve limpeza, estruturação ou até cruzamentos de tabelas dos Bancos de Dados.

No já mencionado software R, existe um pacote chamado “sqldf”, com o qual é possível programar em SQL para manipulação, cruzamento e/ou reestruturação de Data Frames.

https://medium.com/media/becafa0717a316402f77a827ee3e8b1b/href

Subconsultas (subquery)

Um recurso muito útil dessa linguagem é o de subconsultas. Como o próprio nome sugere, ela consiste em se fazer uma consulta dentro de uma consulta, seja em um where ou case when.

Exemplo 1:
select 
       avg(Idade) as media_idade
from Dados a
where (select Idade from Dados where Id = a.Id)>30
group by 'Ocupação'

Exemplo 2:
select case
       when Idade < 
(select avg(Idade) from Dados where 'Ocupação' = a.'Ocupação') then 'Novo na categoria'
       else 'Dentro da média na categoria'
       end as compara_idade
from Dados a

Note que em ambos os exemplos foi utilizada a mesma tabela para comparar seus dados, de forma que há uma consulta dentro do algoritmo de comparação.

No Exemplo 3, compara-se a Idade do indivíduo com a média de todas as idades.

Exemplo 3:
select 
   case 
      when Idade < (select avg(Idade) from Dados) then 'Idade Menor'
      else 'Idade igual ou maior'
   end as compara_idade_media
from Dados

Auto Join

Dada a natureza relacional da linguagem, há uma limitação no sentido de se poder explorar/comparar dados dentro de uma mesma coluna.

Dados2 = 
select * from Dados order by Id

select 
       case
         when  a.Idade < b.Idade then 'mais novo'
         else 'mais velho ou mesma idade'
       end compara_idade
from Dados2 a left join Dados2 b
on a.Id < b.Id
group by 'Ocupação'

No exemplo acima, busca-se comparar as idades entre os indivíduos. Note que, para isso, é feito um left join da tabela Dados com ela mesma.

Obs: Neste caso, inicialmente, foi criada uma tabela Dados2 ordenando os dados pela coluna Id. Isso foi feito para que o auto join pudesse ser feito pela variável.

Um outro exemplo seria o de comparar a idade dos filósofos com os demais membros da tabela.

select 
    case
      when a.Idade = b.Idade then 'Mesma Idade'
      when a.Idade < b.Idade then 'Idade Menor'
      when a.Idade > b.Idade  then 'Idade Maior'
    end as compara_idade_filosofos
from Dados a left join Dados b on a.Id <> b.Id
where b.'Ocupação' = 'Filósofo'

Uma plataforma para aprender/treinar SQL

Como já é de conhecimento de todos, a melhor forma de conhecer nossas lacunas de conhecimento é nos testando. Com isso, gostaria de recomendar uma plataforma que achei, particularmente, interessante para testar meus conhecimentos em SQL:

SQLZOO

Essa plataforma é citada no artigo SQLZoo: The Best Way to Practice SQL, que mostra, além da citada acima, várias outras bem interessantes para testar suas habilidades nessa linguagem.

Mais dicas para Cientistas de Dados

Aconselho fortemente a leitura do artigo abaixo, além do mencionado no início deste post, com mais dicas e informações sobre as possibilidades de aplicação dessa linguagem nessa área de atuação.

Extra 4 SQL Tricks Every Data Scientist Should Know

Conclusão

Vimos como a linguagem SQL é fácil de aprender e de se utilizar. Vimos também como ela é útil quando se trata de lidar com exploração de dados relacionais e cruzamento de tabelas.

É importante destacar que os exemplos de código explorados até aqui são extremamente simples e com fins puramente didáticos. Um auto join ou subconsulta podem se tornar extremamente complexos com dezenas de linhas de código, dependendo da aplicação (podendo estar até mesmo aninhados, ou seja, uma subconsulta dentro de um auto join e vice-versa).

Dito isso, pode-se concluir que saber SQL, para um Cientista de Dados, é tão importante quanto saber inglês para qualquer profissional do século XXI. Hoje em dia, há várias plataformas como Coursera, Udemy, dentre outras com cursos acessíveis, além é claro da possibilidade de se aprender na prática com o bom e velho stackoverflow.

Este post foi esclarecedor? Ficou mais alguma dúvida? Deixe sua contribuição nos comentários para o caso de eu ter esquecido de abordar algum conceito e possa, então, complementar na Parte 2.

Obrigado!

Teoria dos Conjuntos

João Renato — Thu, 11 Jun 2020 17:37:07 GMT

Teoria dos Conjuntos — uma base para a linguagem SQL

Este artigo tem por finalidade explicar de forma didática e concisa a teoria dos conjuntos e, dessa forma, embasar artigos futuros sobre linguagem SQL, Probabilidade e Estatística.

Fonte

A Teoria dos Conjuntos é uma forma de explicar como diferentes elementos se distribuem dentro de grupos, sejam ocupando grupos em comum ou não ocupando nenhum grupo. Dessa forma, ela tenta quantificar as possibilidades de agrupamento desses elementos. Sua aplicabilidade é praticamente infinita. Pode-se usar a teoria de conjuntos em bolhas sociais, levantamentos estatísticos de perfil, em catalogação de livros ou produtos em uma loja ou até mesmo lógica proposicional e por aí vai. Além, é claro, de ser a base para a já mencionada Strutucted Query Language (SQL) e para o conceito de “sets” em Python.

Diagrama de Venn
Elementos pertencentes a conjuntos
Interseção
União
Quando um conjunto contém o outro (operações entre conjuntos)
Recapitulando

1. Diagrama de Venn

O chamado Diagrama de Venn ilustra matematicamente a relação entre dois ao mais conjuntos que possuem ou não elementos em comum.

2. Elementos pertencentes a conjuntos

Diz-se que os elementos a1 e a2 pertencem ao conjunto A, enquanto que b1 e b2 pertencem ao conjunto B. A notação matemática para essa relação pode ser descrita da seguinte forma:

3. Interseção

Note que a2 e b2 pertencem a ambos os conjuntos A e B, porém esses dois elementos encontram-se em uma uma área que é comum aos dois conjuntos. Já a1 e b1 pertencem, cada um, a um conjunto específico. Fazendo um breve exercício mental, poderíamos dizer que a área em que a2 e b2 se encontram é um terceiro conjunto. A essa área em comum denominamos interseção, de forma que:

ou seja, a2 e b2 pertencem à interseção entre A e B.

4. União

O conceito de união entre conjuntos vem da ideia de pertencimento de todos os seus elementos. Observe a seguinte imagem:

A união entre os conjuntos A e B resulta no conjunto dos elementos {a1,a2,b1,b2}. Ou seja, assim como a ideia de interseção entre conjuntos resulta em um terceiro conjunto, a união entre conjuntos também resulta em um novo conjunto. A notação matemática para a união é:

Podemos então definir os seguintes conjuntos C e D:

Note que o conjunto C é formado pelos elementos da interseção entre A e B. Já o conjunto D é formado pelos elementos da união entre A e B. Formalmente, dizemos que C é igual A interseção B e D é igual a A união B.

5. Quando um conjunto contém o outro (operações entre conjuntos)

Repare na seguinte figura:

No diagrama acima tem-se os Conjuntos A e B, com seus respectivos elementos a1e b1.

Diz-se que o conjunto B está contido em A (ou A contém B). Também pode-se dizer que B é um subconjunto de A.

Note que todos os elementos que pertencem a B pertencem a A, mas nem todo elemento pertencente ao conjunto A pertence ao conjunto B.

Podemos extrair desse exemplo as seguintes propriedades:

A partir do conceito de operações entre conjuntos, voltando ao Diagrama do item 4, podemos concluir que:

Ou seja, o conjunto C está contido, ao mesmo tempo, em A e em B, enquanto que o conjunto D contém tanto A como B. Por consequência, os elementos de C pertencem aos conjuntos A e B, em contrapartida tanto os elementos de A como os elementos de B pertencem ao conjunto D.

6. Recapitulando

A teoria dos conjuntos é um excelente ponto de partida pra quem quer entender um pouco mais sobre probabilidade e programação em SQL.

Vimos que a interseção entre dois ou mais conjuntos está ligada à noção de pertencimento mútuo de seus elementos. Vimos também que a união entre dois conjuntos é formada pelo conjunto que abrange todos os seus elementos.

Neste post, foram utilizados — pra fins didáticos — apenas dois conjuntos. Porém, como é possível notar no diagrama que abre este artigo, os conceitos aqui abordados podem ser estendidos para 3, 4, …, n conjuntos diferentes.

Cabe aqui também uma ressalva:

A união entre dois ou mais conjuntos não é o mesmo que a soma de seus elementos, apesar de intuitivamente isso fazer sentido. Basta notar que se somarmos os elementos de A e B teremos como resultado a duplicidade dos elementos pertencentes à sua interseção.

É preciso ter em mente esse conceito principalmente quando falarmos sobre teoria das probabilidades.

Este post foi esclarecedor? Ficou mais alguma dúvida? Deixe sua contribuição nos comentários para caso eu tenha esquecido de abordar algum conceito e possa, então, complementar na Parte 2.

Obrigado!

Aprendendo Julia para Ciência de Dados

João Renato — Sun, 05 Apr 2020 19:10:10 GMT

Aprendendo Julia para Ciência de Dados — Parte 1

Em 2012, um colega de graduação comentou sobre uma nova linguagem de programação que poderia vir a substituir a linguagem R.

De imediato, o(a) leitor(a) já familiarizado(a) com programação para análise de dados pode pensar que meu estimado colega teria mencionado a linguagem Python. Porém, não foi esse o caso. Ele estava falando da recém criada Julia.

De toda sorte, esse foi um comentário que eu, de início, achei interessante, porém não me motivei em me aprofundar sobre o assunto. Eu era muito novo, não era formado em Estatística ainda e não tinha tanta habilidade em programação.

O tempo foi passando, por necessidade do destino tive que aprender SAS, SQL, depois R e por último me interessei pela linguagem Python. Esta última como sendo aquela na qual eu queria mergulhar em termos de conhecimento e aplicações.

Durante um bom tempo, vi a linguagem Python como sendo a que viria a substituir o R. Seu código era amigável, intuitivo, além de que todos os artigos que os comparavam colocavam Python como mais rápido e eficiente. O que me levou a estar determinado a utilizar apenas Python como recurso computacional.

Chega então 2020

Já faz algum tempo que acompanho os artigos do Medium. Principalmente nas áreas de programação e ciência de dados — meus maiores interesses.

Eis que um belo dia me deparo com um artigo intitulado “Why Python is not the programming language of the future”.

Nele, o autor explica as desvantagens do Python perante outras linguagens um pouco mais recentes e desmistifica o Hype em torno dela.

Mas um trecho em particular do artigo me chamou a atenção. Um trecho em que o autor lista possíveis linguagens que viriam a substituir Python no futuro:

Rust
Go
Julia

Quando li o nome “Julia” logo me veio na lembrança meu colega de graduação falando dessa nova linguagem.

Baixei o programa em meu computador, comecei a pesquisar sobre e logo percebi que havia muito pouco material em português a respeito — aqui no Medium, um material bem interessante é do Pizza de Dados, como o artigo “Precisamos falar sobre a linguagem Julia”. Ao investigar mais a fundo, notei que há uma grande comunidade internacional contribuindo para os avanços da Julia, que há muita aplicabilidade em computação científica e muitos recursos visuais.

julialang.org

O interesse e a curiosidade foram crescendo de tal forma que hoje estou determinado a explorar esse nicho, aparentemente pouco explorado aqui no Brasil, de Ciência de Dados utilizando Julia.

Meu objetivo com este artigo — o qual espero ser o primeiro de muitos — é então iniciar meus estudos na aplicabilidade de Julia como ferramenta em Ciência de Dados de forma compartilhada. Com isso, fixo melhor meus conhecimentos e ganho possíveis feedbacks e colaboradores para que possamos trocar ideias e até mesmo criar novas.

Conto com vocês.