十月 Kaggle 官方網誌摘要

Rick Liu

4 min readDec 11, 2017

恩，十一月初忘了看 Kaggle 十月的網誌了。總共四篇：

文本情感分析教學 in R
亞馬遜雨林競賽首獎訪談
九月 Dataset Publishing Awards 首獎訪談
發表 Kaggle 2017 資料科學暨機器學習現況調查報告

文本情感分析教學 in R

Data Science 101: Sentiment Analysis in R Tutorial

Welcome back to Data Science 101! Do you have text data? Do you want to figure out whether the opinions expressed in it…

blog.kaggle.com

這是 Kaggle 的工程師發表的一篇教學，用美國 1989 ~ 2017 年的國情咨文資料庫來做文本情感分析。除了 Kaggle 官方網誌以外，該文在 Kaggle 上也有一個 Kernel 讓你可以直接跑跑看，或是 Fork 出來做練習。主要是使用一個叫做 tidytext 的 R 套件，不過我不會寫 R，自己寫了一份 python 的版本，對照看一下，覺得 R 語法有點神奇。文末附上了三個練習題，以及其他在 Kaggle 上面語言相關的資料集，還有外部的資料資源。

我的 python 版本 kernel。

亞馬遜雨林競賽首獎訪談

Planet: Understanding the Amazon from Space, 1st Place Winner's Interview

In our recent Planet: Understanding the Amazon from Space competition, Planet challenged the Kaggle community to label…

blog.kaggle.com

這邊我就不介紹這個競賽了，首獎是一位名為 bestfitting 的 kaggler。bestfitting 加入 Kaggle 不到兩年，參加十場競賽，最差的成績是 61th, top 4% 銀牌，最近一年的戰績最差是 7th，相當可怕。

解法方面，主要是用了四種（ResNets, DenseNets, Inception, SimpleNet）11 個 CNN 來做 ensemble，文中有一張一看就懂的架構圖。資料前處理的部份基本的都有做，多做了一個叫做 haze removal technique 讓圖片更清楚，對某些標籤有用，但是對某些標籤有害，但是 ensemble 會挑有利的 model 所以整體還是有提升的。

關於 evaluation，因為競賽是用 F2Score 評分，如果用 logloss 的話，不保證 logloss 低 F2Score 就會變高，所以這部分他有自己另外寫。另外 bestfitting 有發現某些標籤之前有相依性，所以他在 CNN 的輸出後面加了 Ridge regression 來捕捉這樣的關係，然後增強整理效能。機器是用 TitanX。

最後建議大家讀 cs229 & css231 還有每天看 paper 然後實作。

九月 Dataset Publishing Awards 首獎訪談

September Kaggle Dataset Publishing Awards Winners' Interview

This interview features the stories and backgrounds of our $10,000 Datasets Publishing Award's September winners…

blog.kaggle.com

第一名的 Dataset 是關於 ISIS 的宗教意識形態的文字資料，數量不多，但是是個滿有話題性的資料集。第二名是關於性格分析的 MBTI 16 型人格，去相關的論壇爬了 8600 個人發表的東西，然後標上人格標記。第三名資料量就稍微大一些些，是英國 2000 ~ 2016 年的事故資料以及交通流量，這個資料集也算是比較熱門的主題，互動的人也比較多。

發表 Kaggle 2017 資料科學暨機器學習現況調查報告

Introducing Kaggle's State of Data Science & Machine Learning Report, 2017

In 2017 we conducted our first ever extra-large, industry-wide survey to captured the state of data science and machine…

blog.kaggle.com

這就是今年 Kaggle 對 Kaggler 們做的調查，除了最後的報告公佈出來，資料以及處理資料的 Kernel 也都公佈了出來，就直接看看吧！

#Kaggle
#MachineLearning
#DataScience
#planet_understanding_the_amazon_from_space

十月 Kaggle 官方網誌摘要

文本情感分析教學 in R

Data Science 101: Sentiment Analysis in R Tutorial

Welcome back to Data Science 101! Do you have text data? Do you want to figure out whether the opinions expressed in it…

亞馬遜雨林競賽首獎訪談

Planet: Understanding the Amazon from Space, 1st Place Winner's Interview

In our recent Planet: Understanding the Amazon from Space competition, Planet challenged the Kaggle community to label…

九月 Dataset Publishing Awards 首獎訪談

September Kaggle Dataset Publishing Awards Winners' Interview

This interview features the stories and backgrounds of our $10,000 Datasets Publishing Award's September winners…

發表 Kaggle 2017 資料科學暨機器學習現況調查報告

Introducing Kaggle's State of Data Science & Machine Learning Report, 2017

In 2017 we conducted our first ever extra-large, industry-wide survey to captured the state of data science and machine…

Written by Rick Liu