#02 Data Visualization: Top 10 practical and impactful techniques you only have to know

可視化によるデータの理解

Akira Takezawa
Coldstart.ml
3 min readFeb 5, 2019

--

A Target is beginners who wanna know …

  • “Must Know Top X” practical visualization technique
  • which visualization library is suitable for you
  • Next Level visualization skills

— — — — —

Why you need to read this …

I’m not a professional Data scientist, but a professional Translator who can simplify complicated ideas and summarize countless techniques. The difference between my article and others are following…

I focused on not Artistic Data Visualization Technique, but Practical Data Visualization Method.

Let’s get started!

— — — — —

Menu

  1. Clean Data: Missing Value Detection
  2. Manual Feature Selection: Correlation Analysis
  3. Statistical Diagnosis: Normalization (Scaler)

— — — — —

1. Clean Data: Missing Value Detection

  • Null Value Handling: Heatmap
titanic = sns.load_dataset("titanic")
nan = titanic.isnull()
sns.heatmap(nan, cmap="Greens")

2. Manual Feature Selection: Correlation Analysis

  • Correlation: Heatmap
# titanic = sns.load_dataset("titanic")
cor = titanic.corr()
sns.heatmap(cor, annot=True)
  • Correlation: Scatter plot
tips = sns.load_dataset("tips")
sns.relplot(x="total_bill", y="tip", hue="sex", data=tips)
  • Distribution: Joint plot
# Using Boston House Price Dataset
sns.jointplot(df["Houce Age"], df["House Price"] df, kind="reg")
  • Correlation: Pair plot with regression line
# iris = sns.load_dataset("iris")
sns.pairplot(iris, kind="reg")

4. Statistical Diagnosis: Normalization (Scaler)

  • Distribution: Histogram
x = np.random.normal(size=100)
sns.distplot(x, hist_kws={"color": "Teal"}, kde_kws={"color": "Navy"});
  • Distribution: Box plot
ax = sns.catplot(x="day", y="tip", data=tips, kind="box")
  • Distribution: Violin plot
sns.catplot(x="day", y="tip", kind="violin", data=tips);

— — — — —

References

--

--

Akira Takezawa
Coldstart.ml

Data Scientist, Rakuten / a discipline of statistical causal inference and time-series modeling / using Python and Stan, R / MLOps is my current concern