Word Cloud + Jieba = Generating fancy word cloud and word frequencies report

Zheng Yang
FullStack-Microservices-Developers
2 min readFeb 26, 2018

Original source: https://medium.com/@riveryang/word-cloud-jieba-generating-fancy-word-cloud-and-word-frequencies-report-858713109e5d

Are you interested in getting evolved into data mining? did you tried to analyze something based on big data? Each trip starts from the first step, this article will bring you the most important, easy to understand, while motivated first step, Text segmentation, as well as its continuous usage: Word Frequencies s\Statistics & fency Word Clouds.

You can find the source code I’ve accomplished from github at https://github.com/zhengyangca/wordcloud_jieba_statistics

But if you wish to see the explanation of each part of the code, you can just move ahead below:

1. main.py

The main process, in charge of calling each functions in order.

loading text file:

d = path.dirname(__file__)
text = open(path.join(d,’doc//2017年中央政府工作报告(全文).txt’)).read()

segment words via jieba

text=chnSegment.word_segment(text)

generate and plot Word Cloud

plotWordcloud.generate_wordcloud(text)

2. Jieba

Chinese text segmentation tool, but useful for both Chinese and English.

Here we call Jieba’s API to segment text and counter the word frequencies:

jieba_word=jieba.cut(text,cut_all=False) # cut_all is false by defaultdataDict=Counter(data)

write to file

# output to txt file
with open(‘doc//词频统计.txt’,’w’) as fw:
for k,v in dataDict.items():
fw.write(“%s,%d\n” % (k,v))
# fw.write(“%s”%dataDict)

3. WordCloud

A Python tool providing API to generate fancy wordcloud based on customized color scheme and masks.

It’s pretty easy to call the API(more info see Word Cloud GitHub)

generate Wordcloud

d=path.dirname(__file__)
alice_mask = np.array(Image.open(path.join(d, “Images//alice_mask.png”)))
font_path=path.join(d,”font//msyh.ttf”)
stopwords = set(STOPWORDS)
wc = WordCloud(background_color=”white”,
max_words=2000,
# mask=alice_mask,
stopwords=stopwords,
font_path=font_path,
)
wc.generate(text)

Open up a new window and plot word cloud

# show window
plt.imshow(wc, interpolation=’bilinear’)
# interpolation=’bilinear’ 表示插值方法为双线性插值
plt.axis(“off”)# take off the axis of the image
plt.show()

Demo Preview

outputted Word Cloud:

Word Cloud with mask

outputted Word Frequencies Report:

,107
2017,3
年,6
中央政府,1
工作,27
报告,2
李克强,1
:,2
各位,8
代表,9
,,738
现在,2
我,2
国务院,8
向,16
大会,1
政府,35
请予,1
审议,2
并,4
请,1
全国政协,1
委员,1
提出,1
意见,4
。,543
一,8
、,350
2016,1
回顾,1
过去,4

--

--