ทำ MapReduce(Word Count) บน Hadoop(Ubuntu server)

Published in

Equinox

4 min readApr 6, 2019

MapReduce นั้นเป็นรูปแบบของการเขียนโปรแกรมรูปแบบนึง โดยจะแบ่งการทำงานออกเป็น 2 ส่วนก็คือ ส่วนของการ Map และ ส่วนของการ Reduce โดยมี input และ output ในรูปแบบของ <key,value> ซึ่งบทความนี้เราจะมาลองฝึกเขียน MapReduce บน Ubuntu กัน

สิ่งที่เราจะได้เรียนรู้จากบทความนี้ !!!

วิธีการลง Docker สำหรับ Hadoop Environment
วิธีการทำ MapReduce บน Hadoop

https://www.edupristine.com/blog/hadoop-mapreduce-framework

ก่อนจะเริ่มกัน หากผู้อ่านยังไม่เข้าใจหลักการทำงานของ MapReduce ก็สามารถไปอ่านได้ที่ลิ้งค์ด้านล่างนี้(ของใครไม่รู้แต่เขียนไว้ได้เข้าใจง่ายดีนะ)

MapReduce คืออะไร?

เริ่มแรกเลยว่าจะเล่าเรื่อง Apache Spark แต่ก่อนจะเล่าว่า Spark…

medium.com

เอาล่ะ ถ้าพอมีพื้นฐานแล้วก็มาทำความเข้าใจสิ่งที่เราจะทำกันต่อ

Word Count Concepts

https://www.oreilly.com/library/view/distributed-computing-in/9781787126992/5fef6ce5-20d7-4d7c-93eb-7e669d48c2b4.xhtml

จากรูปเราก็จะเห็นได้ว่าการนับคำ(Word Count) เนี่ยมันทำงานประมาณนี้นะ

Input : เป็นส่วนของ text file ที่เก็บข้อมูลคำต่างๆ ที่จะนำมาโหลดเข้า mapper

Splitting: เป็นส่วนของการ split data ใน file ออกเป็นส่วนๆ ไปทำที่ Node ต่างๆ ภายใน Cluster

Mapping: เป็นส่วนของการ Map <key,value> ซึ่งก็คือ word,count สำหรับปัญหาที่เรากำลังจะทำ แน่นอนว่าส่วนนี้เราต้องเป็นคน implements ขึ้นมาเองว่าต้องการให้ mapping data ในรูปแบบไหน

Shuffling: เป็นส่วนนึงใน reduce process จะทำการ shuffle key นั้นๆก่อน ซึ่งสามารถทำได้ทั้งที่ตัว node หรือ master node ก็ได้

Reducing: เป็นส่วนที่ทำการรวบรวม job ที่แต่ละ node mapping และ ทำการนนับแต่ละ key เข้าด้วยกัน แน่นอนว่าส่วนนี้เราต้องเป็นคน implements ขึ้นมาเองว่าต้องการให้ reduce data ในรูปแบบไหน

เริ่มทำกันดูดีกว่า

โดยตัวอย่างนี้จะทำบน Ubuntu (ใช้ VM จำลอง) download .iso ได้ที่ด้านล่างนี้

Download Ubuntu Server | Download | Ubuntu

Ubuntu is an open source software operating system that runs from the desktop, to the cloud, to all your internet…

www.ubuntu.com

Step 1 : Install docker & Pull images for Hadoop

install docker: #apt-get install docker.io

docker pull images: #docker pull cloudera/quickstart:latest

docker images: #docker images

เราก็จะเห็นว่าตอนนี้ server เรามี images file มาแล้ว

#docker images

ทำการ run docker image ด้วยคำสั่ง

# docker run -v /root:/mnt--hostname=quickstart.cloudera --privileged=true-t -i -p 9092:9092 -p 2181:2181 -p 11122:11122cloudera/quickstart  /usr/bin/docker-quickstart

จากนั้นเราจะมาอยู่ภายใต้ [root@quickstart]#_ ให้ทำการ run คำสั่งดังนี้

change directory: #cd mnt

Step 2: Implement Mapper

โดยเราจะทำการ split คำใน input ที่ได้มาด้วย tab(\t) ก็จะได้ประมาณนี้

ซึ่งสามารถใช้ wget command ไป load file mapper.py ข้างบนได้เลย

#wget https://raw.githubusercontent.com/Kzis/map_reduce/master/mapper.py

หลังจาก download เสร็จเราก็ #ls ดูก็จะเห็น file mapper.py

Step 3: Implement Reducer

โดยเราจะทำการ reduce data ที่ได้มาด้วย key หาก key ที่อ่านมามีอยู่แล้วก็ให้ทำการนับเพิ่มเข้าไป จากนั้นก็ write ออกมา

ซึ่งสามารถใช้ wget command ไป load file reducer.py ข้างบนได้เลย

#wget https://raw.githubusercontent.com/Kzis/map_reduce/master/reducer.py

หลังจาก download เสร็จเราก็ #ls ดูก็จะเห็น file reducer.py

Step 5: Create text file & Put file to hadoop

ตอนนี้เราก็จะได้ Mapper , Reducer มาเรียบร้อยแล้ว จากนั้นเราจะทำการสร้าง file ที่เก็บ word ต่างๆของเราไว้

install nano: #yum install nano (nano ก็เหมือน text editor ตัวนึงนั่นแหละ)

create file: #nano text.txt

จากนั้นใส่ word ที่เราต้องการลงไปใน file

หลังจากนั้นก็นำ file นี้ขึ้นไปวางไว้บน Hadoop ด้วยคำสั่ง

#hadoop fs -put text.txt /user/cloudera/text.txt

จากนั้นดู file ที่เรา put ไปด้วยคำสั่ง

#hadoop fs -ls /user/cloudera

Step 6: Run MapReduce

มาถึงขั้นตอนนี้เราจะมี

text file : สำหรับเป็น input

mapper.py : สำหรับทำ mapping process ตามที่เรา implement

reducer.py : สำหรับ reduce process ตามที่เรา implement

ลอง run MapReduce job กันเลย

# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar 
–input /user/cloudera/text.txt 
–output /user/cloudera/wx 
–mapper mapper.py 
–reducer reducer.py 
–file mapper.py –file reducer.py

จากนั้นก็รอออออ

เมื่อ job success แล้วเราก็สามารถไปดู result ได้ที่ path /user/cloudera/wx

ด้วยคำสั่ง #hadoop fs -cat /user/cloudera/wx/part-00000

มาถึงตอนนี้เราก็จะเป็นดั่งรูปด้านล่างงง I know map reduceeeee

https://memegenerator.net/instance/63276341/i-know-kung-fu-i-know-map-reduce

Challenge : Word Count 2 Gram

ส่วนนี้เป็นส่วนเพิ่มเติม เรามาลองแก้ Mapper , Reducer เพื่อฝึกกันต่อดีกว่า จากปกติ Mapper จะทำการ split ที่ 1 tab ให้เปลี่ยนเป็น 2 tab และ Reducer ก็ทำการ reduce key และ count เหมือนเดิม โดยคร่าวๆ ก็จะทำตามด้านล่างนี้

Mapper2 .py

mapper 2 gap

Reducer2.py

reducer 2 gap

ซึ่งทั้ง 2 ฝ่ายสามารถไป load file ได้ตามคำสั่งด้านล่างนี้

# wget https://raw.githubusercontent.com/Kzis/map_reduce/master/mapper2.py# wget
https://raw.githubusercontent.com/Kzis/map_reduce/master/reducer2.py

จากนั้นก็ทำการสร้าง text file ใหม่อีกรอบ

แล้วก็นำ file ขึ้น hadoop #hadoop fs -put text2.txt /user/cloudera/text2.txt

จากนั้นก็ทำการ run map reduce job อีกรอบ

# hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar 
–input /user/cloudera/text2.txt 
–output /user/cloudera/wx2 
–mapper mapper2.py 
–reducer reducer2.py 
–file mapper2.py –file reducer2.py

เมื่อ job success แล้วเราก็สามารถไปดู result ได้ที่ path /user/cloudera/wx2

ด้วยคำสั่ง #hadoop fs -cat /user/cloudera/wx2/part-00000

จบแล้วววสำหรับการทำ MapReduce job ทั้งแบบ 1 gap และ 2 gap ถ้าติดขัดตรงไหน หรือมีอะไรจะแนะนำ comment หรือ inbox มาได้โดยตรงเลยนะครับ :P

Source code

Kzis/map_reduce

Contribute to Kzis/map_reduce development by creating an account on GitHub.

github.com

Reference

Data Analytics with Hadoop

Ready to use statistical and machine-learning techniques across large data sets? This practical guide shows you why the…

shop.oreilly.com