Published 2020. 1. 5. 18:34

[Python]자연어 처리를 위한 단어 토큰화(word tokenization)

NLP이전에 방대한 양의 문장들을 보다 쉽게 분석하고 가지고 놀기위해 어느정도 정제(cleansing)하고 정규화하는 작업이 요구됩니다. 그리고 정제와 정규화 이전에 사용자의 목적에 맞게 데이터를 토큰화하는 작업이 요구됩니다.

오늘은 그 중 단어를 기준으로 토큰화하는 방법을 소개합니다.

import nltk
nltk.download('punkt')
nltk.download('treebank')

from nltk.tokenize import word_tokenize
from nltk.tokenize import WordPunctTokenizer
from nltk.tokenize import TreebankWordTokenizer

tb_tokenizer=TreebankWordTokenizer()

단어 토큰화를 시도해보기 위해 위와같은 라이브러리를 import합니다.

import nltk
nltk.download('punkt')
nltk.download('treebank')

여기서 nltk.download 부분의 경우 WordPunctTokenizer와 TreebankTokenizer를 사용하기 위함으로 이미 패키지 다운로드를 받으신 분들은 제외하셔도 됩니다.

text1 = "Love looks not with the eyes, but with the mind. And therefore is wing'd Cupid painted blind."
text2 = "South Korea population is 48,750,000"

word_tok = word_tokenize(text1)
word_tok2 = word_tokenize(text2)

wordpunct_tok = WordPunctTokenizer().tokenize(text1)
wordpunct_tok2 = WordPunctTokenizer().tokenize(text2)

tb_tok = tb_tokenizer.tokenize(text1)
tb_tok2 = tb_tokenizer.tokenize(text2)

print("word_tokenize를 사용한 경우는 아래와 같습니다.")
print(word_tok)
print(word_tok2)
print("wordpunct_tokenize를 사용한 경우는 아래와 같습니다.")
print(wordpunct_tok)
print(wordpunct_tok2)
print("Treebanktokenize를 사용한 경우는 아래와 같습니다.")
print(tb_tok)
print(tb_tok2)

임의로 설정한 두가지 문장을 각각 3가지의 방법으로 토큰화를 진행하였습니다.

결과는 아래와 같습니다. 세가지 방법의 차이를 확인한 뒤 필요에 따라 적절히 선택하여 사용하시면 됩니다.

word_tokenize를 사용한 경우는 아래와 같습니다.
['Love', 'looks', 'not', 'with', 'the', 'eyes', ',', 'but', 'with', 'the', 'mind', '.', 'And', 'therefore', 'is', 'wing', "'d", 'Cupid', 'painted', 'blind', '.']     
['South', 'Korea', 'population', 'is', '48,750,000']
wordpunct_tokenize를 사용한 경우는 아래와 같습니다.
['Love', 'looks', 'not', 'with', 'the', 'eyes', ',', 'but', 'with', 'the', 'mind', '.', 'And', 'therefore', 'is', 'wing', "'", 'd', 'Cupid', 'painted', 'blind', '.'] 
['South', 'Korea', 'population', 'is', '48', ',', '750', ',', '000']
Treebanktokenize를 사용한 경우는 아래와 같습니다.
['Love', 'looks', 'not', 'with', 'the', 'eyes', ',', 'but', 'with', 'the', 'mind.', 'And', 'therefore', 'is', 'wing', "'d", 'Cupid', 'painted', 'blind', '.']
['South', 'Korea', 'population', 'is', '48,750,000']

본자료는 딥 러닝을 이용한 자연어 처리 입문(Won Joon Yoo)을 참고합니다.
코드 전문은 https://github.com/Leo-bb/natural-language-processing에서 확인할 수 있습니다.

Leo-bb/natural-language-processing

Contribute to Leo-bb/natural-language-processing development by creating an account on GitHub.

github.com

저작자표시 동일조건

[Python]자연어 처리를 위한 단어 토큰화(word tokenization)

티스토리툴바