2016년 8월 9일 화요일

python-rake를 이용한 RVM TEST

python-rake를 이용해 LDA(Latent Dirichlet allocation)를 구현해보려 시도했다.
(at python 3.5)

(참고사이트)
https://www.airpair.com/nlp/keyword-extraction-tutorial

먼저 튜토리얼을 다운로드하고,
 $ git clone https://github.com/zelandiya/RAKE-tutorial 

python-rake 인스톨도 했다.
 $ pip3 install python-rake 

다음과 같은 소스를 실행했다. (쵸~단순;;)

import rake
import operator
rake_object = rake.Rake("SmartStoplist.txt", 3, 3, 1)
text = "RIO DE JANEIRO — Japan returned to the top of the men’s gymnastics world, beating Russia and China in the team event at the Olympics. The event was a renewal of the longtime gymnastics rivalry between Japan and China. China won the last two gold medals, in 2008 and 2012. But Japan has Kohei Uchimura, the world’s best gymnast — and perhaps the best ever — anchoring their team. In the final rotation, Japan took on Russia head-to-head in the floor exercise, holding a slim 0.208 lead. Japan went first. After an outstanding score from Kenzo Shirai and a good one from Ryohei Kato, Uchimura locked down the win with a 15.6. Combined, it was the best team floor exercise score of the night. It was the end of a dominant night for Uchimura, who started strong with a 15.100 on the pommel horse, below. Russia could not match those scores and had to settle for second. China won the bronze. Russia had been the early leader after strong horse and rings performances, including a 15.7 by Denis Abliazin on rings. Japan rallied on the vault and parallel bars and took the lead on the high bar. Abliazin, below, and the Russians came up short on the floor routine later in the night, allowing Japan to seal the victory. Russia met with some jeers from the crowd, as it has at other events in the aftermath of reports of state-sponsored doping. China was making early mistakes. You Hao stumbled on his rings landing. More stumbles followed on the vault, normally a chance to pick up high scores. A big parallel bars, including a stellar one by You, thrust China back into medal contention. The Americans got off to a slow start on their first apparatus, floor exercise, when two of their three gymnasts, Alexander Naddour, below, and Sam Mikulak, tumbled out of bounds. Next up was the horse, where Danell Leyva slipped on the dismount, leading to another low score and seriously denting United States medal hopes. Better performances followed, but Leyva fell on the high bar, and the team could finish no better than fifth."
keywords = rake_object.run(text)
print ("keywords: ", keywords)


아, rake_object = rake.Rake("SmartStoplist.txt", 3, 3, 1) 에서 

SmartStoplist.txt 에는 각종 영어단어가 A~Z까지 순차적으로 기록되어 있었다.
(한글도 같은 방법으로 넣으면 될 듯? 항상 뭐든 영어기준으로 개발되는게 참 안타깝다;)

3, 3, 1 은 다음과 같은 의미를 지닌다.
  1. 각 단어는 최소 3문자
  2. 각 구절은 최대 3단어
  3. 각 키워드는 텍스트에 최소 1회 노출

결과는 나름 훌륭했다.
무엇보다 빠른 분석속도가 놀라웠다.


(기사주소)
http://www.nytimes.com/2016/08/09/sports/olympics/gymnastics-japan-team-results.html?hp&action=click&pgtype=Homepage&clickSource=story-heading&module=photo-spot-region&region=top-news&WT.nav=top-news

(결과내용)
making early mistakes: 85%
Danell Leyva slipped: 85%
longtime gymnastics rivalry: 85%
big parallel bars: 80%
thrust China back: 76%
(그외)..

(물론 테스트는 더 많은 기사로서 테스트를 했다. 테스트하다 엄청 졸았다;;;;)

목적과는 부합하지 않아 당장 써먹지는 못하겠지만, 언젠가 문장 분석에 요긴할 듯 하다.




댓글 없음:

댓글 쓰기