[Preprocessing] 한글 문서 띄어쓰기 교정 Spacing!

  • 한글 문서 띄어쓰기 교정의 필요성

형태소 분석 성능 up! => 이후 모델 input의 질 up! => 모델 성능 up!

  • 사용 가능한 패키지

Soyspacing – https://github.com/lovit/soyspacing : 휴리스틱 알고리즘 기반 띄어쓰기 모델. 주어진 알고리즘으로 input으로 들어간 문장들에서 띄어쓰기 규칙을 파악하고 이를 새로운 문장에 적용. 모든 한글 문서에 광범위하게 적용할 수 있는 모델을 내기에는 한계가 있겠지만, 동질적인 주제와 형식을 가진 문서들에 대해 분석을 하는 경우 특히 유용하게 쓸 수 있을 것 같다. (특정 도메인을 위한 모델)

PyKoSpacing – https://github.com/haven-jeon/PyKoSpacing : CNN에 RNN을 쌓아 올린 모델을 뉴스 데이터로 훈련. 세종 코퍼스 등 테스트 셋에서 잘 작동한다고 한다. 하지만 accuracy measure가 성능을 over represent 할 수 있을 것 같다. 대체로 띄어쓰기가 정말 난장판으로 되어있는 경우 볼만한 문서로 바꾸는데 사용할 수 있을 것 같다.

TaKos (Alpha) – https://github.com/Taekyoon/takos-alpha : 앞서 첨부한 Youtube 영상에서 소개된 프로젝트. 아직 상용화까지 개발이 진행 중인 것 같다!

  • Soyspacing 활용예시

모델 트레이닝

모델 적용

기분석 사전 적용!

[Khaiii] Installation on Ubuntu 20.04LTS

카카오 문서 https://github.com/kakao/khaiii/wiki/%EB%B9%8C%EB%93%9C-%EB%B0%8F-%EC%84%A4%EC%B9%98 를 따라서 설치를 진행하다 보면 우분투 20.04LTS 환경에서는 cmake 부분에서 에러가 발생한다.

아래처럼 쭉 설치 문서를 따라하다가…

mkdir build
cd build
cmake .. 

이 cmake 부분을 아래와 같이 바꿔주면 (야매로) 설치가 가능하다고 한다.

cmake -E env CXXFLAGS="-w" cmake ..

남은 부분은 그대로!

make all
make resource
make install
make package_python 
cd package_python 
pip install .

끝!

[Javascript] Entering Letter grades Automatically at UMEG

Entering letter grades automatically at UMEG

Putting letter grades in UMEG with 300 + clicks is cumbersome and prone to mistakes, especially for large sized classes.

An  alternative is to use Canvas(elms) to convert grades into UMEG. However, the functions provided by Canvas are rather restrictive (and slow). And we don’t want to create numerous alarms to be sent to students.

So we can try using the following script at UMEG to fill out the grades automatically.

var t = document.getElementsByTagName(“table”)[5];
var trs = t.getElementsByTagName(“tr”);

// input the grades you want to enter following the order in UMEG (name – ascending order)
var grades = [“A+”, “C”, “D”];

function to_num(letter) {
var i = null;
var j = 2;
if (letter[0] == “A”) {
i = 1
} else if (letter[0] == “B”) {
i = 2
} else if (letter[0] == “C”) {
i = 3
} else if (letter[0] == “D”) {
i = 4
}
if (letter.slice(-1) == “+”) {
j = 1
} else if (letter.slice(-1) == “+”) {
j = 3
}
return [i, j]
}

// change num as the number of students in your class
var num = 374
for (var i = 0; i<num; i++) {
var str_ = “grd” + i
var letter = to_num(grades[i])
var grade = trs[9*i + 1].getElementsByTagName(“td”)[letter[0]*4-2 + letter[1]].getElementsByTagName(“input”)[str_]
grade.checked = false;
grade.checked = true;
console.log(i)
}

California Route 1

20200114-_DSC5361-2

 

20200114-_DSC5279-2

 

20200105-_DSC3193

 

1번 국도(?)를 쭈욱 따라 Monterey에서 Solvang으로 가는 길

자꾸 구글 지도는 내륙으로 가는 길을 추천했지만,

바다를 옆에 두고 산에 난 아슬아슬하리 만큼 좁은 도로를 타보고 싶었었다

 

풍경에 자꾸 눈이 가 죽을 뻔도 했지만 살아 돌아왔으니 좋은 추억이라 해야겠다

 

조금은 뜬금 없이 캠핑카를 타고 낮에는 경치 좋은 곳을 여행하고, 밤에는 연구하는 그런 박사가 되면 참 좋겠다는 생각을 했다… 힘들겠지… 만 뭐 상상은… ㅎㅎ

 

 

[N.K. provocation] Results

LDA Topic Model output (20 Topics):

lda_topic

Topic #13 and #17 can be interpreted as “armed provocation” and “nuclear provocation”, respectively. Each of #14 and #18 can be interpreted as “South-North Dialogue” and “international talks”.

The numbers of articles that belong to each category are shown in the graph below

noname01

Independent variable: the inverse degree of support for the unification of the people (1: necessary / 5: unnecessary)

reg.png

Training NER[0] – Training custom Word Vectors from 10-K/Q filings

NER with word vectors.

Word vectors are particularly useful for terms which aren’t well represented in your labelled training data. For instance, if you’re doing named entity recognition, there will always be lots of names that you don’t have examples of. For instance, imagine your training data happens to contain some examples of the term “Microsoft”, but it doesn’t contain any examples of the term “Symantec”. In your raw text sample, there are plenty of examples of both terms, and they’re used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won’t see examples of “Symantec” labelled as a company. However, it’ll see that “Symantec” has a word vector that usually corresponds to company terms, so it can make the inference.

-https://spacy.io/usage/vectors-similarity

INPUT data : All of 10-K, 10-Q filings available on SEC, with basic preprocessing steps except for lemmatization, N grams.

 

CODE :

As my RAM was restricted to 16 GB, I found the Data Streaming technique is especially useful in my case. (Helpful introductions on training Word Vectors using Gensim : https://rare-technologies.com/word2vec-tutorial/ , https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/)

import os
import sys
import re
from gensim.models import Word2Vec

import logging
logging.basicConfig(
format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO) #Display the progress of training word vectors

dirname = 'D:/10_k' # directory where 10-K/Q filings are downloaded
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname            
    
    def __iter__(self):
        for f_dir in os.listdir(self.dirname):
            for qname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:])):
                #print(os.path.join(dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:]))
                for fname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname)):
                    if fname.endswith(".txt"):
                        for line in open(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname, fname), 'r', encoding = 'utf-8'):
                            yield line.split()
sentences_ = MySentences(dirname)
model = Word2Vec(min_count = 10, size = 200)
model.build_vocab(sentences_) # to test multiple parameters later, it is much convenient to first build vocabulary and save it
#model.save('D:/mltool/word2vec/LMLM_word2vec_MIN_10')
#model = Word2Vec.load('D:/mltool/word2vec/LMLM_word2vec_MIN_10')

sentences_ = MySentences(dirname)
model.window = 5
model.workers = 5
model.size = 200
model.sg = 0 # allows faster training
model.train(sentences__, total_examples = model.corpus_count, epochs = 10)
model.wv.save_word2vec_format('D:/mltool/kv_LMLM_dim_200_MIN_10', binary = False) # save keyed vector from the model

 
Now, finally build spacy model using keyed vector constructed right before.

Type into CMD window:

python -m spacy init-model en your_spacy_model_name --vectors-loc keyed vector location