California Route 1

20200114-_DSC5361-2

 

20200114-_DSC5279-2

 

20200105-_DSC3193

 

1번 국도(?)를 쭈욱 따라 Monterey에서 Solvang으로 가는 길

자꾸 구글 지도는 내륙으로 가는 길을 추천했지만,

바다를 옆에 두고 산에 난 아슬아슬하리 만큼 좁은 도로를 타보고 싶었었다

 

풍경에 자꾸 눈이 가 죽을 뻔도 했지만 살아 돌아왔으니 좋은 추억이라 해야겠다

 

조금은 뜬금 없이 캠핑카를 타고 낮에는 경치 좋은 곳을 여행하고, 밤에는 연구하는 그런 박사가 되면 참 좋겠다는 생각을 했다… 힘들겠지… 만 뭐 상상은… ㅎㅎ

 

 

[N.K. provocation] Results

LDA Topic Model output (20 Topics):

lda_topic

Topic #13 and #17 can be interpreted as “armed provocation” and “nuclear provocation”, respectively. Each of #14 and #18 can be interpreted as “South-North Dialogue” and “international talks”.

The numbers of articles that belong to each category are shown in the graph below

noname01

Independent variable: the inverse degree of support for the unification of the people (1: necessary / 5: unnecessary)

reg.png

Training NER[0] – Training custom Word Vectors from 10-K/Q filings

NER with word vectors.

Word vectors are particularly useful for terms which aren’t well represented in your labelled training data. For instance, if you’re doing named entity recognition, there will always be lots of names that you don’t have examples of. For instance, imagine your training data happens to contain some examples of the term “Microsoft”, but it doesn’t contain any examples of the term “Symantec”. In your raw text sample, there are plenty of examples of both terms, and they’re used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won’t see examples of “Symantec” labelled as a company. However, it’ll see that “Symantec” has a word vector that usually corresponds to company terms, so it can make the inference.

-https://spacy.io/usage/vectors-similarity

INPUT data : All of 10-K, 10-Q filings available on SEC, with basic preprocessing steps except for lemmatization, N grams.

 

CODE :

As my RAM was restricted to 16 GB, I found the Data Streaming technique is especially useful in my case. (Helpful introductions on training Word Vectors using Gensim : https://rare-technologies.com/word2vec-tutorial/ , https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/)

import os
import sys
import re
from gensim.models import Word2Vec

import logging
logging.basicConfig(
format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO) #Display the progress of training word vectors

dirname = 'D:/10_k' # directory where 10-K/Q filings are downloaded
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname            
    
    def __iter__(self):
        for f_dir in os.listdir(self.dirname):
            for qname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:])):
                #print(os.path.join(dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:]))
                for fname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname)):
                    if fname.endswith(".txt"):
                        for line in open(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname, fname), 'r', encoding = 'utf-8'):
                            yield line.split()
sentences_ = MySentences(dirname)
model = Word2Vec(min_count = 10, size = 200)
model.build_vocab(sentences_) # to test multiple parameters later, it is much convenient to first build vocabulary and save it
#model.save('D:/mltool/word2vec/LMLM_word2vec_MIN_10')
#model = Word2Vec.load('D:/mltool/word2vec/LMLM_word2vec_MIN_10')

sentences_ = MySentences(dirname)
model.window = 5
model.workers = 5
model.size = 200
model.sg = 0 # allows faster training
model.train(sentences__, total_examples = model.corpus_count, epochs = 10)
model.wv.save_word2vec_format('D:/mltool/kv_LMLM_dim_200_MIN_10', binary = False) # save keyed vector from the model

 
Now, finally build spacy model using keyed vector constructed right before.

Type into CMD window:

python -m spacy init-model en your_spacy_model_name --vectors-loc keyed vector location