The Big Moon that only I can tell
The Big Moon that only I can tell
1번 국도(?)를 쭈욱 따라 Monterey에서 Solvang으로 가는 길
자꾸 구글 지도는 내륙으로 가는 길을 추천했지만,
바다를 옆에 두고 산에 난 아슬아슬하리 만큼 좁은 도로를 타보고 싶었었다
풍경에 자꾸 눈이 가 죽을 뻔도 했지만 살아 돌아왔으니 좋은 추억이라 해야겠다
조금은 뜬금 없이 캠핑카를 타고 낮에는 경치 좋은 곳을 여행하고, 밤에는 연구하는 그런 박사가 되면 참 좋겠다는 생각을 했다… 힘들겠지… 만 뭐 상상은… ㅎㅎ
어쩌다가 엄청나게 길어진 남미 여행!
내 스스로도 잘 정리가 안되서 쭉 일정을 그려보았다
이번 학기도(?) 잘 마무리하고 잘 다녀와야지 ><!
2020년 1월 12일에 본 해질 무렵의 Carmel Beach
시간의 흐름이 돌연히 선명해지는 순간!
LDA Topic Model output (20 Topics):
Topic #13 and #17 can be interpreted as “armed provocation” and “nuclear provocation”, respectively. Each of #14 and #18 can be interpreted as “South-North Dialogue” and “international talks”.
The numbers of articles that belong to each category are shown in the graph below
Independent variable: the inverse degree of support for the unification of the people (1: necessary / 5: unnecessary)
NER with word vectors.
Word vectors are particularly useful for terms which aren’t well represented in your labelled training data. For instance, if you’re doing named entity recognition, there will always be lots of names that you don’t have examples of. For instance, imagine your training data happens to contain some examples of the term “Microsoft”, but it doesn’t contain any examples of the term “Symantec”. In your raw text sample, there are plenty of examples of both terms, and they’re used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won’t see examples of “Symantec” labelled as a company. However, it’ll see that “Symantec” has a word vector that usually corresponds to company terms, so it can make the inference.
INPUT data : All of 10-K, 10-Q filings available on SEC, with basic preprocessing steps except for lemmatization, N grams.
As my RAM was restricted to 16 GB, I found the Data Streaming technique is especially useful in my case. (Helpful introductions on training Word Vectors using Gensim : https://rare-technologies.com/word2vec-tutorial/ , https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/)
import os import sys import re from gensim.models import Word2Vec import logging logging.basicConfig( format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) #Display the progress of training word vectors dirname = 'D:/10_k' # directory where 10-K/Q filings are downloaded class MySentences(object): def __init__(self, dirname): self.dirname = dirname def __iter__(self): for f_dir in os.listdir(self.dirname): for qname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:])): #print(os.path.join(dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:])) for fname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname)): if fname.endswith(".txt"): for line in open(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname, fname), 'r', encoding = 'utf-8'): yield line.split() sentences_ = MySentences(dirname) model = Word2Vec(min_count = 10, size = 200) model.build_vocab(sentences_) # to test multiple parameters later, it is much convenient to first build vocabulary and save it #model.save('D:/mltool/word2vec/LMLM_word2vec_MIN_10') #model = Word2Vec.load('D:/mltool/word2vec/LMLM_word2vec_MIN_10') sentences_ = MySentences(dirname) model.window = 5 model.workers = 5 model.size = 200 model.sg = 0 # allows faster training model.train(sentences__, total_examples = model.corpus_count, epochs = 10) model.wv.save_word2vec_format('D:/mltool/kv_LMLM_dim_200_MIN_10', binary = False) # save keyed vector from the model
Now, finally build spacy model using keyed vector constructed right before.
Type into CMD window:
python -m spacy init-model en your_spacy_model_name --vectors-loc keyed vector location