Training NER[0] – Training custom Word Vectors from 10-K/Q filings

NER with word vectors.

Word vectors are particularly useful for terms which aren’t well represented in your labelled training data. For instance, if you’re doing named entity recognition, there will always be lots of names that you don’t have examples of. For instance, imagine your training data happens to contain some examples of the term “Microsoft”, but it doesn’t contain any examples of the term “Symantec”. In your raw text sample, there are plenty of examples of both terms, and they’re used in similar contexts. The word vectors make that fact available to the entity recognition model. It still won’t see examples of “Symantec” labelled as a company. However, it’ll see that “Symantec” has a word vector that usually corresponds to company terms, so it can make the inference.

-https://spacy.io/usage/vectors-similarity

INPUT data : All of 10-K, 10-Q filings available on SEC, with basic preprocessing steps except for lemmatization, N grams.

 

CODE :

As my RAM was restricted to 16 GB, I found Data Streaming is useful in my case. (Helpful introductions on training Word Vectors using Gensim : https://rare-technologies.com/word2vec-tutorial/ , https://rare-technologies.com/data-streaming-in-python-generators-iterators-iterables/)

import os
import sys
import re
from gensim.models import Word2Vec

import logging
logging.basicConfig(
format='%(asctime)s : %(levelname)s : %(message)s',
level=logging.INFO) #Display the progress of training word vectors

dirname = 'D:/10_k' # directory where 10-K/Q filings are downloaded
class MySentences(object):
    def __init__(self, dirname):
        self.dirname = dirname            
    
    def __iter__(self):
        for f_dir in os.listdir(self.dirname):
            for qname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:])):
                #print(os.path.join(dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:]))
                for fname in os.listdir(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname)):
                    if fname.endswith(".txt"):
                        for line in open(os.path.join(self.dirname, f_dir, 'EDGAR\\10-X_C', f_dir[-4:], qname, fname), 'r', encoding = 'utf-8'):
                            yield line.split()
sentences_ = MySentences(dirname)
model = Word2Vec(min_count = 10, size = 200)
model.build_vocab(sentences_) # to test multiple parameters later, it is much convenient to first build vocabulary and save it
#model.save('D:/mltool/word2vec/LMLM_word2vec_MIN_10')
#model = Word2Vec.load('D:/mltool/word2vec/LMLM_word2vec_MIN_10')

sentences_ = MySentences(dirname)
model.window = 5
model.workers = 5
model.size = 200
model.sg = 0 # allows faster training
model.train(sentences__, total_examples = model.corpus_count, epochs = 10)
model.wv.save_word2vec_format('D:/mltool/kv_LMLM_dim_200_MIN_10', binary = False) # save keyed vector from the model

 
Now, finally build spacy model using keyed vector constructed right before.

Type into CMD window:
python -m spacy init-model en your_spacy_model_nmae –vectors-loc keyed vector location
  

 

 

[NK Provocation Index][0] Intro

  • North Korean’s military provocations and nuclear threats are likely to hamper Korean Economic Growth
  • Possible Channel : Increased risk lead to Investment, Saving to decrease
  • X(N.K. Provocation)  (–> X'(Investment, saving(consumption) rate)  –> Y(Economic Growth)
  • Identification 1 : Measuring the degree of N.K. Provocation by number of articles belong to ‘Provocation/Nuclear threats’ topic (LDA topic model)
  • Identification 2 : Causality? VAR may be helpful

[NKPR 0] Building Caffe on Window (for anaconda environment)

I recall that installing Caffe on Window was one of the hardest steps on this project.

 

  • However,  some (small) problems arises depending on the different environment one has.  For me, installing VS 2015 raised error ;  a setup package is either missing or damaged, but no perfect help for this problem exists on the web. (Spend two days repeating shredding the whole VS 2015/reinstalling)

 

  • In addition, building PyCaffe requires python 3.5, while I have been using python 3.6 (anaconda) for my previous works. Since I do not want to change my working environment, I tried to install PyCaffe using anaconda environment setting(python 3.5). There are some settings that should be modified before installing.

 

  1. Create new environment for python 3.5. (e.g. conda create -n py35 python = 3.5.0 anaconda)
  2. Before using cmd, call the anaconda environment (e.g. conda activate py35)
  3. When modifying caffe\caffe\scripts\build_win.cmd according to the video above, set CONDA_ROOT variable as location to the python 3.5. environment conda
  4. Now follow the video!
  5. Done!

[USVC] Drawing Supply Chain 2 – US Listed Domestic Firms

이 슬라이드 쇼에는 JavaScript가 필요합니다.

<Histogram of firm out nodes, clockwise from left top: 2000, 2005, 2010, 2015>

numbers

  • Total Sample number decreased through 1998 to 2016:{1998: 9062, 1999: 8906, 2000: 8512, 2001: 8167, 2002: 7692, 2003: 7447, 2004: 7498, 2005: 7015, 2006: 6690, 2007: 6676, 2008: 7448, 2009: 7792, 2010: 7427, 2011: 7223, 2012: 6943, 2013: 6783, 2014: 6640, 2015: 6231, 2016: 5850}
  • The number of edges(linking firms), however, further decreased in the same period
  • The average shortest length of all possible linkages: {2000: 1.638, 2005: 1.531, 2010: 1.284, 2015: 1.322}

 

Possible explanations

  • The (trained natural language) model may be over-fitted to early 2000’s
  • Supply Chain among U.S. firms might be actually decreasing due to economic uncertainty

 

10 Firms with the most in-nodes

(year 2000) : [[‘Walmart Inc’, 0.027717626678215677], [‘Lucent Technologies Inc’, 0.026634906886097875], [‘Hewlett Packard Enterprise Co’, 0.023170203551320916], [‘AT&T Corp’, 0.018189692507579038], [‘Ford Inc’, 0.0173235166738848], [‘Cisco Systems Inc’, 0.01602425292334344], [‘Siemens AG’, 0.013858813339107838], [‘Boeing Corp’, 0.013642269380684278], [‘Intel Corp’, 0.012126461671719359], [‘Target Inc’, 0.01169337375487224]]

(year 2015) : [[‘Walmart Inc’, 0.020942408376963352], [‘AT&T Corp’, 0.010732984293193719], [‘Ford Inc’, 0.010209424083769635], [‘Shell Oil Co’, 0.009947643979057593], [‘Target Inc’, 0.00968586387434555], [‘Home Depot Inc. ‘, 0.009162303664921467], [‘Cisco Inc’, 0.008638743455497384], [‘Microsoft Corp’, 0.008638743455497384]]

 

 

[USVC] Drawing Supply Chain 1 – Small Sample

/******************************************************************
— Title : [Python; NetworkX] Supply Chain analysis
— Key word : networkx, Node, Edge, Centrality, Supply Chain, Value Chain
*******************************************************************/

Data

  • About 200 major firms listed on Compustat data
  • Data Set will soon encompass all the firms with CIK code
  • Customer information extracted from 10-k disclosure data

Graph

  • Drawn from the basic networkx graph tool (nx.draw())
  • year : ordered in years ; 2000, 2005, 2010, 2015
  • Size of node : in_degree_centrality
  • Color of node : out_degree_centrality

2000(2)2005(2)2010(2)2015(2)

Sample Code :

(Reference : https://briandew.wordpress.com/2016/06/15/trade-network-analysis-why-centrality-matters/)

import networkx as nx
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

def draw_G(G, year):
    oc = nx.out_degree_centrality(G)
    for key in oc.keys():
        oc[key] = oc[key]*10
    nx.set_node_attributes(G, name= 'cent', values = oc)
    ic = nx.in_degree_centrality(G)
    nx.set_node_attributes(G, name= 'in', values = ic)
    node_size = [float(G.node[v]['in'])*20000 + 1 for v in G]
    node_color = [float(G.node[v]['cent']) for v in G]
    pos = nx.spring_layout(G, k=30, iterations=8)
    nodes = nx.draw_networkx_nodes(G, pos, node_size=node_size, node_color = node_color, alpha=0.5)
#nodes = nx.draw_networkx_nodes(G, pos, node_color=node_color, alpha=0.5)
    edges = nx.draw_networkx_edges(G, pos, edge_color='black', arrows=True, width=0.3)
    nx.draw_networkx_labels(G, pos, font_size=5)
    plt.text(0,-1.2, 'Node color is out_degree_centrality', fontsize=7)
    plt.title('Compustat firms Supply Chain (year : ' + str(year) + ')', fontsize=12)
    cbar = plt.colorbar(mappable=nodes, cax=None, ax=None, fraction=0.015, pad=0.04)
    cbar.set_clim(0, 1)
    plt.margins(0,0)
    plt.axis('off')
    plt.savefig(str(year)+ 'Supply Chain.png', dpi=1000)
    plt.show()

Numbers(Statistics)

  • Longest path :