[USVC 2] CLASSIFYING TEXT

I found that customer information is stated in two forms.

One is in the sentence type and the other is the table type.

Hence, I started from dividing the 10-k text into two categories; text and table. (by using html tags)

The methods to deal with them , however, are similar: TEXT CLASSIFICATION

*GOAL 1 (sentence form):

Classifying sentences whether they are relevant to the customer information or not.

Example Sentences:

Net sales to the Company’s three major customers, Staples, Inc., Office Max, and United Stationers, Inc., represented approximately 43% in 2004, 46% in 2003 and 46% in 2002.

For fiscal 2003, Fujitsu accounted for approximately 31 percent of our consolidated accounts receivable and approximately 13 percent of our consolidated gross sales.

In 2004, Matyep in Mexico represented 11.0.% of our consolidated revenues and Burlington Resources Inc. represented 10.1%.

Fleetwood was the Company’s largest customer in 2004, representing approximately 31% of total sales.

I hoped that there would be some rules or sentence structures that can cover the whole customer information in 10-k. I tried manually finding those rules, ended up finding 24 kinds of sentences. Although they can help me find every sentences that contains customer information listed on Compustat data(used as a reference point during my whole research), some of the sentences filtered by those 24 rules are have nothing to do with the revenue information.

To get rid of those irrelevant sentences I adopted the machine learning techniques.

*Annotation (GOLD-STANDARD);

-prodigy

*spaCy

 

*Scikit-Learn

 

 

 

 

답글 남기기

아래 항목을 채우거나 오른쪽 아이콘 중 하나를 클릭하여 로그 인 하세요:

WordPress.com 로고

WordPress.com의 계정을 사용하여 댓글을 남깁니다. 로그아웃 /  변경 )

Google+ photo

Google+의 계정을 사용하여 댓글을 남깁니다. 로그아웃 /  변경 )

Twitter 사진

Twitter의 계정을 사용하여 댓글을 남깁니다. 로그아웃 /  변경 )

Facebook 사진

Facebook의 계정을 사용하여 댓글을 남깁니다. 로그아웃 /  변경 )

%s에 연결하는 중