As a multinational mass media and information firm, Thomson Reuters delivers quality news and latest stories to the world driven by intelligent technologies. With numerous news coming up each day, it’s resource-intensive to manually categorize them. As a leader in the information technology field, TR highly emphasizes Artificial Intelligence to harness the world’s content.
Thomson Reuters is challenging you today to leverage machine learning and natural language processing to build an algorithm that can automatically classify news into different categories. To earn more points, we encourage you to take the bonus problem to build a news headline summarization based on news body, which might require Deep Learning.
Given news headlines, build a model to classify them into one of the three news categories.
For problem 1, please only use train1.csv and test1.csv under 1-Title_Classification folder for model building and result predicting. You are not allowed to use other training or testing dataset.
(I) Training dataset: 7916 rows, 3 columns
ID - unique identifier for each news
TITLE - news headline
TOPIC - one of the three topics (0/1/2) (our y label)
(II) Testing dataset: 3392 rows, 3 columns
ID - unique identifier for each news
TITLE - news headline
TOPIC - your predicted result (None)
Given news bodies, build a model to generate their titles.
For problem 2, please only use train2.csv and test2.csv under 2-Title_Summarization folder for model building and result predicting. You are not allowed to use other training or testing dataset.
(I) Training dataset: 17142 rows, 3 columns
ID - unique identifier for each news
BODY - news content
TITLE - news headline (our y label)
(II) Testing dataset: 1904 rows, 3 columns
ID - unique identifier for each news
BODY - news content
TITLE - your predicted summary (None)
(I) You can use one of the five coding languages (Python, R, Java, C, C++) for this competition.
(II) Zip your code and predicted result file in the following format and send it to thomsonreuters_GHC@thomsonreuters.com with a title of firstname-lastname-challenge2 (such as 'bill-smith-challenge2')
ID | TITLE | TOPIC |
---|---|---|
0 | INDONESIAN COFFEE PRODUCTION MAY FALL THIS YEAR | 2 |
1 | INTERNATIONAL BUSINESS MACHINE CORP | 0 |
... | ......... | ... |
ID | BODY | TITLE |
---|---|---|
0 | Jill Considine, New York State................. | headline generated by my machine |
... | ......... | ................................ |
(I) Problem 1 (100 points)
You will get up to 100 points totally based on the accuracy rate of the submission CSV (firstname-lastname-result1.csv). The formula is listed below:
Accuracy rate = (correctly predicted class / total testing class) × 100%
We will review your code to check plagiarism. If we find a high similarity between your code and other participants' code or code published online such as on Github, Kaggle, etc, you won't earn points.
(II) Problem 2 (50 points)
This is a bonus problem which is not required. You will get up to 50 points based on the code review and results review. Review committee at Thomson Reuters will determine the score for Problem 2.
If we find a high similarity between your code and other participants' code or code published online such as on Github, Kaggle, etc, you won't earn points.