Spam and hyperlink analysis

Spam and hyperlink analysis

Task:

Develop Prediction Model for webspam and hyperlink analysis designed and trained (with provided data) to achieve certain prediction goals.

Solution:

We have built model for Spam\Nonspam prediction for links analysis company. We have use Big Data methods for input data size 70+ Gb. There were a lot of text features, which were preprocessed by using TF-IDF, Word2Vec and Features Selecting methods. The columns with date format were changed to timestamp format, and period of page life was extracted. As result we have the percentage prediction for each class: Nospam, Page Spam, Domain Spam.