Abstract:
With the increasing availability of images on the web, identifying image related sentences has become an important problem. This research area is also important for the news publishing community for automatic captioning of news images and summa rization. Although a large body of research has been devoted to image captioning, it is still a challenging problem. Previous works on image captioning mostly focus on gener ating new captions for the images. The problem of identifying image related sentences in news articles is discussed in this thesis for the first time and our approach is novel because we do not try to generate a caption from scratch, but we try to select the most appropriate set of sentences for the image from the news text itself. This technique helps not to lose the relationship between the news article and the image caption. We have used the CNN news dataset which only contains the text parts of news as basis and we have augmented the dataset by collecting the images of the news articles. We generated a two class ground truth for the image and sentences of news article by using Tf-Idf and Word2Vec vectors; and cosine and SEMILAR sentence-to-sentence similarity methods. We utilized HOG and BOVW image descriptors and Word2Vec text feature extraction methods. We implemented Naive Bayes, k-NN and Random Forest classification methods to measure the performance of our proposed system. We have also applied PCA dimensionality reduction method for image features to evaluate the equal weights of image and text features. We have also conducted experiments to solve the unbalanced class distribution of the two classes. The experiment results show that Naive Bayes classifier with HOG features gives better results.