Abstract:
Quantitative Analysis of Literature (QAL) [1] is a rapidly growing field and it deals with numerical analyses of literary texts. Our motivation in this thesis is to classify literary pieces of authors based on their choice of words. We examine, whether given a set of literary documents we can classify them by author, whether we can classify them depending on their genre or not and whether by treating speech by individual protagonists as texts we can infer something about the protagonists themselves. These are novel questions in the field of QAL [2]. Specifically, we would like to know whether we can say something about literary texts by only looking at the occurrences of words in these. Methodologically, we use two main approaches: Probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA). Both of these use the Expectation-Maximization (EM) method in di↵erent settings but for the same purpose. We have analyses of plays by Shakespeare by choosing all plays among the three genres, History plays, Comedies and Tragedies. Taking a collection of plays from all three genres we apply LDA to classify the texts. We find that the classification by and large follows the genre. We next focus on individual plays. Here we treat the lines uttered by each of the characters of the play as a document and apply the LDA approach to classify the documents/protagonists. Lastly, we do these classifications for four plays (Hamlet, Othello, Macbeth and Richard III ). Consequently, although the results of the classification of the characters in the plays are not always what one would have expected based on the reading of the plays, these classifications are not entirely without meaning either.