Posts

Topic Modelling with Latent Dirichlet Allocation aka LDA

Image
Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package. The challenge, however, is how to extract good quality of topics that are clear, segregated and meaningful . This depends heavily on the quality of text preprocessing and the strategy of finding the optimal number of topics. This tutorial attempts to tackle both of these problems. Contents 1. Introduction 2. Prerequisites – Download nltk stopwords and spacy model 3. Import Packages 4. What does LDA do? 5. Prepare Stopwords 6. Import Newsgroups Data 7. Remove emails and newline characters 8. Tokenize words and Clean-up text 9. Creating Bigram and Trigram Models 10. Remove Stopwords, Make Bigrams and Lemmatize 11. Create the Dictionary and Corpus needed for Topic Modeling 12. Building the Topic Model 13. View the topics in LDA model 14. Compute