Hi. When you are building some model/application, it’s always important to know what you have in hand. I know I spend a lot of time searching for the right package for my task. The special reason why I love Python, being an NLP developer, is that almost all of the tremendous work done in the field of NLP is made available in Python.
In this article, I will walk you through various NLP in Python modules available,most of which I have worked with previously, to help you build your NLP python models hassle free.
Though we sometimes have datasets ready and we just need to create models, in most of the cases you would need to collect your own data. Web Scraping involves digging through web pages and extract information which you need and which the web page allows you to.
While Scrapy and Beautiful soup help you to extract data from given webpages, selenium is useful when you need to perform form filling, button clicking, etc.
Data Reading, Writing, Analysis and Manipulation
Apart from standard io file operations, machine learning libraries in Python provide an in-built csv module to handle csv files. You can also use openpyxl and docx to handle Excel sheets and Word Documents respectively. Pandas, however, is much more sophisticated when it comes to data handling as it allows you to handle missing data, merge datasets, allows group by functionality, etc
Whether you build statistical models or neural models, tokenization, pos tagging and several other tasks are necessary. Below are some of the tools I run to, depending on the tasks and languages I need to work on.
Since I always wished a table like below existed, so that I can choose whatever tool/combination of tools suits me the best, I thought I’ll put it in here.
NOTE: The number of languages supported by the above packages may depend on the task you are carrying out.
Build your own NLP model in Python
Though pretrained models work better for us most of the time, you might still want to build your custom models for various NLP tasks. Below are some of the most famous machine learning frameworks out there.
All the above frameworks except Spacy are not just restricted to Natural language. While Scikit Learn is meant for developing Machine Learning models which doesn’t involve deep learning, Tensorflow and PyTorch are deep learning frameworks. Keras provides an API built on Tensorflow 2.0 making your deep learning coding much easier. For an in-depth understanding of these frameworks you can refer the article Top Machine learning frameworks in 2020 here.
Spacy however, is exclusively built for NLP. It lets you build state-of-the-art NLP models in the language of your own choice. Since TensorFlow, PyTorch, Keras and Spacy involve deep learning training, they provide GPU support as well.
Embeddings are nothing but representation of given text in given space. Unlike count vectorizer and tf-idf vectorizer, embeddings aren’t about frequencies of words but they are generated using deep learning techniques to bring out the semantic relationship with another word or group of words.
Static Word Embeddings — Gives a static representation of a given word without keeping context in mind. FastText has been most promising in this category so far.
Dynamic (Contextualized) Word Embeddings — Gives vector representation of given word, calculated based on the current context.
While all the above four are widely used to get contextualized word embeddings, XLNet has outperformed the other algorithms.
Static Sentence Embeddings — Gives a static representation of a given phrase/sentence/sentences, without keeping context in mind.
Dynamic Word Embeddings — Gives vector representation of given phrases/sentences/sentences, calculated based on the current context.
Apart from sentence embedding models like above, there has been a practice of fine tuning word embedding models to get sentence embeddings.
Knowledge Graph (KG)
A Knowledge graph is a collection of entities connected through relations. DBPedia is a well-known knowledge graph built using data from Wikipedia. When the data you have is all interlinked, a traditional database might not be so easy. Especially while retrieving information. This is when you would want to store your data using a KG.
Neo4j provides a DB interface to store and query your data, without having to worry about creating a specific schema. It uses Cypher Query language.
Grakn, which also provides a DB interface, is more systematic and offers functionalities to input custom rules, inherit other node classes, etc. Grakn uses Graql for querying which does inference using deductive reasoning of patterns and relationships in data.
AmpliGraph, on the other hand, is not for storing data. It gives you neural network models that can learn on your existing Knowledge Graph to find missing relations. It also generates stand-alone KG embeddings.
The NLP community has been growing rapidly while helping each other by providing easy-to-use modules in nlp Python. I know it’s always fun to explore the work done in the field, but is also helpful when you have some starting point. This article kinda acts like a cheatsheet, especially for the beginners of NLP, providing the right tools to be used for your tasks. Of course there are many more tools and modules available out there which you can explore anytime. Have fun coding!
- * This article is attributed to my colleague Neha.