Whether text blob an appropriate choice for making sentiment analysis data set from scratch?

If we are about to create a sentiment analysis dataset from scratch based on some topic then how reliable (quality) is to use text blob for making target variables associated with the sentiment of the sentence considering a tradeoff between the amount of manual work needed to put in creating the dataset from scratch and the quality of classification text blob provides? Kindly help in this regard.

This largely depends on the type of data from which you are taking sentences (or text blobs) to annotate sentiment for each example.

For instance, taking Wikipedia articles dump might not be the best idea to construct a sentiment classification dataset because most content would be formal and tagging sentiments to them mostly results in neutral sentiments. (and hence a dataset with class imbalance)

Better sources would be:

  • Some sub-reddits related to your domain of interest
  • Tweets from accounts of relevant domain
  • More generally, news articles (or article headings)

Not an expert here, just my 2 cents :slight_smile:
Best way is to try and find out what works for you.
Probably others can put their thoughts too.


Thanks. So basically even if I start to scrape tweets, annotating corresponding target variable remains a tiresome task (say a dataset of 50K or more tweets) which includes the perception of the annotator.

[In my mind - What is the state of the art in Deep Unsupervised Learning ?]