Project: NLP, Gridsearch, Pipelines and predicting the reddit topics

Tuesday. March 27, 2018

NLP, Gridsearch, Pipelines and Predicting the Reddit Topics

This write-up approaches Natural Language Processing (NLP) which is a powerful set of techniques for processing text into numbers that is then used during modeling. The anylisis below uses a reddit dataset, specifically the text and the title of a reddit post to predict what topic, or sureddit it is about. The project is inspired by one of the assignments I had an option to do during the Data Science Immersive course at General Assembly - with some functions straight out of my brilliant teachers’ notes.

The flow is as follows:

Missing values.
Plain NLP with Logistic Regression in a Pipeline
Tweaked NLP with Logistic Regression using Gridsearch
Dimensionality reduction using PCA’s Truncated SVD and predicting with Random Forest
Looking at Latent Dirichlet Allocation

FURTHER DETAILS:

Missing Values: with 26k rows, I’m dropping any features that have more than 20k rows missing, handpicking ones with balanced values and editing ones with bad values.

NLP techniques used: CountVectorizer takes every unique instance of a word and counts it as a feature. (It can be problematic for several reasons: memory processing power and ambiguity (e.g. LinkedIn seeing 6000+ variations of the title ‘Software Engineer’, although there is technique for dealing with that called ‘stemming’, but its fragile)). So what CountVecotrizer is doing is counting words’ occurence and assigning weight to them according how often they occur.

Another approach is looking at more unique words. As such, Tf-idf highlights what is common or typical in one or two cases and rare in all others. It gives a value that’s weighted, or relative of other documents, not absoulute like CountVectorizer does. That’s usually more interesting (and predictive) then words or items that are common everywhere.

In addition, I’m using GridSearch to first find the best parameters in the model and then in the NLP techniques.

Often, the thousands of columns we get from vectorizing each word are not individually informative. Reducing them in dimensionality using PCA can be very helpful. I used a variant of PCA known as TruncatedSVD (yet it did not improve the predictability of my model).

Lastly, I looked at Latent Dirichlet Allocation (LDA), which is an unstructured technique that finds things that are most likely to be together, but not predicting what’s most likely to be together or how many topics are there.

Eventually bouncing back and forth between optimizing hyperparameters, trying new modelling techniques and working on feature extraction would have gotten me closer to a better predictive model. But this is what I have for now:

import pandas as pd
import numpy as np

df = pd.read_csv('NLP/reddit_posts.csv')

df.shape

(26688, 53)

df.head()

	adserver_click_url	adserver_imp_pixel	archived	author	author_flair_css_class	author_flair_text	contest_mode	created_utc	disable_comments	distinguished	...	spoiler	stickied	subreddit	subreddit_id	third_party_tracking	third_party_tracking_2	thumbnail	title	ups	url
0	NaN	NaN	False	johnnyawesome0	NaN	NaN	False	1480697304	NaN	NaN	...	False	False	techsupport	t5_2qioo	NaN	NaN	self	Help with audio set-up	1.0	https://www.reddit.com/r/techsupport/comments/...
1	NaN	NaN	False	Silverfin113	NaN	NaN	False	1480697424	NaN	NaN	...	False	False	learnprogramming	t5_2r7yd	NaN	NaN	self	Optimizing code for speed	23.0	https://www.reddit.com/r/learnprogramming/comm...
2	NaN	NaN	False	bookbooksbooks	NaN	NaN	False	1480697613	NaN	NaN	...	False	False	gamedev	t5_2qi0a	NaN	NaN	self	Seeking Tales of Development Woe (and Triumph)...	12.0	https://www.reddit.com/r/gamedev/comments/5g4a...
3	NaN	NaN	False	[deleted]	NaN	NaN	False	1480697634	NaN	NaN	...	False	False	learnprogramming	t5_2r7yd	NaN	NaN	default	[Java] Finding smallest value in an array	0.0	https://www.reddit.com/r/learnprogramming/comm...
4	NaN	NaN	False	caffeine_potent	NaN	NaN	False	1480697748	NaN	NaN	...	False	False	learnpython	t5_2r8ot	NaN	NaN	self	currying functions using functools	6.0	https://www.reddit.com/r/learnpython/comments/...

5 rows × 53 columns

df.isnull().sum()

adserver_click_url        26688
adserver_imp_pixel        26688
archived                      0
author                        0
author_flair_css_class    26253
author_flair_text         26337
contest_mode                  0
created_utc                   0
disable_comments          26688
distinguished             26603
domain                        0
downs                         0
edited                        0
gilded                        0
hide_score                    0
href_url                  26688
id                            0
imp_pixel                 26688
is_self                       0
link_flair_css_class      22396
link_flair_text           22078
locked                        0
media                     26420
media_embed                   0
mobile_ad_url             26688
name                          0
num_comments                  0
original_link             26688
over_18                       0
permalink                     0
post_hint                 23175
preview                   23175
promoted                  26688
promoted_by               26688
promoted_display_name     26688
promoted_url              26688
quarantine                    0
retrieved_on                  0
saved                         0
score                         0
secure_media              26420
secure_media_embed            0
selftext                      0
spoiler                       0
stickied                      0
subreddit                     0
subreddit_id                  0
third_party_tracking      26688
third_party_tracking_2    26688
thumbnail                     0
title                         0
ups                           0
url                           0
dtype: int64

missing_cols = df.isnull().sum()[df.isnull().sum() > 20000].index
missing_cols

Index(['adserver_click_url', 'adserver_imp_pixel', 'author_flair_css_class',
       'author_flair_text', 'disable_comments', 'distinguished', 'href_url',
       'imp_pixel', 'link_flair_css_class', 'link_flair_text', 'media',
       'mobile_ad_url', 'original_link', 'post_hint', 'preview', 'promoted',
       'promoted_by', 'promoted_display_name', 'promoted_url', 'secure_media',
       'third_party_tracking', 'third_party_tracking_2'],
      dtype='object')

df.drop(list(missing_cols), axis=1, inplace=True)

df.head()

	archived	author	contest_mode	created_utc	domain	edited	hide_score	id	...	secure_media_embed	selftext	spoiler	stickied	subreddit	subreddit_id	thumbnail	title	ups	url
0	False	johnnyawesome0	False	1480697304	self.techsupport	False	False	5g49s2	...	{}	I have a Sony surround sound system for a blu-...	False	False	techsupport	t5_2qioo	self	Help with audio set-up	1.0	https://www.reddit.com/r/techsupport/comments/...
1	False	Silverfin113	False	1480697424	self.learnprogramming	False	False	5g4a5p	...	{}	I've written what seems to be a prohibitively ...	False	False	learnprogramming	t5_2r7yd	self	Optimizing code for speed	23.0	https://www.reddit.com/r/learnprogramming/comm...
2	False	bookbooksbooks	False	1480697613	self.gamedev	False	False	5g4att	...	{}	I'm writing an article called "Video Games Tha...	False	False	gamedev	t5_2qi0a	self	Seeking Tales of Development Woe (and Triumph)...	12.0	https://www.reddit.com/r/gamedev/comments/5g4a...
3	False	[deleted]	False	1480697634	self.learnprogramming	1480698462	False	5g4awr	...	{}	[deleted]	False	False	learnprogramming	t5_2r7yd	default	[Java] Finding smallest value in an array	0.0	https://www.reddit.com/r/learnprogramming/comm...
4	False	caffeine_potent	False	1480697748	self.learnpython	1480709138	False	5g4bcr	...	{}	I have the following representation of argumen...	False	False	learnpython	t5_2r8ot	self	currying functions using functools	6.0	https://www.reddit.com/r/learnpython/comments/...

5 rows × 31 columns

# a lot are booleans and objects
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26688 entries, 0 to 26687
Data columns (total 31 columns):
archived              26688 non-null bool
author                26688 non-null object
contest_mode          26688 non-null bool
created_utc           26688 non-null int64
domain                26688 non-null object
downs                 26688 non-null float64
edited                26688 non-null object
gilded                26688 non-null float64
hide_score            26688 non-null bool
id                    26688 non-null object
is_self               26688 non-null bool
locked                26688 non-null bool
media_embed           26688 non-null object
name                  26688 non-null object
num_comments          26688 non-null float64
over_18               26688 non-null bool
permalink             26688 non-null object
quarantine            26688 non-null bool
retrieved_on          26688 non-null float64
saved                 26688 non-null bool
score                 26688 non-null float64
secure_media_embed    26688 non-null object
selftext              26688 non-null object
spoiler               26688 non-null bool
stickied              26688 non-null bool
subreddit             26688 non-null object
subreddit_id          26688 non-null object
thumbnail             26688 non-null object
title                 26688 non-null object
ups                   26688 non-null float64
url                   26688 non-null object
dtypes: bool(10), float64(6), int64(1), object(14)
memory usage: 4.5+ MB

# let's take a closer look and identify candidates for elimination
for col in df.columns:
    print(df[col].value_counts(), '\n')

Going through these one by one, I’m picking which columns are useful to use in analysis.

archived: not - too skewed
author: not descriptive, but maybe
contest_mode: not - too skewed
created_utc: timestamp, not informative
domain: yes - might be used as target, self referring to posts not linking to another website
downs: not - no values?
edited: maybe - not descriptive, but leave for now
gilded: not - too skewed
hide_score: not - too skewed
id: not descriptive
is_self: maybe but skewed
locked: not - too skewed
media_embed: not - no idea
name: not - too unique
num_comments: yes
over_18: not - too skewed
permalink: not
retrieved_on: not
saved: not - too skewed
score: yes
secure_media_embed: not
selftext: yes, the actual content (removed / deleted - ?)
spoiler: not - too skewed
stickied: not - too skewed
subreddit: yes - might be used as target
subreddit_id: not - same as subreddit the name (above)
thumbnail: not
title: yes
ups: not - same as score?
url: maybe

I hear from practioners that it’s normal to throw away 50-80% of the dataset as most of it’s is extra noise we don’t need anyway.

drop_for_sure = [
    'archived', 'contest_mode', 'created_utc', 'downs', 'gilded', 'hide_score', 'id', 'is_self', 'locked', 
    'media_embed', 'name', 'over_18', 'permalink', 'retrieved_on', 
    'saved', 'secure_media_embed', 'spoiler', 'stickied', 
    'subreddit_id', 'thumbnail', 'ups', 'url'
]
df.drop(drop_for_sure, axis=1, inplace=True)
df.head()

	author	domain	edited	num_comments	quarantine	score	selftext	subreddit	title
0	johnnyawesome0	self.techsupport	False	1.0	False	1.0	I have a Sony surround sound system for a blu-...	techsupport	Help with audio set-up
1	Silverfin113	self.learnprogramming	False	8.0	False	23.0	I've written what seems to be a prohibitively ...	learnprogramming	Optimizing code for speed
2	bookbooksbooks	self.gamedev	False	5.0	False	12.0	I'm writing an article called "Video Games Tha...	gamedev	Seeking Tales of Development Woe (and Triumph)...
3	[deleted]	self.learnprogramming	1480698462	9.0	False	0.0	[deleted]	learnprogramming	[Java] Finding smallest value in an array
4	caffeine_potent	self.learnpython	1480709138	12.0	False	6.0	I have the following representation of argumen...	learnpython	currying functions using functools

**We can see that there is a ‘[deleted]’ value for a bunch of author and selftext rows. We want to drop to be aware of those, so we’re going to binarize them and take a look at who these are correlated with each other. **

df['author_deleted'] = df['author'].apply(lambda x: 1 if x == '[deleted]' else 0)

df['author_deleted'].value_counts()

0    20741
1     5947
Name: author_deleted, dtype: int64

df['not_useful_selftext'] = df['selftext'].apply(lambda x: 1 if x in ['[deleted]', '[removed]'] else 0)
df['not_useful_selftext'].value_counts()

0    18208
1     8480
Name: not_useful_selftext, dtype: int64

Let’s see if there is an overlap between ‘author_deleted’ and ‘not_useful_selftext’.

df[['author_deleted','not_useful_selftext']].corr()

	author_deleted	not_useful_selftext
author_deleted	1.000000	0.765296
not_useful_selftext	0.765296	1.000000

There’s quite an overlap, so we’re not going to lose too much data if we drop those.

df = df.loc[(df['author_deleted'] == 0)]
df = df.loc[(df['not_useful_selftext'] == 0)]
df.shape

(18108, 11)

df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 18108 entries, 0 to 26686
Data columns (total 11 columns):
author                 18108 non-null object
domain                 18108 non-null object
edited                 18108 non-null object
num_comments           18108 non-null float64
quarantine             18108 non-null bool
score                  18108 non-null float64
selftext               18108 non-null object
subreddit              18108 non-null object
title                  18108 non-null object
author_deleted         18108 non-null int64
not_useful_selftext    18108 non-null int64
dtypes: bool(1), float64(2), int64(2), object(6)
memory usage: 1.5+ MB

Nothing’s null anymore!

MODELLING - Quick classification predicting subreddit

But first I will split the subreddits into broader categories that’s easier to handle.

df.subreddit.value_counts()

techsupport         9522
learnprogramming    2688
learnpython         1420
gamedev              800
web_design           441
javahelp             419
javascript           401
csshelp              298
Python               274
iOSProgramming       262
linux                225
engineering          213
swift                189
computerscience      124
django               111
PHP                   84
css                   77
java                  75
HTML                  74
ruby                  67
flask                 60
compsci               43
technology            42
cpp                   34
html5                 33
pygame                32
jquery                29
perl                  21
lisp                  14
dailyprogrammer        9
programmer             8
IPython                8
inventwithpython       5
netsec                 4
pystats                2
Name: subreddit, dtype: int64

def subreddit_splitter(subred):
    techsupport = ['techsupport', 'learnprogramming', 'computerscience', 
                  'compsci']
    webdev = ['web_development', 'csshelp', 'PHP', 'css', 'HTML', 
              'ruby', 'django', 'flask', 'html5', 'perl']
    python = ['learnpython', 'Python', 'IPython', 'inventwithpython',
             'pystats']
    gamedev = ['gamedev', 'pygame']
    javascript = ['javascript', 'jquery']
    compiled_langs = ['javahelp', 'cpp', 'java', 'lisp']
    if subred in techsupport: 
        return 'techsupport'
    elif subred in webdev:
        return 'webdev'
    elif subred in python:
        return 'python'
    elif subred in gamedev:
        return 'gamedev'
    elif subred in javascript: 
        return 'javascript'
    elif subred in compiled_langs:
        return 'compiled'
    else:
        return 'other'

df['subreds'] = df['subreddit'].apply(subreddit_splitter)

df.subreds.value_counts()

techsupport    12377
python          1709
other           1393
gamedev          832
webdev           825
compiled         542
javascript       430
Name: subreds, dtype: int64

from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.linear_model import LogisticRegression

# StratifiedKFold?

# training and holdout set
# with 2 splits, and reliably shuffled data
skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=2017)

# as per documentation (the '?' above) 
skf.get_n_splits(df, df['subreds'])

for train_idx, test_idx in skf.split(df, df['subreds']):
    X_train, X_test = df.iloc[train_idx, :], df.iloc[test_idx, :]
    y_train, y_test = df['subreds'].iloc[train_idx], df['subreds'].iloc[test_idx]

X_train.shape

(9056, 12)

X_test.shape

(9052, 12)

y_train.shape

(9056,)

y_test.shape

(9052,)

NLP Transformations Pipeline

# # this will give us a fit and transform method that 
# # prescribes a columns of a df and pops it in ready for a pipeline
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column
    def fit(self, X, y=None):
        return self
    def transform (self, X, y=None):
        return X[[self.column]]

# this is a stateles transformer that looks at tsomething and applies a function to it
from sklearn.preprocessing import FunctionTransformer

def extract_first_column(x):
    return x.iloc[:, 0]

def to_dense(x):
    return x.todense()

selftext_pipe = make_pipeline(
    # takes a full df and spits out a df w 1 col, selftext 
    FeatureExtractor('selftext'),
    # take that new df shape (9000, 1) and extract the 1st col, while
    # validate=False silences that warning
    FunctionTransformer(extract_first_column, validate=False),
    # tfidf spits out a sparse matrix 
    TfidfVectorizer(stop_words='english'),
    # functionTransformer here takes a sparse matrix and turns it
    # to a dense one
    FunctionTransformer(to_dense, validate=False)
)

title_pipe = make_pipeline(
    FeatureExtractor('title'),
    FunctionTransformer(extract_first_column, validate=False),
    TfidfVectorizer(stop_words='english'),
    FunctionTransformer(to_dense, validate=False)

)

feature_set_extraction = make_union(
    selftext_pipe,
    title_pipe
)

feature_set_extraction.fit(X_train)

FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline-1', Pipeline(memory=None,
     steps=[('featureextractor', FeatureExtractor(column='selftext')), ('functiontransformer-1', FunctionTransformer(accept_sparse=False,
          func=<function extract_first_column at 0x1121b66a8>,
          inv_kw_args=None, inverse_func=Non...None,
          inverse_func=None, kw_args=None, pass_y='deprecated',
          validate=False))]))],
       transformer_weights=None)

transformed = feature_set_extraction.transform(X_train)
print(transformed.shape, transformed[0:5])

(9056, 47722) [[ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]
 [ 0.  0.  0. ...,  0.  0.  0.]]

Logistic Regression

pipeline = make_pipeline(
    feature_set_extraction,
    TruncatedSVD(n_components=50),
    LogisticRegression()
)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
print(pipeline.transform(X_train).shape)

from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

print(confusion_matrix(y_test, predictions))

[[  14    7    1    1   16  231    1]
 [   0  170    0    1    3  242    0]
 [   0    6    2   10   12  182    3]
 [   0    4    0   76    6  596   14]
 [   1   13    0    4  269  561    6]
 [  15   30    3   44  136 5951    9]
 [   0    0    2   16   19  333   42]]

print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

   compiled       0.47      0.05      0.09       271
    gamedev       0.74      0.41      0.53       416
 javascript       0.25      0.01      0.02       215
      other       0.50      0.11      0.18       696
     python       0.58      0.31      0.41       854
techsupport       0.74      0.96      0.83      6188
     webdev       0.56      0.10      0.17       412

avg / total       0.68      0.72      0.66      9052

print(accuracy_score(y_test, predictions))

0.720724701723

print(y_test.value_counts() / y_test.count())

techsupport    0.683606
python         0.094344
other          0.076889
gamedev        0.045957
webdev         0.045515
compiled       0.029938
javascript     0.023752
Name: subreds, dtype: float64

Our model did better than if we simply have predicted the majority class for all (we would have gotten .68% score then). Considering there was no hyper parameter optimization, we did good, but there is a lot of room to improve.

Hyperparameter Tuning with Gridsearch:

1/2: Logistic Regression parameters

We’re going to tweak multiple hyperparameters inside the pipeline at once. Let’s first look at all the different cases of all the different parameters that exist in our pipeline.

pipeline_params = pipeline.get_params()

LogisticRegression().get_params()

{'C': 1.0,
 'class_weight': None,
 'dual': False,
 'fit_intercept': True,
 'intercept_scaling': 1,
 'max_iter': 100,
 'multi_class': 'ovr',
 'n_jobs': 1,
 'penalty': 'l2',
 'random_state': None,
 'solver': 'liblinear',
 'tol': 0.0001,
 'verbose': 0,
 'warm_start': False}

# every key here is something we can tweak in our model
# some of those are well nested
for key in pipeline_params.keys():
    print(key)

memory
steps
featureunion
truncatedsvd
logisticregression
featureunion__n_jobs
featureunion__transformer_list
featureunion__transformer_weights
featureunion__pipeline-1
featureunion__pipeline-2
featureunion__pipeline-1__memory
featureunion__pipeline-1__steps
featureunion__pipeline-1__featureextractor
featureunion__pipeline-1__functiontransformer-1
featureunion__pipeline-1__tfidfvectorizer
featureunion__pipeline-1__functiontransformer-2
featureunion__pipeline-1__featureextractor__column
featureunion__pipeline-1__functiontransformer-1__accept_sparse
featureunion__pipeline-1__functiontransformer-1__func
featureunion__pipeline-1__functiontransformer-1__inv_kw_args
featureunion__pipeline-1__functiontransformer-1__inverse_func
featureunion__pipeline-1__functiontransformer-1__kw_args
featureunion__pipeline-1__functiontransformer-1__pass_y
featureunion__pipeline-1__functiontransformer-1__validate
featureunion__pipeline-1__tfidfvectorizer__analyzer
featureunion__pipeline-1__tfidfvectorizer__binary
featureunion__pipeline-1__tfidfvectorizer__decode_error
featureunion__pipeline-1__tfidfvectorizer__dtype
featureunion__pipeline-1__tfidfvectorizer__encoding
featureunion__pipeline-1__tfidfvectorizer__input
featureunion__pipeline-1__tfidfvectorizer__lowercase
featureunion__pipeline-1__tfidfvectorizer__max_df
featureunion__pipeline-1__tfidfvectorizer__max_features
featureunion__pipeline-1__tfidfvectorizer__min_df
featureunion__pipeline-1__tfidfvectorizer__ngram_range
featureunion__pipeline-1__tfidfvectorizer__norm
featureunion__pipeline-1__tfidfvectorizer__preprocessor
featureunion__pipeline-1__tfidfvectorizer__smooth_idf
featureunion__pipeline-1__tfidfvectorizer__stop_words
featureunion__pipeline-1__tfidfvectorizer__strip_accents
featureunion__pipeline-1__tfidfvectorizer__sublinear_tf
featureunion__pipeline-1__tfidfvectorizer__token_pattern
featureunion__pipeline-1__tfidfvectorizer__tokenizer
featureunion__pipeline-1__tfidfvectorizer__use_idf
featureunion__pipeline-1__tfidfvectorizer__vocabulary
featureunion__pipeline-1__functiontransformer-2__accept_sparse
featureunion__pipeline-1__functiontransformer-2__func
featureunion__pipeline-1__functiontransformer-2__inv_kw_args
featureunion__pipeline-1__functiontransformer-2__inverse_func
featureunion__pipeline-1__functiontransformer-2__kw_args
featureunion__pipeline-1__functiontransformer-2__pass_y
featureunion__pipeline-1__functiontransformer-2__validate
featureunion__pipeline-2__memory
featureunion__pipeline-2__steps
featureunion__pipeline-2__featureextractor
featureunion__pipeline-2__functiontransformer-1
featureunion__pipeline-2__tfidfvectorizer
featureunion__pipeline-2__functiontransformer-2
featureunion__pipeline-2__featureextractor__column
featureunion__pipeline-2__functiontransformer-1__accept_sparse
featureunion__pipeline-2__functiontransformer-1__func
featureunion__pipeline-2__functiontransformer-1__inv_kw_args
featureunion__pipeline-2__functiontransformer-1__inverse_func
featureunion__pipeline-2__functiontransformer-1__kw_args
featureunion__pipeline-2__functiontransformer-1__pass_y
featureunion__pipeline-2__functiontransformer-1__validate
featureunion__pipeline-2__tfidfvectorizer__analyzer
featureunion__pipeline-2__tfidfvectorizer__binary
featureunion__pipeline-2__tfidfvectorizer__decode_error
featureunion__pipeline-2__tfidfvectorizer__dtype
featureunion__pipeline-2__tfidfvectorizer__encoding
featureunion__pipeline-2__tfidfvectorizer__input
featureunion__pipeline-2__tfidfvectorizer__lowercase
featureunion__pipeline-2__tfidfvectorizer__max_df
featureunion__pipeline-2__tfidfvectorizer__max_features
featureunion__pipeline-2__tfidfvectorizer__min_df
featureunion__pipeline-2__tfidfvectorizer__ngram_range
featureunion__pipeline-2__tfidfvectorizer__norm
featureunion__pipeline-2__tfidfvectorizer__preprocessor
featureunion__pipeline-2__tfidfvectorizer__smooth_idf
featureunion__pipeline-2__tfidfvectorizer__stop_words
featureunion__pipeline-2__tfidfvectorizer__strip_accents
featureunion__pipeline-2__tfidfvectorizer__sublinear_tf
featureunion__pipeline-2__tfidfvectorizer__token_pattern
featureunion__pipeline-2__tfidfvectorizer__tokenizer
featureunion__pipeline-2__tfidfvectorizer__use_idf
featureunion__pipeline-2__tfidfvectorizer__vocabulary
featureunion__pipeline-2__functiontransformer-2__accept_sparse
featureunion__pipeline-2__functiontransformer-2__func
featureunion__pipeline-2__functiontransformer-2__inv_kw_args
featureunion__pipeline-2__functiontransformer-2__inverse_func
featureunion__pipeline-2__functiontransformer-2__kw_args
featureunion__pipeline-2__functiontransformer-2__pass_y
featureunion__pipeline-2__functiontransformer-2__validate
truncatedsvd__algorithm
truncatedsvd__n_components
truncatedsvd__n_iter
truncatedsvd__random_state
truncatedsvd__tol
logisticregression__C
logisticregression__class_weight
logisticregression__dual
logisticregression__fit_intercept
logisticregression__intercept_scaling
logisticregression__max_iter
logisticregression__multi_class
logisticregression__n_jobs
logisticregression__penalty
logisticregression__random_state
logisticregression__solver
logisticregression__tol
logisticregression__verbose
logisticregression__warm_start

# specifically, there is this much parameters we could tweak!
# gridsearch will be great help here
len(pipeline_params)

# using GridsearchCV and Pipeline to do hyperparameter
# optimization on the LogisticRegression
from sklearn.model_selection import GridSearchCV

pipeline

Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline-1', Pipeline(memory=None,
     steps=[('featureextractor', FeatureExtractor(column='selftext')), ('functiontransformer-1', FunctionTransformer(accept_sparse=False,
          func=<function extract_first_column at 0x11...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))])

Things we could tweak inside of Logistic Regression: - logisticregression__C (regularization strength) - logisticregression__class_weight - logisticregression__dual - logisticregression__fit_intercept - logisticregression__intercept_scaling - logisticregression__max_iter - logisticregression__multi_class - logisticregression__n_jobs - logisticregression__penalty - logisticregression__random_state - logisticregression__solver - logisticregression__tol - logisticregression__verbose - logisticregression__warm_start

# this took me 90 mins to run (i had too many tabs open) 

params_grid = {
    'logisticregression__penalty': ['l1', 'l2'],
    'logisticregression__C':[0.01, 1.0, 100.0],
    'logisticregression__fit_intercept': [True, False]
}

gs = GridSearchCV(
    pipeline, 
    params_grid,
    n_jobs=-1,
    verbose=2
)

gs.fit(X_train, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
...
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed: 89.3min finished





GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('featureunion', FeatureUnion(n_jobs=1,
       transformer_list=[('pipeline-1', Pipeline(memory=None,
     steps=[('featureextractor', FeatureExtractor(column='selftext')), ('functiontransformer-1', FunctionTransformer(accept_sparse=False,
          func=<function extract_first_column at 0x11...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False))]),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'logisticregression__penalty': ['l1', 'l2'], 'logisticregression__C': [0.01, 1.0, 100.0], 'logisticregression__fit_intercept': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)

gs.best_params_

{'logisticregression__C': 100.0,
 'logisticregression__fit_intercept': False,
 'logisticregression__penalty': 'l1'}

predictions = gs.best_estimator_.predict(X_test)

print(confusion_matrix(y_test, predictions))

[[  25    6    1    2   24  212    1]
 [   0  256    0    5    9  145    1]
 [   0    7   26   23   23  129    7]
 [   1   14    2  180   17  454   28]
 [   1    9    4   15  405  402   18]
 [  25   52   19   91  179 5786   36]
 [   0    1    2   34   21  227  127]]

print(classification_report(y_test, predictions))

             precision    recall  f1-score   support

   compiled       0.48      0.09      0.15       271
    gamedev       0.74      0.62      0.67       416
 javascript       0.48      0.12      0.19       215
      other       0.51      0.26      0.34       696
     python       0.60      0.47      0.53       854
techsupport       0.79      0.94      0.85      6188
     webdev       0.58      0.31      0.40       412

avg / total       0.72      0.75      0.72      9052

print(accuracy_score(y_test, predictions))

0.751767565179

Hyperparameter Tuning with Gridsearch:

2/2: NLP parameters

# adding the best parameters to LR as determined by the gridsearch
pipeline = make_pipeline(
    feature_set_extraction,
    TruncatedSVD(n_components=50),
    LogisticRegression(penalty='l1', C='100.0')
)
# pipeline.fit(X_train, y_train)
# predictions = pipeline.predict(X_test)

pipeline_params




```python
# this took me 40 mins to run
from sklearn.feature_extraction.text import CountVectorizer
params_grid = {
    'truncatedsvd__n_components': [50, 250],
    'featureunion__pipeline-1__tfidfvectorizer': [
        TfidfVectorizer(stop_words='english'),
        CountVectorizer(stop_words='english', max_features=10000)
    ],
    'featureunion__pipeline-2__tfidfvectorizer': [
        TfidfVectorizer(stop_words='english'),
        CountVectorizer(stop_words='english', max_features=10000)
    ],
}

gs = GridSearchCV(
    pipeline, 
    params_grid,
    n_jobs=-1,
    verbose=2
)

gs.fit(X_train, y_train)

Fitting 3 folds for each of 8 candidates, totalling 24 fits

gs.best_params_

{'featureunion__pipeline-1__tfidfvectorizer': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=10000, min_df=1,
         ngram_range=(1, 1), preprocessor=None, stop_words='english',
         strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=None, vocabulary=None),
 'featureunion__pipeline-2__tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
         stop_words='english', strip_accents=None, sublinear_tf=False,
         token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
         vocabulary=None),
 'truncatedsvd__n_components': 250}

* CountVectorizer was the best param in the first tfidf step and TfidfVectorizer was best in the second. Best number of components - 250. Let’s now fit these params and see how well they predict:*

# Function to check fit
def check_fit(predictions, y_true):
        print(confusion_matrix(y_test, predictions))
        print(classification_report(y_test, predictions))
        print(accuracy_score(y_test, predictions))

predictions = gs.best_estimator_.predict(X_test)
check_fit(predictions, y_test)

[[  76    3    3    4    8  174    3]
 [   0  248    2   13    9  143    1]
 [   1    9   44   16    2  139    4]
 [   4   10    3  205   18  431   25]
 [   4   15    5   15  447  344   24]
 [  56   27   27   74  130 5840   34]
 [   2    2    9   31   11  240  117]]
             precision    recall  f1-score   support

   compiled       0.53      0.28      0.37       271
    gamedev       0.79      0.60      0.68       416
 javascript       0.47      0.20      0.29       215
      other       0.57      0.29      0.39       696
     python       0.72      0.52      0.60       854
techsupport       0.80      0.94      0.87      6188
     webdev       0.56      0.28      0.38       412

avg / total       0.75      0.77      0.74      9052

0.770768890853

** They predict 9% more accurate than the baseline. **

DIMENSIONALITY REDUCTION WITH PCA (TruncatedSVD)

# predicting using subreddit's text
X_train, X_test, y_train, y_test = train_test_split(df['selftext'].values, df['subreds'].values,
                                                  test_size=0.33,  random_state=2017)

cv = CountVectorizer()
cv.fit(X_train)
X = cv.transform(X_train)
print(X.shape)

(12132, 47561)

import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
tsvd = TruncatedSVD(n_components=11)
tsvd.fit(X)
plt.plot(range(11), tsvd.explained_variance_ratio_.cumsum())

[<matplotlib.lines.Line2D at 0x1a2491d8d0>]

png

X_tsvd = tsvd.transform(X)

Predicting with RandomForestClassifier

rfc = RandomForestClassifier()
rfc.fit(X_tsvd, y_train)
print(rfc.score(X_tsvd, y_train))
print(confusion_matrix(y_train, rfc.predict(X_tsvd)))
print(classification_report(y_train, rfc.predict(X_tsvd)))

0.984009231784
[[ 345    0    0    0    0   18    0]
 [   0  527    0    1    2   16    0]
 [   0    0  290    1    1    7    0]
 [   3    0    1  880    0   31    3]
 [   1    1    1    1 1086   46    2]
 [   2    2    4    3    6 8316    3]
 [   0    0    1    2    1   34  494]]
             precision    recall  f1-score   support

   compiled       0.98      0.95      0.97       363
    gamedev       0.99      0.97      0.98       546
 javascript       0.98      0.97      0.97       299
      other       0.99      0.96      0.97       918
     python       0.99      0.95      0.97      1138
techsupport       0.98      1.00      0.99      8336
     webdev       0.98      0.93      0.96       532

avg / total       0.98      0.98      0.98     12132

This seems like an exceptionally good model with high accuracy..

X_test_cv = cv.transform(X_test)
X_test_svd = tsvd.transform(X_test_cv)
print(rfc.score(X_test_svd, y_test))
print(confusion_matrix(y_test, rfc.predict(X_test_svd)))
print(classification_report(y_test, rfc.predict(X_test_svd)))

0.643239625167
[[  21   11    3    8   25  110    1]
 [   8   27    5   19   22  204    1]
 [   3    7    3    9   16   91    2]
 [  12   25    4   58   39  327   10]
 [   9   18    6   29   92  403   14]
 [  30   48   31  119  156 3631   26]
 [   5   10    5   22   31  208   12]]
             precision    recall  f1-score   support

   compiled       0.24      0.12      0.16       179
    gamedev       0.18      0.09      0.12       286
 javascript       0.05      0.02      0.03       131
      other       0.22      0.12      0.16       475
     python       0.24      0.16      0.19       571
techsupport       0.73      0.90      0.81      4041
     webdev       0.18      0.04      0.07       293

avg / total       0.56      0.64      0.59      5976

Except on the test set it did worse, way worse than expected.

Same in a pipeline

pipeline = make_pipeline(
   CountVectorizer(),
   TruncatedSVD(n_components=10),
   RandomForestClassifier())

pipeline.fit(X_train, y_train)

# score it and run the predictions from the training set.
print(pipeline.score(X_train, y_train))
predictions = pipeline.predict(X_train)
print(confusion_matrix(y_train, predictions))
print(classification_report(y_train, predictions))

0.970820969337
[[ 330    0    0    3    0   30    0]
 [   1  518    0    2    2   22    1]
 [   0    0  280    4    1   12    2]
 [   0    3    1  847    2   61    4]
 [   2    2    2    5 1039   87    1]
 [   1    7    5   10   12 8297    4]
 [   1    3    0    5    5   51  467]]
             precision    recall  f1-score   support

   compiled       0.99      0.91      0.95       363
    gamedev       0.97      0.95      0.96       546
 javascript       0.97      0.94      0.95       299
      other       0.97      0.92      0.94       918
     python       0.98      0.91      0.94      1138
techsupport       0.97      1.00      0.98      8336
     webdev       0.97      0.88      0.92       532

avg / total       0.97      0.97      0.97     12132

# score and predict with our test set
print(pipeline.score(X_test, y_test))
predictions = pipeline.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))

0.645582329317
[[  19    6    3   14   21  111    5]
 [   6   32    6   21   16  201    4]
 [   1    3    9    8   13   91    6]
 [   2   21    6   69   39  323   15]
 [  16   24    6   37   94  392    2]
 [  32   55   34  129  138 3621   32]
 [   5    8    6   17   22  221   14]]
             precision    recall  f1-score   support

   compiled       0.23      0.11      0.15       179
    gamedev       0.21      0.11      0.15       286
 javascript       0.13      0.07      0.09       131
      other       0.23      0.15      0.18       475
     python       0.27      0.16      0.21       571
techsupport       0.73      0.90      0.80      4041
     webdev       0.18      0.05      0.08       293

avg / total       0.57      0.65      0.60      5976

Latent Dirichlet Allocation

# transform the data using CountVectorizer and removing stop words:
cv = CountVectorizer(stop_words='english')
cv.fit(df['title'].values)
X = cv.transform(df['title'].values)
X

<18108x12407 sparse matrix of type '<class 'numpy.int64'>'
	with 102034 stored elements in Compressed Sparse Row format>

# instantiate an LDA and fit it to our sparse matrix of words
from sklearn.decomposition import LatentDirichletAllocation
feature_names = cv.get_feature_names()
lda = LatentDirichletAllocation(n_components=7) # 7 for the number of topics
lda.fit(X)

/Users/Olga/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:536: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
  DeprecationWarning)





LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7, learning_method=None,
             learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
             mean_change_tol=0.001, n_components=7, n_jobs=1,
             n_topics=None, perp_tol=0.1, random_state=None,
             topic_word_prior=None, total_samples=1000000.0, verbose=0)

print(lda.components_.shape)

(7, 12407)

results = pd.DataFrame(lda.components_,
                     columns=feature_names)

# classifying words that are likely to be used together into 7 topics 
# if we wanted to find 7 topics
for topic in range(7):
    print('Topic', topic)
    word_list = results.T[topic].sort_values(ascending=False).index
    print(' '.join(word_list[0:25]), '\n')

Topic 0
help need game way java best data audio programming project looking advice list create function app file android software page set open phone input design 

Topic 1
use pc drive does wifi getting problem hard know make post disk power tv image 2016 connect randomly gtx sure external really think add thread 

Topic 2
laptop new boot code trying learn ssd computer program issues having won hdd want possible slow don just loop keyboard bsod files drivers time 100 

Topic 3
screen internet monitor good error card video learning doesn pc desktop graphics display black computer connection work asus google run using javascript chrome driver won 

Topic 4
python question games gpu problems computer cpu running motherboard usage high different playing ve swift line coding fps fan programming ram dual test low monitor 

Topic 5
windows 10 working pc like web new install router update gaming start fix making got old change device vs random access website network development online 

Topic 6
using issue usb text work python amp keeps build mouse laptop code html crashing js server django creating php based port search unable apps iphone 

Keywords: hyperparameter optimization, model tweking, gridsearch