This write-up approaches Natural Language Processing (NLP) which is a powerful set of techniques for processing text into numbers that is then used during modeling. The anylisis below uses a reddit dataset, specifically the text and the title of a reddit post to predict what topic, or sureddit it is about. The project is inspired by one of the assignments I had an option to do during the Data Science Immersive course at General Assembly - with some functions straight out of my brilliant teachers’ notes.
The flow is as follows:
FURTHER DETAILS:
Missing Values: with 26k rows, I’m dropping any features that have more than 20k rows missing, handpicking ones with balanced values and editing ones with bad values.
NLP techniques used: CountVectorizer takes every unique instance of a word and counts it as a feature. (It can be problematic for several reasons: memory processing power and ambiguity (e.g. LinkedIn seeing 6000+ variations of the title ‘Software Engineer’, although there is technique for dealing with that called ‘stemming’, but its fragile)). So what CountVecotrizer is doing is counting words’ occurence and assigning weight to them according how often they occur.
Another approach is looking at more unique words. As such, Tf-idf highlights what is common or typical in one or two cases and rare in all others. It gives a value that’s weighted, or relative of other documents, not absoulute like CountVectorizer does. That’s usually more interesting (and predictive) then words or items that are common everywhere.
In addition, I’m using GridSearch to first find the best parameters in the model and then in the NLP techniques.
Often, the thousands of columns we get from vectorizing each word are not individually informative. Reducing them in dimensionality using PCA can be very helpful. I used a variant of PCA known as TruncatedSVD (yet it did not improve the predictability of my model).
Lastly, I looked at Latent Dirichlet Allocation (LDA), which is an unstructured technique that finds things that are most likely to be together, but not predicting what’s most likely to be together or how many topics are there.
Eventually bouncing back and forth between optimizing hyperparameters, trying new modelling techniques and working on feature extraction would have gotten me closer to a better predictive model. But this is what I have for now:
import pandas as pd
import numpy as np
df = pd.read_csv('NLP/reddit_posts.csv')
df.shape
(26688, 53)
df.head()
adserver_click_url | adserver_imp_pixel | archived | author | author_flair_css_class | author_flair_text | contest_mode | created_utc | disable_comments | distinguished | ... | spoiler | stickied | subreddit | subreddit_id | third_party_tracking | third_party_tracking_2 | thumbnail | title | ups | url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | False | johnnyawesome0 | NaN | NaN | False | 1480697304 | NaN | NaN | ... | False | False | techsupport | t5_2qioo | NaN | NaN | self | Help with audio set-up | 1.0 | https://www.reddit.com/r/techsupport/comments/... |
1 | NaN | NaN | False | Silverfin113 | NaN | NaN | False | 1480697424 | NaN | NaN | ... | False | False | learnprogramming | t5_2r7yd | NaN | NaN | self | Optimizing code for speed | 23.0 | https://www.reddit.com/r/learnprogramming/comm... |
2 | NaN | NaN | False | bookbooksbooks | NaN | NaN | False | 1480697613 | NaN | NaN | ... | False | False | gamedev | t5_2qi0a | NaN | NaN | self | Seeking Tales of Development Woe (and Triumph)... | 12.0 | https://www.reddit.com/r/gamedev/comments/5g4a... |
3 | NaN | NaN | False | [deleted] | NaN | NaN | False | 1480697634 | NaN | NaN | ... | False | False | learnprogramming | t5_2r7yd | NaN | NaN | default | [Java] Finding smallest value in an array | 0.0 | https://www.reddit.com/r/learnprogramming/comm... |
4 | NaN | NaN | False | caffeine_potent | NaN | NaN | False | 1480697748 | NaN | NaN | ... | False | False | learnpython | t5_2r8ot | NaN | NaN | self | currying functions using functools | 6.0 | https://www.reddit.com/r/learnpython/comments/... |
5 rows × 53 columns
df.isnull().sum()
adserver_click_url 26688
adserver_imp_pixel 26688
archived 0
author 0
author_flair_css_class 26253
author_flair_text 26337
contest_mode 0
created_utc 0
disable_comments 26688
distinguished 26603
domain 0
downs 0
edited 0
gilded 0
hide_score 0
href_url 26688
id 0
imp_pixel 26688
is_self 0
link_flair_css_class 22396
link_flair_text 22078
locked 0
media 26420
media_embed 0
mobile_ad_url 26688
name 0
num_comments 0
original_link 26688
over_18 0
permalink 0
post_hint 23175
preview 23175
promoted 26688
promoted_by 26688
promoted_display_name 26688
promoted_url 26688
quarantine 0
retrieved_on 0
saved 0
score 0
secure_media 26420
secure_media_embed 0
selftext 0
spoiler 0
stickied 0
subreddit 0
subreddit_id 0
third_party_tracking 26688
third_party_tracking_2 26688
thumbnail 0
title 0
ups 0
url 0
dtype: int64
missing_cols = df.isnull().sum()[df.isnull().sum() > 20000].index
missing_cols
Index(['adserver_click_url', 'adserver_imp_pixel', 'author_flair_css_class',
'author_flair_text', 'disable_comments', 'distinguished', 'href_url',
'imp_pixel', 'link_flair_css_class', 'link_flair_text', 'media',
'mobile_ad_url', 'original_link', 'post_hint', 'preview', 'promoted',
'promoted_by', 'promoted_display_name', 'promoted_url', 'secure_media',
'third_party_tracking', 'third_party_tracking_2'],
dtype='object')
df.drop(list(missing_cols), axis=1, inplace=True)
df.head()
archived | author | contest_mode | created_utc | domain | downs | edited | gilded | hide_score | id | ... | secure_media_embed | selftext | spoiler | stickied | subreddit | subreddit_id | thumbnail | title | ups | url | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | johnnyawesome0 | False | 1480697304 | self.techsupport | 0.0 | False | 0.0 | False | 5g49s2 | ... | {} | I have a Sony surround sound system for a blu-... | False | False | techsupport | t5_2qioo | self | Help with audio set-up | 1.0 | https://www.reddit.com/r/techsupport/comments/... |
1 | False | Silverfin113 | False | 1480697424 | self.learnprogramming | 0.0 | False | 0.0 | False | 5g4a5p | ... | {} | I've written what seems to be a prohibitively ... | False | False | learnprogramming | t5_2r7yd | self | Optimizing code for speed | 23.0 | https://www.reddit.com/r/learnprogramming/comm... |
2 | False | bookbooksbooks | False | 1480697613 | self.gamedev | 0.0 | False | 0.0 | False | 5g4att | ... | {} | I'm writing an article called "Video Games Tha... | False | False | gamedev | t5_2qi0a | self | Seeking Tales of Development Woe (and Triumph)... | 12.0 | https://www.reddit.com/r/gamedev/comments/5g4a... |
3 | False | [deleted] | False | 1480697634 | self.learnprogramming | 0.0 | 1480698462 | 0.0 | False | 5g4awr | ... | {} | [deleted] | False | False | learnprogramming | t5_2r7yd | default | [Java] Finding smallest value in an array | 0.0 | https://www.reddit.com/r/learnprogramming/comm... |
4 | False | caffeine_potent | False | 1480697748 | self.learnpython | 0.0 | 1480709138 | 0.0 | False | 5g4bcr | ... | {} | I have the following representation of argumen... | False | False | learnpython | t5_2r8ot | self | currying functions using functools | 6.0 | https://www.reddit.com/r/learnpython/comments/... |
5 rows × 31 columns
# a lot are booleans and objects
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26688 entries, 0 to 26687
Data columns (total 31 columns):
archived 26688 non-null bool
author 26688 non-null object
contest_mode 26688 non-null bool
created_utc 26688 non-null int64
domain 26688 non-null object
downs 26688 non-null float64
edited 26688 non-null object
gilded 26688 non-null float64
hide_score 26688 non-null bool
id 26688 non-null object
is_self 26688 non-null bool
locked 26688 non-null bool
media_embed 26688 non-null object
name 26688 non-null object
num_comments 26688 non-null float64
over_18 26688 non-null bool
permalink 26688 non-null object
quarantine 26688 non-null bool
retrieved_on 26688 non-null float64
saved 26688 non-null bool
score 26688 non-null float64
secure_media_embed 26688 non-null object
selftext 26688 non-null object
spoiler 26688 non-null bool
stickied 26688 non-null bool
subreddit 26688 non-null object
subreddit_id 26688 non-null object
thumbnail 26688 non-null object
title 26688 non-null object
ups 26688 non-null float64
url 26688 non-null object
dtypes: bool(10), float64(6), int64(1), object(14)
memory usage: 4.5+ MB
# let's take a closer look and identify candidates for elimination
for col in df.columns:
print(df[col].value_counts(), '\n')
Going through these one by one, I’m picking which columns are useful to use in analysis.
I hear from practioners that it’s normal to throw away 50-80% of the dataset as most of it’s is extra noise we don’t need anyway.
drop_for_sure = [
'archived', 'contest_mode', 'created_utc', 'downs', 'gilded', 'hide_score', 'id', 'is_self', 'locked',
'media_embed', 'name', 'over_18', 'permalink', 'retrieved_on',
'saved', 'secure_media_embed', 'spoiler', 'stickied',
'subreddit_id', 'thumbnail', 'ups', 'url'
]
df.drop(drop_for_sure, axis=1, inplace=True)
df.head()
author | domain | edited | num_comments | quarantine | score | selftext | subreddit | title | |
---|---|---|---|---|---|---|---|---|---|
0 | johnnyawesome0 | self.techsupport | False | 1.0 | False | 1.0 | I have a Sony surround sound system for a blu-... | techsupport | Help with audio set-up |
1 | Silverfin113 | self.learnprogramming | False | 8.0 | False | 23.0 | I've written what seems to be a prohibitively ... | learnprogramming | Optimizing code for speed |
2 | bookbooksbooks | self.gamedev | False | 5.0 | False | 12.0 | I'm writing an article called "Video Games Tha... | gamedev | Seeking Tales of Development Woe (and Triumph)... |
3 | [deleted] | self.learnprogramming | 1480698462 | 9.0 | False | 0.0 | [deleted] | learnprogramming | [Java] Finding smallest value in an array |
4 | caffeine_potent | self.learnpython | 1480709138 | 12.0 | False | 6.0 | I have the following representation of argumen... | learnpython | currying functions using functools |
**We can see that there is a ‘[deleted]’ value for a bunch of author and selftext rows. We want to drop to be aware of those, so we’re going to binarize them and take a look at who these are correlated with each other. **
df['author_deleted'] = df['author'].apply(lambda x: 1 if x == '[deleted]' else 0)
df['author_deleted'].value_counts()
0 20741
1 5947
Name: author_deleted, dtype: int64
df['not_useful_selftext'] = df['selftext'].apply(lambda x: 1 if x in ['[deleted]', '[removed]'] else 0)
df['not_useful_selftext'].value_counts()
0 18208
1 8480
Name: not_useful_selftext, dtype: int64
Let’s see if there is an overlap between ‘author_deleted’ and ‘not_useful_selftext’.
df[['author_deleted','not_useful_selftext']].corr()
author_deleted | not_useful_selftext | |
---|---|---|
author_deleted | 1.000000 | 0.765296 |
not_useful_selftext | 0.765296 | 1.000000 |
There’s quite an overlap, so we’re not going to lose too much data if we drop those.
df = df.loc[(df['author_deleted'] == 0)]
df = df.loc[(df['not_useful_selftext'] == 0)]
df.shape
(18108, 11)
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18108 entries, 0 to 26686
Data columns (total 11 columns):
author 18108 non-null object
domain 18108 non-null object
edited 18108 non-null object
num_comments 18108 non-null float64
quarantine 18108 non-null bool
score 18108 non-null float64
selftext 18108 non-null object
subreddit 18108 non-null object
title 18108 non-null object
author_deleted 18108 non-null int64
not_useful_selftext 18108 non-null int64
dtypes: bool(1), float64(2), int64(2), object(6)
memory usage: 1.5+ MB
Nothing’s null anymore!
But first I will split the subreddits into broader categories that’s easier to handle.
df.subreddit.value_counts()
techsupport 9522
learnprogramming 2688
learnpython 1420
gamedev 800
web_design 441
javahelp 419
javascript 401
csshelp 298
Python 274
iOSProgramming 262
linux 225
engineering 213
swift 189
computerscience 124
django 111
PHP 84
css 77
java 75
HTML 74
ruby 67
flask 60
compsci 43
technology 42
cpp 34
html5 33
pygame 32
jquery 29
perl 21
lisp 14
dailyprogrammer 9
programmer 8
IPython 8
inventwithpython 5
netsec 4
pystats 2
Name: subreddit, dtype: int64
def subreddit_splitter(subred):
techsupport = ['techsupport', 'learnprogramming', 'computerscience',
'compsci']
webdev = ['web_development', 'csshelp', 'PHP', 'css', 'HTML',
'ruby', 'django', 'flask', 'html5', 'perl']
python = ['learnpython', 'Python', 'IPython', 'inventwithpython',
'pystats']
gamedev = ['gamedev', 'pygame']
javascript = ['javascript', 'jquery']
compiled_langs = ['javahelp', 'cpp', 'java', 'lisp']
if subred in techsupport:
return 'techsupport'
elif subred in webdev:
return 'webdev'
elif subred in python:
return 'python'
elif subred in gamedev:
return 'gamedev'
elif subred in javascript:
return 'javascript'
elif subred in compiled_langs:
return 'compiled'
else:
return 'other'
df['subreds'] = df['subreddit'].apply(subreddit_splitter)
df.subreds.value_counts()
techsupport 12377
python 1709
other 1393
gamedev 832
webdev 825
compiled 542
javascript 430
Name: subreds, dtype: int64
from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import make_pipeline, make_union
from sklearn.linear_model import LogisticRegression
# StratifiedKFold?
# training and holdout set
# with 2 splits, and reliably shuffled data
skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=2017)
# as per documentation (the '?' above)
skf.get_n_splits(df, df['subreds'])
2
for train_idx, test_idx in skf.split(df, df['subreds']):
X_train, X_test = df.iloc[train_idx, :], df.iloc[test_idx, :]
y_train, y_test = df['subreds'].iloc[train_idx], df['subreds'].iloc[test_idx]
X_train.shape
(9056, 12)
X_test.shape
(9052, 12)
y_train.shape
(9056,)
y_test.shape
(9052,)
# # this will give us a fit and transform method that
# # prescribes a columns of a df and pops it in ready for a pipeline
from sklearn.base import BaseEstimator, TransformerMixin
class FeatureExtractor(BaseEstimator, TransformerMixin):
def __init__(self, column):
self.column = column
def fit(self, X, y=None):
return self
def transform (self, X, y=None):
return X[[self.column]]
# this is a stateles transformer that looks at tsomething and applies a function to it
from sklearn.preprocessing import FunctionTransformer
def extract_first_column(x):
return x.iloc[:, 0]
def to_dense(x):
return x.todense()
selftext_pipe = make_pipeline(
# takes a full df and spits out a df w 1 col, selftext
FeatureExtractor('selftext'),
# take that new df shape (9000, 1) and extract the 1st col, while
# validate=False silences that warning
FunctionTransformer(extract_first_column, validate=False),
# tfidf spits out a sparse matrix
TfidfVectorizer(stop_words='english'),
# functionTransformer here takes a sparse matrix and turns it
# to a dense one
FunctionTransformer(to_dense, validate=False)
)
title_pipe = make_pipeline(
FeatureExtractor('title'),
FunctionTransformer(extract_first_column, validate=False),
TfidfVectorizer(stop_words='english'),
FunctionTransformer(to_dense, validate=False)
)
feature_set_extraction = make_union(
selftext_pipe,
title_pipe
)
feature_set_extraction.fit(X_train)
FeatureUnion(n_jobs=1,
transformer_list=[('pipeline-1', Pipeline(memory=None,
steps=[('featureextractor', FeatureExtractor(column='selftext')), ('functiontransformer-1', FunctionTransformer(accept_sparse=False,
func=<function extract_first_column at 0x1121b66a8>,
inv_kw_args=None, inverse_func=Non...None,
inverse_func=None, kw_args=None, pass_y='deprecated',
validate=False))]))],
transformer_weights=None)
transformed = feature_set_extraction.transform(X_train)
print(transformed.shape, transformed[0:5])
(9056, 47722) [[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]
[ 0. 0. 0. ..., 0. 0. 0.]]
pipeline = make_pipeline(
feature_set_extraction,
TruncatedSVD(n_components=50),
LogisticRegression()
)
pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)
print(pipeline.transform(X_train).shape)
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
print(confusion_matrix(y_test, predictions))
[[ 14 7 1 1 16 231 1]
[ 0 170 0 1 3 242 0]
[ 0 6 2 10 12 182 3]
[ 0 4 0 76 6 596 14]
[ 1 13 0 4 269 561 6]
[ 15 30 3 44 136 5951 9]
[ 0 0 2 16 19 333 42]]
print(classification_report(y_test, predictions))
precision recall f1-score support
compiled 0.47 0.05 0.09 271
gamedev 0.74 0.41 0.53 416
javascript 0.25 0.01 0.02 215
other 0.50 0.11 0.18 696
python 0.58 0.31 0.41 854
techsupport 0.74 0.96 0.83 6188
webdev 0.56 0.10 0.17 412
avg / total 0.68 0.72 0.66 9052
print(accuracy_score(y_test, predictions))
0.720724701723
print(y_test.value_counts() / y_test.count())
techsupport 0.683606
python 0.094344
other 0.076889
gamedev 0.045957
webdev 0.045515
compiled 0.029938
javascript 0.023752
Name: subreds, dtype: float64
Our model did better than if we simply have predicted the majority class for all (we would have gotten .68% score then). Considering there was no hyper parameter optimization, we did good, but there is a lot of room to improve.
1/2: Logistic Regression parameters
We’re going to tweak multiple hyperparameters inside the pipeline at once. Let’s first look at all the different cases of all the different parameters that exist in our pipeline.
pipeline_params = pipeline.get_params()
LogisticRegression().get_params()
{'C': 1.0,
'class_weight': None,
'dual': False,
'fit_intercept': True,
'intercept_scaling': 1,
'max_iter': 100,
'multi_class': 'ovr',
'n_jobs': 1,
'penalty': 'l2',
'random_state': None,
'solver': 'liblinear',
'tol': 0.0001,
'verbose': 0,
'warm_start': False}
# every key here is something we can tweak in our model
# some of those are well nested
for key in pipeline_params.keys():
print(key)
memory
steps
featureunion
truncatedsvd
logisticregression
featureunion__n_jobs
featureunion__transformer_list
featureunion__transformer_weights
featureunion__pipeline-1
featureunion__pipeline-2
featureunion__pipeline-1__memory
featureunion__pipeline-1__steps
featureunion__pipeline-1__featureextractor
featureunion__pipeline-1__functiontransformer-1
featureunion__pipeline-1__tfidfvectorizer
featureunion__pipeline-1__functiontransformer-2
featureunion__pipeline-1__featureextractor__column
featureunion__pipeline-1__functiontransformer-1__accept_sparse
featureunion__pipeline-1__functiontransformer-1__func
featureunion__pipeline-1__functiontransformer-1__inv_kw_args
featureunion__pipeline-1__functiontransformer-1__inverse_func
featureunion__pipeline-1__functiontransformer-1__kw_args
featureunion__pipeline-1__functiontransformer-1__pass_y
featureunion__pipeline-1__functiontransformer-1__validate
featureunion__pipeline-1__tfidfvectorizer__analyzer
featureunion__pipeline-1__tfidfvectorizer__binary
featureunion__pipeline-1__tfidfvectorizer__decode_error
featureunion__pipeline-1__tfidfvectorizer__dtype
featureunion__pipeline-1__tfidfvectorizer__encoding
featureunion__pipeline-1__tfidfvectorizer__input
featureunion__pipeline-1__tfidfvectorizer__lowercase
featureunion__pipeline-1__tfidfvectorizer__max_df
featureunion__pipeline-1__tfidfvectorizer__max_features
featureunion__pipeline-1__tfidfvectorizer__min_df
featureunion__pipeline-1__tfidfvectorizer__ngram_range
featureunion__pipeline-1__tfidfvectorizer__norm
featureunion__pipeline-1__tfidfvectorizer__preprocessor
featureunion__pipeline-1__tfidfvectorizer__smooth_idf
featureunion__pipeline-1__tfidfvectorizer__stop_words
featureunion__pipeline-1__tfidfvectorizer__strip_accents
featureunion__pipeline-1__tfidfvectorizer__sublinear_tf
featureunion__pipeline-1__tfidfvectorizer__token_pattern
featureunion__pipeline-1__tfidfvectorizer__tokenizer
featureunion__pipeline-1__tfidfvectorizer__use_idf
featureunion__pipeline-1__tfidfvectorizer__vocabulary
featureunion__pipeline-1__functiontransformer-2__accept_sparse
featureunion__pipeline-1__functiontransformer-2__func
featureunion__pipeline-1__functiontransformer-2__inv_kw_args
featureunion__pipeline-1__functiontransformer-2__inverse_func
featureunion__pipeline-1__functiontransformer-2__kw_args
featureunion__pipeline-1__functiontransformer-2__pass_y
featureunion__pipeline-1__functiontransformer-2__validate
featureunion__pipeline-2__memory
featureunion__pipeline-2__steps
featureunion__pipeline-2__featureextractor
featureunion__pipeline-2__functiontransformer-1
featureunion__pipeline-2__tfidfvectorizer
featureunion__pipeline-2__functiontransformer-2
featureunion__pipeline-2__featureextractor__column
featureunion__pipeline-2__functiontransformer-1__accept_sparse
featureunion__pipeline-2__functiontransformer-1__func
featureunion__pipeline-2__functiontransformer-1__inv_kw_args
featureunion__pipeline-2__functiontransformer-1__inverse_func
featureunion__pipeline-2__functiontransformer-1__kw_args
featureunion__pipeline-2__functiontransformer-1__pass_y
featureunion__pipeline-2__functiontransformer-1__validate
featureunion__pipeline-2__tfidfvectorizer__analyzer
featureunion__pipeline-2__tfidfvectorizer__binary
featureunion__pipeline-2__tfidfvectorizer__decode_error
featureunion__pipeline-2__tfidfvectorizer__dtype
featureunion__pipeline-2__tfidfvectorizer__encoding
featureunion__pipeline-2__tfidfvectorizer__input
featureunion__pipeline-2__tfidfvectorizer__lowercase
featureunion__pipeline-2__tfidfvectorizer__max_df
featureunion__pipeline-2__tfidfvectorizer__max_features
featureunion__pipeline-2__tfidfvectorizer__min_df
featureunion__pipeline-2__tfidfvectorizer__ngram_range
featureunion__pipeline-2__tfidfvectorizer__norm
featureunion__pipeline-2__tfidfvectorizer__preprocessor
featureunion__pipeline-2__tfidfvectorizer__smooth_idf
featureunion__pipeline-2__tfidfvectorizer__stop_words
featureunion__pipeline-2__tfidfvectorizer__strip_accents
featureunion__pipeline-2__tfidfvectorizer__sublinear_tf
featureunion__pipeline-2__tfidfvectorizer__token_pattern
featureunion__pipeline-2__tfidfvectorizer__tokenizer
featureunion__pipeline-2__tfidfvectorizer__use_idf
featureunion__pipeline-2__tfidfvectorizer__vocabulary
featureunion__pipeline-2__functiontransformer-2__accept_sparse
featureunion__pipeline-2__functiontransformer-2__func
featureunion__pipeline-2__functiontransformer-2__inv_kw_args
featureunion__pipeline-2__functiontransformer-2__inverse_func
featureunion__pipeline-2__functiontransformer-2__kw_args
featureunion__pipeline-2__functiontransformer-2__pass_y
featureunion__pipeline-2__functiontransformer-2__validate
truncatedsvd__algorithm
truncatedsvd__n_components
truncatedsvd__n_iter
truncatedsvd__random_state
truncatedsvd__tol
logisticregression__C
logisticregression__class_weight
logisticregression__dual
logisticregression__fit_intercept
logisticregression__intercept_scaling
logisticregression__max_iter
logisticregression__multi_class
logisticregression__n_jobs
logisticregression__penalty
logisticregression__random_state
logisticregression__solver
logisticregression__tol
logisticregression__verbose
logisticregression__warm_start
# specifically, there is this much parameters we could tweak!
# gridsearch will be great help here
len(pipeline_params)
113
# using GridsearchCV and Pipeline to do hyperparameter
# optimization on the LogisticRegression
from sklearn.model_selection import GridSearchCV
pipeline
Pipeline(memory=None,
steps=[('featureunion', FeatureUnion(n_jobs=1,
transformer_list=[('pipeline-1', Pipeline(memory=None,
steps=[('featureextractor', FeatureExtractor(column='selftext')), ('functiontransformer-1', FunctionTransformer(accept_sparse=False,
func=<function extract_first_column at 0x11...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))])
Things we could tweak inside of Logistic Regression: - logisticregression__C (regularization strength) - logisticregression__class_weight - logisticregression__dual - logisticregression__fit_intercept - logisticregression__intercept_scaling - logisticregression__max_iter - logisticregression__multi_class - logisticregression__n_jobs - logisticregression__penalty - logisticregression__random_state - logisticregression__solver - logisticregression__tol - logisticregression__verbose - logisticregression__warm_start
# this took me 90 mins to run (i had too many tabs open)
params_grid = {
'logisticregression__penalty': ['l1', 'l2'],
'logisticregression__C':[0.01, 1.0, 100.0],
'logisticregression__fit_intercept': [True, False]
}
gs = GridSearchCV(
pipeline,
params_grid,
n_jobs=-1,
verbose=2
)
gs.fit(X_train, y_train)
Fitting 3 folds for each of 12 candidates, totalling 36 fits
...
[Parallel(n_jobs=-1)]: Done 36 out of 36 | elapsed: 89.3min finished
GridSearchCV(cv=None, error_score='raise',
estimator=Pipeline(memory=None,
steps=[('featureunion', FeatureUnion(n_jobs=1,
transformer_list=[('pipeline-1', Pipeline(memory=None,
steps=[('featureextractor', FeatureExtractor(column='selftext')), ('functiontransformer-1', FunctionTransformer(accept_sparse=False,
func=<function extract_first_column at 0x11...ty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))]),
fit_params=None, iid=True, n_jobs=-1,
param_grid={'logisticregression__penalty': ['l1', 'l2'], 'logisticregression__C': [0.01, 1.0, 100.0], 'logisticregression__fit_intercept': [True, False]},
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
scoring=None, verbose=2)
gs.best_params_
{'logisticregression__C': 100.0,
'logisticregression__fit_intercept': False,
'logisticregression__penalty': 'l1'}
predictions = gs.best_estimator_.predict(X_test)
print(confusion_matrix(y_test, predictions))
[[ 25 6 1 2 24 212 1]
[ 0 256 0 5 9 145 1]
[ 0 7 26 23 23 129 7]
[ 1 14 2 180 17 454 28]
[ 1 9 4 15 405 402 18]
[ 25 52 19 91 179 5786 36]
[ 0 1 2 34 21 227 127]]
print(classification_report(y_test, predictions))
precision recall f1-score support
compiled 0.48 0.09 0.15 271
gamedev 0.74 0.62 0.67 416
javascript 0.48 0.12 0.19 215
other 0.51 0.26 0.34 696
python 0.60 0.47 0.53 854
techsupport 0.79 0.94 0.85 6188
webdev 0.58 0.31 0.40 412
avg / total 0.72 0.75 0.72 9052
print(accuracy_score(y_test, predictions))
0.751767565179
2/2: NLP parameters
# adding the best parameters to LR as determined by the gridsearch
pipeline = make_pipeline(
feature_set_extraction,
TruncatedSVD(n_components=50),
LogisticRegression(penalty='l1', C='100.0')
)
# pipeline.fit(X_train, y_train)
# predictions = pipeline.predict(X_test)
pipeline_params
```python
# this took me 40 mins to run
from sklearn.feature_extraction.text import CountVectorizer
params_grid = {
'truncatedsvd__n_components': [50, 250],
'featureunion__pipeline-1__tfidfvectorizer': [
TfidfVectorizer(stop_words='english'),
CountVectorizer(stop_words='english', max_features=10000)
],
'featureunion__pipeline-2__tfidfvectorizer': [
TfidfVectorizer(stop_words='english'),
CountVectorizer(stop_words='english', max_features=10000)
],
}
gs = GridSearchCV(
pipeline,
params_grid,
n_jobs=-1,
verbose=2
)
gs.fit(X_train, y_train)
Fitting 3 folds for each of 8 candidates, totalling 24 fits
gs.best_params_
{'featureunion__pipeline-1__tfidfvectorizer': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=10000, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words='english',
strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None),
'featureunion__pipeline-2__tfidfvectorizer': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
stop_words='english', strip_accents=None, sublinear_tf=False,
token_pattern='(?u)\\b\\w\\w+\\b', tokenizer=None, use_idf=True,
vocabulary=None),
'truncatedsvd__n_components': 250}
* CountVectorizer was the best param in the first tfidf step and TfidfVectorizer was best in the second. Best number of components - 250. Let’s now fit these params and see how well they predict:*
# Function to check fit
def check_fit(predictions, y_true):
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
print(accuracy_score(y_test, predictions))
predictions = gs.best_estimator_.predict(X_test)
check_fit(predictions, y_test)
[[ 76 3 3 4 8 174 3]
[ 0 248 2 13 9 143 1]
[ 1 9 44 16 2 139 4]
[ 4 10 3 205 18 431 25]
[ 4 15 5 15 447 344 24]
[ 56 27 27 74 130 5840 34]
[ 2 2 9 31 11 240 117]]
precision recall f1-score support
compiled 0.53 0.28 0.37 271
gamedev 0.79 0.60 0.68 416
javascript 0.47 0.20 0.29 215
other 0.57 0.29 0.39 696
python 0.72 0.52 0.60 854
techsupport 0.80 0.94 0.87 6188
webdev 0.56 0.28 0.38 412
avg / total 0.75 0.77 0.74 9052
0.770768890853
** They predict 9% more accurate than the baseline. **
# predicting using subreddit's text
X_train, X_test, y_train, y_test = train_test_split(df['selftext'].values, df['subreds'].values,
test_size=0.33, random_state=2017)
cv = CountVectorizer()
cv.fit(X_train)
X = cv.transform(X_train)
print(X.shape)
(12132, 47561)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
tsvd = TruncatedSVD(n_components=11)
tsvd.fit(X)
plt.plot(range(11), tsvd.explained_variance_ratio_.cumsum())
[<matplotlib.lines.Line2D at 0x1a2491d8d0>]
X_tsvd = tsvd.transform(X)
rfc = RandomForestClassifier()
rfc.fit(X_tsvd, y_train)
print(rfc.score(X_tsvd, y_train))
print(confusion_matrix(y_train, rfc.predict(X_tsvd)))
print(classification_report(y_train, rfc.predict(X_tsvd)))
0.984009231784
[[ 345 0 0 0 0 18 0]
[ 0 527 0 1 2 16 0]
[ 0 0 290 1 1 7 0]
[ 3 0 1 880 0 31 3]
[ 1 1 1 1 1086 46 2]
[ 2 2 4 3 6 8316 3]
[ 0 0 1 2 1 34 494]]
precision recall f1-score support
compiled 0.98 0.95 0.97 363
gamedev 0.99 0.97 0.98 546
javascript 0.98 0.97 0.97 299
other 0.99 0.96 0.97 918
python 0.99 0.95 0.97 1138
techsupport 0.98 1.00 0.99 8336
webdev 0.98 0.93 0.96 532
avg / total 0.98 0.98 0.98 12132
This seems like an exceptionally good model with high accuracy..
X_test_cv = cv.transform(X_test)
X_test_svd = tsvd.transform(X_test_cv)
print(rfc.score(X_test_svd, y_test))
print(confusion_matrix(y_test, rfc.predict(X_test_svd)))
print(classification_report(y_test, rfc.predict(X_test_svd)))
0.643239625167
[[ 21 11 3 8 25 110 1]
[ 8 27 5 19 22 204 1]
[ 3 7 3 9 16 91 2]
[ 12 25 4 58 39 327 10]
[ 9 18 6 29 92 403 14]
[ 30 48 31 119 156 3631 26]
[ 5 10 5 22 31 208 12]]
precision recall f1-score support
compiled 0.24 0.12 0.16 179
gamedev 0.18 0.09 0.12 286
javascript 0.05 0.02 0.03 131
other 0.22 0.12 0.16 475
python 0.24 0.16 0.19 571
techsupport 0.73 0.90 0.81 4041
webdev 0.18 0.04 0.07 293
avg / total 0.56 0.64 0.59 5976
Except on the test set it did worse, way worse than expected.
pipeline = make_pipeline(
CountVectorizer(),
TruncatedSVD(n_components=10),
RandomForestClassifier())
pipeline.fit(X_train, y_train)
# score it and run the predictions from the training set.
print(pipeline.score(X_train, y_train))
predictions = pipeline.predict(X_train)
print(confusion_matrix(y_train, predictions))
print(classification_report(y_train, predictions))
0.970820969337
[[ 330 0 0 3 0 30 0]
[ 1 518 0 2 2 22 1]
[ 0 0 280 4 1 12 2]
[ 0 3 1 847 2 61 4]
[ 2 2 2 5 1039 87 1]
[ 1 7 5 10 12 8297 4]
[ 1 3 0 5 5 51 467]]
precision recall f1-score support
compiled 0.99 0.91 0.95 363
gamedev 0.97 0.95 0.96 546
javascript 0.97 0.94 0.95 299
other 0.97 0.92 0.94 918
python 0.98 0.91 0.94 1138
techsupport 0.97 1.00 0.98 8336
webdev 0.97 0.88 0.92 532
avg / total 0.97 0.97 0.97 12132
# score and predict with our test set
print(pipeline.score(X_test, y_test))
predictions = pipeline.predict(X_test)
print(confusion_matrix(y_test, predictions))
print(classification_report(y_test, predictions))
0.645582329317
[[ 19 6 3 14 21 111 5]
[ 6 32 6 21 16 201 4]
[ 1 3 9 8 13 91 6]
[ 2 21 6 69 39 323 15]
[ 16 24 6 37 94 392 2]
[ 32 55 34 129 138 3621 32]
[ 5 8 6 17 22 221 14]]
precision recall f1-score support
compiled 0.23 0.11 0.15 179
gamedev 0.21 0.11 0.15 286
javascript 0.13 0.07 0.09 131
other 0.23 0.15 0.18 475
python 0.27 0.16 0.21 571
techsupport 0.73 0.90 0.80 4041
webdev 0.18 0.05 0.08 293
avg / total 0.57 0.65 0.60 5976
# transform the data using CountVectorizer and removing stop words:
cv = CountVectorizer(stop_words='english')
cv.fit(df['title'].values)
X = cv.transform(df['title'].values)
X
<18108x12407 sparse matrix of type '<class 'numpy.int64'>'
with 102034 stored elements in Compressed Sparse Row format>
# instantiate an LDA and fit it to our sparse matrix of words
from sklearn.decomposition import LatentDirichletAllocation
feature_names = cv.get_feature_names()
lda = LatentDirichletAllocation(n_components=7) # 7 for the number of topics
lda.fit(X)
/Users/Olga/anaconda3/lib/python3.6/site-packages/sklearn/decomposition/online_lda.py:536: DeprecationWarning: The default value for 'learning_method' will be changed from 'online' to 'batch' in the release 0.20. This warning was introduced in 0.18.
DeprecationWarning)
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
evaluate_every=-1, learning_decay=0.7, learning_method=None,
learning_offset=10.0, max_doc_update_iter=100, max_iter=10,
mean_change_tol=0.001, n_components=7, n_jobs=1,
n_topics=None, perp_tol=0.1, random_state=None,
topic_word_prior=None, total_samples=1000000.0, verbose=0)
print(lda.components_.shape)
(7, 12407)
results = pd.DataFrame(lda.components_,
columns=feature_names)
# classifying words that are likely to be used together into 7 topics
# if we wanted to find 7 topics
for topic in range(7):
print('Topic', topic)
word_list = results.T[topic].sort_values(ascending=False).index
print(' '.join(word_list[0:25]), '\n')
Topic 0
help need game way java best data audio programming project looking advice list create function app file android software page set open phone input design
Topic 1
use pc drive does wifi getting problem hard know make post disk power tv image 2016 connect randomly gtx sure external really think add thread
Topic 2
laptop new boot code trying learn ssd computer program issues having won hdd want possible slow don just loop keyboard bsod files drivers time 100
Topic 3
screen internet monitor good error card video learning doesn pc desktop graphics display black computer connection work asus google run using javascript chrome driver won
Topic 4
python question games gpu problems computer cpu running motherboard usage high different playing ve swift line coding fps fan programming ram dual test low monitor
Topic 5
windows 10 working pc like web new install router update gaming start fix making got old change device vs random access website network development online
Topic 6
using issue usb text work python amp keeps build mouse laptop code html crashing js server django creating php based port search unable apps iphone
Keywords: hyperparameter optimization, model tweking, gridsearch