Reddit Post Classification
Problem Statment:
In this project, we want to see if we can tell apart posts from two popular subreddits: datascience and wallstreetbets, or any two subreddits or categories a text might belong to in general. Can we create a simple tool that accurately says which group a post is more likely to come from? And what words or phrases give us the best clues about this?
import praw
import pandas as pd
import nltk
import pandas as np
import unicodedata
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer,TfidfVectorizer
from sklearn.model_selection import train_test_split
from textblob import Word
import numpy as np
import spacy
import re
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
import plotly.express as px
from sklearn.manifold import TSNE
from sklearn.svm import LinearSVC
reddit = praw.Reddit(
client_id='client_id goes here',
client_secret='client secret goes here',
user_agent='Pro3',
username='-__A__-',
password=''
)
# Below is JUST an example of how you can use PRAW
# Choose your subreddit
subreddit_DataScience = reddit.subreddit('DataScience')
subreddit_wallstreetbets = reddit.subreddit('wallstreetbets')
# Adjust the limit as needed -- Note that this will grab the 25 most recent posts
posts_DS = subreddit_DataScience.new(limit=2525)
posts_wsb = subreddit_wallstreetbets.new(limit=2525)
data = []
for post in posts_DS:
data.append([post.title, post.selftext, post.subreddit])
# Turn into a dataframe
datascience = pd.DataFrame(data, columns = ['title', 'self_text', 'subreddit'])
datascience.head()
title | self_text | subreddit | |
---|---|---|---|
0 | What Every Developer Should Know About GPU Com... | datascience | |
1 | Do you use CRUD or like apps to bridge the gap... | In my about 5 years of experience working for ... | datascience |
2 | Anybody ever been drug tested for handling sen... | I am currently a DA for a company that uses da... | datascience |
3 | Any data imputation technique shares? | Hello, \n\nI’ve been reading up some articles ... | datascience |
4 | Application of classical time series and deep ... | datascience |
data_wsb = []
for post in posts_wsb:
data_wsb.append([post.title, post.selftext, post.subreddit])
# Turn into a dataframe
wsb = pd.DataFrame(data_wsb, columns = ['title', 'self_text', 'subreddit'])
wsb.head()
title | self_text | subreddit | |
---|---|---|---|
0 | How Can I Bet Against the US Defaulting on Debt? | With the US slipping into 33.5 trillion of deb... | wallstreetbets |
1 | Corporate Bankruptcies - next shoe... | wallstreetbets | |
2 | Seeking wisdom | Tesla announced their earnings and the stock w... | wallstreetbets |
3 | Hedge funds, Pension funds, Banks, CEOs and th... | wallstreetbets | |
4 | Bulls: "WE ARE SO OVERSOLD" .. Reality: | wallstreetbets |
df = pd.concat([datascience, wsb])
df.subreddit.value_counts()
datascience 860
wallstreetbets 717
Name: subreddit, dtype: int64
df.head()
title | self_text | subreddit | |
---|---|---|---|
0 | What Every Developer Should Know About GPU Com... | datascience | |
1 | Do you use CRUD or like apps to bridge the gap... | In my about 5 years of experience working for ... | datascience |
2 | Anybody ever been drug tested for handling sen... | I am currently a DA for a company that uses da... | datascience |
3 | Any data imputation technique shares? | Hello, \n\nI’ve been reading up some articles ... | datascience |
4 | Application of classical time series and deep ... | datascience |
df.tail()
title | self_text | subreddit | |
---|---|---|---|
712 | Giving MCD the big DD | The McRib is back (again). logically speaking ... | wallstreetbets |
713 | Banner bank at risk banr | Does anyone know the rumors around which bank ... | wallstreetbets |
714 | Meta laying off most of Metaverse teams | “Meta (META.O) is planning to lay off employee... | wallstreetbets |
715 | Mortgage rates just hit 8% | Student loan payments start again this month, ... | wallstreetbets |
716 | Microsoft Needs So Much Power to Train AI That... | Invest in small nuclear reactor manufacturers | wallstreetbets |
Shuffling the DataFrame
df = df.sample(frac = 1)
df[:10]
title | self_text | subreddit | |
---|---|---|---|
71 | Sap ui5 fiori vs data science | I'm in a part of my life where I hate my job. ... | datascience |
783 | PG Certification in Business Data Analytics | Hey!\n\nDo you have any reviews on the above m... | datascience |
205 | SHAP Deep Reinforcement Learning | Hi Guys,\n\nIs there a way to integrate SHAP w... | datascience |
537 | AAA service trucks are using Rivians now | wallstreetbets | |
304 | What do corporate data scientists struggle wit... | As a data scientist, if you could let someone ... | datascience |
521 | Idea for a Tool - "Define your data science pr... | Hey guys, I've worked with a lot of clients th... | datascience |
226 | Daily Discussion Thread for October 16, 2023 | **Join **[**WSB's community voice chat**](http... | wallstreetbets |
232 | How a slick accounting maneuver led to a $29 b... | wallstreetbets | |
7 | Data Structures & Algorithms in Data Science | hi ppl. I'm wondering if it is useful to learn... | datascience |
363 | Quantifying picture component to a whole | Simple example would be chopping a square into... | datascience |
Feature engineering and pre-processing
Merging title and self_text
df['post'] = df.apply(lambda row: f"title: {row['title']} text: {row['self_text']}", axis=1)
df.head()
title | self_text | subreddit | post | |
---|---|---|---|---|
71 | Sap ui5 fiori vs data science | I'm in a part of my life where I hate my job. ... | datascience | title: Sap ui5 fiori vs data science text: I'm... |
783 | PG Certification in Business Data Analytics | Hey!\n\nDo you have any reviews on the above m... | datascience | title: PG Certification in Business Data Analy... |
205 | SHAP Deep Reinforcement Learning | Hi Guys,\n\nIs there a way to integrate SHAP w... | datascience | title: SHAP Deep Reinforcement Learning text: ... |
537 | AAA service trucks are using Rivians now | wallstreetbets | title: AAA service trucks are using Rivians no... | |
304 | What do corporate data scientists struggle wit... | As a data scientist, if you could let someone ... | datascience | title: What do corporate data scientists strug... |
df.drop(['title','self_text'], axis=1, inplace=True)
df.head()
subreddit | post | |
---|---|---|
71 | datascience | title: Sap ui5 fiori vs data science text: I'm... |
783 | datascience | title: PG Certification in Business Data Analy... |
205 | datascience | title: SHAP Deep Reinforcement Learning text: ... |
537 | wallstreetbets | title: AAA service trucks are using Rivians no... |
304 | datascience | title: What do corporate data scientists strug... |
Preprocessing
I making use of Regex to remove numbers and links from the post and creating a new column called cleaned post with processed text.
pattern = r'\b\d+\b|http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
df['cleaned_post'] = df['post'].replace(pattern, '', regex=True)
- Starting off with highest single word count.
- Highest bigrams count
- Highest trigram count
We will use countvectorizer with ngram_range:
- ngram_range = (1,1) -> To do EDA on unigrams
- ngram_range = (2,2) -> To do EDA on bigrams
- ngram_range = (3,3) -> To do EDA on bigrams
#intitialize CountVectorizer
vectorizer = CountVectorizer()
#fit
wm = vectorizer.fit_transform(df['cleaned_post'])
wm.toarray()
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
...,
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 0, 0]], dtype=int64)
df_vect_ex = pd.DataFrame(wm.toarray(), columns= vectorizer.get_feature_names_out(), index=df.index)
df_vect_ex
06pm | 0dte | 0dtes | 0s | 0t | 0th | 1000s | 100_000_000 | 100bps | 100k | ... | za | zero | zhang | zhuzh | zig | zillow | zone | zones | zoom | zoomer | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
71 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
783 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
205 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
537 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
304 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
336 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
40 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
838 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
576 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
61 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1577 rows × 11947 columns
df_vect_ex['target_subreddit'] = df['subreddit']
Count_w = df_vect_ex.drop('target_subreddit', axis=1).sum().sort_values(ascending = False)
import seaborn as sns
sns.barplot(x=Count_w.index[:10], y = Count_w[:10], color='purple')
plt.show()
def gettopten(df):
nv = CountVectorizer(stop_words='english', token_pattern= (r'\b(?!http\b|https\b|www\b|ftp\b)(?<!http)(?<!https)(?<!www)(?<!ftp)'
r'\b[^\d\W]+\b(?!.[a-zA-Z0-9]+\b)'))
nvv = nv.fit_transform(df['cleaned_post'])
df_no = pd.DataFrame(nvv.toarray(), columns= nv.get_feature_names_out(), index= df.index)
new_count = df_no.sum().sort_values(ascending=False)
return sns.barplot(x=new_count.index[:10],y=new_count[:10], palette='colorblind')
gettopten(df)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
Now we will check top 10 word counts in each subreddit
df[df['subreddit'] == 'datascience']
subreddit | post | cleaned_post | |
---|---|---|---|
71 | datascience | title: Sap ui5 fiori vs data science text: I'm... | title: Sap ui5 fiori vs data science text: I'm... |
783 | datascience | title: PG Certification in Business Data Analy... | title: PG Certification in Business Data Analy... |
205 | datascience | title: SHAP Deep Reinforcement Learning text: ... | title: SHAP Deep Reinforcement Learning text: ... |
304 | datascience | title: What do corporate data scientists strug... | title: What do corporate data scientists strug... |
521 | datascience | title: Idea for a Tool - "Define your data sci... | title: Idea for a Tool - "Define your data sci... |
... | ... | ... | ... |
692 | datascience | title: Possibility of getting Data Science (Jr... | title: Possibility of getting Data Science (Jr... |
505 | datascience | title: AI Career text: I'm currently in my fir... | title: AI Career text: I'm currently in my fir... |
780 | datascience | title: Do people not use sci-kit learn / other... | title: Do people not use sci-kit learn / other... |
838 | datascience | title: Computer for Coding text: Hi everyone, ... | title: Computer for Coding text: Hi everyone, ... |
61 | datascience | title: Repetitive airflow pipeline problems te... | title: Repetitive airflow pipeline problems te... |
860 rows × 3 columns
gettopten(df[df['subreddit'] == 'datascience'])
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
gettopten(df[df['subreddit'] == 'wallstreetbets'])
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
def gettop10(df,n,stop='english'):
cvec = CountVectorizer(ngram_range=(n,n), stop_words= stop)
nvv = cvec.fit_transform(df['cleaned_post'])
df_no = pd.DataFrame(nvv.toarray(), columns= cvec.get_feature_names_out(), index= df.index)
new_count = df_no.sum().sort_values(ascending=False)
plt.figure(figsize=(12,6))
plt.tight_layout()
return sns.barplot(x=new_count[:10],y=new_count.index[:10], palette='colorblind')
Top 10 highest occuring bigrams in the entire dataset
gettop10(df,2)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
Discovery:
There seems to be a word that appears frequently called x200b
. Upon
further investigation, this is the unicode for whitespace character.
We will need to modify our post and remove this chracter with help of
regex.
df['cleaned_post'] = df['cleaned_post'].replace(r'x200B|text|title|\n|\'', '', regex=True)
gettop10(df,2)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
Top 10 occurring bigrams in the wallstreetbets
gettop10(df[df['subreddit'] == 'wallstreetbets'],2)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
Top 10 occurring bigrams in the datascience
gettop10(df[df['subreddit'] == 'datascience'],2)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
Top 10 occurring trigrams in the wallstreetbets
gettop10(df[df['subreddit'] == 'wallstreetbets'],3)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
Top 10 occurring trigrams in the datascience
gettop10(df[df['subreddit'] == 'datascience'],3)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
Trigrams with stopwords
gettop10(df[df['subreddit'] == 'datascience'],3,stop=None)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
gettop10(df[df['subreddit'] == 'wallstreetbets'],3,stop=None)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
4-gram with stopwords
gettop10(df[df['subreddit'] == 'datascience'],4,stop=None)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
gettop10(df[df['subreddit'] == 'wallstreetbets'],3,stop=None)
plt.title('Top 10 Values')
plt.tight_layout()
plt.show()
T-SNE Visulization
df.reset_index(drop=True, inplace=True)
color_palette = [ '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7']
#'#E69F00', '#56B4E9','#0072B2'
def tsne_viz(df, n, stop='english'):
cvec = TfidfVectorizer(ngram_range=(n,n), stop_words=stop)
vectorized_matrix = cvec.fit_transform(df['cleaned_post'])
tsne = TSNE(n_components=3, random_state=42)
tsne_results = tsne.fit_transform(vectorized_matrix.toarray())
fig = plt.figure(figsize=(12,8))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(tsne_results[:,0], tsne_results[:,1], tsne_results[:,2],
c=pd.factorize(df['subreddit'])[0], cmap="viridis", s=60)
legend1 = ax.legend(*scatter.legend_elements(), title="Subreddits")
ax.add_artist(legend1)
ax.set_title('3D t-SNE Visualization')
ax.set_xlabel('t-SNE Dimension 1')
ax.set_ylabel('t-SNE Dimension 2')
ax.set_zlabel('t-SNE Dimension 3')
plt.show()
tsne_viz(df, 3, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
warnings.warn(
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
warnings.warn(
def tsne_vizinter(df, n, stop='english'):
cvec = TfidfVectorizer(ngram_range=(n-1,n), stop_words=stop)
vectorized_matrix = cvec.fit_transform(df['cleaned_post'])
tsne = TSNE(n_components=3, random_state=42)
tsne_results = tsne.fit_transform(vectorized_matrix.toarray())
df_tsne = pd.DataFrame(tsne_results, columns=['dim1', 'dim2', 'dim3']).reset_index(drop=True)
df_tsne['subreddit'] = df['subreddit'].reset_index(drop=True)
fig = px.scatter_3d(df_tsne, x='dim1', y='dim2', z='dim3', color='subreddit',color_discrete_sequence=color_palette)
fig.show()
tsne_vizinter(df, 3, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning: The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
warnings.warn(
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning: The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
warnings.warn(
Unable to display output for mime type(s): application/vnd.plotly.v1+json
We can see that bigrams and trigrams gives a mixed cluster
def tsne_viz_index(df, n, stop='english'):
cvec = TfidfVectorizer(ngram_range=(n-1,n), stop_words=stop)
vectorized_matrix = cvec.fit_transform(df['cleaned_post'])
tsne = TSNE(n_components=3, random_state=42)
tsne_results = tsne.fit_transform(vectorized_matrix.toarray())
df_tsne = pd.DataFrame(tsne_results, columns=['dim1', 'dim2', 'dim3'])
df_tsne['subreddit'] = df['subreddit'].reset_index(drop=True)
df_tsne['index'] = df.index # Add the index as a column
fig = px.scatter_3d(df_tsne, x='dim1', y='dim2', z='dim3', color='subreddit', hover_data=['index'],color_discrete_sequence=color_palette)
fig.show()
tsne_viz_index(df, 2, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning:
The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning:
The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
Unable to display output for mime type(s): application/vnd.plotly.v1+json
We see unigram and bigram features give a good cluster.
print(df.iloc[501]['cleaned_post']), # Checking out the outlier post in the above visuallization
print(df.iloc[501]['subreddit'])
: My fault guys I bought DTE Calls a minute before the drop :
wallstreetbets
An outlier that only has a title and a : for text.
print(df.iloc[1148]['cleaned_post']), # Checking out the outlier post in the above visuallization
print(df.iloc[1148]['subreddit'])
: Isn’t Disney supposed to be under ? : I have been watching this stock for really long time. It does look like company isn’t going to get better anytime soon. Even Ceo said in the interview that Disney is in worse shape than he thought. Why are people still buying this stock? Is it solely because people are betting it’s going to turn around like meta? I’m bullish on Disney but I’m just going to wait until it goes below .
wallstreetbets
This post is identified as an outlier by TSNE visulization correctly as it is talking about homw buying in the wallstreetbets subreddit.
print(df.iloc[59]['cleaned_post']), # Checking out the outlier post in the above visuallization
print(df.iloc[59]['subreddit'])
: Sick to my stomach - Lost 23K : I started with the about 8K investing at the beginning of this year. Had made it to little over 40K by end of September. Today I disregarded all my stop loss rules, and personal limits and paid for it dearly. My mind was so set on chart patterns, Vs, and inverse Vs from all these days of trading, I was too confident that at some point, there would be a drop, and I kept buying puts on top of puts. Let this be a lesson bools and bears. I am really sad, angry and upset today. Will see about Monday when Monday comes along.
wallstreetbets
Another example of an outlier.
tsne_viz_index(df, 3, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning:
The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning:
The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Trigrams and bigrams fetures give a mix cluster.
print(df.iloc[376]['cleaned_post']), # Checking out the outlier post in the above visuallization
print(df.iloc[376]['subreddit'])
: AI’s Data Cannibalism : Im looking to read more on this topic mentioned in the .&#;Feel free to suggest books and articles
datascience
This post is quite ambugious.
Further cleaning
import re # Source Chat GPT
def clean_text(text):
# Remove URLs
text = re.sub(r'http\S+|www\S+', '', text)
# Remove Emojis
emoji_pattern = re.compile(
u"([\U00002600-\U000027BF])|" # Misc symbols
u"([\U0001F600-\U0001F64F])|" # Emoticons
u"([\U0001F300-\U0001F5FF])|" # Symbols & pictographs
u"([\U0001F680-\U0001F6FF])|" # Transport & map symbols
u"([\U0001F700-\U0001F77F])|" # Alchemical symbols
u"([\U0001F780-\U0001F7FF])|" # Geometric shapes ext
u"([\U0001F800-\U0001F8FF])|" # Supplemental arrows C
u"([\U0001F900-\U0001F9FF])|" # Supplemental symbols
u"([\U0001FA00-\U0001FA6F])|" # Chess symbols
u"([\U0001FA70-\U0001FAFF])" # Symbols and pictographs ext A
, re.UNICODE)
text = emoji_pattern.sub(r'', text)
# Remove excessive whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Remove placeholder text
text = re.sub(r'Daily Discussion Thread for [A-Za-z\s]+,', '', text)
return text
df2 = df.copy()
df2['cleaned_post'] = df2['cleaned_post'].apply(clean_text)
tsne_viz_index(df2, 3, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning:
The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning:
The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Even after further cleaning bigrams and trigrams fetures still give a mix cluster this implies that bigrams and trigrams fetures are not a good choise for modeling purposes.
print(df2.iloc[1162]['cleaned_post']) # Checking out the outlier post in the above visuallization
print(df2.iloc[1162]['subreddit'])
: Eye Tracking Data : Hey all,I am a neuroscience Ph.D. student working with some eye-tracking data. The typical approach in my lab has been to try and fit the data to a GLM. Which is fine as a first pass, but I dont want to be limited to just that. I am curious if anyone else here has worked with eye-tracking data and can point me in the right direction. As far as the details are concerned, I am collecting eye-tracking data in few experimental cons. I would go into detail, but I want to stay at least a bit vague for privacy concerns. But to give you some idea of what I am doing, I have one task where participants are looking for a certain stimulus among distractor stimuli. The primary measurable output of this experiment is what stimulus they move their eyes to. But I am sure there is more information captured in the eye-tracking data that we can leverage. Another experiment is looking at overall gaze stability to infer cognitive mechanisms. If anyone is interested, I am willing to go in to more detail via PM. Any help would be appreciated! My first instinct to use some form of logistic regression or SVM and check performance. Let me know if I am on the right track.
datascience
This person just posted an emoji which TSNE vizulization correctly classifies as an extrem outlier.
print(df2.iloc[193]['cleaned_post']) # Checking out the outlier post in the above visuallization
print(df2.iloc[193]['subreddit'])
: Hoping on the AMC bull train :
wallstreetbets
This is the same person.
tsne_viz_index(df2, 4, stop='english')
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:780: FutureWarning:
The default initialization in TSNE will change from 'random' to 'pca' in 1.2.
c:\Users\muham\anaconda3\lib\site-packages\sklearn\manifold\_t_sne.py:790: FutureWarning:
The default learning rate in TSNE will change from 200.0 to 'auto' in 1.2.
Unable to display output for mime type(s): application/vnd.plotly.v1+json
Trigrams and 4-grams give a mixed cluster.
The TSNE visualization shows that unigram and bigram features gives us the best clusters.
Modeling
X= df2['cleaned_post']
y=df2['subreddit'].reset_index(drop=True)
y= y.apply(lambda x: x.display_name)
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=42)
pgrid = {
'tvec__stop_words': [None, 'english'],
'tvec__min_df': [1, 2, 3],
'tvec__ngram_range': [(1, 1), (1, 2), (2,2)],
'logit__penalty': ['l1','l2'],
'logit__C': [0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1,5,10,50],
'logit__max_iter': [ 2000],
'logit__solver': ['liblinear']
}
pipe = Pipeline([
('tvec', TfidfVectorizer()),
('logit', LogisticRegression())
])
gs_tvec = GridSearchCV(pipe, pgrid, cv=10, n_jobs=6)
%%time
gs_tvec.fit(X_train, y_train)
Wall time: 6min 40s
GridSearchCV(cv=10,
estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
('logit', LogisticRegression())]),
n_jobs=6,
param_grid={'logit__C': [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05,
0.1, 0.5, 1, 5, 10, 50],
'logit__max_iter': [2000],
'logit__penalty': ['l1', 'l2'],
'logit__solver': ['liblinear'],
'tvec__min_df': [1, 2, 3],
'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
'tvec__stop_words': [None, 'english']})
gs_tvec.score(X_test, y_test)
0.959493670886076
gs_tvec.best_params_
{'logit__C': 5,
'logit__max_iter': 2000,
'logit__penalty': 'l2',
'logit__solver': 'liblinear',
'tvec__min_df': 1,
'tvec__ngram_range': (1, 1),
'tvec__stop_words': 'english'}
Let’s Also train a SVM Model
pgrid_svm = {
'tvec__stop_words': [None, 'english'],
'tvec__min_df': [1, 2, 3],
'tvec__ngram_range': [(1, 1), (1, 2), (2,2)],
'svm__penalty': ['l2'], # The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.
'svm__C': [0.00001,0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.5,1,5,10,50],
'svm__max_iter': [2000]
}
pipe_svm = Pipeline([
('tvec', TfidfVectorizer()),
('svm', LinearSVC())
])
gs_tvec_SVM = GridSearchCV(pipe_svm, pgrid_svm, cv=10, n_jobs=6)
%%time
gs_tvec_SVM.fit(X_train, y_train)
Wall time: 59.1 s
GridSearchCV(cv=10,
estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
('svm', LinearSVC())]),
n_jobs=6,
param_grid={'svm__C': [1e-05, 0.0001, 0.0005, 0.001, 0.005, 0.01,
0.05, 0.1, 0.5, 1, 5, 10, 50],
'svm__max_iter': [2000], 'svm__penalty': ['l2'],
'tvec__min_df': [1, 2, 3],
'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
'tvec__stop_words': [None, 'english']})
gs_tvec_SVM.best_params_
{'svm__C': 0.5,
'svm__max_iter': 2000,
'svm__penalty': 'l2',
'tvec__min_df': 1,
'tvec__ngram_range': (1, 2),
'tvec__stop_words': 'english'}
gs_tvec_SVM.score(X_test, y_test)
0.9670886075949368
We have successfully trained a model that achieves 96% accuracy on the test set!!
preds = gs_tvec.predict(["I made $1000 on the stock market today! let's go baby!",
"How do I remove null objects from my dataset?",
"Guys I need some investment decisions, please help.",
"I trained a Logistic Regression model to classify the subreddits of a given post."])
preds
array(['wallstreetbets', 'datascience', 'wallstreetbets', 'datascience'],
dtype=object)
pd.DataFrame({"input": ["I made $1000 on the stock market today! let's go baby!",
"How do I remove null objects from my dataset?",
"Guys I need some investment decisions, please help.",
"I trained a Logistic Regression model to classify the subreddits of a given post."],
"model prediction": preds})
input | model prediction | |
---|---|---|
0 | I made $1000 on the stock market today! let's ... | wallstreetbets |
1 | How do I remove null objects from my dataset? | datascience |
2 | Guys I need some investment decisions, please ... | wallstreetbets |
3 | I trained a Logistic Regression model to class... | datascience |
Conclusion:
- We crawled reddit using PRAW and pulled posts from two subreddits,
datascience
andwallstreetbets
. - We performed EDA on the WordCount occuring in each subreddit and found promising results:
- Datascience subreddit bigram counts:
- WallStreetBets subreddit bigram counts:
- We also performed TSNE visualizations and found unigram and bigram features gives the best cluster:
Our best model has accuracy of > 95%
.
The classifier we built correctly classifies the post based on the patterns it learned during training. As can be seen in the testing did on made up posts!
input | model prediction |
---|---|
I made $1000 on the stock market today! let’s go baby! | wallstreetbets |
How do I remove null objects from my dataset? | datascience |
Guys I need some investment decisions, please help. | wallstreetbets |
I trained a Logistic Regression model to classify the subreddits of a given post. | datascience |