%%shell
jupyter nbconvert --to html /content/drive/"My Drive"/Colab_Notebooks/yayfinal.ipynb
our project website can be found at: https://mscb25.github.io/datasci-final-maddieriley/
Migraines are the 2nd most common cause of disability world wide. Presenting in a number of fashions, migraines are highly variable in triggers, symptoms, and severity. The purpose of this project is to explore the relationships between migraines and their comorbidities. Using medical history and genetic data, we aimed to explore how migraine occurance could possibly be linked to other factors.
We considered two common forms of migraine in this study - migraine with aura and migraine without aura:
Goals and Plans:
Overall, through this assignment, we hoped to better understand the statistical association between migraines, common comorbidities, and genomics; this would allow for an improved understanding of the pathophysiology of migraines.
Some questions we wanted to explore are:
Project Disclaimer:
Through this research experiment, we wanted to explore statistical associations between migraines and the presentation of comorbidities + variance of genes. We are NOT claiming that migraines cause comorbidities or vise versa. We are also NOT claiming that allele varients between populations are a cause of migraines. Any usage of "correlation" or "causation" in this assignment refers to the percieved relationship between variables - it is not indicative of an accurate scientific conclusion.
Collaboration Plan:
To ensure completion of this project, we met at least twice a week. For most of the semester, we found Tuesday and Thursday evenings around 8pm to be optimal. Since we also had classes together each day, we utilized time before or after lecture to update one another on nightly progress. We also texted one another and shared planning documents via google drive. In addition, we often split tasks to optimize time, with Maddie organizing a majority of the data and Riley creating visual representations.
For this project, we obtained data from the Personal Genome Project(PGP). This open data source provided access to a number of statistics and participants used in this assignment.
1) PGP Google Surveys
Located at https://my.pgp-hms.org/google_surveys
Within the PGP data repositories, there are a number of 'participant surveys' filled out by the subject.
We utilized two of the general information surveys to gauge which populations were being represented.
This google survey includes the 'Particpant' (represented by a 8 digit code) followed by personal information including : year of birth, sex/gender, and race/ethnicity
This survey also includes the 'Participant' followed by various phenotypical categories including: height, blood type, and eye color
We also collected data regarding the specific medical conditions participants had. Each file contained the 'Participant' and a column indicating whether they had a medical condition under that umbrella. The syndrome classes we looked at were:
The occurance of migraines in participants was evaluated with this survey
These sources led to the culmination of data from 13 different google surveys.
2) Get-Evidence Variant Reports - Genomic Data
Given the medical histories of the patients (collected by the methods avove), we wanted to determine whether there was any genetic relation between migraines and/or comorbidities.
To achieve this goal, we scraped data from the PGP Whole genome datasets. Since only some participants had their genetic data uploaded, we filtered by 'Whole genome datasets' and accessed profiles with this component fufilled.
Since there was no dataset that contained all the information we desired, we took information from participant profiles and created a data source, which can be accessed here.
Methodology:
In summation, the rare variants and uncommon gene mutations were organized in a dataframe
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/Colab_Notebooks
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
As highlighted in Part 0, there are 13 different google surveys containing vital information about the medical history of the participants. Most of the challenges in this section arose from trying to present all the information in a digestible manner.
# These are the 11 surveys containing medical conditions of each patient
# Each csv was read in
nerv_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_NervousSystem-20181010220056 (1).csv")
circ_data = pd.read_csv('PGPTrait&DiseaseSurvey2012_CirculatorySystem-20181010220109.csv')
endo_data = pd.read_csv('PGPTrait&DiseaseSurvey2012_Endocrine,Metabolic,Nutritional,AndImmunity-20181010220044 (1).csv')
blood_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_Blood-20181010220050.csv")
vis_hear_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_VisionAndHearing-20181010220103.csv")
resp_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_RespiratorySystem-20181010220114.csv")
digest_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_DigestiveSystem-20181010214607.csv")
genit_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_GenitourinarySystems-20181010214612.csv")
skin_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_SkinAndSubcutaneousTissue-20181010214618.csv")
musculo_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_MusculoskeletalSystemAndConnectiveTissue-20181010214624.csv")
congen_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_CongenitalTraitsAndAnomalies-20181010214629.csv")
# Phenotypic and general physical traits (the other 2 surveys) were also read in
phenotypes = pd.read_csv("PGPBasicPhenotypesSurvey2015-20181010214636.csv")
gen_survey = pd.read_csv("PGPParticipantSurvey-20181010220019.csv")
nerv_data.head() # All condition data takes on this format.
#putting all the csv's in a list format for easier manipulation
condition_data = [nerv_data,circ_data,endo_data,blood_data,vis_hear_data,resp_data,
digest_data,genit_data,skin_data,musculo_data,congen_data,]
def drop_col(): # dropping unwanted columns
for i in condition_data:
i.drop("Do not touch!",axis=1,inplace=True)
i.drop("Timestamp",axis=1,inplace=True)
i.drop("Other condition not listed here?",axis=1,inplace=True)
drop_col()
nerv_data.head() #now, this is how each dataframe looks
# renaming the columns for clarity
nerv_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with one of the following conditions?": 'Nervous System Conditions' },inplace=True)
circ_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with one of the following conditions?": 'Circulatory System Conditions'},inplace=True)
endo_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Endocrine System Conditions'},inplace=True)
blood_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Blood Conditions' },inplace=True)
vis_hear_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with one of the following conditions?": 'Visual and Hearing Conditions' },inplace=True)
resp_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Respiratory System Conditions'},inplace=True)
digest_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Digestive System Conditions'},inplace=True)
genit_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Genitourinary System Conditions'},inplace=True)
skin_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Skin Conditions'},inplace=True)
musculo_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Musculoskeletal System Conditions'},inplace=True)
congen_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Congenital Conditions'},inplace=True)
After each set of data was downloaded and cleaned, we were left with 11 frames of medical information. Since 'Participant' was a shared column in each, we used inner join to merge all the conditions into a singular dataframe.
# merging into one df
conditions = condition_data[0].merge(condition_data[1],on="Participant",how="inner")
conditions = conditions.merge(condition_data[2],on="Participant",how="inner")
conditions = conditions.merge(condition_data[3],on="Participant",how="inner")
conditions = conditions.merge(condition_data[4],on="Participant",how="inner")
conditions = conditions.merge(condition_data[5],on="Participant",how="inner")
conditions = conditions.merge(condition_data[6],on="Participant",how="inner")
conditions = conditions.merge(condition_data[7],on="Participant",how="inner")
conditions = conditions.merge(condition_data[8],on="Participant",how="inner")
conditions = conditions.merge(condition_data[9],on="Participant",how="inner")
conditions = conditions.merge(condition_data[10],on="Participant",how="inner")
conditions
After merging, the resulting dataframe needed to be cleaned
# Way too many rows
conditions['Participant'].value_counts()
# duplicates need to go
conditions = conditions.drop_duplicates(subset=['Participant'])
conditions
# duplicates gone but need to reset index
conditions = conditions.reset_index(drop=True)
conditions
Now, it is evident that there are 1767 unique participants that shared their full medical history.
The next step was to determine how many of these individuals had been diagnosed with migraines.
# number of people with migraines
mig_w_aura = conditions['Nervous System Conditions'].str.contains('Migraine with aura', case=False, na=False)
w_sum = mig_w_aura.sum()
mig_no_aura = conditions['Nervous System Conditions'].str.contains('Migraine without aura', case=False, na=False)
no_sum = mig_no_aura.sum()
w_sum, no_sum
# just Participants with migraines
only_mig_haver = conditions[mig_w_aura | mig_no_aura]
only_mig_haver.Participant.count()
only_mig_haver
Out of the dataset, there are 384 individuals with migraines. 215 participants have migraine with aura, while 218 have migraine without aura
Then, we needed to process and clean the phenotypic data for the participants
# Now need to clean up phenotype and general data
phenotypes.columns
# columns to drop
unwanted_traits =['1.4 — Comments',
'2.1 — Left Eye (Photograph Number) (full-size image: https://goo.gl/XQ2Voh)',
'2.2 — Right Eye (Photograph Number) (full-size image: https://goo.gl/XQ2Voh)',
'2.3 — Left Eye Color - Text Description',
'2.4 — Right Eye Color - Text Description', '2.5 —Comments',
'3.1 — What is your natural hair color currently, when without artificial color or dye?',
'3.2 — Hair Color - Text Description', '3.3 — Comments',
'4.1 — Any final thoughts?', '1.4 — Handedness','Timestamp', 'Do not touch!']
phenotypes = phenotypes.drop(unwanted_traits,axis=1)
# desired columns
phenotypes.rename(columns={'1.1 — Blood Type': 'Blood Type', '1.2 — Height': 'Height (in)', '1.3 — Weight': 'Weight (lbs)'},inplace=True)
# lots of null values present
phenotypes.isnull().sum()
# ensuring all the NaN values are dropped
phenotypes = phenotypes.dropna()
phenotypes = phenotypes.reset_index(drop=True)
phenotypes
When that was completed, we moved onto cleaning the general traits frame
# cleaning up the traits data
gen_survey.columns
# grabbing the traits we're interested in
traits = gen_survey[['Participant','Sex/Gender','Race/ethnicity']]
# using more descriptive categories for race/ethnicity
traits['Race/ethnicity'] = traits['Race/ethnicity'].str.split(n=3).str[:3].str.join(' ')
traits['Race/ethnicity'].value_counts()
# replacing the column titles with grammatically correct names
traits['Race/ethnicity'] = traits['Race/ethnicity'].replace({'American Indian /':'American Indian','Hispanic or Latino,': 'Hispanic or Latino',
'Asian, White': 'Asian', 'or': '','Asian, Black or': 'Asian', 'White, No response': 'White',
'Native Hawaiian or': 'Native Hawaiian'})
After that, we merged all the conditions, phenotypes, and general data into one singular df
# merging all
all = conditions.merge(phenotypes,how='inner',on=['Participant'])
all = all.merge(traits,how='inner',on=['Participant'])
all
all['Participant'].value_counts()
# need to drop dupes again
all = all.drop_duplicates(subset=['Participant'])
all = all.reset_index(drop=True)
The second major source of information we needed to process was the genetic profiles of the participants.
The dataframe being read in is "vari.xlsx", which is the dataframe we created of the important rare, pathogenic gene variants. The methodology we used to generate this source is detailed in section 0. To quickly summarize, the df holds the following information:
# reading in the dataframe
alleles = pd.read_excel("vari.xlsx")
alleles
# clean up and make binary variables for whether the participant has migraines
alleles = alleles.drop('Unnamed: 7',axis=1)
alleles['Has Migraine'] = alleles['Type of Migraine'] != 'None'
alleles['Has Migraine'] = alleles['Has Migraine'].map({True: 1, False: 0})
alleles
# drop nulls
alleles.isnull().sum()
alleles = alleles.dropna()
alleles = alleles.reset_index(drop=True)
alleles
Finally, we fully combined the conditions, phenotypes, general information, and gene data to create a comprehensive source of information about the participants
# dfs with genes
w_genes = all.merge(alleles,how='inner',on=['Participant'])
w_genes
After all the data was cleaned and processed, we took a look at relationships between a variety of factors
In order to determine the statistical relationship between migraines and comorbidities, binary variables first had to be created to separate whether someone has migraines in general, migraines with aura, or migraines without aura
# Want binary variables for having conditions but first need to isolate migraines and migraine types
conditions['Has Migraines'] = conditions['Nervous System Conditions'].str.contains('Migraine',case=False, na=False)
conditions[['No Migraines','Has Migraines']] = pd.get_dummies(conditions['Has Migraines'])
conditions = conditions.drop('No Migraines',axis=1)
conditions['Has Migraines with Aura'] = conditions['Nervous System Conditions'].str.contains('Migraine with aura',case=False, na=False)
conditions['Has Migraines without Aura'] = conditions['Nervous System Conditions'].str.contains('Migraine without aura',case=False, na=False)
conditions[['No Migraines with Aura','Has Migraines with Aura']] = pd.get_dummies(conditions['Has Migraines with Aura'])
conditions[['No Migraines without Aura','Has Migraines without Aura']] = pd.get_dummies(conditions['Has Migraines without Aura'])
conditions = conditions.drop('No Migraines with Aura',axis=1)
conditions = conditions.drop('No Migraines without Aura',axis=1)
# need to isolate Migraines out of nerv system conditions
conditions['Has Nervous System Conditions'] = (~conditions['Nervous System Conditions'].str.fullmatch('Migraine without aura',case=False,na=False) & (conditions['Nervous System Conditions'].notnull())
& (~conditions['Nervous System Conditions'].str.fullmatch('Migraine with aura',case=False,na=False)))
conditions[['No Nervous conditions','Has Nervous System Conditions']] = pd.get_dummies(conditions['Has Nervous System Conditions'])
conditions = conditions.drop('No Nervous conditions',axis=1)
conditions
From there, we created dummy variables for each class of comorbidities to indicate whether a person has a condition from that subset
# rest of the condition binaries
conditions['Has Blood Conditions'] = conditions['Blood Conditions'].notnull().astype('int')
conditions['Has Circulatory Conditions'] = conditions['Circulatory System Conditions'].notnull().astype('int')
conditions['Has Endocrine Conditions'] = conditions['Endocrine System Conditions'].notnull().astype('int')
conditions['Has Vision and Hearing Conditions'] = conditions['Visual and Hearing Conditions'].notnull().astype('int')
conditions['Has Respiratory Conditions'] = conditions['Respiratory System Conditions'].notnull().astype('int')
conditions['Has Digestive Conditions'] = conditions['Digestive System Conditions'].notnull().astype('int')
conditions['Has Genitourinary Conditions'] = conditions['Genitourinary System Conditions'].notnull().astype('int')
conditions['Has Skin Conditions'] = conditions['Skin Conditions'].notnull().astype('int')
conditions['Has Musculoskeletal Conditions'] = conditions['Musculoskeletal System Conditions'].notnull().astype('int')
conditions['Has Congenital Conditions'] = conditions['Congenital Conditions'].notnull().astype('int')
# EDA time
conditions.columns
Then we got the proportions of participants who 1) had migraines vs didn't have migraines and 2) had no aura vs had aura
def prob_no_mig(cond): # Getting proportions of those with and without Migraines per biological system
a = ((conditions['Has Migraines'] == 0) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 0).sum()
return a
def prob_w_mig(cond):
b =((conditions['Has Migraines'] == 1) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 1).sum()
return b
prob_w_list = [] # iterating through columns list for biological system
for i in conditions.columns[15:]:
prob_w_list.append(prob_w_mig(i))
prob_no_list = []
for i in conditions.columns[15:]:
prob_no_list.append(prob_no_mig(i))
def prob_no_aura(cond): # Getting proportions of those with and without Migraines with aura per system
a = ((conditions['Has Migraines without Aura'] == 1) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 1).sum()
return a
def prob_w_aura(cond):
b =((conditions['Has Migraines with Aura'] == 1) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 1).sum()
return b
prob_w_aura_list = [] # iterating through columns for biological system
for i in conditions.columns[15:]:
prob_w_aura_list.append(prob_w_aura(i))
prob_no_aura_list = []
for i in conditions.columns[15:]:
prob_no_aura_list.append(prob_no_aura(i))
We then generated a new dataframe with these proportions to evaluate the makeup of the population
d={'w mig': prob_w_list, 'no mig':prob_no_list, 'w_aura': prob_w_aura_list,'no_aura': prob_no_aura_list} # putting proportions into df
conditions_prop = pd.DataFrame(data=d,index=conditions.columns[15:])
conditions_prop = conditions_prop.reset_index()
# dataframe showing the proportion of each subset who has a specific comorbidity
conditions_prop
conditions_prop.plot.bar(x='index',y=['w mig','no mig'],color=[ 'red', 'blue'], xlabel='Conditions',ylabel='Percent of Individuals', title='Condition Proportions of Migraine Havers vs Control', width=0.8, figsize=(12,6)).grid()
# Bar graph of conditions
In every category, individuals with migraines were more likely to have comorbidities than participants without migraines. Someone with migraines is more than twice as likely to have another nervous system condition than someone without migraines
conditions_prop.plot.bar(x='index',y=['w_aura','no_aura'],color=[ 'green', 'orange'], xlabel='Conditions',ylabel='Percent of Individuals', title='Condition Proportions of Migraine with Aura vs without', width=0.8, figsize=(12,6)).grid()
# Bar graph of conditions w and wout aura
When looking at the prevelance of comorbidities in participants with vs without aura, there does not appear to be a relationship --> the frequency is approximately the same for both groupings across all 11 categories
From there, we wanted to more specifically look at whether phenotypes and general traits had a statistical relationship with migraines
# need dummies for Blood Types
phenotypes = pd.get_dummies(phenotypes,columns=['Blood Type'])
phenotypes # dataframe with dummies for all (common) blood types
# need height values to be type float
phenotypes['Height (in)'] = phenotypes['Height (in)'].str.replace("\"","")
phenotypes['Height (in)'] = phenotypes['Height (in)'].str.replace("'"," ")
phenotypes['Height (in)'] = [s.split(" ") for s in phenotypes['Height (in)']]
phenotypes['Height (in)'] = [float(value[0])*12 + float(value[1]) for value in phenotypes['Height (in)']]
phenotypes
traits = pd.get_dummies(traits,columns=['Sex/Gender','Race/ethnicity']) # Getting dummies for sex and race
# remerging conditions with phenotypes and general traits // considering dummies now
all = conditions.merge(phenotypes,how='inner',on=['Participant'])
all = all.merge(traits,how='inner',on=['Participant'])
all
all['Participant'].value_counts()
# need to drop dupes again
all = all.drop_duplicates(subset=['Participant'])
all = all.reset_index(drop=True)
# EDA for traits
# mig havers by sex
male = [((all['Has Migraines'] == 1) & (all['Sex/Gender_Male']==1)).sum() / all['Has Migraines'].sum(),
((all['Has Migraines with Aura'] == 1) & (all['Sex/Gender_Male']==1)).sum() / all['Has Migraines with Aura'].sum(),
((all['Has Migraines without Aura'] == 1) & (all['Sex/Gender_Male']==1)).sum() / all['Has Migraines without Aura'].sum()]
female = [((all['Has Migraines'] == 1) & (all['Sex/Gender_Female']==1)).sum() / all['Has Migraines'].sum(),
((all['Has Migraines with Aura'] == 1) & (all['Sex/Gender_Female']==1)).sum() / all['Has Migraines with Aura'].sum(),
((all['Has Migraines without Aura'] == 1) & (all['Sex/Gender_Female']==1)).sum() / all['Has Migraines without Aura'].sum()]
d={'male': male, 'female':female} #df for mig havers by sex
sex_props = pd.DataFrame(data=d,index=['w mig','w_aura','no_aura'])
sex_props = sex_props.reset_index()
sex_props #proportions of participant sex considering migraine types
sex_props.plot.bar(x='index',y=['male','female'],color=[ 'royalblue', 'pink'], xlabel='Migraines and Types of Migraines',ylabel='Percent of Individuals', title='Sex Proportions of Having Migraines and Types of Migraines ', width=0.8, figsize=(12,6)).grid()
There appears to be a significant statistical relationship between biological sex and migraine occurance, with female participants making up close to 2/3 of the migraine having population
From there, we wanted to see if blood type had any statistical relationship with migraines
all.columns[28:37]
def prob_no_mig(cond): # Blood type proportions based on Migraine haver or not
a = ((all['Has Migraines'] == 0) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 0).sum()
return a
def prob_w_mig(cond):
b =((all['Has Migraines'] == 1) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 1).sum()
return b
prob_w_list = [] # iterating through columns for blood types
for i in all.columns[28:37]:
prob_w_list.append(prob_w_mig(i))
prob_no_list = []
for i in all.columns[28:37]:
prob_no_list.append(prob_no_mig(i))
def prob_no_aura(cond): # Blood type proportions based on Aura haver or not
a = ((all['Has Migraines with Aura'] == 1) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 1).sum()
return a
def prob_w_aura(cond):
b =((all['Has Migraines without Aura'] == 1) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 1).sum()
return b
prob_w_aura_list = [] # iterating through columns for blood types
for i in all.columns[28:37]:
prob_w_aura_list.append(prob_w_aura(i))
prob_no_aura_list = []
for i in all.columns[28:37]:
prob_no_aura_list.append(prob_no_aura(i))
d={'w mig': prob_w_list, 'no mig':prob_no_list, 'w_aura': prob_w_aura_list,'no_aura': prob_no_aura_list} #blood type probs dataframe
blood_props = pd.DataFrame(data=d,index=all.columns[28:37])
blood_props = blood_props.reset_index()
blood_props #blood type vs type of migraine
blood_props.plot.bar(x='index',y=['w mig','no mig'],color=[ 'brown', 'aqua'], xlabel='Blood Types',ylabel='Percent of Individuals', title='Proportions of Blood Types for Migraine Havers and Control', width=0.8, figsize=(12,6)).grid()
There does not appear to be a significant statistical relationship between having migraines and any specific blood type
blood_props.plot.bar(x='index',y=['w_aura','no_aura'],color=[ 'gray', 'red'], xlabel='Blood Types',ylabel='Percent of Individuals', title='Proportion of Blood Types for Aura and No Aura', width=0.8, figsize=(12,6)).grid()
There is also a lack of significant relationship between blood type and migraines with aura vs without aura. Although there are larger differences between the control and experimental, this can be attributed to natural variance in a small data set
We were also curious as to whether race/ethnicity had a relationship with migraine occurance
all.columns[54:]
def prob_no_mig(cond): # proportions for race per those with and without migraines
a = ((all['Has Migraines'] == 0) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
return a
def prob_w_mig(cond):
b =((all['Has Migraines'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
return b
prob_w_list = [] # iterating through columns for races
for i in all.columns[54:]:
prob_w_list.append(prob_w_mig(i))
prob_no_list = []
for i in all.columns[54:]:
prob_no_list.append(prob_no_mig(i))
def prob_no_aura(cond): # aura types per race
a = ((all['Has Migraines with Aura'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
return a
def prob_w_aura(cond):
b =((all['Has Migraines without Aura'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
return b
prob_w_aura_list = [] # iterating through columns for races
for i in all.columns[54:]:
prob_w_aura_list.append(prob_w_aura(i))
prob_no_aura_list = []
for i in all.columns[54:]:
prob_no_aura_list.append(prob_no_aura(i))
d={'w mig': prob_w_list, 'no mig':prob_no_list, 'w_aura': prob_w_aura_list,'no_aura': prob_no_aura_list} # race props df
race_props = pd.DataFrame(data=d,index=all.columns[54:])
race_props = race_props.reset_index()
race_props = race_props.dropna() # some races had very low values and no migs
race_props #race/ethnicity vs migraines
race_props.drop(race_props.loc[race_props.w_aura < 0.00001].index, inplace=True) # getting rid of columns with 0s (low pops, unimportant)
race_props
race_props.plot.bar(x='index',y=['w mig','no mig'],color=[ 'orange', 'blue'], xlabel='Race/Ethnicity',ylabel='Percent of Individuals', title='Proportions of Race/Ethnicity for Migraine Havers and Control', width=0.8, figsize=(12,6)).grid()
Based on this information alone, it would appear there is a statistical relationship between race and migraine. However, there is a higher proportion of white participants vs other races, which makes this metric a poor indicator
race_props.plot.bar(x='index',y=['w_aura','no_aura'],color=[ 'green', 'red'], xlabel='Race/Ethnicity',ylabel='Percent of Individuals', title='Proportion of Race/Ethnicity for Aura and No Aura', width=0.8, figsize=(12,6)).grid()
The same could be said for this plot --> although there appears to be a relationship between ethnicity and migraine with/without aura, the size and makeup of the dataset must be considered first
We also wanted to explore some relationships found in the genetic data. Within this section, we created additional dataframes and evaluated ratios of gene alleles in the population
alleles
alleles.Gene.nunique() #there are 412 different genes in the df
no_mig = alleles['Type of Migraine'].str.contains('None', case=False, na=False)
aura_mig = alleles['Type of Migraine'].str.contains('with aura', case=False, na=False)
no_aura_mig = alleles['Type of Migraine'].str.contains('no aura', case=False, na=False)
both_mig = alleles['Type of Migraine'].str.contains('Both', case=False, na=False)
none_df = alleles[no_mig]
aura_df = alleles[aura_mig]
no_aura_df = alleles[no_aura_mig]
both_df = alleles[both_mig]
none_df.Gene.value_counts()
none_df.Participant.nunique() #40 people
aura_df.Participant.nunique() #49 people
no_aura_df.Participant.nunique() #38 people
both_df.Participant.nunique() #10 people
Out of the genetic information collected, there are alleles available for 40 participants with no migraines, 49 participants with migraine with aura, 38 people with migraine without aura, and 10 participants with both forms
none_df2 = none_df[none_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
aura_df2 = aura_df[aura_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
no_aura_df2 = no_aura_df[no_aura_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
both_df2 = both_df[both_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
# keeping only the desired dataframes
none_df2.Gene.value_counts()
none_df2 = none_df2.drop('Disease capacity', axis = 1)
none_df2 = none_df2.drop('Recessive or dominant', axis = 1)
none_df2 = none_df2.drop('Mutation or variant?', axis = 1)
# dropping columns
#none_df2 = none_df2.drop('Unnamed: 7', axis = 1)
def get_ratio(df, pattern):
num = df.groupby(df[pattern].str.lower()).size()
denom = len(df[pattern])
return num/denom
# function to get the ratio of heterozygous to homozygous
def gene_grab(df, listg):
count = df.Gene.value_counts()
for gene, num in count.iteritems():
listg.append((gene, num))
return listg
# function to collect the gene and its' relative frequency
none_count = none_df2.Gene.value_counts()
none_genes = []
test = gene_grab(none_df2, none_genes)
test
# this creates a list with all the genes and freqs
def get_freq(gene_freq_list, df, ratios_list):
i = 0
while i < len(gene_freq_list):
gene = gene_freq_list[i][0]
specific_gene_df = df[df['Gene'].str.contains(gene)]
ratlist = []
hold = get_ratio(specific_gene_df, 'Homo/heterozyg')
for pattern, ratio in hold.iteritems():
ratlist.append((pattern, ratio))
ratios_list.append(ratlist)
i += 1
return ratios_list
# function that returns the allele types broken down individually
breakdown_test = []
none_freqs = get_freq(test, none_df2, breakdown_test)
breakdown_test
total_none_set = []
for item in range(len(test)):
gene = test[item][0]
freq = test[item][1]
ratios = breakdown_test[item]
total_none_set.append((gene, freq, ratios))
total_none_set
# creating list with gene, freq, and allele type distrib
After collecting this information for people without migraines, the same process was repeated with the other dfs. The same functions were utilized
aura_df2.Gene.value_counts()
aura_df2 = aura_df2.drop('Disease capacity', axis = 1)
aura_df2 = aura_df2.drop('Recessive or dominant', axis = 1)
aura_df2 = aura_df2.drop('Mutation or variant?', axis = 1)
#aura_df2 = aura_df2.drop('Unnamed: 7', axis = 1)
aura_genes = []
ag_list = gene_grab(aura_df2, aura_genes)
ag_list
aurafr = []
aura_freqs = get_freq(ag_list, aura_df2, aurafr)
aurafr
total_aura_set = []
for item in range(len(ag_list)):
gene = ag_list[item][0]
freq = ag_list[item][1]
ratios = aurafr[item]
total_aura_set.append((gene, freq, ratios))
total_aura_set #full set of information for people with migraine with aura
both_df2.Gene.value_counts()
both_count = both_df2.Gene.value_counts()
both_genes = []
bboth = gene_grab(both_df2, both_genes)
bboth
both_test = []
both_freqs = get_freq(bboth, both_df2, both_test)
both_test
total_both_set = []
for item in range(len(bboth)):
gene = bboth[item][0]
freq = bboth[item][1]
ratios = both_test[item]
total_both_set.append((gene, freq, ratios))
total_both_set # information for those with both forms of migraine
no_aura_df2.Gene.value_counts()
no_aura_count = no_aura_df2.Gene.value_counts()
no_aura_genes = []
noaur = gene_grab(no_aura_df2, no_aura_genes)
noaur
noaur_test = []
noaur_freqs = get_freq(noaur, no_aura_df2, noaur_test)
noaur_test
total_noaur_set = []
for item in range(len(noaur)):
gene = noaur[item][0]
freq = noaur[item][1]
ratios = noaur_test[item]
total_noaur_set.append((gene, freq, ratios))
total_noaur_set # total information for people with migraine without aura
After this, a compiled df was made
total_none_set
tot_none_df2 = pd.DataFrame(total_none_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_none_df2
total_aura_set
tot_aura_df2 = pd.DataFrame(total_aura_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_aura_df2
total_both_set
tot_both_df2 = pd.DataFrame(total_both_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_both_df2
total_noaur_set
tot_noaur_df2 = pd.DataFrame(total_noaur_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_noaur_df2
# all the lists being converted to individual dataframes
# merging into one large dataframe
none_and_aura = tot_none_df2.merge(tot_aura_df2, on="Gene", how="outer")
none_and_aura.columns = ["Gene","Freq: None", "Alleles: none", "Freq: Aura", "Alleles: Aura"]
none_and_aura
none_aura_both = none_and_aura.merge(tot_both_df2, on="Gene", how="outer")
none_aura_both.columns = ["Gene","Freq: None", "Alleles: none", "Freq: Aura", "Alleles: Aura", "Freq: Both", "Alleles: Both"]
none_aura_both
all_types_df2 = none_aura_both.merge(tot_noaur_df2, on="Gene", how="outer")
all_types_df2.columns = ["Gene","Freq: None", "Alleles: none", "Freq: Aura", "Alleles: Aura", "Freq: Both", "Alleles: Both", "Freq: No Aura", "Alleles: No aura"]
all_types_df2
#filling in the NaN values
all_types_df2['Freq: Aura'] = all_types_df2['Freq: Aura'].fillna(0.0)
all_types_df2['Freq: None'] = all_types_df2['Freq: None'].fillna(0.0)
all_types_df2['Freq: No Aura'] = all_types_df2['Freq: No Aura'].fillna(0.0)
all_types_df2['Freq: Both'] = all_types_df2['Freq: Both'].fillna(0.0)
all_types_df2
The, various mathematical equations were applied to the frequencies. Ratios of gene frequency between all populations and proportions in each individual subset were considered
all_freq = ['Freq: None', "Freq: Both", "Freq: Aura", "Freq: No Aura"]
all_types_df2['Total'] = all_types_df2[all_freq].sum(axis=1)
all_types_df2
all_types_df2['None Ratio'] = all_types_df2["Freq: None"].apply(lambda x: x / 40)
all_types_df2['Aura Ratio'] = all_types_df2["Freq: Aura"].apply(lambda x: x / 49)
all_types_df2['No Aura Ratio'] = all_types_df2["Freq: No Aura"].apply(lambda x: x / 38)
all_types_df2['Both Ratio'] = all_types_df2["Freq: Both"].apply(lambda x: x / 10)
all_types_df2['All Ratio'] = all_types_df2["Total"].apply(lambda x: x / 137)
all_types_df2['All Mig Ratio'] = ((all_types_df2["Freq: Aura"] + all_types_df2["Freq: No Aura"] + all_types_df2["Freq: Both"]) / 97)
all_types_df2
# hyper specific dropping of columns with not enough information
moddf = all_types_df2.drop([74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62, 61, 60, 58, 57, 56, 0, 2, 3, 5, 6,9,1,8, 10, 11, 15, 17, 18, 22, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 ])
moddf2 = moddf.drop([35, 36, 37, 38, 39, 55,20, 14,33, 34,13, 53, 16, 19, 47, 48, 49, 50, 51, 52, 59 ])
moddf2
Note: In order to create a graphable dataframe, we dropped a number of genes that had overall frequencies of less than 5 and/or low frequencies with a similar distribution in the control and migraine populations.
We dropped these manually, which is why there are a wide array of columns dropped above. Doing this project again, we would have likely developed a better methodology
import sklearn
assert sklearn.__version__ >= "0.20"
import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
moddf2.plot.bar(x='Gene', y=['None Ratio', 'All Mig Ratio'], color=[ 'royalblue', 'aqua'], ylabel='Percent of Individuals', title='Gene Proportions of Migraine Havers vs Control', width=0.8, figsize=(12,6)).grid()
#sns.catplot(data=moddf2, x="Gene", y=["None Ratio", "All Mig Ratio"])
Within the set of significant alleles, there are a number of genes that are present in migraine populations, but not in the control group. Each column has at least 5 occurances of the gene across the board. These could possibly relate to migraines, or may be indicative of a scientific process outside the scope of our EDA
For the project, we decided we wanted to created two different models.
all_features = ['Has Migraines', 'Has Migraines with Aura',
'Has Migraines without Aura','Height (in)', 'Weight (lbs)', 'Blood Type_A +', 'Blood Type_A -',
'Blood Type_AB +', 'Blood Type_AB -', 'Blood Type_B +',
'Blood Type_B -', 'Blood Type_Don\'t know', 'Blood Type_O +',
'Blood Type_O -', 'Sex/Gender_Female',
'Sex/Gender_Male','Race/ethnicity_American Indian', 'Race/ethnicity_Asian',
'Race/ethnicity_Black or African', 'Race/ethnicity_Hispanic or Latino',
'Race/ethnicity_White'] # features to test
#knn for system that is seemingly highly correlated with having migraines (nervous) with and without migs
def knn_nerv(k):
model = KNeighborsClassifier(n_neighbors = k)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = all[all_features]
y_train = all['Has Nervous System Conditions']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_nerv_no_mig(k):
model = KNeighborsClassifier(n_neighbors = k)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = all[all_features[3:]]
y_train = all['Has Nervous System Conditions']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
nerv = ks.apply(knn_nerv)
nerv_no_mig = ks.apply(knn_nerv_no_mig)
plt.plot(nerv,label='Nervous System Conditions')
plt.plot(nerv_no_mig,label='Nervous System Conditions no mig')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.legend()
# as you can see migraines are correlated to prediction accuracy for nervous system conditions
This model indicates that knowing whether someone has migraines or not leads to increased accuracy in predicting whether someone has a nervous system condition (other than migraines). This shows that there is likely a statistical relationship between migraines and the ability to predict comorbid neurological syndromes
#knn for system that is seemingly not highly correlated with having migraines (digestive) with and without migs
def knn_dig(k):
model = KNeighborsClassifier(n_neighbors = k)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = all[all_features]
y_train = all['Has Digestive Conditions']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_dig_no_mig(k):
model = KNeighborsClassifier(n_neighbors = k)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = all[all_features[3:]]
y_train = all['Has Digestive Conditions']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
dig = ks.apply(knn_dig)
dig_no_mig = ks.apply(knn_dig_no_mig)
plt.plot(dig,label='Digestive System Conditions')
plt.plot(dig_no_mig,label='Digestive System Conditions no mig')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.legend()
# as you can see removing migraines does not have a significant effect on accuracy for predicting digestive system conditions
In this model, it is evident that knowing whether someone has migraines or not is not helpful information for predicting whether someone has a digestive system condition. This is not very successful and shows a lack of relationship between migraines and digestive comorbidities. This is likely the case for other conditions.
We started by rereading the alleles dataframe to ensure there were no issues with the dataset
alleles = pd.read_excel('vari.xlsx') # genetic data
alleles = alleles.drop('Unnamed: 7',axis=1) # dropping unnecessary data
alleles['Has Migraine'] = alleles['Type of Migraine'] != 'None' # getting binary migraine columns
alleles['Has Migraine'] = alleles['Has Migraine'].map({True: 1, False: 0})
alleles['Has Migraine with aura'] = alleles['Type of Migraine'] == 'Mig with aura'
alleles['Has Migraine with aura'] = alleles['Has Migraine with aura'].map({True: 1, False: 0})
alleles['Has Migraine without aura'] = alleles['Type of Migraine'] == 'Mig no aura'
alleles['Has Migraine without aura'] = alleles['Has Migraine without aura'].map({True: 1, False: 0})
alleles
feat = ['Gene','Recessive or dominant','Homo/heterozyg','Mutation or variant?','Disease capacity'] # test cols
# testing which features are most relevant to accurately predicting whether someone has migraines
def knn_full(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[feat]
y_train = alleles['Has Migraine']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_gene(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[feat[1:]]
y_train = alleles['Has Migraine']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_rec_dom(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Disease capacity']]
y_train = alleles['Has Migraine']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_homo_hetero(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Recessive or dominant','Mutation or variant?','Disease capacity']]
y_train = alleles['Has Migraine']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_mutate(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Homo/heterozyg','Recessive or dominant','Disease capacity']]
y_train = alleles['Has Migraine']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_disease(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Recessive or dominant']]
y_train = alleles['Has Migraine']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
full = ks.apply(knn_full)
no_gene = ks.apply(knn_no_gene)
no_rec_dom = ks.apply(knn_no_rec_dom)
no_homo_hetero = ks.apply(knn_no_homo_hetero)
no_mutate = ks.apply(knn_no_mutate)
no_disease = ks.apply(knn_no_disease)
plt.plot(full,label = 'full features')
plt.plot(no_gene,label = 'no genes')
plt.plot(no_rec_dom,label = 'no recessive/dom')
plt.plot(no_homo_hetero,label = 'no homo/het')
plt.plot(no_mutate,label = 'no mutations')
plt.plot(no_disease,label = 'no disease capacity')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.title('Accuracy of Predicting Migraines Per Missing Feature')
plt.legend()
# it seems that removing homo/hetero significantly impacts the models ability to predict whether someone has migraines
# most accurate disregards gene names but keeps other data
This model used genomic information (from the dataframe) to predict whether someone had migraines or not. It appears that eliminating the genes themselves (no genes) actually has a positive relationship with accuracy! However, eliminating whether a gene is homozygous or heterozygous decreases accuracy, highlighting it has a more significant statistical relationship with the ability to predict migraines
# testing which features are most relevant to accurately predicting whether someone has migraines with aura
def knn_full(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[feat]
y_train = alleles['Has Migraine with aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_gene(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[feat[1:]]
y_train = alleles['Has Migraine with aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_rec_dom(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Disease capacity']]
y_train = alleles['Has Migraine with aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_homo_hetero(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Recessive or dominant','Mutation or variant?','Disease capacity']]
y_train = alleles['Has Migraine with aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_mutate(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Homo/heterozyg','Recessive or dominant','Disease capacity']]
y_train = alleles['Has Migraine with aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_disease(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Recessive or dominant']]
y_train = alleles['Has Migraine with aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
full = ks.apply(knn_full)
no_gene = ks.apply(knn_no_gene)
no_rec_dom = ks.apply(knn_no_rec_dom)
no_homo_hetero = ks.apply(knn_no_homo_hetero)
no_mutate = ks.apply(knn_no_mutate)
no_disease = ks.apply(knn_no_disease)
plt.plot(full,label = 'full features')
plt.plot(no_gene,label = 'no genes')
plt.plot(no_rec_dom,label = 'no recessive/dom')
plt.plot(no_homo_hetero,label = 'no homo/het')
plt.plot(no_mutate,label = 'no mutations')
plt.plot(no_disease,label = 'no disease capacity')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.title('Accuracy of Predicting Migraines with Aura Per Missing Feature')
plt.legend()
# Genes and recessive/ dominant seem most important to predicting this type of migraine.
When specifically predicting whether someone has migraines with aura, it appears that more neighbors increases accuracy. In this case, removing genes led to the largest decrease in accuracy. This is supported by the chart from the EDA section highlighting gene variance in only migraine havers; however, there are a number of variables that impact this conclusion
# testing which features are most relevant to accurately predicting whether someone has migraines without aura
def knn_full(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[feat]
y_train = alleles['Has Migraine without aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_gene(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[feat[1:]]
y_train = alleles['Has Migraine without aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_rec_dom(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Disease capacity']]
y_train = alleles['Has Migraine without aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_homo_hetero(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Recessive or dominant','Mutation or variant?','Disease capacity']]
y_train = alleles['Has Migraine without aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_mutate(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Homo/heterozyg','Recessive or dominant','Disease capacity']]
y_train = alleles['Has Migraine without aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_disease(n):
model = KNeighborsClassifier(n_neighbors=n)
scaler=StandardScaler()
vec = DictVectorizer(sparse=False)
pipeline = Pipeline([
("vec",vec),
("scaler", scaler),
("model", model)
])
X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Recessive or dominant']]
y_train = alleles['Has Migraine without aura']
X_train=X_train.to_dict(orient="records")
return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
full = ks.apply(knn_full)
no_gene = ks.apply(knn_no_gene)
no_rec_dom = ks.apply(knn_no_rec_dom)
no_homo_hetero = ks.apply(knn_no_homo_hetero)
no_mutate = ks.apply(knn_no_mutate)
no_disease = ks.apply(knn_no_disease)
plt.plot(full,label = 'full features')
plt.plot(no_gene,label = 'no genes')
plt.plot(no_rec_dom,label = 'no recessive/dom')
plt.plot(no_homo_hetero,label = 'no homo/het')
plt.plot(no_mutate,label = 'no mutations')
plt.plot(no_disease,label = 'no disease capacity')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.title('Accuracy of Predicting Migraines without aura Per Missing Feature')
plt.legend()
# Harder to interpret, seems that homo/hetero has much smaller effect on Migraines without aura
For predicting migraines without aura, eliminating whether there was a mutation appeared to decrease accuracy the most. It also shows a positive relationship between accuracy and increasing neighbors
Overall, it appears our models were not incredibly informative. However, they showed there could be a statistical relationship between certain comorbidities and specific genes. Additional data sources and further research would need to be conducted to validify any of these claims