%%shell
jupyter nbconvert --to html /content/drive/"My Drive"/Colab_Notebooks/yayfinal.ipynb

Migraine Comorbidity and Genetic Analysis: Data Science Final Tutorial¶

Maddie Bonanno and Riley Martin¶

our project website can be found at: https://mscb25.github.io/datasci-final-maddieriley/

Part 0: Initial Discussion¶

Purpose¶

Migraines are the 2nd most common cause of disability world wide. Presenting in a number of fashions, migraines are highly variable in triggers, symptoms, and severity. The purpose of this project is to explore the relationships between migraines and their comorbidities. Using medical history and genetic data, we aimed to explore how migraine occurance could possibly be linked to other factors.

We considered two common forms of migraine in this study - migraine with aura and migraine without aura:

Migraine without aura = a neurological disease that often presents as a headache accompanied by nausea and sensitivity to external stimuli
Migraine with aura = the same critera as above ^ plus some form of visual, sensory, or motor disturbance that occurs before/during the attack

Goals and Plans:

Overall, through this assignment, we hoped to better understand the statistical association between migraines, common comorbidities, and genomics; this would allow for an improved understanding of the pathophysiology of migraines.

Some questions we wanted to explore are:

Are people with migraines more likely to have comorbidities?
If so, which types of comorbidities are most frequent in participants with migraines?
Is it possible that genetics plays a role in the likelihood of developing migraines?
Is there a particular subset of individuals migraines tend to impact?
Is there any difference in comorbidities between participants that have migraine with aura vs migraine without aura?

Project Disclaimer:

Through this research experiment, we wanted to explore statistical associations between migraines and the presentation of comorbidities + variance of genes. We are NOT claiming that migraines cause comorbidities or vise versa. We are also NOT claiming that allele varients between populations are a cause of migraines. Any usage of "correlation" or "causation" in this assignment refers to the percieved relationship between variables - it is not indicative of an accurate scientific conclusion.

Collaboration Plan:

To ensure completion of this project, we met at least twice a week. For most of the semester, we found Tuesday and Thursday evenings around 8pm to be optimal. Since we also had classes together each day, we utilized time before or after lecture to update one another on nightly progress. We also texted one another and shared planning documents via google drive. In addition, we often split tasks to optimize time, with Maddie organizing a majority of the data and Riley creating visual representations.

Data Sources¶

For this project, we obtained data from the Personal Genome Project(PGP). This open data source provided access to a number of statistics and participants used in this assignment.

1) PGP Google Surveys

Located at https://my.pgp-hms.org/google_surveys

Within the PGP data repositories, there are a number of 'participant surveys' filled out by the subject.

We utilized two of the general information surveys to gauge which populations were being represented.

1) PGP Participant Survey

This google survey includes the 'Particpant' (represented by a 8 digit code) followed by personal information including : year of birth, sex/gender, and race/ethnicity

2) PGP Basic Phenotypes Survey 2015

This survey also includes the 'Participant' followed by various phenotypical categories including: height, blood type, and eye color

We also collected data regarding the specific medical conditions participants had. Each file contained the 'Participant' and a column indicating whether they had a medical condition under that umbrella. The syndrome classes we looked at were:

1) Nervous System

The occurance of migraines in participants was evaluated with this survey

2) Blood
3) Vision and Hearing
4) Circulatory System
5) Digestive System
6) Endocrine, Metabolic, Nutritional, and Immunity
7) Respiratory System
8) Skin and Subcutaneous Tissue
9) Musculoskeletal System and Connective Tissue
10) Genitourinary Systems
11) Congenital Traits and Anomalies

These sources led to the culmination of data from 13 different google surveys.

2) Get-Evidence Variant Reports - Genomic Data

Given the medical histories of the patients (collected by the methods avove), we wanted to determine whether there was any genetic relation between migraines and/or comorbidities.

To achieve this goal, we scraped data from the PGP Whole genome datasets. Since only some participants had their genetic data uploaded, we filtered by 'Whole genome datasets' and accessed profiles with this component fufilled.

Since there was no dataset that contained all the information we desired, we took information from participant profiles and created a data source, which can be accessed here.

Methodology:

Filtered participants based on whether 'Whole genome datasets' had a value > 0
Clicked on the first participant ID to access the "Participant Profile". Recorded the ID in the excel file.
Scrolled to the bottom of the page to find the links to this person's surveys. Clicked on the Nervous System conditions survey and checked if they had been diagnosed with any form of migraine; recorded 'mig with aura', 'mig no aura', 'both', or 'none' in the second column based on the survey response.
Next, clicked on 'View Report', a hyperlink located in the chart, to see the genetic data available for this participant. If the link was corrupt, the participant was discareded from the excel file.
On the new page named 'Variant report for (participant ID)", we recorded all the genes located in the "Show likely pathogenic and rare (<2.5%) pathogenic variants". The information in the "Impact" column was split into three excel columns: 'recessive or dominant', 'Homo/heterozyg', and 'Disease capacity'. Furthermore, the genes were labeled as "variant" unless specifically denoted as a mutation.
After this, genes under the "Insufficiently evaluated variants" tab were considered. Genes were recorded in the excel (following the previous criteria) if 1) the prioritization score was >5, or 2) the gene name includes "Shift" or "*" at the end - this denotes a mutation and should be recorded as such.
This process was repeated for each participant that had whole genome data publically available

In summation, the rare variants and uncommon gene mutations were organized in a dataframe

genetic information was obtained from 137 unique participants

Part 1: Data Acquisition¶

import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/Colab_Notebooks
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab_Notebooks

Google Survey Collection¶

As highlighted in Part 0, there are 13 different google surveys containing vital information about the medical history of the participants. Most of the challenges in this section arose from trying to present all the information in a digestible manner.

# These are the 11 surveys containing medical conditions of each patient
# Each csv was read in

nerv_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_NervousSystem-20181010220056 (1).csv")
circ_data = pd.read_csv('PGPTrait&DiseaseSurvey2012_CirculatorySystem-20181010220109.csv')
endo_data = pd.read_csv('PGPTrait&DiseaseSurvey2012_Endocrine,Metabolic,Nutritional,AndImmunity-20181010220044 (1).csv')
blood_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_Blood-20181010220050.csv")
vis_hear_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_VisionAndHearing-20181010220103.csv")
resp_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_RespiratorySystem-20181010220114.csv")
digest_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_DigestiveSystem-20181010214607.csv")
genit_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_GenitourinarySystems-20181010214612.csv")
skin_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_SkinAndSubcutaneousTissue-20181010214618.csv")
musculo_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_MusculoskeletalSystemAndConnectiveTissue-20181010214624.csv")
congen_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_CongenitalTraitsAndAnomalies-20181010214629.csv")

# Phenotypic and general physical traits (the other 2 surveys) were also read in 

phenotypes = pd.read_csv("PGPBasicPhenotypesSurvey2015-20181010214636.csv")
gen_survey = pd.read_csv("PGPParticipantSurvey-20181010220019.csv")

nerv_data.head() # All condition data takes on this format.

#putting all the csv's in a list format for easier manipulation
condition_data = [nerv_data,circ_data,endo_data,blood_data,vis_hear_data,resp_data,
                  digest_data,genit_data,skin_data,musculo_data,congen_data,]

def drop_col(): # dropping unwanted columns
  for i in condition_data:
    i.drop("Do not touch!",axis=1,inplace=True)
    i.drop("Timestamp",axis=1,inplace=True)
    i.drop("Other condition not listed here?",axis=1,inplace=True)
drop_col()

nerv_data.head() #now, this is how each dataframe looks

# renaming the columns for clarity
nerv_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with one of the following conditions?": 'Nervous System Conditions' },inplace=True)
circ_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with one of the following conditions?": 'Circulatory System Conditions'},inplace=True)
endo_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Endocrine System Conditions'},inplace=True)
blood_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Blood Conditions' },inplace=True)
vis_hear_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with one of the following conditions?": 'Visual and Hearing Conditions' },inplace=True)
resp_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Respiratory System Conditions'},inplace=True)
digest_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Digestive System Conditions'},inplace=True)
genit_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Genitourinary System Conditions'},inplace=True)
skin_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Skin Conditions'},inplace=True)
musculo_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Musculoskeletal System Conditions'},inplace=True)
congen_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Congenital Conditions'},inplace=True)

After each set of data was downloaded and cleaned, we were left with 11 frames of medical information. Since 'Participant' was a shared column in each, we used inner join to merge all the conditions into a singular dataframe.

# merging into one df
conditions = condition_data[0].merge(condition_data[1],on="Participant",how="inner")
conditions = conditions.merge(condition_data[2],on="Participant",how="inner")
conditions = conditions.merge(condition_data[3],on="Participant",how="inner")
conditions = conditions.merge(condition_data[4],on="Participant",how="inner")
conditions = conditions.merge(condition_data[5],on="Participant",how="inner")
conditions = conditions.merge(condition_data[6],on="Participant",how="inner")
conditions = conditions.merge(condition_data[7],on="Participant",how="inner")
conditions = conditions.merge(condition_data[8],on="Participant",how="inner")
conditions = conditions.merge(condition_data[9],on="Participant",how="inner")
conditions = conditions.merge(condition_data[10],on="Participant",how="inner")

conditions

After merging, the resulting dataframe needed to be cleaned

# Way too many rows
conditions['Participant'].value_counts()
# duplicates need to go
conditions = conditions.drop_duplicates(subset=['Participant'])
conditions
# duplicates gone but need to reset index

conditions = conditions.reset_index(drop=True)
conditions

Now, it is evident that there are 1767 unique participants that shared their full medical history.

The next step was to determine how many of these individuals had been diagnosed with migraines.

# number of people with migraines
mig_w_aura = conditions['Nervous System Conditions'].str.contains('Migraine with aura', case=False, na=False)
w_sum = mig_w_aura.sum()
mig_no_aura = conditions['Nervous System Conditions'].str.contains('Migraine without aura', case=False, na=False)
no_sum = mig_no_aura.sum()
w_sum, no_sum

(215, 218)

# just Participants with migraines
only_mig_haver = conditions[mig_w_aura | mig_no_aura]
only_mig_haver.Participant.count()

384

only_mig_haver

Out of the dataset, there are 384 individuals with migraines. 215 participants have migraine with aura, while 218 have migraine without aura

Then, we needed to process and clean the phenotypic data for the participants

# Now need to clean up phenotype and general data
phenotypes.columns

Index(['Participant', 'Timestamp', 'Do not touch!', '1.1 — Blood Type',
       '1.2 — Height', '1.3 — Weight', '1.4 — Comments',
       '2.1 — Left Eye (Photograph Number)  (full-size image: https://goo.gl/XQ2Voh)',
       '2.2 — Right Eye (Photograph Number)  (full-size image: https://goo.gl/XQ2Voh)',
       '2.3 — Left Eye Color - Text Description',
       '2.4 — Right Eye Color - Text Description', '2.5 —Comments',
       '3.1 — What is your natural hair color currently, when without artificial color or dye?',
       '3.2 — Hair Color - Text Description', '3.3 — Comments',
       '4.1 — Any final thoughts?', '1.4 — Handedness'],
      dtype='object')

# columns to drop 
unwanted_traits =['1.4 — Comments',
       '2.1 — Left Eye (Photograph Number)  (full-size image: https://goo.gl/XQ2Voh)',
       '2.2 — Right Eye (Photograph Number)  (full-size image: https://goo.gl/XQ2Voh)',
       '2.3 — Left Eye Color - Text Description',
       '2.4 — Right Eye Color - Text Description', '2.5 —Comments',
       '3.1 — What is your natural hair color currently, when without artificial color or dye?',
       '3.2 — Hair Color - Text Description', '3.3 — Comments',
       '4.1 — Any final thoughts?', '1.4 — Handedness','Timestamp', 'Do not touch!']

phenotypes = phenotypes.drop(unwanted_traits,axis=1)

# desired columns
phenotypes.rename(columns={'1.1 — Blood Type': 'Blood Type',	'1.2 — Height': 'Height (in)',	'1.3 — Weight': 'Weight (lbs)'},inplace=True)

# lots of null values present
phenotypes.isnull().sum()

Participant      0
Blood Type      42
Height (in)     23
Weight (lbs)    26
dtype: int64

# ensuring all the NaN values are dropped
phenotypes = phenotypes.dropna()
phenotypes = phenotypes.reset_index(drop=True)
phenotypes

When that was completed, we moved onto cleaning the general traits frame

# cleaning up the traits data
gen_survey.columns

Index(['Participant', 'Timestamp', 'Do not touch!', 'Year of birth',
       'Which statement best describes you?',
       'Severe disease or rare genetic trait',
       'Do you have a severe genetic disease or rare genetic trait? If so, you can add a description for your public profile.',
       'Disease/trait: Onset', 'Disease/trait: Rarity',
       'Disease/trait: Severity', 'Disease/trait: Relative enrollment',
       'Disease/trait: Diagnosis', 'Disease/trait: Genetic confirmation',
       'Disease/trait: Documentation',
       'Disease/trait: Documentation description', 'Sex/Gender',
       'Race/ethnicity', 'Maternal grandmother: Country of origin',
       'Paternal grandmother: Country of origin',
       'Paternal grandfather: Country of origin',
       'Maternal grandfather: Country of origin', 'Enrollment of relatives',
       'Enrollment of older individuals', 'Enrollment of parents',
       'Enrolled relatives [Monozygotic / Identical twins]',
       'Enrolled relatives [Parents]',
       'Enrolled relatives [Siblings / Fraternal twins]',
       'Enrolled relatives [Children]', 'Enrolled relatives [Grandparents]',
       'Enrolled relatives [Grandchildren]',
       'Enrolled relatives [Aunts/Uncles]',
       'Enrolled relatives [Nephews/Nieces]',
       'Enrolled relatives [Half-siblings]',
       'Enrolled relatives [Cousins or more distant]',
       'Enrolled relatives [Not genetically related (e.g. husband/wife)]',
       'Are all your enrolled relatives linked to your PGP profile?',
       'Have you uploaded genetic data to your PGP participant profile?',
       'Have you used the PGP web interface to record a designated proxy?',
       'Have you uploaded health record data using our Google Health or Microsoft Healthvault interfaces?',
       'Uploaded health records: Update status',
       'Uploaded health records: Extensiveness', 'Blood sample',
       'Saliva sample', 'Microbiome samples', 'Tissue samples from surgery',
       'Tissue samples from autopsy', 'Month of birth',
       'Anatomical sex at birth', 'Maternal grandmother: Race/ethnicity',
       'Maternal grandfather: Race/ethnicity',
       'Paternal grandmother: Race/ethnicity',
       'Paternal grandfather: Race/ethnicity'],
      dtype='object')

# grabbing the traits we're interested in
traits = gen_survey[['Participant','Sex/Gender','Race/ethnicity']]

# using more descriptive categories for race/ethnicity
traits['Race/ethnicity'] = traits['Race/ethnicity'].str.split(n=3).str[:3].str.join(' ')

<ipython-input-321-819ea03731c9>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  traits['Race/ethnicity'] = traits['Race/ethnicity'].str.split(n=3).str[:3].str.join(' ')

traits['Race/ethnicity'].value_counts()

White                     3532
American Indian /          155
Asian                      126
Hispanic or Latino,         82
Black or African            64
Hispanic or Latino          61
Asian, White                40
No response                 30
Asian, Native Hawaiian       7
White, No response           5
Native Hawaiian or           3
Asian, Hispanic or           2
Asian, Black or              1
Name: Race/ethnicity, dtype: int64

# replacing the column titles with grammatically correct names
traits['Race/ethnicity'] = traits['Race/ethnicity'].replace({'American Indian /':'American Indian','Hispanic or Latino,': 'Hispanic or Latino',
                                                             'Asian, White': 'Asian', 'or': '','Asian, Black or': 'Asian', 'White, No response': 'White',
                                                             'Native Hawaiian or': 'Native Hawaiian'})

<ipython-input-323-b3079cff3887>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  traits['Race/ethnicity'] = traits['Race/ethnicity'].replace({'American Indian /':'American Indian','Hispanic or Latino,': 'Hispanic or Latino',

After that, we merged all the conditions, phenotypes, and general data into one singular df

# merging all
all = conditions.merge(phenotypes,how='inner',on=['Participant'])
all = all.merge(traits,how='inner',on=['Participant'])

all

all['Participant'].value_counts()
# need to drop dupes again

hu5880D9    10
hu6D1115    10
huD554DB     8
hu2E4B9F     7
hu8B35DE     7
            ..
huFF2969     1
hu048C92     1
hu6642CE     1
hu00698E     1
hu09787B     1
Name: Participant, Length: 763, dtype: int64

all = all.drop_duplicates(subset=['Participant'])

all = all.reset_index(drop=True)

Gene data Acquisition¶

The second major source of information we needed to process was the genetic profiles of the participants.

The dataframe being read in is "vari.xlsx", which is the dataframe we created of the important rare, pathogenic gene variants. The methodology we used to generate this source is detailed in section 0. To quickly summarize, the df holds the following information:

Participant ID
Type of migraine
Gene
Recessive vs dominant allele
Homozygous vs heterozygous
Whether the allele is a mutation or a variant
The disease capacity of the allele

# reading in the dataframe 
alleles = pd.read_excel("vari.xlsx")
alleles

# clean up and make binary variables for whether the participant has migraines
alleles = alleles.drop('Unnamed: 7',axis=1)

alleles['Has Migraine'] = alleles['Type of Migraine'] != 'None'
alleles['Has Migraine'] = alleles['Has Migraine'].map({True: 1, False: 0})

alleles

# drop nulls
alleles.isnull().sum()
alleles = alleles.dropna()
alleles = alleles.reset_index(drop=True)

alleles

Finally, we fully combined the conditions, phenotypes, general information, and gene data to create a comprehensive source of information about the participants

# dfs with genes
w_genes = all.merge(alleles,how='inner',on=['Participant'])

w_genes

Part 2: EDA¶

After all the data was cleaned and processed, we took a look at relationships between a variety of factors

Exploring Comorbidities¶

In order to determine the statistical relationship between migraines and comorbidities, binary variables first had to be created to separate whether someone has migraines in general, migraines with aura, or migraines without aura

# Want binary variables for having conditions but first need to isolate migraines and migraine types
conditions['Has Migraines'] = conditions['Nervous System Conditions'].str.contains('Migraine',case=False, na=False)
conditions[['No Migraines','Has Migraines']] = pd.get_dummies(conditions['Has Migraines'])
conditions = conditions.drop('No Migraines',axis=1)
conditions['Has Migraines with Aura'] = conditions['Nervous System Conditions'].str.contains('Migraine with aura',case=False, na=False)
conditions['Has Migraines without Aura'] = conditions['Nervous System Conditions'].str.contains('Migraine without aura',case=False, na=False)
conditions[['No Migraines with Aura','Has Migraines with Aura']] = pd.get_dummies(conditions['Has Migraines with Aura'])
conditions[['No Migraines without Aura','Has Migraines without Aura']] = pd.get_dummies(conditions['Has Migraines without Aura'])
conditions = conditions.drop('No Migraines with Aura',axis=1)
conditions = conditions.drop('No Migraines without Aura',axis=1)

# need to isolate Migraines out of nerv system conditions
conditions['Has Nervous System Conditions'] = (~conditions['Nervous System Conditions'].str.fullmatch('Migraine without aura',case=False,na=False) & (conditions['Nervous System Conditions'].notnull())
& (~conditions['Nervous System Conditions'].str.fullmatch('Migraine with aura',case=False,na=False)))
conditions[['No Nervous conditions','Has Nervous System Conditions']] = pd.get_dummies(conditions['Has Nervous System Conditions'])
conditions = conditions.drop('No Nervous conditions',axis=1)

conditions

From there, we created dummy variables for each class of comorbidities to indicate whether a person has a condition from that subset

# rest of the condition binaries
conditions['Has Blood Conditions'] = conditions['Blood Conditions'].notnull().astype('int')
conditions['Has Circulatory Conditions'] = conditions['Circulatory System Conditions'].notnull().astype('int')
conditions['Has Endocrine Conditions'] = conditions['Endocrine System Conditions'].notnull().astype('int')
conditions['Has Vision and Hearing Conditions'] = conditions['Visual and Hearing Conditions'].notnull().astype('int')
conditions['Has Respiratory Conditions'] = conditions['Respiratory System Conditions'].notnull().astype('int')
conditions['Has Digestive Conditions'] = conditions['Digestive System Conditions'].notnull().astype('int')
conditions['Has Genitourinary Conditions'] = conditions['Genitourinary System Conditions'].notnull().astype('int')
conditions['Has Skin Conditions'] = conditions['Skin Conditions'].notnull().astype('int')
conditions['Has Musculoskeletal Conditions'] = conditions['Musculoskeletal System Conditions'].notnull().astype('int')
conditions['Has Congenital Conditions'] = conditions['Congenital Conditions'].notnull().astype('int')

# EDA time
conditions.columns

Index(['Participant', 'Nervous System Conditions',
       'Circulatory System Conditions', 'Endocrine System Conditions',
       'Blood Conditions', 'Visual and Hearing Conditions',
       'Respiratory System Conditions', 'Digestive System Conditions',
       'Genitourinary System Conditions', 'Skin Conditions',
       'Musculoskeletal System Conditions', 'Congenital Conditions',
       'Has Migraines', 'Has Migraines with Aura',
       'Has Migraines without Aura', 'Has Nervous System Conditions',
       'Has Blood Conditions', 'Has Circulatory Conditions',
       'Has Endocrine Conditions', 'Has Vision and Hearing Conditions',
       'Has Respiratory Conditions', 'Has Digestive Conditions',
       'Has Genitourinary Conditions', 'Has Skin Conditions',
       'Has Musculoskeletal Conditions', 'Has Congenital Conditions'],
      dtype='object')

Then we got the proportions of participants who 1) had migraines vs didn't have migraines and 2) had no aura vs had aura

def prob_no_mig(cond):  # Getting proportions of those with and without Migraines per biological system
  a = ((conditions['Has Migraines'] == 0) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 0).sum()
  return a
def prob_w_mig(cond):
  b =((conditions['Has Migraines'] == 1) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 1).sum()
  return b

prob_w_list = [] # iterating through columns list for biological system
for i in conditions.columns[15:]:
  prob_w_list.append(prob_w_mig(i))
prob_no_list = []
for i in conditions.columns[15:]:
  prob_no_list.append(prob_no_mig(i))

def prob_no_aura(cond): # Getting proportions of those with and without Migraines with aura per system
  a = ((conditions['Has Migraines without Aura'] == 1) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 1).sum()
  return a
def prob_w_aura(cond):
  b =((conditions['Has Migraines with Aura'] == 1) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 1).sum()
  return b

prob_w_aura_list = [] # iterating through columns for biological system
for i in conditions.columns[15:]:
  prob_w_aura_list.append(prob_w_aura(i))
prob_no_aura_list = []
for i in conditions.columns[15:]:
  prob_no_aura_list.append(prob_no_aura(i))

We then generated a new dataframe with these proportions to evaluate the makeup of the population

d={'w mig': prob_w_list, 'no mig':prob_no_list, 'w_aura': prob_w_aura_list,'no_aura': prob_no_aura_list} # putting proportions into df

conditions_prop = pd.DataFrame(data=d,index=conditions.columns[15:])

conditions_prop = conditions_prop.reset_index()

# dataframe showing the proportion of each subset who has a specific comorbidity
conditions_prop

conditions_prop.plot.bar(x='index',y=['w mig','no mig'],color=[ 'red', 'blue'], xlabel='Conditions',ylabel='Percent of Individuals', title='Condition Proportions of Migraine Havers vs Control', width=0.8, figsize=(12,6)).grid()
# Bar graph of conditions

In every category, individuals with migraines were more likely to have comorbidities than participants without migraines. Someone with migraines is more than twice as likely to have another nervous system condition than someone without migraines

conditions_prop.plot.bar(x='index',y=['w_aura','no_aura'],color=[ 'green', 'orange'], xlabel='Conditions',ylabel='Percent of Individuals', title='Condition Proportions of Migraine with Aura vs without', width=0.8, figsize=(12,6)).grid()
# Bar graph of conditions w and wout aura

When looking at the prevelance of comorbidities in participants with vs without aura, there does not appear to be a relationship --> the frequency is approximately the same for both groupings across all 11 categories

From there, we wanted to more specifically look at whether phenotypes and general traits had a statistical relationship with migraines

# need dummies for Blood Types
phenotypes = pd.get_dummies(phenotypes,columns=['Blood Type'])

phenotypes # dataframe with dummies for all (common) blood types

# need height values to be type float
phenotypes['Height (in)'] = phenotypes['Height (in)'].str.replace("\"","")
phenotypes['Height (in)'] = phenotypes['Height (in)'].str.replace("'"," ")
phenotypes['Height (in)'] = [s.split(" ") for s in phenotypes['Height (in)']]
phenotypes['Height (in)'] = [float(value[0])*12 + float(value[1]) for value in phenotypes['Height (in)']]

phenotypes

traits = pd.get_dummies(traits,columns=['Sex/Gender','Race/ethnicity']) # Getting dummies for sex and race

# remerging conditions with phenotypes and general traits // considering dummies now
all = conditions.merge(phenotypes,how='inner',on=['Participant'])
all = all.merge(traits,how='inner',on=['Participant'])

all

all['Participant'].value_counts()
# need to drop dupes again

all = all.drop_duplicates(subset=['Participant'])
all = all.reset_index(drop=True)

# EDA for traits
# mig havers by sex
male = [((all['Has Migraines'] == 1) & (all['Sex/Gender_Male']==1)).sum() / all['Has Migraines'].sum(),
        ((all['Has Migraines with Aura'] == 1) & (all['Sex/Gender_Male']==1)).sum() / all['Has Migraines with Aura'].sum(),
        ((all['Has Migraines without Aura'] == 1) & (all['Sex/Gender_Male']==1)).sum() / all['Has Migraines without Aura'].sum()]
female = [((all['Has Migraines'] == 1) & (all['Sex/Gender_Female']==1)).sum() / all['Has Migraines'].sum(),
          ((all['Has Migraines with Aura'] == 1) & (all['Sex/Gender_Female']==1)).sum() / all['Has Migraines with Aura'].sum(),
          ((all['Has Migraines without Aura'] == 1) & (all['Sex/Gender_Female']==1)).sum() / all['Has Migraines without Aura'].sum()]

d={'male': male, 'female':female} #df for mig havers by sex

sex_props = pd.DataFrame(data=d,index=['w mig','w_aura','no_aura'])

sex_props = sex_props.reset_index()

sex_props #proportions of participant sex considering migraine types

sex_props.plot.bar(x='index',y=['male','female'],color=[ 'royalblue', 'pink'], xlabel='Migraines and Types of Migraines',ylabel='Percent of Individuals', title='Sex Proportions of Having Migraines and Types of Migraines ', width=0.8, figsize=(12,6)).grid()

There appears to be a significant statistical relationship between biological sex and migraine occurance, with female participants making up close to 2/3 of the migraine having population

From there, we wanted to see if blood type had any statistical relationship with migraines

all.columns[28:37]

Index(['Blood Type_A +', 'Blood Type_A -', 'Blood Type_AB +',
       'Blood Type_AB -', 'Blood Type_B +', 'Blood Type_B -',
       'Blood Type_Don't know', 'Blood Type_O +', 'Blood Type_O -'],
      dtype='object')

def prob_no_mig(cond): # Blood type proportions based on Migraine haver or not
  a = ((all['Has Migraines'] == 0) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 0).sum()
  return a
def prob_w_mig(cond):
  b =((all['Has Migraines'] == 1) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 1).sum()
  return b

prob_w_list = [] # iterating through columns for blood types
for i in all.columns[28:37]:
  prob_w_list.append(prob_w_mig(i))
prob_no_list = []
for i in all.columns[28:37]:
  prob_no_list.append(prob_no_mig(i))

def prob_no_aura(cond): # Blood type proportions based on Aura haver or not 
  a = ((all['Has Migraines with Aura'] == 1) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 1).sum()
  return a
def prob_w_aura(cond):
  b =((all['Has Migraines without Aura'] == 1) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 1).sum()
  return b

prob_w_aura_list = [] # iterating through columns for blood types
for i in all.columns[28:37]:
  prob_w_aura_list.append(prob_w_aura(i))
prob_no_aura_list = []
for i in all.columns[28:37]:
  prob_no_aura_list.append(prob_no_aura(i))

d={'w mig': prob_w_list, 'no mig':prob_no_list, 'w_aura': prob_w_aura_list,'no_aura': prob_no_aura_list} #blood type probs dataframe

blood_props = pd.DataFrame(data=d,index=all.columns[28:37])

blood_props = blood_props.reset_index()

blood_props #blood type vs type of migraine

blood_props.plot.bar(x='index',y=['w mig','no mig'],color=[ 'brown', 'aqua'], xlabel='Blood Types',ylabel='Percent of Individuals', title='Proportions of Blood Types for Migraine Havers and Control', width=0.8, figsize=(12,6)).grid()

There does not appear to be a significant statistical relationship between having migraines and any specific blood type

blood_props.plot.bar(x='index',y=['w_aura','no_aura'],color=[ 'gray', 'red'], xlabel='Blood Types',ylabel='Percent of Individuals', title='Proportion of Blood Types for Aura and No Aura', width=0.8, figsize=(12,6)).grid()

There is also a lack of significant relationship between blood type and migraines with aura vs without aura. Although there are larger differences between the control and experimental, this can be attributed to natural variance in a small data set

We were also curious as to whether race/ethnicity had a relationship with migraine occurance

all.columns[54:]

Index(['Race/ethnicity_American Indian', 'Race/ethnicity_Asian',
       'Race/ethnicity_Asian, Hispanic or',
       'Race/ethnicity_Asian, Native Hawaiian',
       'Race/ethnicity_Black or African', 'Race/ethnicity_Hispanic or Latino',
       'Race/ethnicity_Native Hawaiian', 'Race/ethnicity_No response',
       'Race/ethnicity_White'],
      dtype='object')

def prob_no_mig(cond): # proportions for race per those with and without migraines
  a = ((all['Has Migraines'] == 0) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
  return a
def prob_w_mig(cond): 
  b =((all['Has Migraines'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
  return b

prob_w_list = [] # iterating through columns for races
for i in all.columns[54:]:
  prob_w_list.append(prob_w_mig(i))
prob_no_list = []
for i in all.columns[54:]:
  prob_no_list.append(prob_no_mig(i))

<ipython-input-378-545a5e4fc270>:5: RuntimeWarning: invalid value encountered in long_scalars
  b =((all['Has Migraines'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
<ipython-input-378-545a5e4fc270>:2: RuntimeWarning: invalid value encountered in long_scalars
  a = ((all['Has Migraines'] == 0) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()

def prob_no_aura(cond): # aura types per race
  a = ((all['Has Migraines with Aura'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
  return a
def prob_w_aura(cond):
  b =((all['Has Migraines without Aura'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
  return b

prob_w_aura_list = [] # iterating through columns for races
for i in all.columns[54:]:
  prob_w_aura_list.append(prob_w_aura(i))
prob_no_aura_list = []
for i in all.columns[54:]:
  prob_no_aura_list.append(prob_no_aura(i))

<ipython-input-380-6f2f39f387d9>:5: RuntimeWarning: invalid value encountered in long_scalars
  b =((all['Has Migraines without Aura'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
<ipython-input-380-6f2f39f387d9>:2: RuntimeWarning: invalid value encountered in long_scalars
  a = ((all['Has Migraines with Aura'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()

d={'w mig': prob_w_list, 'no mig':prob_no_list, 'w_aura': prob_w_aura_list,'no_aura': prob_no_aura_list} # race props df

race_props = pd.DataFrame(data=d,index=all.columns[54:])

race_props = race_props.reset_index()

race_props = race_props.dropna() # some races had very low values and no migs

race_props #race/ethnicity vs migraines

race_props.drop(race_props.loc[race_props.w_aura < 0.00001].index, inplace=True) # getting rid of columns with 0s (low pops, unimportant)

race_props

race_props.plot.bar(x='index',y=['w mig','no mig'],color=[ 'orange', 'blue'], xlabel='Race/Ethnicity',ylabel='Percent of Individuals', title='Proportions of Race/Ethnicity for Migraine Havers and Control', width=0.8, figsize=(12,6)).grid()

Based on this information alone, it would appear there is a statistical relationship between race and migraine. However, there is a higher proportion of white participants vs other races, which makes this metric a poor indicator

race_props.plot.bar(x='index',y=['w_aura','no_aura'],color=[ 'green', 'red'], xlabel='Race/Ethnicity',ylabel='Percent of Individuals', title='Proportion of Race/Ethnicity for Aura and No Aura', width=0.8, figsize=(12,6)).grid()

The same could be said for this plot --> although there appears to be a relationship between ethnicity and migraine with/without aura, the size and makeup of the dataset must be considered first

Exploring Genes¶

We also wanted to explore some relationships found in the genetic data. Within this section, we created additional dataframes and evaluated ratios of gene alleles in the population

alleles

alleles.Gene.nunique() #there are 412 different genes in the df

412

no_mig = alleles['Type of Migraine'].str.contains('None', case=False, na=False)
aura_mig = alleles['Type of Migraine'].str.contains('with aura', case=False, na=False)
no_aura_mig = alleles['Type of Migraine'].str.contains('no aura', case=False, na=False)
both_mig = alleles['Type of Migraine'].str.contains('Both', case=False, na=False)

none_df = alleles[no_mig]
aura_df = alleles[aura_mig]
no_aura_df = alleles[no_aura_mig]
both_df = alleles[both_mig]

none_df.Gene.value_counts()
none_df.Participant.nunique() #40 people
aura_df.Participant.nunique() #49 people
no_aura_df.Participant.nunique() #38 people
both_df.Participant.nunique() #10 people

10

Out of the genetic information collected, there are alleles available for 40 participants with no migraines, 49 participants with migraine with aura, 38 people with migraine without aura, and 10 participants with both forms

none_df2 = none_df[none_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
aura_df2 = aura_df[aura_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
no_aura_df2 = no_aura_df[no_aura_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
both_df2 = both_df[both_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
# keeping only the desired dataframes

none_df2.Gene.value_counts()

MTRR-I49M         27
NEFL-S472         27
COL4A1-Q1334H     17
C3-R102G          13
rs5186            13
APOE-C130R         8
CBS-I278T          7
MBL2-G54D          6
MBL2-R52C          6
NPC1-W1122         4
SERPINA1-E288V     4
SYNE1-N1915        4
AMPD1-Q12X         4
APOA5-S19W         4
HABP2-G534E        3
PAX2-Y273          3
HFE-C282Y          3
TGM1-E520G         3
TTN-E190           3
KRT5-G138E         3
NOD2-R702W         3
ACAD8-S171C        3
CREBBP-P1878       2
PRPH-D141Y         2
SERPINA1-E366K     2
RYR1-P2002         2
HPS6-A597          2
SNCA-Y39           2
PHKB-M185I         2
LPL-N318S          2
ALG3-F200          2
PKP2-S140F         2
MSR1-R293X         2
SPG11-K1013E       2
CETP-A390P         2
THBD-A43T          2
CD40LG-G219R       2
PEX26-L153V        2
WFS1-R456H         2
SNCA-A69           2
Name: Gene, dtype: int64

none_df2 = none_df2.drop('Disease capacity', axis = 1)
none_df2 = none_df2.drop('Recessive or dominant', axis = 1)
none_df2 = none_df2.drop('Mutation or variant?', axis = 1)
# dropping columns
#none_df2 = none_df2.drop('Unnamed: 7', axis = 1)

def get_ratio(df, pattern):
  num = df.groupby(df[pattern].str.lower()).size()
  denom = len(df[pattern])
  return num/denom

# function to get the ratio of heterozygous to homozygous

def gene_grab(df, listg):
  count = df.Gene.value_counts()
  for gene, num in count.iteritems():
    listg.append((gene, num))

  return listg

# function to collect the gene and its' relative frequency

none_count = none_df2.Gene.value_counts()
none_genes = []

test = gene_grab(none_df2, none_genes)
test

# this creates a list with all the genes and freqs

[('MTRR-I49M', 27),
 ('NEFL-S472', 27),
 ('COL4A1-Q1334H', 17),
 ('C3-R102G', 13),
 ('rs5186', 13),
 ('APOE-C130R', 8),
 ('CBS-I278T', 7),
 ('MBL2-G54D', 6),
 ('MBL2-R52C', 6),
 ('NPC1-W1122', 4),
 ('SERPINA1-E288V', 4),
 ('SYNE1-N1915', 4),
 ('AMPD1-Q12X', 4),
 ('APOA5-S19W', 4),
 ('HABP2-G534E', 3),
 ('PAX2-Y273', 3),
 ('HFE-C282Y', 3),
 ('TGM1-E520G', 3),
 ('TTN-E190', 3),
 ('KRT5-G138E', 3),
 ('NOD2-R702W', 3),
 ('ACAD8-S171C', 3),
 ('CREBBP-P1878', 2),
 ('PRPH-D141Y', 2),
 ('SERPINA1-E366K', 2),
 ('RYR1-P2002', 2),
 ('HPS6-A597', 2),
 ('SNCA-Y39', 2),
 ('PHKB-M185I', 2),
 ('LPL-N318S', 2),
 ('ALG3-F200', 2),
 ('PKP2-S140F', 2),
 ('MSR1-R293X', 2),
 ('SPG11-K1013E', 2),
 ('CETP-A390P', 2),
 ('THBD-A43T', 2),
 ('CD40LG-G219R', 2),
 ('PEX26-L153V', 2),
 ('WFS1-R456H', 2),
 ('SNCA-A69', 2)]

def get_freq(gene_freq_list, df, ratios_list):
  i = 0
  while i < len(gene_freq_list):
    gene = gene_freq_list[i][0]
    specific_gene_df = df[df['Gene'].str.contains(gene)]
    ratlist = []
    hold = get_ratio(specific_gene_df, 'Homo/heterozyg')
    for pattern, ratio in hold.iteritems():
      ratlist.append((pattern, ratio))
    ratios_list.append(ratlist)
    i += 1

  return ratios_list

# function that returns the allele types broken down individually

breakdown_test = []
none_freqs = get_freq(test, none_df2, breakdown_test)

breakdown_test

[[('carrier (heterozygous)', 0.6296296296296297),
  ('homozygous', 0.37037037037037035)],
 [('homozygous', 1.0)],
 [('heterozygous', 0.7058823529411765), ('homozygous', 0.29411764705882354)],
 [('heterozygous', 1.0)],
 [('heterozygous', 0.8461538461538461), ('homozygous', 0.15384615384615385)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)]]

total_none_set = []
for item in range(len(test)):
  gene = test[item][0]
  freq = test[item][1]
  ratios = breakdown_test[item]
  total_none_set.append((gene, freq, ratios))

total_none_set

# creating list with gene, freq, and allele type distrib

[('MTRR-I49M',
  27,
  [('carrier (heterozygous)', 0.6296296296296297),
   ('homozygous', 0.37037037037037035)]),
 ('NEFL-S472', 27, [('homozygous', 1.0)]),
 ('COL4A1-Q1334H',
  17,
  [('heterozygous', 0.7058823529411765), ('homozygous', 0.29411764705882354)]),
 ('C3-R102G', 13, [('heterozygous', 1.0)]),
 ('rs5186',
  13,
  [('heterozygous', 0.8461538461538461), ('homozygous', 0.15384615384615385)]),
 ('APOE-C130R', 8, [('heterozygous', 1.0)]),
 ('CBS-I278T', 7, [('carrier (heterozygous)', 1.0)]),
 ('MBL2-G54D', 6, [('carrier (heterozygous)', 1.0)]),
 ('MBL2-R52C', 6, [('carrier (heterozygous)', 1.0)]),
 ('NPC1-W1122', 4, [('heterozygous', 1.0)]),
 ('SERPINA1-E288V', 4, [('carrier (heterozygous)', 1.0)]),
 ('SYNE1-N1915', 4, [('heterozygous', 1.0)]),
 ('AMPD1-Q12X', 4, [('carrier (heterozygous)', 1.0)]),
 ('APOA5-S19W', 4, [('heterozygous', 1.0)]),
 ('HABP2-G534E', 3, [('heterozygous', 1.0)]),
 ('PAX2-Y273', 3, [('heterozygous', 1.0)]),
 ('HFE-C282Y', 3, [('carrier (heterozygous)', 1.0)]),
 ('TGM1-E520G', 3, [('carrier (heterozygous)', 1.0)]),
 ('TTN-E190', 3, [('heterozygous', 1.0)]),
 ('KRT5-G138E', 3, [('heterozygous', 1.0)]),
 ('NOD2-R702W', 3, [('heterozygous', 1.0)]),
 ('ACAD8-S171C', 3, [('carrier (heterozygous)', 1.0)]),
 ('CREBBP-P1878', 2, [('heterozygous', 1.0)]),
 ('PRPH-D141Y', 2, [('carrier (heterozygous)', 1.0)]),
 ('SERPINA1-E366K', 2, [('carrier (heterozygous)', 1.0)]),
 ('RYR1-P2002', 2, [('heterozygous', 1.0)]),
 ('HPS6-A597', 2, [('heterozygous', 1.0)]),
 ('SNCA-Y39', 2, [('heterozygous', 1.0)]),
 ('PHKB-M185I', 2, [('carrier (heterozygous)', 1.0)]),
 ('LPL-N318S', 2, [('heterozygous', 1.0)]),
 ('ALG3-F200', 2, [('heterozygous', 1.0)]),
 ('PKP2-S140F', 2, [('heterozygous', 1.0)]),
 ('MSR1-R293X', 2, [('heterozygous', 1.0)]),
 ('SPG11-K1013E', 2, [('carrier (heterozygous)', 1.0)]),
 ('CETP-A390P', 2, [('heterozygous', 1.0)]),
 ('THBD-A43T', 2, [('heterozygous', 1.0)]),
 ('CD40LG-G219R', 2, [('carrier (heterozygous)', 1.0)]),
 ('PEX26-L153V', 2, [('carrier (heterozygous)', 1.0)]),
 ('WFS1-R456H', 2, [('heterozygous', 1.0)]),
 ('SNCA-A69', 2, [('heterozygous', 1.0)])]

After collecting this information for people without migraines, the same process was repeated with the other dfs. The same functions were utilized

aura_df2.Gene.value_counts()

MTRR-I49M         35
COL4A1-Q1334H     30
rs5186            21
APOE-C130R        14
C3-R102G          13
BEST1-S192        12
BEST1-Y245        12
MBL2-R52C         10
AMPD1-Q12X        10
MBL2-G54D          9
CETP-A390P         7
NEFL-S472          7
KRT5-G138E         5
KRT86-E402Q        5
PIGR-A580V         5
BTD-D444H          5
ACAD8-S171C        4
SERPINA1-E288V     4
VWF-S1506L         3
HABP2-G534E        3
RET-R231H          3
WFS1-R456H         3
KDR-C482R          3
HFE-S65C           3
SLC4A1-E40K        3
MPO-M251T          3
NOD2-G908R         3
KRT14-C18X         2
FCGR2B-I232T       2
RPGRIP1L-A229T     2
CFTR-S1235R        2
HFE-C282Y          2
MEFV-E148Q         2
MFN2-Q276R         2
APOA5-S19W         2
TTN-E190           2
COL9A3-R103W       2
PRF1-A91V          2
NOD2-R702W         2
ABCA4-G863A        2
ABCA4-A1038V       2
Name: Gene, dtype: int64

aura_df2 = aura_df2.drop('Disease capacity', axis = 1)
aura_df2 = aura_df2.drop('Recessive or dominant', axis = 1)
aura_df2 = aura_df2.drop('Mutation or variant?', axis = 1)
#aura_df2 = aura_df2.drop('Unnamed: 7', axis = 1)

aura_genes = []

ag_list = gene_grab(aura_df2, aura_genes)
ag_list

[('MTRR-I49M', 35),
 ('COL4A1-Q1334H', 30),
 ('rs5186', 21),
 ('APOE-C130R', 14),
 ('C3-R102G', 13),
 ('BEST1-S192', 12),
 ('BEST1-Y245', 12),
 ('MBL2-R52C', 10),
 ('AMPD1-Q12X', 10),
 ('MBL2-G54D', 9),
 ('CETP-A390P', 7),
 ('NEFL-S472', 7),
 ('KRT5-G138E', 5),
 ('KRT86-E402Q', 5),
 ('PIGR-A580V', 5),
 ('BTD-D444H', 5),
 ('ACAD8-S171C', 4),
 ('SERPINA1-E288V', 4),
 ('VWF-S1506L', 3),
 ('HABP2-G534E', 3),
 ('RET-R231H', 3),
 ('WFS1-R456H', 3),
 ('KDR-C482R', 3),
 ('HFE-S65C', 3),
 ('SLC4A1-E40K', 3),
 ('MPO-M251T', 3),
 ('NOD2-G908R', 3),
 ('KRT14-C18X', 2),
 ('FCGR2B-I232T', 2),
 ('RPGRIP1L-A229T', 2),
 ('CFTR-S1235R', 2),
 ('HFE-C282Y', 2),
 ('MEFV-E148Q', 2),
 ('MFN2-Q276R', 2),
 ('APOA5-S19W', 2),
 ('TTN-E190', 2),
 ('COL9A3-R103W', 2),
 ('PRF1-A91V', 2),
 ('NOD2-R702W', 2),
 ('ABCA4-G863A', 2),
 ('ABCA4-A1038V', 2)]

aurafr = []
aura_freqs = get_freq(ag_list, aura_df2, aurafr)
aurafr

[[('carrier (heterozygous)', 0.7428571428571429),
  ('homozygous', 0.2571428571428571)],
 [('heterozygous', 0.6333333333333333), ('homozygous', 0.36666666666666664)],
 [('heterozygous', 0.7142857142857143), ('homozygous', 0.2857142857142857)],
 [('heterozygous', 0.8571428571428571), ('homozygous', 0.14285714285714285)],
 [('heterozygous', 0.8461538461538461), ('homozygous', 0.15384615384615385)],
 [('homozygous', 1.0)],
 [('homozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)]]

total_aura_set = []
for item in range(len(ag_list)):
  gene = ag_list[item][0]
  freq = ag_list[item][1]
  ratios = aurafr[item]
  total_aura_set.append((gene, freq, ratios))

total_aura_set #full set of information for people with migraine with aura

[('MTRR-I49M',
  35,
  [('carrier (heterozygous)', 0.7428571428571429),
   ('homozygous', 0.2571428571428571)]),
 ('COL4A1-Q1334H',
  30,
  [('heterozygous', 0.6333333333333333), ('homozygous', 0.36666666666666664)]),
 ('rs5186',
  21,
  [('heterozygous', 0.7142857142857143), ('homozygous', 0.2857142857142857)]),
 ('APOE-C130R',
  14,
  [('heterozygous', 0.8571428571428571), ('homozygous', 0.14285714285714285)]),
 ('C3-R102G',
  13,
  [('heterozygous', 0.8461538461538461), ('homozygous', 0.15384615384615385)]),
 ('BEST1-S192', 12, [('homozygous', 1.0)]),
 ('BEST1-Y245', 12, [('homozygous', 1.0)]),
 ('MBL2-R52C', 10, [('carrier (heterozygous)', 1.0)]),
 ('AMPD1-Q12X', 10, [('carrier (heterozygous)', 1.0)]),
 ('MBL2-G54D', 9, [('carrier (heterozygous)', 1.0)]),
 ('CETP-A390P', 7, [('heterozygous', 1.0)]),
 ('NEFL-S472', 7, [('homozygous', 1.0)]),
 ('KRT5-G138E', 5, [('heterozygous', 1.0)]),
 ('KRT86-E402Q', 5, [('homozygous', 1.0)]),
 ('PIGR-A580V', 5, [('heterozygous', 1.0)]),
 ('BTD-D444H', 5, [('carrier (heterozygous)', 1.0)]),
 ('ACAD8-S171C', 4, [('carrier (heterozygous)', 1.0)]),
 ('SERPINA1-E288V', 4, [('carrier (heterozygous)', 1.0)]),
 ('VWF-S1506L', 3, [('carrier (heterozygous)', 1.0)]),
 ('HABP2-G534E', 3, [('heterozygous', 1.0)]),
 ('RET-R231H', 3, [('heterozygous', 1.0)]),
 ('WFS1-R456H', 3, [('heterozygous', 1.0)]),
 ('KDR-C482R', 3, [('heterozygous', 1.0)]),
 ('HFE-S65C', 3, [('carrier (heterozygous)', 1.0)]),
 ('SLC4A1-E40K', 3, [('carrier (heterozygous)', 1.0)]),
 ('MPO-M251T', 3, [('carrier (heterozygous)', 1.0)]),
 ('NOD2-G908R', 3, [('heterozygous', 1.0)]),
 ('KRT14-C18X', 2, [('heterozygous', 1.0)]),
 ('FCGR2B-I232T', 2, [('heterozygous', 1.0)]),
 ('RPGRIP1L-A229T', 2, [('heterozygous', 1.0)]),
 ('CFTR-S1235R', 2, [('carrier (heterozygous)', 1.0)]),
 ('HFE-C282Y', 2, [('carrier (heterozygous)', 1.0)]),
 ('MEFV-E148Q', 2, [('carrier (heterozygous)', 1.0)]),
 ('MFN2-Q276R', 2, [('heterozygous', 1.0)]),
 ('APOA5-S19W', 2, [('heterozygous', 1.0)]),
 ('TTN-E190', 2, [('heterozygous', 1.0)]),
 ('COL9A3-R103W', 2, [('heterozygous', 1.0)]),
 ('PRF1-A91V', 2, [('heterozygous', 1.0)]),
 ('NOD2-R702W', 2, [('heterozygous', 1.0)]),
 ('ABCA4-G863A', 2, [('carrier (heterozygous)', 1.0)]),
 ('ABCA4-A1038V', 2, [('heterozygous', 1.0)])]

both_df2.Gene.value_counts()

MTRR-I49M         9
HFE-C282Y         5
KRT86-E402Q       4
COL4A1-Q1334H     4
APOA5-S19W        3
VWF-S1506L        3
MBL2-R52C         3
C3-R102G          3
SERPINA1-E366K    2
KRT5-G138E        2
CYP21A2-Q319X     2
MYH7-Q1334        2
MBL2-G54D         2
SERPINA1-E288V    2
RPGRIP1L-A229T    2
APC-Y486X         2
AMPD1-Q12X        2
PRF1-A91V         2
Name: Gene, dtype: int64

both_count = both_df2.Gene.value_counts()
both_genes = []

bboth = gene_grab(both_df2, both_genes)
bboth

[('MTRR-I49M', 9),
 ('HFE-C282Y', 5),
 ('KRT86-E402Q', 4),
 ('COL4A1-Q1334H', 4),
 ('APOA5-S19W', 3),
 ('VWF-S1506L', 3),
 ('MBL2-R52C', 3),
 ('C3-R102G', 3),
 ('SERPINA1-E366K', 2),
 ('KRT5-G138E', 2),
 ('CYP21A2-Q319X', 2),
 ('MYH7-Q1334', 2),
 ('MBL2-G54D', 2),
 ('SERPINA1-E288V', 2),
 ('RPGRIP1L-A229T', 2),
 ('APC-Y486X', 2),
 ('AMPD1-Q12X', 2),
 ('PRF1-A91V', 2)]

both_test = []
both_freqs = get_freq(bboth, both_df2, both_test)

both_test

[[('carrier (heterozygous)', 0.5555555555555556),
  ('homozygous', 0.4444444444444444)],
 [('carrier (heterozygous)', 1.0)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)]]

total_both_set = []
for item in range(len(bboth)):
  gene = bboth[item][0]
  freq = bboth[item][1]
  ratios = both_test[item]
  total_both_set.append((gene, freq, ratios))

total_both_set # information for those with both forms of migraine

[('MTRR-I49M',
  9,
  [('carrier (heterozygous)', 0.5555555555555556),
   ('homozygous', 0.4444444444444444)]),
 ('HFE-C282Y', 5, [('carrier (heterozygous)', 1.0)]),
 ('KRT86-E402Q', 4, [('homozygous', 1.0)]),
 ('COL4A1-Q1334H', 4, [('heterozygous', 1.0)]),
 ('APOA5-S19W', 3, [('heterozygous', 1.0)]),
 ('VWF-S1506L', 3, [('carrier (heterozygous)', 1.0)]),
 ('MBL2-R52C', 3, [('carrier (heterozygous)', 1.0)]),
 ('C3-R102G', 3, [('heterozygous', 1.0)]),
 ('SERPINA1-E366K', 2, [('carrier (heterozygous)', 1.0)]),
 ('KRT5-G138E', 2, [('heterozygous', 1.0)]),
 ('CYP21A2-Q319X', 2, [('carrier (heterozygous)', 1.0)]),
 ('MYH7-Q1334', 2, [('heterozygous', 1.0)]),
 ('MBL2-G54D', 2, [('carrier (heterozygous)', 1.0)]),
 ('SERPINA1-E288V', 2, [('carrier (heterozygous)', 1.0)]),
 ('RPGRIP1L-A229T', 2, [('heterozygous', 1.0)]),
 ('APC-Y486X', 2, [('heterozygous', 1.0)]),
 ('AMPD1-Q12X', 2, [('carrier (heterozygous)', 1.0)]),
 ('PRF1-A91V', 2, [('heterozygous', 1.0)])]

no_aura_df2.Gene.value_counts()

MTRR-I49M         29
rs5186            23
COL4A1-Q1334H     19
NEFL-S472         12
C3-R102G          11
AMPD1-Q12X        10
APOE-C130R        10
MBL2-G54D          8
BEST1-S192         8
BEST1-Y245         8
KRT86-E402Q        6
APOA5-S19W         6
MBL2-R52C          4
DMD-E2910V         3
NOD2-R702W         3
CYP21A2-Q319X      3
PRPH-D141Y         3
RP1-T373I          3
MEFV-P369S         3
CFTR-W1204X        2
VWF-S1506L         2
WFS1-R456H         2
ARSA-T274M         2
KRT5-G138E         2
PMP22-T118M        2
HFE-C282Y          2
APC-Y486X          2
CBS-I278T          2
KRT14-C18X         2
PMM2-V129M         2
ABCC6-R1164X       2
BTD-D444H          2
VCL-M1073          2
DOK7-S45L          2
RPGRIP1L-A229T     2
CETP-A390P         2
SPG11-K1013E       2
RET-R231H          2
Name: Gene, dtype: int64

no_aura_count = no_aura_df2.Gene.value_counts()
no_aura_genes = []

noaur = gene_grab(no_aura_df2, no_aura_genes)
noaur

[('MTRR-I49M', 29),
 ('rs5186', 23),
 ('COL4A1-Q1334H', 19),
 ('NEFL-S472', 12),
 ('C3-R102G', 11),
 ('AMPD1-Q12X', 10),
 ('APOE-C130R', 10),
 ('MBL2-G54D', 8),
 ('BEST1-S192', 8),
 ('BEST1-Y245', 8),
 ('KRT86-E402Q', 6),
 ('APOA5-S19W', 6),
 ('MBL2-R52C', 4),
 ('DMD-E2910V', 3),
 ('NOD2-R702W', 3),
 ('CYP21A2-Q319X', 3),
 ('PRPH-D141Y', 3),
 ('RP1-T373I', 3),
 ('MEFV-P369S', 3),
 ('CFTR-W1204X', 2),
 ('VWF-S1506L', 2),
 ('WFS1-R456H', 2),
 ('ARSA-T274M', 2),
 ('KRT5-G138E', 2),
 ('PMP22-T118M', 2),
 ('HFE-C282Y', 2),
 ('APC-Y486X', 2),
 ('CBS-I278T', 2),
 ('KRT14-C18X', 2),
 ('PMM2-V129M', 2),
 ('ABCC6-R1164X', 2),
 ('BTD-D444H', 2),
 ('VCL-M1073', 2),
 ('DOK7-S45L', 2),
 ('RPGRIP1L-A229T', 2),
 ('CETP-A390P', 2),
 ('SPG11-K1013E', 2),
 ('RET-R231H', 2)]

noaur_test = []
noaur_freqs = get_freq(noaur, no_aura_df2, noaur_test)

noaur_test

[[('carrier (heterozygous)', 0.5517241379310345),
  ('homozygous', 0.4482758620689655)],
 [('heterozygous', 0.9130434782608695), ('homozygous', 0.08695652173913043)],
 [('heterozygous', 0.8947368421052632), ('homozygous', 0.10526315789473684)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 0.8), ('homozygous', 0.2)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('homozygous', 1.0)],
 [('homozygous', 1.0)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('homozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)]]

total_noaur_set = []
for item in range(len(noaur)):
  gene = noaur[item][0]
  freq = noaur[item][1]
  ratios = noaur_test[item]
  total_noaur_set.append((gene, freq, ratios))

total_noaur_set # total information for people with migraine without aura

[('MTRR-I49M',
  29,
  [('carrier (heterozygous)', 0.5517241379310345),
   ('homozygous', 0.4482758620689655)]),
 ('rs5186',
  23,
  [('heterozygous', 0.9130434782608695), ('homozygous', 0.08695652173913043)]),
 ('COL4A1-Q1334H',
  19,
  [('heterozygous', 0.8947368421052632), ('homozygous', 0.10526315789473684)]),
 ('NEFL-S472', 12, [('homozygous', 1.0)]),
 ('C3-R102G', 11, [('heterozygous', 1.0)]),
 ('AMPD1-Q12X', 10, [('carrier (heterozygous)', 0.8), ('homozygous', 0.2)]),
 ('APOE-C130R', 10, [('heterozygous', 1.0)]),
 ('MBL2-G54D', 8, [('carrier (heterozygous)', 1.0)]),
 ('BEST1-S192', 8, [('homozygous', 1.0)]),
 ('BEST1-Y245', 8, [('homozygous', 1.0)]),
 ('KRT86-E402Q', 6, [('homozygous', 1.0)]),
 ('APOA5-S19W', 6, [('heterozygous', 1.0)]),
 ('MBL2-R52C', 4, [('carrier (heterozygous)', 1.0)]),
 ('DMD-E2910V', 3, [('heterozygous', 1.0)]),
 ('NOD2-R702W', 3, [('heterozygous', 1.0)]),
 ('CYP21A2-Q319X', 3, [('carrier (heterozygous)', 1.0)]),
 ('PRPH-D141Y', 3, [('carrier (heterozygous)', 1.0)]),
 ('RP1-T373I', 3, [('carrier (heterozygous)', 1.0)]),
 ('MEFV-P369S', 3, [('carrier (heterozygous)', 1.0)]),
 ('CFTR-W1204X', 2, [('homozygous', 1.0)]),
 ('VWF-S1506L', 2, [('carrier (heterozygous)', 1.0)]),
 ('WFS1-R456H', 2, [('heterozygous', 1.0)]),
 ('ARSA-T274M', 2, [('carrier (heterozygous)', 1.0)]),
 ('KRT5-G138E', 2, [('heterozygous', 1.0)]),
 ('PMP22-T118M', 2, [('heterozygous', 1.0)]),
 ('HFE-C282Y', 2, [('carrier (heterozygous)', 1.0)]),
 ('APC-Y486X', 2, [('heterozygous', 1.0)]),
 ('CBS-I278T', 2, [('carrier (heterozygous)', 1.0)]),
 ('KRT14-C18X', 2, [('heterozygous', 1.0)]),
 ('PMM2-V129M', 2, [('homozygous', 1.0)]),
 ('ABCC6-R1164X', 2, [('heterozygous', 1.0)]),
 ('BTD-D444H', 2, [('carrier (heterozygous)', 1.0)]),
 ('VCL-M1073', 2, [('heterozygous', 1.0)]),
 ('DOK7-S45L', 2, [('carrier (heterozygous)', 1.0)]),
 ('RPGRIP1L-A229T', 2, [('heterozygous', 1.0)]),
 ('CETP-A390P', 2, [('heterozygous', 1.0)]),
 ('SPG11-K1013E', 2, [('carrier (heterozygous)', 1.0)]),
 ('RET-R231H', 2, [('heterozygous', 1.0)])]

After this, a compiled df was made

total_none_set
tot_none_df2 = pd.DataFrame(total_none_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_none_df2

total_aura_set
tot_aura_df2 = pd.DataFrame(total_aura_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_aura_df2

total_both_set
tot_both_df2 = pd.DataFrame(total_both_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_both_df2

total_noaur_set
tot_noaur_df2 = pd.DataFrame(total_noaur_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_noaur_df2

# all the lists being converted to individual dataframes

# merging into one large dataframe

none_and_aura = tot_none_df2.merge(tot_aura_df2, on="Gene", how="outer")
none_and_aura.columns = ["Gene","Freq: None", "Alleles: none", "Freq: Aura", "Alleles: Aura"]
none_and_aura

none_aura_both = none_and_aura.merge(tot_both_df2, on="Gene", how="outer")
none_aura_both.columns = ["Gene","Freq: None", "Alleles: none", "Freq: Aura", "Alleles: Aura", "Freq: Both", "Alleles: Both"]
none_aura_both

all_types_df2 = none_aura_both.merge(tot_noaur_df2, on="Gene", how="outer")
all_types_df2.columns = ["Gene","Freq: None", "Alleles: none", "Freq: Aura", "Alleles: Aura", "Freq: Both", "Alleles: Both", "Freq: No Aura", "Alleles: No aura"]
all_types_df2

#filling in the NaN values
all_types_df2['Freq: Aura'] = all_types_df2['Freq: Aura'].fillna(0.0)
all_types_df2['Freq: None'] = all_types_df2['Freq: None'].fillna(0.0)
all_types_df2['Freq: No Aura'] = all_types_df2['Freq: No Aura'].fillna(0.0)
all_types_df2['Freq: Both'] = all_types_df2['Freq: Both'].fillna(0.0)
all_types_df2

The, various mathematical equations were applied to the frequencies. Ratios of gene frequency between all populations and proportions in each individual subset were considered

all_freq = ['Freq: None', "Freq: Both", "Freq: Aura", "Freq: No Aura"]
all_types_df2['Total'] = all_types_df2[all_freq].sum(axis=1)

all_types_df2

all_types_df2['None Ratio'] = all_types_df2["Freq: None"].apply(lambda x: x / 40)
all_types_df2['Aura Ratio'] = all_types_df2["Freq: Aura"].apply(lambda x: x / 49)
all_types_df2['No Aura Ratio'] = all_types_df2["Freq: No Aura"].apply(lambda x: x / 38)
all_types_df2['Both Ratio'] = all_types_df2["Freq: Both"].apply(lambda x: x / 10)
all_types_df2['All Ratio'] = all_types_df2["Total"].apply(lambda x: x / 137)
all_types_df2['All Mig Ratio'] = ((all_types_df2["Freq: Aura"] + all_types_df2["Freq: No Aura"] + all_types_df2["Freq: Both"]) / 97)

all_types_df2

# hyper specific dropping of columns with not enough information
moddf = all_types_df2.drop([74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62, 61, 60, 58, 57, 56, 0, 2, 3, 5, 6,9,1,8, 10, 11, 15, 17, 18, 22, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 ])
moddf2 = moddf.drop([35, 36, 37, 38, 39, 55,20, 14,33, 34,13,  53,  16, 19, 47, 48, 49, 50, 51, 52, 59 ])
moddf2

Note: In order to create a graphable dataframe, we dropped a number of genes that had overall frequencies of less than 5 and/or low frequencies with a similar distribution in the control and migraine populations.

We dropped these manually, which is why there are a wide array of columns dropped above. Doing this project again, we would have likely developed a better methodology

import sklearn
assert sklearn.__version__ >= "0.20"

import numpy as np
import os

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)


moddf2.plot.bar(x='Gene', y=['None Ratio', 'All Mig Ratio'], color=[ 'royalblue', 'aqua'], ylabel='Percent of Individuals', title='Gene Proportions of Migraine Havers vs Control', width=0.8, figsize=(12,6)).grid()
#sns.catplot(data=moddf2, x="Gene", y=["None Ratio", "All Mig Ratio"])

Within the set of significant alleles, there are a number of genes that are present in migraine populations, but not in the control group. Each column has at least 5 occurances of the gene across the board. These could possibly relate to migraines, or may be indicative of a scientific process outside the scope of our EDA

Part 3: Modeling¶

For the project, we decided we wanted to created two different models.

Can we predict whether someone has a specific comorbidity (i.e. a condition in one of the 11 subsets) based on whether or not they have migraines?
Can we predict whether someone has migraines using genomic data?

Comorbidity Based¶

all_features = ['Has Migraines', 'Has Migraines with Aura',
       'Has Migraines without Aura','Height (in)', 'Weight (lbs)', 'Blood Type_A +', 'Blood Type_A -',
       'Blood Type_AB +', 'Blood Type_AB -', 'Blood Type_B +',
       'Blood Type_B -', 'Blood Type_Don\'t know', 'Blood Type_O +',
       'Blood Type_O -', 'Sex/Gender_Female',
       'Sex/Gender_Male','Race/ethnicity_American Indian', 'Race/ethnicity_Asian',
       'Race/ethnicity_Black or African', 'Race/ethnicity_Hispanic or Latino',
       'Race/ethnicity_White'] # features to test

#knn for system that is seemingly highly correlated with having migraines (nervous) with and without migs
def knn_nerv(k): 
    model = KNeighborsClassifier(n_neighbors = k)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
 
    X_train = all[all_features]
    y_train = all['Has Nervous System Conditions']
    X_train=X_train.to_dict(orient="records")
    
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_nerv_no_mig(k):
    model = KNeighborsClassifier(n_neighbors = k)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
 
    X_train = all[all_features[3:]]
    y_train = all['Has Nervous System Conditions']
    X_train=X_train.to_dict(orient="records")
    
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())

ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
nerv = ks.apply(knn_nerv)
nerv_no_mig = ks.apply(knn_nerv_no_mig)
plt.plot(nerv,label='Nervous System Conditions')
plt.plot(nerv_no_mig,label='Nervous System Conditions no mig')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.legend()
# as you can see migraines are correlated to prediction accuracy for nervous system conditions

<matplotlib.legend.Legend at 0x7f9f922499a0>

This model indicates that knowing whether someone has migraines or not leads to increased accuracy in predicting whether someone has a nervous system condition (other than migraines). This shows that there is likely a statistical relationship between migraines and the ability to predict comorbid neurological syndromes

#knn for system that is seemingly not highly correlated with having migraines (digestive) with and without migs
def knn_dig(k):
    model = KNeighborsClassifier(n_neighbors = k)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
 
    X_train = all[all_features]
    y_train = all['Has Digestive Conditions']
    X_train=X_train.to_dict(orient="records")
    
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_dig_no_mig(k):
    model = KNeighborsClassifier(n_neighbors = k)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
 
    X_train = all[all_features[3:]]
    y_train = all['Has Digestive Conditions']
    X_train=X_train.to_dict(orient="records")
    
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())

ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
dig = ks.apply(knn_dig)
dig_no_mig = ks.apply(knn_dig_no_mig)
plt.plot(dig,label='Digestive System Conditions')
plt.plot(dig_no_mig,label='Digestive System Conditions no mig')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.legend()
# as you can see removing migraines does not have a significant effect on accuracy for predicting digestive system conditions

<matplotlib.legend.Legend at 0x7f9f92247100>

In this model, it is evident that knowing whether someone has migraines or not is not helpful information for predicting whether someone has a digestive system condition. This is not very successful and shows a lack of relationship between migraines and digestive comorbidities. This is likely the case for other conditions.

Gene Based¶

We started by rereading the alleles dataframe to ensure there were no issues with the dataset

alleles = pd.read_excel('vari.xlsx') # genetic data

alleles = alleles.drop('Unnamed: 7',axis=1) # dropping unnecessary data

alleles['Has Migraine'] = alleles['Type of Migraine'] != 'None' # getting binary migraine columns
alleles['Has Migraine'] = alleles['Has Migraine'].map({True: 1, False: 0})

alleles['Has Migraine with aura'] = alleles['Type of Migraine'] == 'Mig with aura'
alleles['Has Migraine with aura'] = alleles['Has Migraine with aura'].map({True: 1, False: 0})
alleles['Has Migraine without aura'] = alleles['Type of Migraine'] == 'Mig no aura'
alleles['Has Migraine without aura'] = alleles['Has Migraine without aura'].map({True: 1, False: 0})

alleles

feat = ['Gene','Recessive or dominant','Homo/heterozyg','Mutation or variant?','Disease capacity'] # test cols

# testing which features are most relevant to accurately predicting whether someone has migraines
def knn_full(n):
  model = KNeighborsClassifier(n_neighbors=n)
  scaler=StandardScaler()
  vec = DictVectorizer(sparse=False)
  pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
  X_train = alleles[feat]
  y_train = alleles['Has Migraine']
  X_train=X_train.to_dict(orient="records")
  return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_gene(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[feat[1:]]
    y_train = alleles['Has Migraine']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_rec_dom(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_homo_hetero(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Recessive or dominant','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_mutate(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Recessive or dominant','Disease capacity']]
    y_train = alleles['Has Migraine']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_disease(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Recessive or dominant']]
    y_train = alleles['Has Migraine']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())

ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
full = ks.apply(knn_full)
no_gene = ks.apply(knn_no_gene)
no_rec_dom = ks.apply(knn_no_rec_dom)
no_homo_hetero = ks.apply(knn_no_homo_hetero)
no_mutate = ks.apply(knn_no_mutate)
no_disease = ks.apply(knn_no_disease)
plt.plot(full,label = 'full features')
plt.plot(no_gene,label = 'no genes')
plt.plot(no_rec_dom,label = 'no recessive/dom')
plt.plot(no_homo_hetero,label = 'no homo/het')
plt.plot(no_mutate,label = 'no mutations')
plt.plot(no_disease,label = 'no disease capacity')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.title('Accuracy of Predicting Migraines Per Missing Feature')
plt.legend()
# it seems that removing homo/hetero significantly impacts the models ability to predict whether someone has migraines
# most accurate disregards gene names but keeps other data

<matplotlib.legend.Legend at 0x7f9f92cecf70>

This model used genomic information (from the dataframe) to predict whether someone had migraines or not. It appears that eliminating the genes themselves (no genes) actually has a positive relationship with accuracy! However, eliminating whether a gene is homozygous or heterozygous decreases accuracy, highlighting it has a more significant statistical relationship with the ability to predict migraines

# testing which features are most relevant to accurately predicting whether someone has migraines with aura

def knn_full(n):
  model = KNeighborsClassifier(n_neighbors=n)
  scaler=StandardScaler()
  vec = DictVectorizer(sparse=False)
  pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
  X_train = alleles[feat]
  y_train = alleles['Has Migraine with aura']
  X_train=X_train.to_dict(orient="records")
  return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_gene(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[feat[1:]]
    y_train = alleles['Has Migraine with aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_rec_dom(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine with aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_homo_hetero(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Recessive or dominant','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine with aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_mutate(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Recessive or dominant','Disease capacity']]
    y_train = alleles['Has Migraine with aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_disease(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Recessive or dominant']]
    y_train = alleles['Has Migraine with aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())

ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
full = ks.apply(knn_full)
no_gene = ks.apply(knn_no_gene)
no_rec_dom = ks.apply(knn_no_rec_dom)
no_homo_hetero = ks.apply(knn_no_homo_hetero)
no_mutate = ks.apply(knn_no_mutate)
no_disease = ks.apply(knn_no_disease)
plt.plot(full,label = 'full features')
plt.plot(no_gene,label = 'no genes')
plt.plot(no_rec_dom,label = 'no recessive/dom')
plt.plot(no_homo_hetero,label = 'no homo/het')
plt.plot(no_mutate,label = 'no mutations')
plt.plot(no_disease,label = 'no disease capacity')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.title('Accuracy of Predicting Migraines with Aura Per Missing Feature')
plt.legend()
# Genes and recessive/ dominant seem most important to predicting this type of migraine.

<matplotlib.legend.Legend at 0x7f9f942b3eb0>

When specifically predicting whether someone has migraines with aura, it appears that more neighbors increases accuracy. In this case, removing genes led to the largest decrease in accuracy. This is supported by the chart from the EDA section highlighting gene variance in only migraine havers; however, there are a number of variables that impact this conclusion

# testing which features are most relevant to accurately predicting whether someone has migraines without aura

def knn_full(n):
  model = KNeighborsClassifier(n_neighbors=n)
  scaler=StandardScaler()
  vec = DictVectorizer(sparse=False)
  pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
  X_train = alleles[feat]
  y_train = alleles['Has Migraine without aura']
  X_train=X_train.to_dict(orient="records")
  return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_gene(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[feat[1:]]
    y_train = alleles['Has Migraine without aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_rec_dom(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine without aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_homo_hetero(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Recessive or dominant','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine without aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_mutate(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Recessive or dominant','Disease capacity']]
    y_train = alleles['Has Migraine without aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_disease(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Recessive or dominant']]
    y_train = alleles['Has Migraine without aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())

ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
full = ks.apply(knn_full)
no_gene = ks.apply(knn_no_gene)
no_rec_dom = ks.apply(knn_no_rec_dom)
no_homo_hetero = ks.apply(knn_no_homo_hetero)
no_mutate = ks.apply(knn_no_mutate)
no_disease = ks.apply(knn_no_disease)
plt.plot(full,label = 'full features')
plt.plot(no_gene,label = 'no genes')
plt.plot(no_rec_dom,label = 'no recessive/dom')
plt.plot(no_homo_hetero,label = 'no homo/het')
plt.plot(no_mutate,label = 'no mutations')
plt.plot(no_disease,label = 'no disease capacity')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.title('Accuracy of Predicting Migraines without aura Per Missing Feature')
plt.legend()
# Harder to interpret, seems that homo/hetero has much smaller effect on Migraines without aura

<matplotlib.legend.Legend at 0x7f9f92720790>

For predicting migraines without aura, eliminating whether there was a mutation appeared to decrease accuracy the most. It also shows a positive relationship between accuracy and increasing neighbors

Overall, it appears our models were not incredibly informative. However, they showed there could be a statistical relationship between certain comorbidities and specific genes. Additional data sources and further research would need to be conducted to validify any of these claims

	Participant	Timestamp	Do not touch!	Have you ever been diagnosed with one of the following conditions?	Other condition not listed here?
0	hu3073E3	10/8/2012 21:22:10	4iq7dcisqa3zh75l1gmfxwvct1fs8n0k4g7gdzb2g559dt...	NaN	NaN
1	hu407142	10/9/2012 16:47:19	3dk0y4yds6u6pvp32azrysui4pbhhdn2y854l788d465w0...	NaN	NaN
2	huF974A8	10/9/2012 18:39:56	1pvw4ziy416x9ba0r31q6rhl917rle5g8bjgvzyfz678tr...	NaN	NaN
3	hu620F18	10/9/2012 19:18:30	2cmxvu2ozclqr135m573crz079idiw8m0boj3ie9fz257q...	Migraine without aura	NaN
4	hu3C0611	10/9/2012 19:29:14	43ogbna2kllvhfbjzljxzr6a0i6vf48mppgv0u8lys9lme...	Migraine without aura, Hereditary motor and se...	NaN

	Participant	Nervous System Conditions	Circulatory System Conditions	Endocrine System Conditions	Blood Conditions	Visual and Hearing Conditions	Respiratory System Conditions	Digestive System Conditions	Genitourinary System Conditions	Skin Conditions	Musculoskeletal System Conditions	Congenital Conditions
0	hu3073E3	NaN	NaN	NaN	NaN	Age-related cataract, Myopia (Nearsightedness)...	Deviated septum, Allergic rhinitis	Dental cavities, Canker sores (oral ulcers), I...	Urinary tract infection (UTI)	Eczema, Allergic contact dermatitis, Hair loss...	Chondromalacia patella (CMP)	NaN
1	hu407142	NaN	NaN	NaN	NaN	Myopia (Nearsightedness), Astigmatism, Dry eye...	Chronic sinusitis, Allergic rhinitis	Dental cavities	Urinary tract infection (UTI)	Acne	NaN	NaN
2	hu407142	NaN	NaN	NaN	NaN	Myopia (Nearsightedness), Astigmatism, Dry eye...	Chronic sinusitis, Allergic rhinitis	Dental cavities	Urinary tract infection (UTI)	Acne	NaN	NaN
3	huF974A8	NaN	NaN	NaN	NaN	Myopia (Nearsightedness), Dry eye syndrome, Fl...	NaN	Dental cavities, Canker sores (oral ulcers)	NaN	NaN	Osgood-Schlatter disease	NaN
4	hu620F18	Migraine without aura	NaN	High cholesterol (hypercholesterolemia)	NaN	Myopia (Nearsightedness), Astigmatism, Floaters	NaN	Impacted tooth, Dental cavities, Gingivitis	NaN	Eczema, Allergic contact dermatitis, Skin tags	Sciatica	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
1151773	huF8913E	Recurrent sleep paralysis, Restless legs syndr...	Hypertension, Hemorrhoids	Thyroid nodule(s), High cholesterol (hyperchol...	NaN	Myopia (Nearsightedness), Astigmatism, Floaters	Deviated septum, Chronic sinusitis, Allergic r...	Impacted tooth, Dental cavities, Canker sores ...	Urinary tract infection (UTI), Endometriosis	Dandruff, Acne	Sciatica, Tennis elbow, Bone spurs, Fibromyalg...	Spina bifida
1151774	hu794D40	Recurrent sleep paralysis	Hypertension	Thyroid nodule(s)	NaN	Age-related macular degeneration	Nasal polyps, Chronic sinusitis, Chronic tonsi...	Dental cavities, Temporomandibular joint (TMJ)...	Urinary tract infection (UTI)	Eczema, Allergic contact dermatitis, Rosacea, ...	Postural kyphosis	NaN
1151775	huD8AD3F	Restless legs syndrome, Migraine with aura, Mi...	Angina, Cardiac arrhythmia	NaN	Iron deficiency anemia	Myopia (Nearsightedness), Tinnitus	NaN	Gastroesophageal reflux disease (GERD), Irrita...	Urinary tract infection (UTI), Ovarian cysts	Eczema, Allergic contact dermatitis, Hyperhidr...	Tennis elbow, Fibromyalgia, Scoliosis	NaN
1151776	hu35E970	Essential tremor, Chronic tension headaches (1...	Raynaud's phenomenon	Thyroid nodule(s), Lactose intolerance	Iron deficiency anemia, Hereditary thrombophil...	NaN	Allergic rhinitis, Asthma	Dental cavities, Gingivitis, Canker sores (ora...	Urinary tract infection (UTI)	Dandruff, Allergic contact dermatitis, Rosacea	Frozen shoulder, Fibromyalgia	Developmental dysplasia of the hip
1151777	hu09787B	NaN	Hypertension, Hemorrhoids	Thyroid nodule(s), Hypothyroidism, Hashimoto's...	Von Willebrand disease	Myopia (Nearsightedness), Astigmatism, Age-rel...	Chronic sinusitis	Dental cavities, Gallstones	Kidney stones	Dandruff, Hair loss (includes female and male ...	Bone spurs, Osteoporosis, Scoliosis	Congenital clubfoot (equinovarus)

	Participant	Nervous System Conditions	Circulatory System Conditions	Endocrine System Conditions	Blood Conditions	Visual and Hearing Conditions	Respiratory System Conditions	Digestive System Conditions	Genitourinary System Conditions	Skin Conditions	Musculoskeletal System Conditions	Congenital Conditions
0	hu3073E3	NaN	NaN	NaN	NaN	Age-related cataract, Myopia (Nearsightedness)...	Deviated septum, Allergic rhinitis	Dental cavities, Canker sores (oral ulcers), I...	Urinary tract infection (UTI)	Eczema, Allergic contact dermatitis, Hair loss...	Chondromalacia patella (CMP)	NaN
1	hu407142	NaN	NaN	NaN	NaN	Myopia (Nearsightedness), Astigmatism, Dry eye...	Chronic sinusitis, Allergic rhinitis	Dental cavities	Urinary tract infection (UTI)	Acne	NaN	NaN
3	huF974A8	NaN	NaN	NaN	NaN	Myopia (Nearsightedness), Dry eye syndrome, Fl...	NaN	Dental cavities, Canker sores (oral ulcers)	NaN	NaN	Osgood-Schlatter disease	NaN
4	hu620F18	Migraine without aura	NaN	High cholesterol (hypercholesterolemia)	NaN	Myopia (Nearsightedness), Astigmatism, Floaters	NaN	Impacted tooth, Dental cavities, Gingivitis	NaN	Eczema, Allergic contact dermatitis, Skin tags	Sciatica	NaN
5	hu3C0611	Migraine without aura, Hereditary motor and se...	NaN	Thyroid nodule(s), Hypothyroidism, Hashimoto's...	Iron deficiency anemia	Floaters	Chronic tonsillitis, Allergic rhinitis, Asthma	Dental cavities	Kidney stones, Urinary tract infection (UTI)	Eczema, Keloids	Bunions, Plantar fasciitis	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
1151773	huF8913E	Recurrent sleep paralysis, Restless legs syndr...	Hypertension, Hemorrhoids	Thyroid nodule(s), High cholesterol (hyperchol...	NaN	Myopia (Nearsightedness), Astigmatism, Floaters	Deviated septum, Chronic sinusitis, Allergic r...	Impacted tooth, Dental cavities, Canker sores ...	Urinary tract infection (UTI), Endometriosis	Dandruff, Acne	Sciatica, Tennis elbow, Bone spurs, Fibromyalg...	Spina bifida
1151774	hu794D40	Recurrent sleep paralysis	Hypertension	Thyroid nodule(s)	NaN	Age-related macular degeneration	Nasal polyps, Chronic sinusitis, Chronic tonsi...	Dental cavities, Temporomandibular joint (TMJ)...	Urinary tract infection (UTI)	Eczema, Allergic contact dermatitis, Rosacea, ...	Postural kyphosis	NaN
1151775	huD8AD3F	Restless legs syndrome, Migraine with aura, Mi...	Angina, Cardiac arrhythmia	NaN	Iron deficiency anemia	Myopia (Nearsightedness), Tinnitus	NaN	Gastroesophageal reflux disease (GERD), Irrita...	Urinary tract infection (UTI), Ovarian cysts	Eczema, Allergic contact dermatitis, Hyperhidr...	Tennis elbow, Fibromyalgia, Scoliosis	NaN
1151776	hu35E970	Essential tremor, Chronic tension headaches (1...	Raynaud's phenomenon	Thyroid nodule(s), Lactose intolerance	Iron deficiency anemia, Hereditary thrombophil...	NaN	Allergic rhinitis, Asthma	Dental cavities, Gingivitis, Canker sores (ora...	Urinary tract infection (UTI)	Dandruff, Allergic contact dermatitis, Rosacea	Frozen shoulder, Fibromyalgia	Developmental dysplasia of the hip
1151777	hu09787B	NaN	Hypertension, Hemorrhoids	Thyroid nodule(s), Hypothyroidism, Hashimoto's...	Von Willebrand disease	Myopia (Nearsightedness), Astigmatism, Age-rel...	Chronic sinusitis	Dental cavities, Gallstones	Kidney stones	Dandruff, Hair loss (includes female and male ...	Bone spurs, Osteoporosis, Scoliosis	Congenital clubfoot (equinovarus)

	Participant	Nervous System Conditions	Circulatory System Conditions	Endocrine System Conditions	Blood Conditions	Visual and Hearing Conditions	Respiratory System Conditions	Digestive System Conditions	Genitourinary System Conditions	Skin Conditions	Musculoskeletal System Conditions	Congenital Conditions
0	hu3073E3	NaN	NaN	NaN	NaN	Age-related cataract, Myopia (Nearsightedness)...	Deviated septum, Allergic rhinitis	Dental cavities, Canker sores (oral ulcers), I...	Urinary tract infection (UTI)	Eczema, Allergic contact dermatitis, Hair loss...	Chondromalacia patella (CMP)	NaN
1	hu407142	NaN	NaN	NaN	NaN	Myopia (Nearsightedness), Astigmatism, Dry eye...	Chronic sinusitis, Allergic rhinitis	Dental cavities	Urinary tract infection (UTI)	Acne	NaN	NaN
2	huF974A8	NaN	NaN	NaN	NaN	Myopia (Nearsightedness), Dry eye syndrome, Fl...	NaN	Dental cavities, Canker sores (oral ulcers)	NaN	NaN	Osgood-Schlatter disease	NaN
3	hu620F18	Migraine without aura	NaN	High cholesterol (hypercholesterolemia)	NaN	Myopia (Nearsightedness), Astigmatism, Floaters	NaN	Impacted tooth, Dental cavities, Gingivitis	NaN	Eczema, Allergic contact dermatitis, Skin tags	Sciatica	NaN
4	hu3C0611	Migraine without aura, Hereditary motor and se...	NaN	Thyroid nodule(s), Hypothyroidism, Hashimoto's...	Iron deficiency anemia	Floaters	Chronic tonsillitis, Allergic rhinitis, Asthma	Dental cavities	Kidney stones, Urinary tract infection (UTI)	Eczema, Keloids	Bunions, Plantar fasciitis	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
1762	huF8913E	Recurrent sleep paralysis, Restless legs syndr...	Hypertension, Hemorrhoids	Thyroid nodule(s), High cholesterol (hyperchol...	NaN	Myopia (Nearsightedness), Astigmatism, Floaters	Deviated septum, Chronic sinusitis, Allergic r...	Impacted tooth, Dental cavities, Canker sores ...	Urinary tract infection (UTI), Endometriosis	Dandruff, Acne	Sciatica, Tennis elbow, Bone spurs, Fibromyalg...	Spina bifida
1763	hu794D40	Recurrent sleep paralysis	Hypertension	Thyroid nodule(s)	NaN	Age-related macular degeneration	Nasal polyps, Chronic sinusitis, Chronic tonsi...	Dental cavities, Temporomandibular joint (TMJ)...	Urinary tract infection (UTI)	Eczema, Allergic contact dermatitis, Rosacea, ...	Postural kyphosis	NaN
1764	huD8AD3F	Restless legs syndrome, Migraine with aura, Mi...	Angina, Cardiac arrhythmia	NaN	Iron deficiency anemia	Myopia (Nearsightedness), Tinnitus	NaN	Gastroesophageal reflux disease (GERD), Irrita...	Urinary tract infection (UTI), Ovarian cysts	Eczema, Allergic contact dermatitis, Hyperhidr...	Tennis elbow, Fibromyalgia, Scoliosis	NaN
1765	hu35E970	Essential tremor, Chronic tension headaches (1...	Raynaud's phenomenon	Thyroid nodule(s), Lactose intolerance	Iron deficiency anemia, Hereditary thrombophil...	NaN	Allergic rhinitis, Asthma	Dental cavities, Gingivitis, Canker sores (ora...	Urinary tract infection (UTI)	Dandruff, Allergic contact dermatitis, Rosacea	Frozen shoulder, Fibromyalgia	Developmental dysplasia of the hip
1766	hu09787B	NaN	Hypertension, Hemorrhoids	Thyroid nodule(s), Hypothyroidism, Hashimoto's...	Von Willebrand disease	Myopia (Nearsightedness), Astigmatism, Age-rel...	Chronic sinusitis	Dental cavities, Gallstones	Kidney stones	Dandruff, Hair loss (includes female and male ...	Bone spurs, Osteoporosis, Scoliosis	Congenital clubfoot (equinovarus)

	Participant	Nervous System Conditions	Circulatory System Conditions	Endocrine System Conditions	Blood Conditions	Visual and Hearing Conditions	Respiratory System Conditions	Digestive System Conditions	Genitourinary System Conditions	Skin Conditions	Musculoskeletal System Conditions	Congenital Conditions
3	hu620F18	Migraine without aura	NaN	High cholesterol (hypercholesterolemia)	NaN	Myopia (Nearsightedness), Astigmatism, Floaters	NaN	Impacted tooth, Dental cavities, Gingivitis	NaN	Eczema, Allergic contact dermatitis, Skin tags	Sciatica	NaN
4	hu3C0611	Migraine without aura, Hereditary motor and se...	NaN	Thyroid nodule(s), Hypothyroidism, Hashimoto's...	Iron deficiency anemia	Floaters	Chronic tonsillitis, Allergic rhinitis, Asthma	Dental cavities	Kidney stones, Urinary tract infection (UTI)	Eczema, Keloids	Bunions, Plantar fasciitis	NaN
5	hu384E20	Migraine with aura	Hemorrhoids	NaN	NaN	Floaters	NaN	Dental cavities, Temporomandibular joint (TMJ)...	NaN	Dandruff, Eczema	Scoliosis	NaN
12	hu5FCE15	Migraine with aura	NaN	NaN	NaN	Myopia (Nearsightedness), Astigmatism	NaN	Dental cavities, Geographic tongue, Irritable ...	NaN	Acne	NaN	NaN
17	hu1EE386	Migraine without aura	Hypertension, Raynaud's phenomenon	NaN	NaN	Myopia (Nearsightedness), Astigmatism	NaN	Dental cavities	Urinary tract infection (UTI), Endometriosis, ...	NaN	Fibromyalgia	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
1752	huCD1A7A	Chronic tension headaches (15+ days per month,...	NaN	NaN	NaN	Myopia (Nearsightedness), Astigmatism, Presbyo...	NaN	Dental cavities, Temporomandibular joint (TMJ)...	Urinary tract infection (UTI), Ovarian cysts	NaN	Osteoarthritis, Frozen shoulder, Tennis elbow,...	NaN
1753	huA9AFFD	Chronic tension headaches (15+ days per month,...	NaN	Hypothyroidism, Lactose intolerance, High chol...	Iron deficiency anemia	Myopia (Nearsightedness), Astigmatism	Deviated septum, Chronic sinusitis, Chronic to...	Impacted tooth, Dental cavities, Gingivitis, T...	Kidney stones	Dandruff, Skin tags, Hair loss (includes femal...	Frozen shoulder, Tennis elbow, Plantar fasciit...	Ehlers-Danlos syndrome
1759	huC8E030	Chronic tension headaches (15+ days per month,...	Hypertension, Cardiac arrhythmia, Varicose veins	Thyroid nodule(s), Hypothyroidism, Hashimoto's...	Iron deficiency anemia	Hyperopia (Farsightedness), Presbyopia, Dry ey...	Deviated septum, Nasal polyps, Chronic sinusit...	Impacted tooth, Dental cavities, Gingivitis, T...	Urinary tract infection (UTI), Endometriosis, ...	Dandruff, Eczema, Allergic contact dermatitis,...	Osteoarthritis, Chondromalacia patella (CMP), ...	NaN
1762	huF8913E	Recurrent sleep paralysis, Restless legs syndr...	Hypertension, Hemorrhoids	Thyroid nodule(s), High cholesterol (hyperchol...	NaN	Myopia (Nearsightedness), Astigmatism, Floaters	Deviated septum, Chronic sinusitis, Allergic r...	Impacted tooth, Dental cavities, Canker sores ...	Urinary tract infection (UTI), Endometriosis	Dandruff, Acne	Sciatica, Tennis elbow, Bone spurs, Fibromyalg...	Spina bifida
1764	huD8AD3F	Restless legs syndrome, Migraine with aura, Mi...	Angina, Cardiac arrhythmia	NaN	Iron deficiency anemia	Myopia (Nearsightedness), Tinnitus	NaN	Gastroesophageal reflux disease (GERD), Irrita...	Urinary tract infection (UTI), Ovarian cysts	Eczema, Allergic contact dermatitis, Hyperhidr...	Tennis elbow, Fibromyalgia, Scoliosis	NaN

	Participant	Blood Type	Height (in)	Weight (lbs)
0	hu826751	AB +	6'2"	188.0
1	huDDCF88	O +	5'10"	159.0
2	hu3DC5EA	A +	5'5"	184.0
3	hu008567	O +	5'1"	138.0
4	hu98FFC6	A +	5'5"	230.0
...	...	...	...	...
1096	huF8913E	O +	5'6"	185.0
1097	hu794D40	A +	5'9"	170.0
1098	huD8AD3F	O +	5'5"	108.0
1099	hu09787B	O +	5'9"	233.0
1100	huF5CD05	A +	5'5"	215.0

	Participant	Type of Migraine	Gene	Recessive or dominant	Homo/heterozyg	Mutation or variant?	Disease capacity	Unnamed: 7
0	hu620F18	Mig no aura	CBS-I278T	Recessive	Carrier (Heterozygous)	Mutation	Likely pathogenic	NaN
1	hu620F18	Mig no aura	C3-R102G	Complex/Other	Heterozygous	Variant	Likely pathogenic	NaN
2	hu620F18	Mig no aura	COL4A1-Q1334H	Dominant	Heterozygous	Variant	Likely pathogenic	NaN
3	hu620F18	Mig no aura	MTRR-I49M	Recessive	Carrier (Heterozygous)	Variant	Likely pathogenic	NaN
4	hu620F18	Mig no aura	rs5186	Unknown	Heterozygous	Variant	Likely pathogenic	NaN
...	...	...	...	...	...	...	...	...
1256	hu05FD49	None	RPE65-N356	/	heterozygous	Mutation - Frameshift	/	NaN
1257	hu05FD49	None	PKD1-R2430	/	heterozygous	Mutation - nonsense	/	NaN
1258	hu05FD49	None	FLG-R3879	/	heterozygous	Mutation - nonsense	/	NaN
1259	hu05FD49	None	SBF2-H1549	/	heterozygous	Mutation - nonsense	/	NaN
1260	hu05FD49	None	NF2-K523	/	heterozygous	Mutation - Frameshift	/	NaN

	index	w mig	no mig	w_aura	no_aura
0	Has Nervous System Conditions	0.528646	0.218366	0.299479	0.356771
1	Has Blood Conditions	0.335938	0.157628	0.182292	0.205729
2	Has Circulatory Conditions	0.596354	0.453362	0.343750	0.335938
3	Has Endocrine Conditions	0.554688	0.417209	0.289062	0.338542
4	Has Vision and Hearing Conditions	0.854167	0.757773	0.497396	0.473958
5	Has Respiratory Conditions	0.635417	0.485900	0.348958	0.372396
6	Has Digestive Conditions	0.966146	0.913955	0.536458	0.552083
7	Has Genitourinary Conditions	0.682292	0.432393	0.382812	0.388021
8	Has Skin Conditions	0.880208	0.797542	0.492188	0.507812
9	Has Musculoskeletal Conditions	0.690104	0.506869	0.408854	0.375000
10	Has Congenital Conditions	0.182292	0.052784	0.111979	0.104167

	Participant	Height (in)	Weight (lbs)	Blood Type_A +	Blood Type_A -	Blood Type_AB +	Blood Type_AB -	Blood Type_B +	Blood Type_B -	Blood Type_Don't know	Blood Type_O +	Blood Type_O -
0	hu826751	74.0	188.0	0	0	1	0	0	0	0	0	0
1	huDDCF88	70.0	159.0	0	0	0	0	0	0	0	1	0
2	hu3DC5EA	65.0	184.0	1	0	0	0	0	0	0	0	0
3	hu008567	61.0	138.0	0	0	0	0	0	0	0	1	0
4	hu98FFC6	65.0	230.0	1	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...
1096	huF8913E	66.0	185.0	0	0	0	0	0	0	0	1	0
1097	hu794D40	69.0	170.0	1	0	0	0	0	0	0	0	0
1098	huD8AD3F	65.0	108.0	0	0	0	0	0	0	0	1	0
1099	hu09787B	69.0	233.0	0	0	0	0	0	0	0	1	0
1100	huF5CD05	65.0	215.0	1	0	0	0	0	0	0	0	0

	index	male	female
0	w mig	0.283505	0.706186
1	w_aura	0.227723	0.752475
2	no_aura	0.298246	0.684211

	index	w mig	no mig	w_aura	no_aura
0	Blood Type_A +	0.268041	0.256591	0.144330	0.154639
1	Blood Type_A -	0.041237	0.070299	0.041237	0.005155
2	Blood Type_AB +	0.061856	0.028120	0.036082	0.036082
3	Blood Type_AB -	0.015464	0.015817	0.005155	0.010309
4	Blood Type_B +	0.077320	0.086116	0.036082	0.051546
5	Blood Type_B -	0.015464	0.015817	0.010309	0.010309
6	Blood Type_Don't know	0.154639	0.184534	0.087629	0.077320
7	Blood Type_O +	0.278351	0.256591	0.170103	0.139175
8	Blood Type_O -	0.087629	0.086116	0.056701	0.036082

	index	w mig	no mig	w_aura	no_aura
0	Race/ethnicity_American Indian	0.419355	0.580645	0.258065	0.290323
1	Race/ethnicity_Asian	0.058824	0.941176	0.058824	0.000000
4	Race/ethnicity_Black or African	0.333333	0.666667	0.333333	0.000000
5	Race/ethnicity_Hispanic or Latino	0.269231	0.730769	0.230769	0.115385
6	Race/ethnicity_Native Hawaiian	0.000000	1.000000	0.000000	0.000000
7	Race/ethnicity_No response	0.500000	0.500000	0.000000	0.500000
8	Race/ethnicity_White	0.250368	0.749632	0.142857	0.129602

	Participant	Type of Migraine	Gene	Recessive or dominant	Homo/heterozyg	Mutation or variant?	Disease capacity	Has Migraine
78	hu4F8813	None	MTRR-I49M	Recessive	Carrier (Heterozygous)	Variant	Likely pathogenic	0
81	hu4F8813	None	PEX26-L153V	/	Carrier (Heterozygous)	/	Probably damaging	0
82	hu4386OC	None	SERPINA1-E366K	Recessive	Carrier (Heterozygous)	Variant	Well-established pathogenic	0
84	hu4386OC	None	C3-R102G	Complex/Other	Heterozygous	Variant	Likely pathogenic	0
88	hu4386OC	None	SERPINA1-E288V	Recessive	Carrier (Heterozygous)	Variant	Well-established pathogenic	0
...	...	...	...	...	...	...	...	...
1242	huA4E2CF	None	NEFL-S472	/	homozygous	Mutation - Frameshift	/	0
1245	huA4E2CF	None	RYR1-P2002	/	heterozygous	Mutation - Frameshift	/	0
1248	hu05FD49	None	COL4A1-Q1334H	Dominant	homozygous	variant	likely pathogenic	0
1249	hu05FD49	None	MTRR-I49M	recessive	homozygous	variant	likely pathogenic	0
1252	hu05FD49	None	NEFL-S472	/	homozygous	Mutation - Frameshift	/	0

	Gene	Frequency	Alleles
0	MTRR-I49M	27	[(carrier (heterozygous), 0.6296296296296297),...
1	NEFL-S472	27	[(homozygous, 1.0)]
2	COL4A1-Q1334H	17	[(heterozygous, 0.7058823529411765), (homozygo...
3	C3-R102G	13	[(heterozygous, 1.0)]
4	rs5186	13	[(heterozygous, 0.8461538461538461), (homozygo...
5	APOE-C130R	8	[(heterozygous, 1.0)]
6	CBS-I278T	7	[(carrier (heterozygous), 1.0)]
7	MBL2-G54D	6	[(carrier (heterozygous), 1.0)]
8	MBL2-R52C	6	[(carrier (heterozygous), 1.0)]
9	NPC1-W1122	4	[(heterozygous, 1.0)]
10	SERPINA1-E288V	4	[(carrier (heterozygous), 1.0)]
11	SYNE1-N1915	4	[(heterozygous, 1.0)]
12	AMPD1-Q12X	4	[(carrier (heterozygous), 1.0)]
13	APOA5-S19W	4	[(heterozygous, 1.0)]
14	HABP2-G534E	3	[(heterozygous, 1.0)]
15	PAX2-Y273	3	[(heterozygous, 1.0)]
16	HFE-C282Y	3	[(carrier (heterozygous), 1.0)]
17	TGM1-E520G	3	[(carrier (heterozygous), 1.0)]
18	TTN-E190	3	[(heterozygous, 1.0)]
19	KRT5-G138E	3	[(heterozygous, 1.0)]
20	NOD2-R702W	3	[(heterozygous, 1.0)]
21	ACAD8-S171C	3	[(carrier (heterozygous), 1.0)]
22	CREBBP-P1878	2	[(heterozygous, 1.0)]
23	PRPH-D141Y	2	[(carrier (heterozygous), 1.0)]
24	SERPINA1-E366K	2	[(carrier (heterozygous), 1.0)]
25	RYR1-P2002	2	[(heterozygous, 1.0)]
26	HPS6-A597	2	[(heterozygous, 1.0)]
27	SNCA-Y39	2	[(heterozygous, 1.0)]
28	PHKB-M185I	2	[(carrier (heterozygous), 1.0)]
29	LPL-N318S	2	[(heterozygous, 1.0)]
30	ALG3-F200	2	[(heterozygous, 1.0)]
31	PKP2-S140F	2	[(heterozygous, 1.0)]
32	MSR1-R293X	2	[(heterozygous, 1.0)]
33	SPG11-K1013E	2	[(carrier (heterozygous), 1.0)]
34	CETP-A390P	2	[(heterozygous, 1.0)]
35	THBD-A43T	2	[(heterozygous, 1.0)]
36	CD40LG-G219R	2	[(carrier (heterozygous), 1.0)]
37	PEX26-L153V	2	[(carrier (heterozygous), 1.0)]
38	WFS1-R456H	2	[(heterozygous, 1.0)]
39	SNCA-A69	2	[(heterozygous, 1.0)]

	Gene	Freq: None	Alleles: none	Freq: Aura	Alleles: Aura
0	MTRR-I49M	27.0	[(carrier (heterozygous), 0.6296296296296297),...	35.0	[(carrier (heterozygous), 0.7428571428571429),...
1	NEFL-S472	27.0	[(homozygous, 1.0)]	7.0	[(homozygous, 1.0)]
2	COL4A1-Q1334H	17.0	[(heterozygous, 0.7058823529411765), (homozygo...	30.0	[(heterozygous, 0.6333333333333333), (homozygo...
3	C3-R102G	13.0	[(heterozygous, 1.0)]	13.0	[(heterozygous, 0.8461538461538461), (homozygo...
4	rs5186	13.0	[(heterozygous, 0.8461538461538461), (homozygo...	21.0	[(heterozygous, 0.7142857142857143), (homozygo...
...	...	...	...	...	...
57	MFN2-Q276R	NaN	NaN	2.0	[(heterozygous, 1.0)]
58	COL9A3-R103W	NaN	NaN	2.0	[(heterozygous, 1.0)]
59	PRF1-A91V	NaN	NaN	2.0	[(heterozygous, 1.0)]
60	ABCA4-G863A	NaN	NaN	2.0	[(carrier (heterozygous), 1.0)]
61	ABCA4-A1038V	NaN	NaN	2.0	[(heterozygous, 1.0)]