In [ ]:
%%shell
jupyter nbconvert --to html /content/drive/"My Drive"/Colab_Notebooks/yayfinal.ipynb

Migraine Comorbidity and Genetic Analysis: Data Science Final Tutorial


Maddie Bonanno and Riley Martin

our project website can be found at: https://mscb25.github.io/datasci-final-maddieriley/

Part 0: Initial Discussion

Purpose

Migraines are the 2nd most common cause of disability world wide. Presenting in a number of fashions, migraines are highly variable in triggers, symptoms, and severity. The purpose of this project is to explore the relationships between migraines and their comorbidities. Using medical history and genetic data, we aimed to explore how migraine occurance could possibly be linked to other factors.

We considered two common forms of migraine in this study - migraine with aura and migraine without aura:

  • Migraine without aura = a neurological disease that often presents as a headache accompanied by nausea and sensitivity to external stimuli
  • Migraine with aura = the same critera as above ^ plus some form of visual, sensory, or motor disturbance that occurs before/during the attack

Goals and Plans:

Overall, through this assignment, we hoped to better understand the statistical association between migraines, common comorbidities, and genomics; this would allow for an improved understanding of the pathophysiology of migraines.

Some questions we wanted to explore are:

  1. Are people with migraines more likely to have comorbidities?
  2. If so, which types of comorbidities are most frequent in participants with migraines?
  3. Is it possible that genetics plays a role in the likelihood of developing migraines?
  4. Is there a particular subset of individuals migraines tend to impact?
  5. Is there any difference in comorbidities between participants that have migraine with aura vs migraine without aura?

Project Disclaimer:

Through this research experiment, we wanted to explore statistical associations between migraines and the presentation of comorbidities + variance of genes. We are NOT claiming that migraines cause comorbidities or vise versa. We are also NOT claiming that allele varients between populations are a cause of migraines. Any usage of "correlation" or "causation" in this assignment refers to the percieved relationship between variables - it is not indicative of an accurate scientific conclusion.

Collaboration Plan:

To ensure completion of this project, we met at least twice a week. For most of the semester, we found Tuesday and Thursday evenings around 8pm to be optimal. Since we also had classes together each day, we utilized time before or after lecture to update one another on nightly progress. We also texted one another and shared planning documents via google drive. In addition, we often split tasks to optimize time, with Maddie organizing a majority of the data and Riley creating visual representations.

Data Sources

For this project, we obtained data from the Personal Genome Project(PGP). This open data source provided access to a number of statistics and participants used in this assignment.

1) PGP Google Surveys

Located at https://my.pgp-hms.org/google_surveys

Within the PGP data repositories, there are a number of 'participant surveys' filled out by the subject.

We utilized two of the general information surveys to gauge which populations were being represented.

This google survey includes the 'Particpant' (represented by a 8 digit code) followed by personal information including : year of birth, sex/gender, and race/ethnicity

This survey also includes the 'Participant' followed by various phenotypical categories including: height, blood type, and eye color

We also collected data regarding the specific medical conditions participants had. Each file contained the 'Participant' and a column indicating whether they had a medical condition under that umbrella. The syndrome classes we looked at were:

The occurance of migraines in participants was evaluated with this survey

These sources led to the culmination of data from 13 different google surveys.

2) Get-Evidence Variant Reports - Genomic Data

Given the medical histories of the patients (collected by the methods avove), we wanted to determine whether there was any genetic relation between migraines and/or comorbidities.

To achieve this goal, we scraped data from the PGP Whole genome datasets. Since only some participants had their genetic data uploaded, we filtered by 'Whole genome datasets' and accessed profiles with this component fufilled.

Since there was no dataset that contained all the information we desired, we took information from participant profiles and created a data source, which can be accessed here.

Methodology:

  1. Filtered participants based on whether 'Whole genome datasets' had a value > 0
  2. Clicked on the first participant ID to access the "Participant Profile". Recorded the ID in the excel file.
  3. Scrolled to the bottom of the page to find the links to this person's surveys. Clicked on the Nervous System conditions survey and checked if they had been diagnosed with any form of migraine; recorded 'mig with aura', 'mig no aura', 'both', or 'none' in the second column based on the survey response.
  4. Next, clicked on 'View Report', a hyperlink located in the chart, to see the genetic data available for this participant. If the link was corrupt, the participant was discareded from the excel file.
  5. On the new page named 'Variant report for (participant ID)", we recorded all the genes located in the "Show likely pathogenic and rare (<2.5%) pathogenic variants". The information in the "Impact" column was split into three excel columns: 'recessive or dominant', 'Homo/heterozyg', and 'Disease capacity'. Furthermore, the genes were labeled as "variant" unless specifically denoted as a mutation.
  6. After this, genes under the "Insufficiently evaluated variants" tab were considered. Genes were recorded in the excel (following the previous criteria) if 1) the prioritization score was >5, or 2) the gene name includes "Shift" or "*" at the end - this denotes a mutation and should be recorded as such.
  7. This process was repeated for each participant that had whole genome data publically available

In summation, the rare variants and uncommon gene mutations were organized in a dataframe

  • genetic information was obtained from 137 unique participants

Part 1: Data Acquisition

In [291]:
import pandas as pd
import numpy as np
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/Colab_Notebooks
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
from sklearn.feature_extraction import DictVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
/content/drive/My Drive/Colab_Notebooks

Google Survey Collection

As highlighted in Part 0, there are 13 different google surveys containing vital information about the medical history of the participants. Most of the challenges in this section arose from trying to present all the information in a digestible manner.

In [292]:
# These are the 11 surveys containing medical conditions of each patient
# Each csv was read in

nerv_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_NervousSystem-20181010220056 (1).csv")
circ_data = pd.read_csv('PGPTrait&DiseaseSurvey2012_CirculatorySystem-20181010220109.csv')
endo_data = pd.read_csv('PGPTrait&DiseaseSurvey2012_Endocrine,Metabolic,Nutritional,AndImmunity-20181010220044 (1).csv')
blood_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_Blood-20181010220050.csv")
vis_hear_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_VisionAndHearing-20181010220103.csv")
resp_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_RespiratorySystem-20181010220114.csv")
digest_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_DigestiveSystem-20181010214607.csv")
genit_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_GenitourinarySystems-20181010214612.csv")
skin_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_SkinAndSubcutaneousTissue-20181010214618.csv")
musculo_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_MusculoskeletalSystemAndConnectiveTissue-20181010214624.csv")
congen_data = pd.read_csv("PGPTrait&DiseaseSurvey2012_CongenitalTraitsAndAnomalies-20181010214629.csv")

# Phenotypic and general physical traits (the other 2 surveys) were also read in 

phenotypes = pd.read_csv("PGPBasicPhenotypesSurvey2015-20181010214636.csv")
gen_survey = pd.read_csv("PGPParticipantSurvey-20181010220019.csv")
In [293]:
nerv_data.head() # All condition data takes on this format.
Out[293]:
Participant Timestamp Do not touch! Have you ever been diagnosed with one of the following conditions? Other condition not listed here?
0 hu3073E3 10/8/2012 21:22:10 4iq7dcisqa3zh75l1gmfxwvct1fs8n0k4g7gdzb2g559dt... NaN NaN
1 hu407142 10/9/2012 16:47:19 3dk0y4yds6u6pvp32azrysui4pbhhdn2y854l788d465w0... NaN NaN
2 huF974A8 10/9/2012 18:39:56 1pvw4ziy416x9ba0r31q6rhl917rle5g8bjgvzyfz678tr... NaN NaN
3 hu620F18 10/9/2012 19:18:30 2cmxvu2ozclqr135m573crz079idiw8m0boj3ie9fz257q... Migraine without aura NaN
4 hu3C0611 10/9/2012 19:29:14 43ogbna2kllvhfbjzljxzr6a0i6vf48mppgv0u8lys9lme... Migraine without aura, Hereditary motor and se... NaN
In [294]:
#putting all the csv's in a list format for easier manipulation
condition_data = [nerv_data,circ_data,endo_data,blood_data,vis_hear_data,resp_data,
                  digest_data,genit_data,skin_data,musculo_data,congen_data,]
In [295]:
def drop_col(): # dropping unwanted columns
  for i in condition_data:
    i.drop("Do not touch!",axis=1,inplace=True)
    i.drop("Timestamp",axis=1,inplace=True)
    i.drop("Other condition not listed here?",axis=1,inplace=True)
drop_col()
In [296]:
nerv_data.head() #now, this is how each dataframe looks
Out[296]:
Participant Have you ever been diagnosed with one of the following conditions?
0 hu3073E3 NaN
1 hu407142 NaN
2 huF974A8 NaN
3 hu620F18 Migraine without aura
4 hu3C0611 Migraine without aura, Hereditary motor and se...
In [297]:
# renaming the columns for clarity
nerv_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with one of the following conditions?": 'Nervous System Conditions' },inplace=True)
circ_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with one of the following conditions?": 'Circulatory System Conditions'},inplace=True)
endo_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Endocrine System Conditions'},inplace=True)
blood_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Blood Conditions' },inplace=True)
vis_hear_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with one of the following conditions?": 'Visual and Hearing Conditions' },inplace=True)
resp_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Respiratory System Conditions'},inplace=True)
digest_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Digestive System Conditions'},inplace=True)
genit_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Genitourinary System Conditions'},inplace=True)
skin_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Skin Conditions'},inplace=True)
musculo_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Musculoskeletal System Conditions'},inplace=True)
congen_data.rename(columns={"Participant": "Participant","Have you ever been diagnosed with any of the following conditions?": 'Congenital Conditions'},inplace=True)

After each set of data was downloaded and cleaned, we were left with 11 frames of medical information. Since 'Participant' was a shared column in each, we used inner join to merge all the conditions into a singular dataframe.

In [298]:
# merging into one df
conditions = condition_data[0].merge(condition_data[1],on="Participant",how="inner")
conditions = conditions.merge(condition_data[2],on="Participant",how="inner")
conditions = conditions.merge(condition_data[3],on="Participant",how="inner")
conditions = conditions.merge(condition_data[4],on="Participant",how="inner")
conditions = conditions.merge(condition_data[5],on="Participant",how="inner")
conditions = conditions.merge(condition_data[6],on="Participant",how="inner")
conditions = conditions.merge(condition_data[7],on="Participant",how="inner")
conditions = conditions.merge(condition_data[8],on="Participant",how="inner")
conditions = conditions.merge(condition_data[9],on="Participant",how="inner")
conditions = conditions.merge(condition_data[10],on="Participant",how="inner")
In [299]:
conditions
Out[299]:
Participant Nervous System Conditions Circulatory System Conditions Endocrine System Conditions Blood Conditions Visual and Hearing Conditions Respiratory System Conditions Digestive System Conditions Genitourinary System Conditions Skin Conditions Musculoskeletal System Conditions Congenital Conditions
0 hu3073E3 NaN NaN NaN NaN Age-related cataract, Myopia (Nearsightedness)... Deviated septum, Allergic rhinitis Dental cavities, Canker sores (oral ulcers), I... Urinary tract infection (UTI) Eczema, Allergic contact dermatitis, Hair loss... Chondromalacia patella (CMP) NaN
1 hu407142 NaN NaN NaN NaN Myopia (Nearsightedness), Astigmatism, Dry eye... Chronic sinusitis, Allergic rhinitis Dental cavities Urinary tract infection (UTI) Acne NaN NaN
2 hu407142 NaN NaN NaN NaN Myopia (Nearsightedness), Astigmatism, Dry eye... Chronic sinusitis, Allergic rhinitis Dental cavities Urinary tract infection (UTI) Acne NaN NaN
3 huF974A8 NaN NaN NaN NaN Myopia (Nearsightedness), Dry eye syndrome, Fl... NaN Dental cavities, Canker sores (oral ulcers) NaN NaN Osgood-Schlatter disease NaN
4 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags Sciatica NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
1151773 huF8913E Recurrent sleep paralysis, Restless legs syndr... Hypertension, Hemorrhoids Thyroid nodule(s), High cholesterol (hyperchol... NaN Myopia (Nearsightedness), Astigmatism, Floaters Deviated septum, Chronic sinusitis, Allergic r... Impacted tooth, Dental cavities, Canker sores ... Urinary tract infection (UTI), Endometriosis Dandruff, Acne Sciatica, Tennis elbow, Bone spurs, Fibromyalg... Spina bifida
1151774 hu794D40 Recurrent sleep paralysis Hypertension Thyroid nodule(s) NaN Age-related macular degeneration Nasal polyps, Chronic sinusitis, Chronic tonsi... Dental cavities, Temporomandibular joint (TMJ)... Urinary tract infection (UTI) Eczema, Allergic contact dermatitis, Rosacea, ... Postural kyphosis NaN
1151775 huD8AD3F Restless legs syndrome, Migraine with aura, Mi... Angina, Cardiac arrhythmia NaN Iron deficiency anemia Myopia (Nearsightedness), Tinnitus NaN Gastroesophageal reflux disease (GERD), Irrita... Urinary tract infection (UTI), Ovarian cysts Eczema, Allergic contact dermatitis, Hyperhidr... Tennis elbow, Fibromyalgia, Scoliosis NaN
1151776 hu35E970 Essential tremor, Chronic tension headaches (1... Raynaud's phenomenon Thyroid nodule(s), Lactose intolerance Iron deficiency anemia, Hereditary thrombophil... NaN Allergic rhinitis, Asthma Dental cavities, Gingivitis, Canker sores (ora... Urinary tract infection (UTI) Dandruff, Allergic contact dermatitis, Rosacea Frozen shoulder, Fibromyalgia Developmental dysplasia of the hip
1151777 hu09787B NaN Hypertension, Hemorrhoids Thyroid nodule(s), Hypothyroidism, Hashimoto's... Von Willebrand disease Myopia (Nearsightedness), Astigmatism, Age-rel... Chronic sinusitis Dental cavities, Gallstones Kidney stones Dandruff, Hair loss (includes female and male ... Bone spurs, Osteoporosis, Scoliosis Congenital clubfoot (equinovarus)

1151778 rows × 12 columns

After merging, the resulting dataframe needed to be cleaned

In [300]:
# Way too many rows
conditions['Participant'].value_counts()
# duplicates need to go
conditions = conditions.drop_duplicates(subset=['Participant'])
conditions
# duplicates gone but need to reset index
Out[300]:
Participant Nervous System Conditions Circulatory System Conditions Endocrine System Conditions Blood Conditions Visual and Hearing Conditions Respiratory System Conditions Digestive System Conditions Genitourinary System Conditions Skin Conditions Musculoskeletal System Conditions Congenital Conditions
0 hu3073E3 NaN NaN NaN NaN Age-related cataract, Myopia (Nearsightedness)... Deviated septum, Allergic rhinitis Dental cavities, Canker sores (oral ulcers), I... Urinary tract infection (UTI) Eczema, Allergic contact dermatitis, Hair loss... Chondromalacia patella (CMP) NaN
1 hu407142 NaN NaN NaN NaN Myopia (Nearsightedness), Astigmatism, Dry eye... Chronic sinusitis, Allergic rhinitis Dental cavities Urinary tract infection (UTI) Acne NaN NaN
3 huF974A8 NaN NaN NaN NaN Myopia (Nearsightedness), Dry eye syndrome, Fl... NaN Dental cavities, Canker sores (oral ulcers) NaN NaN Osgood-Schlatter disease NaN
4 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags Sciatica NaN
5 hu3C0611 Migraine without aura, Hereditary motor and se... NaN Thyroid nodule(s), Hypothyroidism, Hashimoto's... Iron deficiency anemia Floaters Chronic tonsillitis, Allergic rhinitis, Asthma Dental cavities Kidney stones, Urinary tract infection (UTI) Eczema, Keloids Bunions, Plantar fasciitis NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
1151773 huF8913E Recurrent sleep paralysis, Restless legs syndr... Hypertension, Hemorrhoids Thyroid nodule(s), High cholesterol (hyperchol... NaN Myopia (Nearsightedness), Astigmatism, Floaters Deviated septum, Chronic sinusitis, Allergic r... Impacted tooth, Dental cavities, Canker sores ... Urinary tract infection (UTI), Endometriosis Dandruff, Acne Sciatica, Tennis elbow, Bone spurs, Fibromyalg... Spina bifida
1151774 hu794D40 Recurrent sleep paralysis Hypertension Thyroid nodule(s) NaN Age-related macular degeneration Nasal polyps, Chronic sinusitis, Chronic tonsi... Dental cavities, Temporomandibular joint (TMJ)... Urinary tract infection (UTI) Eczema, Allergic contact dermatitis, Rosacea, ... Postural kyphosis NaN
1151775 huD8AD3F Restless legs syndrome, Migraine with aura, Mi... Angina, Cardiac arrhythmia NaN Iron deficiency anemia Myopia (Nearsightedness), Tinnitus NaN Gastroesophageal reflux disease (GERD), Irrita... Urinary tract infection (UTI), Ovarian cysts Eczema, Allergic contact dermatitis, Hyperhidr... Tennis elbow, Fibromyalgia, Scoliosis NaN
1151776 hu35E970 Essential tremor, Chronic tension headaches (1... Raynaud's phenomenon Thyroid nodule(s), Lactose intolerance Iron deficiency anemia, Hereditary thrombophil... NaN Allergic rhinitis, Asthma Dental cavities, Gingivitis, Canker sores (ora... Urinary tract infection (UTI) Dandruff, Allergic contact dermatitis, Rosacea Frozen shoulder, Fibromyalgia Developmental dysplasia of the hip
1151777 hu09787B NaN Hypertension, Hemorrhoids Thyroid nodule(s), Hypothyroidism, Hashimoto's... Von Willebrand disease Myopia (Nearsightedness), Astigmatism, Age-rel... Chronic sinusitis Dental cavities, Gallstones Kidney stones Dandruff, Hair loss (includes female and male ... Bone spurs, Osteoporosis, Scoliosis Congenital clubfoot (equinovarus)

1767 rows × 12 columns

In [301]:
conditions = conditions.reset_index(drop=True)
conditions
Out[301]:
Participant Nervous System Conditions Circulatory System Conditions Endocrine System Conditions Blood Conditions Visual and Hearing Conditions Respiratory System Conditions Digestive System Conditions Genitourinary System Conditions Skin Conditions Musculoskeletal System Conditions Congenital Conditions
0 hu3073E3 NaN NaN NaN NaN Age-related cataract, Myopia (Nearsightedness)... Deviated septum, Allergic rhinitis Dental cavities, Canker sores (oral ulcers), I... Urinary tract infection (UTI) Eczema, Allergic contact dermatitis, Hair loss... Chondromalacia patella (CMP) NaN
1 hu407142 NaN NaN NaN NaN Myopia (Nearsightedness), Astigmatism, Dry eye... Chronic sinusitis, Allergic rhinitis Dental cavities Urinary tract infection (UTI) Acne NaN NaN
2 huF974A8 NaN NaN NaN NaN Myopia (Nearsightedness), Dry eye syndrome, Fl... NaN Dental cavities, Canker sores (oral ulcers) NaN NaN Osgood-Schlatter disease NaN
3 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags Sciatica NaN
4 hu3C0611 Migraine without aura, Hereditary motor and se... NaN Thyroid nodule(s), Hypothyroidism, Hashimoto's... Iron deficiency anemia Floaters Chronic tonsillitis, Allergic rhinitis, Asthma Dental cavities Kidney stones, Urinary tract infection (UTI) Eczema, Keloids Bunions, Plantar fasciitis NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
1762 huF8913E Recurrent sleep paralysis, Restless legs syndr... Hypertension, Hemorrhoids Thyroid nodule(s), High cholesterol (hyperchol... NaN Myopia (Nearsightedness), Astigmatism, Floaters Deviated septum, Chronic sinusitis, Allergic r... Impacted tooth, Dental cavities, Canker sores ... Urinary tract infection (UTI), Endometriosis Dandruff, Acne Sciatica, Tennis elbow, Bone spurs, Fibromyalg... Spina bifida
1763 hu794D40 Recurrent sleep paralysis Hypertension Thyroid nodule(s) NaN Age-related macular degeneration Nasal polyps, Chronic sinusitis, Chronic tonsi... Dental cavities, Temporomandibular joint (TMJ)... Urinary tract infection (UTI) Eczema, Allergic contact dermatitis, Rosacea, ... Postural kyphosis NaN
1764 huD8AD3F Restless legs syndrome, Migraine with aura, Mi... Angina, Cardiac arrhythmia NaN Iron deficiency anemia Myopia (Nearsightedness), Tinnitus NaN Gastroesophageal reflux disease (GERD), Irrita... Urinary tract infection (UTI), Ovarian cysts Eczema, Allergic contact dermatitis, Hyperhidr... Tennis elbow, Fibromyalgia, Scoliosis NaN
1765 hu35E970 Essential tremor, Chronic tension headaches (1... Raynaud's phenomenon Thyroid nodule(s), Lactose intolerance Iron deficiency anemia, Hereditary thrombophil... NaN Allergic rhinitis, Asthma Dental cavities, Gingivitis, Canker sores (ora... Urinary tract infection (UTI) Dandruff, Allergic contact dermatitis, Rosacea Frozen shoulder, Fibromyalgia Developmental dysplasia of the hip
1766 hu09787B NaN Hypertension, Hemorrhoids Thyroid nodule(s), Hypothyroidism, Hashimoto's... Von Willebrand disease Myopia (Nearsightedness), Astigmatism, Age-rel... Chronic sinusitis Dental cavities, Gallstones Kidney stones Dandruff, Hair loss (includes female and male ... Bone spurs, Osteoporosis, Scoliosis Congenital clubfoot (equinovarus)

1767 rows × 12 columns

Now, it is evident that there are 1767 unique participants that shared their full medical history.

The next step was to determine how many of these individuals had been diagnosed with migraines.

In [302]:
# number of people with migraines
mig_w_aura = conditions['Nervous System Conditions'].str.contains('Migraine with aura', case=False, na=False)
w_sum = mig_w_aura.sum()
mig_no_aura = conditions['Nervous System Conditions'].str.contains('Migraine without aura', case=False, na=False)
no_sum = mig_no_aura.sum()
w_sum, no_sum
Out[302]:
(215, 218)
In [312]:
# just Participants with migraines
only_mig_haver = conditions[mig_w_aura | mig_no_aura]
only_mig_haver.Participant.count()
Out[312]:
384
In [313]:
only_mig_haver
Out[313]:
Participant Nervous System Conditions Circulatory System Conditions Endocrine System Conditions Blood Conditions Visual and Hearing Conditions Respiratory System Conditions Digestive System Conditions Genitourinary System Conditions Skin Conditions Musculoskeletal System Conditions Congenital Conditions
3 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags Sciatica NaN
4 hu3C0611 Migraine without aura, Hereditary motor and se... NaN Thyroid nodule(s), Hypothyroidism, Hashimoto's... Iron deficiency anemia Floaters Chronic tonsillitis, Allergic rhinitis, Asthma Dental cavities Kidney stones, Urinary tract infection (UTI) Eczema, Keloids Bunions, Plantar fasciitis NaN
5 hu384E20 Migraine with aura Hemorrhoids NaN NaN Floaters NaN Dental cavities, Temporomandibular joint (TMJ)... NaN Dandruff, Eczema Scoliosis NaN
12 hu5FCE15 Migraine with aura NaN NaN NaN Myopia (Nearsightedness), Astigmatism NaN Dental cavities, Geographic tongue, Irritable ... NaN Acne NaN NaN
17 hu1EE386 Migraine without aura Hypertension, Raynaud's phenomenon NaN NaN Myopia (Nearsightedness), Astigmatism NaN Dental cavities Urinary tract infection (UTI), Endometriosis, ... NaN Fibromyalgia NaN
... ... ... ... ... ... ... ... ... ... ... ... ...
1752 huCD1A7A Chronic tension headaches (15+ days per month,... NaN NaN NaN Myopia (Nearsightedness), Astigmatism, Presbyo... NaN Dental cavities, Temporomandibular joint (TMJ)... Urinary tract infection (UTI), Ovarian cysts NaN Osteoarthritis, Frozen shoulder, Tennis elbow,... NaN
1753 huA9AFFD Chronic tension headaches (15+ days per month,... NaN Hypothyroidism, Lactose intolerance, High chol... Iron deficiency anemia Myopia (Nearsightedness), Astigmatism Deviated septum, Chronic sinusitis, Chronic to... Impacted tooth, Dental cavities, Gingivitis, T... Kidney stones Dandruff, Skin tags, Hair loss (includes femal... Frozen shoulder, Tennis elbow, Plantar fasciit... Ehlers-Danlos syndrome
1759 huC8E030 Chronic tension headaches (15+ days per month,... Hypertension, Cardiac arrhythmia, Varicose veins Thyroid nodule(s), Hypothyroidism, Hashimoto's... Iron deficiency anemia Hyperopia (Farsightedness), Presbyopia, Dry ey... Deviated septum, Nasal polyps, Chronic sinusit... Impacted tooth, Dental cavities, Gingivitis, T... Urinary tract infection (UTI), Endometriosis, ... Dandruff, Eczema, Allergic contact dermatitis,... Osteoarthritis, Chondromalacia patella (CMP), ... NaN
1762 huF8913E Recurrent sleep paralysis, Restless legs syndr... Hypertension, Hemorrhoids Thyroid nodule(s), High cholesterol (hyperchol... NaN Myopia (Nearsightedness), Astigmatism, Floaters Deviated septum, Chronic sinusitis, Allergic r... Impacted tooth, Dental cavities, Canker sores ... Urinary tract infection (UTI), Endometriosis Dandruff, Acne Sciatica, Tennis elbow, Bone spurs, Fibromyalg... Spina bifida
1764 huD8AD3F Restless legs syndrome, Migraine with aura, Mi... Angina, Cardiac arrhythmia NaN Iron deficiency anemia Myopia (Nearsightedness), Tinnitus NaN Gastroesophageal reflux disease (GERD), Irrita... Urinary tract infection (UTI), Ovarian cysts Eczema, Allergic contact dermatitis, Hyperhidr... Tennis elbow, Fibromyalgia, Scoliosis NaN

384 rows × 12 columns

Out of the dataset, there are 384 individuals with migraines. 215 participants have migraine with aura, while 218 have migraine without aura

Then, we needed to process and clean the phenotypic data for the participants

In [314]:
# Now need to clean up phenotype and general data
phenotypes.columns
Out[314]:
Index(['Participant', 'Timestamp', 'Do not touch!', '1.1 — Blood Type',
       '1.2 — Height', '1.3 — Weight', '1.4 — Comments',
       '2.1 — Left Eye (Photograph Number)  (full-size image: https://goo.gl/XQ2Voh)',
       '2.2 — Right Eye (Photograph Number)  (full-size image: https://goo.gl/XQ2Voh)',
       '2.3 — Left Eye Color - Text Description',
       '2.4 — Right Eye Color - Text Description', '2.5 —Comments',
       '3.1 — What is your natural hair color currently, when without artificial color or dye?',
       '3.2 — Hair Color - Text Description', '3.3 — Comments',
       '4.1 — Any final thoughts?', '1.4 — Handedness'],
      dtype='object')
In [315]:
# columns to drop 
unwanted_traits =['1.4 — Comments',
       '2.1 — Left Eye (Photograph Number)  (full-size image: https://goo.gl/XQ2Voh)',
       '2.2 — Right Eye (Photograph Number)  (full-size image: https://goo.gl/XQ2Voh)',
       '2.3 — Left Eye Color - Text Description',
       '2.4 — Right Eye Color - Text Description', '2.5 —Comments',
       '3.1 — What is your natural hair color currently, when without artificial color or dye?',
       '3.2 — Hair Color - Text Description', '3.3 — Comments',
       '4.1 — Any final thoughts?', '1.4 — Handedness','Timestamp', 'Do not touch!']

phenotypes = phenotypes.drop(unwanted_traits,axis=1)
In [316]:
# desired columns
phenotypes.rename(columns={'1.1 — Blood Type': 'Blood Type',	'1.2 — Height': 'Height (in)',	'1.3 — Weight': 'Weight (lbs)'},inplace=True)
In [317]:
# lots of null values present
phenotypes.isnull().sum()
Out[317]:
Participant      0
Blood Type      42
Height (in)     23
Weight (lbs)    26
dtype: int64
In [318]:
# ensuring all the NaN values are dropped
phenotypes = phenotypes.dropna()
phenotypes = phenotypes.reset_index(drop=True)
phenotypes
Out[318]:
Participant Blood Type Height (in) Weight (lbs)
0 hu826751 AB + 6'2" 188.0
1 huDDCF88 O + 5'10" 159.0
2 hu3DC5EA A + 5'5" 184.0
3 hu008567 O + 5'1" 138.0
4 hu98FFC6 A + 5'5" 230.0
... ... ... ... ...
1096 huF8913E O + 5'6" 185.0
1097 hu794D40 A + 5'9" 170.0
1098 huD8AD3F O + 5'5" 108.0
1099 hu09787B O + 5'9" 233.0
1100 huF5CD05 A + 5'5" 215.0

1101 rows × 4 columns

When that was completed, we moved onto cleaning the general traits frame

In [319]:
# cleaning up the traits data
gen_survey.columns
Out[319]:
Index(['Participant', 'Timestamp', 'Do not touch!', 'Year of birth',
       'Which statement best describes you?',
       'Severe disease or rare genetic trait',
       'Do you have a severe genetic disease or rare genetic trait? If so, you can add a description for your public profile.',
       'Disease/trait: Onset', 'Disease/trait: Rarity',
       'Disease/trait: Severity', 'Disease/trait: Relative enrollment',
       'Disease/trait: Diagnosis', 'Disease/trait: Genetic confirmation',
       'Disease/trait: Documentation',
       'Disease/trait: Documentation description', 'Sex/Gender',
       'Race/ethnicity', 'Maternal grandmother: Country of origin',
       'Paternal grandmother: Country of origin',
       'Paternal grandfather: Country of origin',
       'Maternal grandfather: Country of origin', 'Enrollment of relatives',
       'Enrollment of older individuals', 'Enrollment of parents',
       'Enrolled relatives [Monozygotic / Identical twins]',
       'Enrolled relatives [Parents]',
       'Enrolled relatives [Siblings / Fraternal twins]',
       'Enrolled relatives [Children]', 'Enrolled relatives [Grandparents]',
       'Enrolled relatives [Grandchildren]',
       'Enrolled relatives [Aunts/Uncles]',
       'Enrolled relatives [Nephews/Nieces]',
       'Enrolled relatives [Half-siblings]',
       'Enrolled relatives [Cousins or more distant]',
       'Enrolled relatives [Not genetically related (e.g. husband/wife)]',
       'Are all your enrolled relatives linked to your PGP profile?',
       'Have you uploaded genetic data to your PGP participant profile?',
       'Have you used the PGP web interface to record a designated proxy?',
       'Have you uploaded health record data using our Google Health or Microsoft Healthvault interfaces?',
       'Uploaded health records: Update status',
       'Uploaded health records: Extensiveness', 'Blood sample',
       'Saliva sample', 'Microbiome samples', 'Tissue samples from surgery',
       'Tissue samples from autopsy', 'Month of birth',
       'Anatomical sex at birth', 'Maternal grandmother: Race/ethnicity',
       'Maternal grandfather: Race/ethnicity',
       'Paternal grandmother: Race/ethnicity',
       'Paternal grandfather: Race/ethnicity'],
      dtype='object')
In [320]:
# grabbing the traits we're interested in
traits = gen_survey[['Participant','Sex/Gender','Race/ethnicity']]
In [321]:
# using more descriptive categories for race/ethnicity
traits['Race/ethnicity'] = traits['Race/ethnicity'].str.split(n=3).str[:3].str.join(' ')
<ipython-input-321-819ea03731c9>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  traits['Race/ethnicity'] = traits['Race/ethnicity'].str.split(n=3).str[:3].str.join(' ')
In [322]:
traits['Race/ethnicity'].value_counts()
Out[322]:
White                     3532
American Indian /          155
Asian                      126
Hispanic or Latino,         82
Black or African            64
Hispanic or Latino          61
Asian, White                40
No response                 30
Asian, Native Hawaiian       7
White, No response           5
Native Hawaiian or           3
Asian, Hispanic or           2
Asian, Black or              1
Name: Race/ethnicity, dtype: int64
In [323]:
# replacing the column titles with grammatically correct names
traits['Race/ethnicity'] = traits['Race/ethnicity'].replace({'American Indian /':'American Indian','Hispanic or Latino,': 'Hispanic or Latino',
                                                             'Asian, White': 'Asian', 'or': '','Asian, Black or': 'Asian', 'White, No response': 'White',
                                                             'Native Hawaiian or': 'Native Hawaiian'})
<ipython-input-323-b3079cff3887>:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  traits['Race/ethnicity'] = traits['Race/ethnicity'].replace({'American Indian /':'American Indian','Hispanic or Latino,': 'Hispanic or Latino',

After that, we merged all the conditions, phenotypes, and general data into one singular df

In [324]:
# merging all
all = conditions.merge(phenotypes,how='inner',on=['Participant'])
all = all.merge(traits,how='inner',on=['Participant'])
In [325]:
all
Out[325]:
Participant Nervous System Conditions Circulatory System Conditions Endocrine System Conditions Blood Conditions Visual and Hearing Conditions Respiratory System Conditions Digestive System Conditions Genitourinary System Conditions Skin Conditions Musculoskeletal System Conditions Congenital Conditions Blood Type Height (in) Weight (lbs) Sex/Gender Race/ethnicity
0 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags Sciatica NaN O - 5'4" 150.0 Female American Indian
1 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags Sciatica NaN O - 5'4" 150.0 Female American Indian
2 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags Sciatica NaN O - 5'4" 150.0 Female American Indian
3 hu3C0611 Migraine without aura, Hereditary motor and se... NaN Thyroid nodule(s), Hypothyroidism, Hashimoto's... Iron deficiency anemia Floaters Chronic tonsillitis, Allergic rhinitis, Asthma Dental cavities Kidney stones, Urinary tract infection (UTI) Eczema, Keloids Bunions, Plantar fasciitis NaN B + 5'9" 164.0 Female White
4 huE9B698 NaN NaN Hypothyroidism NaN Myopia (Nearsightedness), Floaters NaN Impacted tooth, Dental cavities, Canker sores ... NaN Allergic contact dermatitis, Keloids, Skin tags NaN NaN Don't know 5'6" 150.0 Male White
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1127 huD2D4B8 NaN Varicose veins NaN NaN NaN Asthma NaN NaN Acne NaN NaN O + 5'1" 135.0 Female White
1128 huF8913E Recurrent sleep paralysis, Restless legs syndr... Hypertension, Hemorrhoids Thyroid nodule(s), High cholesterol (hyperchol... NaN Myopia (Nearsightedness), Astigmatism, Floaters Deviated septum, Chronic sinusitis, Allergic r... Impacted tooth, Dental cavities, Canker sores ... Urinary tract infection (UTI), Endometriosis Dandruff, Acne Sciatica, Tennis elbow, Bone spurs, Fibromyalg... Spina bifida O + 5'6" 185.0 Female Hispanic or Latino
1129 hu794D40 Recurrent sleep paralysis Hypertension Thyroid nodule(s) NaN Age-related macular degeneration Nasal polyps, Chronic sinusitis, Chronic tonsi... Dental cavities, Temporomandibular joint (TMJ)... Urinary tract infection (UTI) Eczema, Allergic contact dermatitis, Rosacea, ... Postural kyphosis NaN A + 5'9" 170.0 Female White
1130 huD8AD3F Restless legs syndrome, Migraine with aura, Mi... Angina, Cardiac arrhythmia NaN Iron deficiency anemia Myopia (Nearsightedness), Tinnitus NaN Gastroesophageal reflux disease (GERD), Irrita... Urinary tract infection (UTI), Ovarian cysts Eczema, Allergic contact dermatitis, Hyperhidr... Tennis elbow, Fibromyalgia, Scoliosis NaN O + 5'5" 108.0 Male White
1131 hu09787B NaN Hypertension, Hemorrhoids Thyroid nodule(s), Hypothyroidism, Hashimoto's... Von Willebrand disease Myopia (Nearsightedness), Astigmatism, Age-rel... Chronic sinusitis Dental cavities, Gallstones Kidney stones Dandruff, Hair loss (includes female and male ... Bone spurs, Osteoporosis, Scoliosis Congenital clubfoot (equinovarus) O + 5'9" 233.0 Male White

1132 rows × 17 columns

In [326]:
all['Participant'].value_counts()
# need to drop dupes again
Out[326]:
hu5880D9    10
hu6D1115    10
huD554DB     8
hu2E4B9F     7
hu8B35DE     7
            ..
huFF2969     1
hu048C92     1
hu6642CE     1
hu00698E     1
hu09787B     1
Name: Participant, Length: 763, dtype: int64
In [327]:
all = all.drop_duplicates(subset=['Participant'])
In [328]:
all = all.reset_index(drop=True)

Gene data Acquisition

The second major source of information we needed to process was the genetic profiles of the participants.

The dataframe being read in is "vari.xlsx", which is the dataframe we created of the important rare, pathogenic gene variants. The methodology we used to generate this source is detailed in section 0. To quickly summarize, the df holds the following information:

  • Participant ID
  • Type of migraine
  • Gene
  • Recessive vs dominant allele
  • Homozygous vs heterozygous
  • Whether the allele is a mutation or a variant
  • The disease capacity of the allele
In [329]:
# reading in the dataframe 
alleles = pd.read_excel("vari.xlsx")
alleles
Out[329]:
Participant Type of Migraine Gene Recessive or dominant Homo/heterozyg Mutation or variant? Disease capacity Unnamed: 7
0 hu620F18 Mig no aura CBS-I278T Recessive Carrier (Heterozygous) Mutation Likely pathogenic NaN
1 hu620F18 Mig no aura C3-R102G Complex/Other Heterozygous Variant Likely pathogenic NaN
2 hu620F18 Mig no aura COL4A1-Q1334H Dominant Heterozygous Variant Likely pathogenic NaN
3 hu620F18 Mig no aura MTRR-I49M Recessive Carrier (Heterozygous) Variant Likely pathogenic NaN
4 hu620F18 Mig no aura rs5186 Unknown Heterozygous Variant Likely pathogenic NaN
... ... ... ... ... ... ... ... ...
1256 hu05FD49 None RPE65-N356 / heterozygous Mutation - Frameshift / NaN
1257 hu05FD49 None PKD1-R2430 / heterozygous Mutation - nonsense / NaN
1258 hu05FD49 None FLG-R3879 / heterozygous Mutation - nonsense / NaN
1259 hu05FD49 None SBF2-H1549 / heterozygous Mutation - nonsense / NaN
1260 hu05FD49 None NF2-K523 / heterozygous Mutation - Frameshift / NaN

1261 rows × 8 columns

In [330]:
# clean up and make binary variables for whether the participant has migraines
alleles = alleles.drop('Unnamed: 7',axis=1)
In [331]:
alleles['Has Migraine'] = alleles['Type of Migraine'] != 'None'
alleles['Has Migraine'] = alleles['Has Migraine'].map({True: 1, False: 0})
In [332]:
alleles
Out[332]:
Participant Type of Migraine Gene Recessive or dominant Homo/heterozyg Mutation or variant? Disease capacity Has Migraine
0 hu620F18 Mig no aura CBS-I278T Recessive Carrier (Heterozygous) Mutation Likely pathogenic 1
1 hu620F18 Mig no aura C3-R102G Complex/Other Heterozygous Variant Likely pathogenic 1
2 hu620F18 Mig no aura COL4A1-Q1334H Dominant Heterozygous Variant Likely pathogenic 1
3 hu620F18 Mig no aura MTRR-I49M Recessive Carrier (Heterozygous) Variant Likely pathogenic 1
4 hu620F18 Mig no aura rs5186 Unknown Heterozygous Variant Likely pathogenic 1
... ... ... ... ... ... ... ... ...
1256 hu05FD49 None RPE65-N356 / heterozygous Mutation - Frameshift / 0
1257 hu05FD49 None PKD1-R2430 / heterozygous Mutation - nonsense / 0
1258 hu05FD49 None FLG-R3879 / heterozygous Mutation - nonsense / 0
1259 hu05FD49 None SBF2-H1549 / heterozygous Mutation - nonsense / 0
1260 hu05FD49 None NF2-K523 / heterozygous Mutation - Frameshift / 0

1261 rows × 8 columns

In [333]:
# drop nulls
alleles.isnull().sum()
alleles = alleles.dropna()
alleles = alleles.reset_index(drop=True)
In [334]:
alleles
Out[334]:
Participant Type of Migraine Gene Recessive or dominant Homo/heterozyg Mutation or variant? Disease capacity Has Migraine
0 hu620F18 Mig no aura CBS-I278T Recessive Carrier (Heterozygous) Mutation Likely pathogenic 1
1 hu620F18 Mig no aura C3-R102G Complex/Other Heterozygous Variant Likely pathogenic 1
2 hu620F18 Mig no aura COL4A1-Q1334H Dominant Heterozygous Variant Likely pathogenic 1
3 hu620F18 Mig no aura MTRR-I49M Recessive Carrier (Heterozygous) Variant Likely pathogenic 1
4 hu620F18 Mig no aura rs5186 Unknown Heterozygous Variant Likely pathogenic 1
... ... ... ... ... ... ... ... ...
1253 hu05FD49 None RPE65-N356 / heterozygous Mutation - Frameshift / 0
1254 hu05FD49 None PKD1-R2430 / heterozygous Mutation - nonsense / 0
1255 hu05FD49 None FLG-R3879 / heterozygous Mutation - nonsense / 0
1256 hu05FD49 None SBF2-H1549 / heterozygous Mutation - nonsense / 0
1257 hu05FD49 None NF2-K523 / heterozygous Mutation - Frameshift / 0

1258 rows × 8 columns

Finally, we fully combined the conditions, phenotypes, general information, and gene data to create a comprehensive source of information about the participants

In [335]:
# dfs with genes
w_genes = all.merge(alleles,how='inner',on=['Participant'])
In [336]:
w_genes
Out[336]:
Participant Nervous System Conditions Circulatory System Conditions Endocrine System Conditions Blood Conditions Visual and Hearing Conditions Respiratory System Conditions Digestive System Conditions Genitourinary System Conditions Skin Conditions ... Weight (lbs) Sex/Gender Race/ethnicity Type of Migraine Gene Recessive or dominant Homo/heterozyg Mutation or variant? Disease capacity Has Migraine
0 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags ... 150.0 Female American Indian Mig no aura CBS-I278T Recessive Carrier (Heterozygous) Mutation Likely pathogenic 1
1 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags ... 150.0 Female American Indian Mig no aura C3-R102G Complex/Other Heterozygous Variant Likely pathogenic 1
2 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags ... 150.0 Female American Indian Mig no aura COL4A1-Q1334H Dominant Heterozygous Variant Likely pathogenic 1
3 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags ... 150.0 Female American Indian Mig no aura MTRR-I49M Recessive Carrier (Heterozygous) Variant Likely pathogenic 1
4 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags ... 150.0 Female American Indian Mig no aura rs5186 Unknown Heterozygous Variant Likely pathogenic 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
623 huA9AFFD Chronic tension headaches (15+ days per month,... NaN Hypothyroidism, Lactose intolerance, High chol... Iron deficiency anemia Myopia (Nearsightedness), Astigmatism Deviated septum, Chronic sinusitis, Chronic to... Impacted tooth, Dental cavities, Gingivitis, T... Kidney stones Dandruff, Skin tags, Hair loss (includes femal... ... 240.0 Male White Mig no aura rs5186 Unknown Homozygous Variant Likely pathogenic 1
624 huA9AFFD Chronic tension headaches (15+ days per month,... NaN Hypothyroidism, Lactose intolerance, High chol... Iron deficiency anemia Myopia (Nearsightedness), Astigmatism Deviated septum, Chronic sinusitis, Chronic to... Impacted tooth, Dental cavities, Gingivitis, T... Kidney stones Dandruff, Skin tags, Hair loss (includes femal... ... 240.0 Male White Mig no aura CTFR-W1204X / Homozygous Mutation - nonsense / 1
625 huA9AFFD Chronic tension headaches (15+ days per month,... NaN Hypothyroidism, Lactose intolerance, High chol... Iron deficiency anemia Myopia (Nearsightedness), Astigmatism Deviated septum, Chronic sinusitis, Chronic to... Impacted tooth, Dental cavities, Gingivitis, T... Kidney stones Dandruff, Skin tags, Hair loss (includes femal... ... 240.0 Male White Mig no aura ARSA-T274M / Carrier (Heterozygous) / Probably damaging 1
626 huA9AFFD Chronic tension headaches (15+ days per month,... NaN Hypothyroidism, Lactose intolerance, High chol... Iron deficiency anemia Myopia (Nearsightedness), Astigmatism Deviated septum, Chronic sinusitis, Chronic to... Impacted tooth, Dental cavities, Gingivitis, T... Kidney stones Dandruff, Skin tags, Hair loss (includes femal... ... 240.0 Male White Mig no aura KRT86-E402Q / Homozygous / Probably damaging 1
627 huA9AFFD Chronic tension headaches (15+ days per month,... NaN Hypothyroidism, Lactose intolerance, High chol... Iron deficiency anemia Myopia (Nearsightedness), Astigmatism Deviated septum, Chronic sinusitis, Chronic to... Impacted tooth, Dental cavities, Gingivitis, T... Kidney stones Dandruff, Skin tags, Hair loss (includes femal... ... 240.0 Male White Mig no aura VWF-S1506L / Carrier (Heterozygous) / Probably damaging 1

628 rows × 24 columns

Part 2: EDA

After all the data was cleaned and processed, we took a look at relationships between a variety of factors

Exploring Comorbidities

In order to determine the statistical relationship between migraines and comorbidities, binary variables first had to be created to separate whether someone has migraines in general, migraines with aura, or migraines without aura

In [337]:
# Want binary variables for having conditions but first need to isolate migraines and migraine types
conditions['Has Migraines'] = conditions['Nervous System Conditions'].str.contains('Migraine',case=False, na=False)
conditions[['No Migraines','Has Migraines']] = pd.get_dummies(conditions['Has Migraines'])
conditions = conditions.drop('No Migraines',axis=1)
conditions['Has Migraines with Aura'] = conditions['Nervous System Conditions'].str.contains('Migraine with aura',case=False, na=False)
conditions['Has Migraines without Aura'] = conditions['Nervous System Conditions'].str.contains('Migraine without aura',case=False, na=False)
conditions[['No Migraines with Aura','Has Migraines with Aura']] = pd.get_dummies(conditions['Has Migraines with Aura'])
conditions[['No Migraines without Aura','Has Migraines without Aura']] = pd.get_dummies(conditions['Has Migraines without Aura'])
conditions = conditions.drop('No Migraines with Aura',axis=1)
conditions = conditions.drop('No Migraines without Aura',axis=1)
In [338]:
# need to isolate Migraines out of nerv system conditions
conditions['Has Nervous System Conditions'] = (~conditions['Nervous System Conditions'].str.fullmatch('Migraine without aura',case=False,na=False) & (conditions['Nervous System Conditions'].notnull())
& (~conditions['Nervous System Conditions'].str.fullmatch('Migraine with aura',case=False,na=False)))
conditions[['No Nervous conditions','Has Nervous System Conditions']] = pd.get_dummies(conditions['Has Nervous System Conditions'])
conditions = conditions.drop('No Nervous conditions',axis=1)
In [339]:
conditions
Out[339]:
Participant Nervous System Conditions Circulatory System Conditions Endocrine System Conditions Blood Conditions Visual and Hearing Conditions Respiratory System Conditions Digestive System Conditions Genitourinary System Conditions Skin Conditions Musculoskeletal System Conditions Congenital Conditions Has Migraines Has Migraines with Aura Has Migraines without Aura Has Nervous System Conditions
0 hu3073E3 NaN NaN NaN NaN Age-related cataract, Myopia (Nearsightedness)... Deviated septum, Allergic rhinitis Dental cavities, Canker sores (oral ulcers), I... Urinary tract infection (UTI) Eczema, Allergic contact dermatitis, Hair loss... Chondromalacia patella (CMP) NaN 0 0 0 0
1 hu407142 NaN NaN NaN NaN Myopia (Nearsightedness), Astigmatism, Dry eye... Chronic sinusitis, Allergic rhinitis Dental cavities Urinary tract infection (UTI) Acne NaN NaN 0 0 0 0
2 huF974A8 NaN NaN NaN NaN Myopia (Nearsightedness), Dry eye syndrome, Fl... NaN Dental cavities, Canker sores (oral ulcers) NaN NaN Osgood-Schlatter disease NaN 0 0 0 0
3 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags Sciatica NaN 1 0 1 0
4 hu3C0611 Migraine without aura, Hereditary motor and se... NaN Thyroid nodule(s), Hypothyroidism, Hashimoto's... Iron deficiency anemia Floaters Chronic tonsillitis, Allergic rhinitis, Asthma Dental cavities Kidney stones, Urinary tract infection (UTI) Eczema, Keloids Bunions, Plantar fasciitis NaN 1 0 1 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1762 huF8913E Recurrent sleep paralysis, Restless legs syndr... Hypertension, Hemorrhoids Thyroid nodule(s), High cholesterol (hyperchol... NaN Myopia (Nearsightedness), Astigmatism, Floaters Deviated septum, Chronic sinusitis, Allergic r... Impacted tooth, Dental cavities, Canker sores ... Urinary tract infection (UTI), Endometriosis Dandruff, Acne Sciatica, Tennis elbow, Bone spurs, Fibromyalg... Spina bifida 1 0 1 1
1763 hu794D40 Recurrent sleep paralysis Hypertension Thyroid nodule(s) NaN Age-related macular degeneration Nasal polyps, Chronic sinusitis, Chronic tonsi... Dental cavities, Temporomandibular joint (TMJ)... Urinary tract infection (UTI) Eczema, Allergic contact dermatitis, Rosacea, ... Postural kyphosis NaN 0 0 0 1
1764 huD8AD3F Restless legs syndrome, Migraine with aura, Mi... Angina, Cardiac arrhythmia NaN Iron deficiency anemia Myopia (Nearsightedness), Tinnitus NaN Gastroesophageal reflux disease (GERD), Irrita... Urinary tract infection (UTI), Ovarian cysts Eczema, Allergic contact dermatitis, Hyperhidr... Tennis elbow, Fibromyalgia, Scoliosis NaN 1 1 1 1
1765 hu35E970 Essential tremor, Chronic tension headaches (1... Raynaud's phenomenon Thyroid nodule(s), Lactose intolerance Iron deficiency anemia, Hereditary thrombophil... NaN Allergic rhinitis, Asthma Dental cavities, Gingivitis, Canker sores (ora... Urinary tract infection (UTI) Dandruff, Allergic contact dermatitis, Rosacea Frozen shoulder, Fibromyalgia Developmental dysplasia of the hip 0 0 0 1
1766 hu09787B NaN Hypertension, Hemorrhoids Thyroid nodule(s), Hypothyroidism, Hashimoto's... Von Willebrand disease Myopia (Nearsightedness), Astigmatism, Age-rel... Chronic sinusitis Dental cavities, Gallstones Kidney stones Dandruff, Hair loss (includes female and male ... Bone spurs, Osteoporosis, Scoliosis Congenital clubfoot (equinovarus) 0 0 0 0

1767 rows × 16 columns

From there, we created dummy variables for each class of comorbidities to indicate whether a person has a condition from that subset

In [340]:
# rest of the condition binaries
conditions['Has Blood Conditions'] = conditions['Blood Conditions'].notnull().astype('int')
conditions['Has Circulatory Conditions'] = conditions['Circulatory System Conditions'].notnull().astype('int')
conditions['Has Endocrine Conditions'] = conditions['Endocrine System Conditions'].notnull().astype('int')
conditions['Has Vision and Hearing Conditions'] = conditions['Visual and Hearing Conditions'].notnull().astype('int')
conditions['Has Respiratory Conditions'] = conditions['Respiratory System Conditions'].notnull().astype('int')
conditions['Has Digestive Conditions'] = conditions['Digestive System Conditions'].notnull().astype('int')
conditions['Has Genitourinary Conditions'] = conditions['Genitourinary System Conditions'].notnull().astype('int')
conditions['Has Skin Conditions'] = conditions['Skin Conditions'].notnull().astype('int')
conditions['Has Musculoskeletal Conditions'] = conditions['Musculoskeletal System Conditions'].notnull().astype('int')
conditions['Has Congenital Conditions'] = conditions['Congenital Conditions'].notnull().astype('int')
In [341]:
# EDA time
conditions.columns
Out[341]:
Index(['Participant', 'Nervous System Conditions',
       'Circulatory System Conditions', 'Endocrine System Conditions',
       'Blood Conditions', 'Visual and Hearing Conditions',
       'Respiratory System Conditions', 'Digestive System Conditions',
       'Genitourinary System Conditions', 'Skin Conditions',
       'Musculoskeletal System Conditions', 'Congenital Conditions',
       'Has Migraines', 'Has Migraines with Aura',
       'Has Migraines without Aura', 'Has Nervous System Conditions',
       'Has Blood Conditions', 'Has Circulatory Conditions',
       'Has Endocrine Conditions', 'Has Vision and Hearing Conditions',
       'Has Respiratory Conditions', 'Has Digestive Conditions',
       'Has Genitourinary Conditions', 'Has Skin Conditions',
       'Has Musculoskeletal Conditions', 'Has Congenital Conditions'],
      dtype='object')

Then we got the proportions of participants who 1) had migraines vs didn't have migraines and 2) had no aura vs had aura

In [342]:
def prob_no_mig(cond):  # Getting proportions of those with and without Migraines per biological system
  a = ((conditions['Has Migraines'] == 0) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 0).sum()
  return a
def prob_w_mig(cond):
  b =((conditions['Has Migraines'] == 1) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 1).sum()
  return b
In [343]:
prob_w_list = [] # iterating through columns list for biological system
for i in conditions.columns[15:]:
  prob_w_list.append(prob_w_mig(i))
prob_no_list = []
for i in conditions.columns[15:]:
  prob_no_list.append(prob_no_mig(i))
In [344]:
def prob_no_aura(cond): # Getting proportions of those with and without Migraines with aura per system
  a = ((conditions['Has Migraines without Aura'] == 1) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 1).sum()
  return a
def prob_w_aura(cond):
  b =((conditions['Has Migraines with Aura'] == 1) & (conditions[cond] == 1)).sum() / (conditions['Has Migraines'] == 1).sum()
  return b
In [345]:
prob_w_aura_list = [] # iterating through columns for biological system
for i in conditions.columns[15:]:
  prob_w_aura_list.append(prob_w_aura(i))
prob_no_aura_list = []
for i in conditions.columns[15:]:
  prob_no_aura_list.append(prob_no_aura(i))

We then generated a new dataframe with these proportions to evaluate the makeup of the population

In [469]:
d={'w mig': prob_w_list, 'no mig':prob_no_list, 'w_aura': prob_w_aura_list,'no_aura': prob_no_aura_list} # putting proportions into df
In [347]:
conditions_prop = pd.DataFrame(data=d,index=conditions.columns[15:])
In [348]:
conditions_prop = conditions_prop.reset_index()
In [349]:
# dataframe showing the proportion of each subset who has a specific comorbidity
conditions_prop
Out[349]:
index w mig no mig w_aura no_aura
0 Has Nervous System Conditions 0.528646 0.218366 0.299479 0.356771
1 Has Blood Conditions 0.335938 0.157628 0.182292 0.205729
2 Has Circulatory Conditions 0.596354 0.453362 0.343750 0.335938
3 Has Endocrine Conditions 0.554688 0.417209 0.289062 0.338542
4 Has Vision and Hearing Conditions 0.854167 0.757773 0.497396 0.473958
5 Has Respiratory Conditions 0.635417 0.485900 0.348958 0.372396
6 Has Digestive Conditions 0.966146 0.913955 0.536458 0.552083
7 Has Genitourinary Conditions 0.682292 0.432393 0.382812 0.388021
8 Has Skin Conditions 0.880208 0.797542 0.492188 0.507812
9 Has Musculoskeletal Conditions 0.690104 0.506869 0.408854 0.375000
10 Has Congenital Conditions 0.182292 0.052784 0.111979 0.104167
In [350]:
conditions_prop.plot.bar(x='index',y=['w mig','no mig'],color=[ 'red', 'blue'], xlabel='Conditions',ylabel='Percent of Individuals', title='Condition Proportions of Migraine Havers vs Control', width=0.8, figsize=(12,6)).grid()
# Bar graph of conditions

In every category, individuals with migraines were more likely to have comorbidities than participants without migraines. Someone with migraines is more than twice as likely to have another nervous system condition than someone without migraines

In [351]:
conditions_prop.plot.bar(x='index',y=['w_aura','no_aura'],color=[ 'green', 'orange'], xlabel='Conditions',ylabel='Percent of Individuals', title='Condition Proportions of Migraine with Aura vs without', width=0.8, figsize=(12,6)).grid()
# Bar graph of conditions w and wout aura

When looking at the prevelance of comorbidities in participants with vs without aura, there does not appear to be a relationship --> the frequency is approximately the same for both groupings across all 11 categories

From there, we wanted to more specifically look at whether phenotypes and general traits had a statistical relationship with migraines

In [352]:
# need dummies for Blood Types
phenotypes = pd.get_dummies(phenotypes,columns=['Blood Type'])
In [353]:
phenotypes # dataframe with dummies for all (common) blood types
Out[353]:
Participant Height (in) Weight (lbs) Blood Type_A + Blood Type_A - Blood Type_AB + Blood Type_AB - Blood Type_B + Blood Type_B - Blood Type_Don't know Blood Type_O + Blood Type_O -
0 hu826751 6'2" 188.0 0 0 1 0 0 0 0 0 0
1 huDDCF88 5'10" 159.0 0 0 0 0 0 0 0 1 0
2 hu3DC5EA 5'5" 184.0 1 0 0 0 0 0 0 0 0
3 hu008567 5'1" 138.0 0 0 0 0 0 0 0 1 0
4 hu98FFC6 5'5" 230.0 1 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
1096 huF8913E 5'6" 185.0 0 0 0 0 0 0 0 1 0
1097 hu794D40 5'9" 170.0 1 0 0 0 0 0 0 0 0
1098 huD8AD3F 5'5" 108.0 0 0 0 0 0 0 0 1 0
1099 hu09787B 5'9" 233.0 0 0 0 0 0 0 0 1 0
1100 huF5CD05 5'5" 215.0 1 0 0 0 0 0 0 0 0

1101 rows × 12 columns

In [354]:
# need height values to be type float
phenotypes['Height (in)'] = phenotypes['Height (in)'].str.replace("\"","")
phenotypes['Height (in)'] = phenotypes['Height (in)'].str.replace("'"," ")
phenotypes['Height (in)'] = [s.split(" ") for s in phenotypes['Height (in)']]
phenotypes['Height (in)'] = [float(value[0])*12 + float(value[1]) for value in phenotypes['Height (in)']]
In [355]:
phenotypes
Out[355]:
Participant Height (in) Weight (lbs) Blood Type_A + Blood Type_A - Blood Type_AB + Blood Type_AB - Blood Type_B + Blood Type_B - Blood Type_Don't know Blood Type_O + Blood Type_O -
0 hu826751 74.0 188.0 0 0 1 0 0 0 0 0 0
1 huDDCF88 70.0 159.0 0 0 0 0 0 0 0 1 0
2 hu3DC5EA 65.0 184.0 1 0 0 0 0 0 0 0 0
3 hu008567 61.0 138.0 0 0 0 0 0 0 0 1 0
4 hu98FFC6 65.0 230.0 1 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ...
1096 huF8913E 66.0 185.0 0 0 0 0 0 0 0 1 0
1097 hu794D40 69.0 170.0 1 0 0 0 0 0 0 0 0
1098 huD8AD3F 65.0 108.0 0 0 0 0 0 0 0 1 0
1099 hu09787B 69.0 233.0 0 0 0 0 0 0 0 1 0
1100 huF5CD05 65.0 215.0 1 0 0 0 0 0 0 0 0

1101 rows × 12 columns

In [356]:
traits = pd.get_dummies(traits,columns=['Sex/Gender','Race/ethnicity']) # Getting dummies for sex and race
In [357]:
# remerging conditions with phenotypes and general traits // considering dummies now
all = conditions.merge(phenotypes,how='inner',on=['Participant'])
all = all.merge(traits,how='inner',on=['Participant'])
In [358]:
all
Out[358]:
Participant Nervous System Conditions Circulatory System Conditions Endocrine System Conditions Blood Conditions Visual and Hearing Conditions Respiratory System Conditions Digestive System Conditions Genitourinary System Conditions Skin Conditions ... Sex/Gender_sex: female; gender: non-binary Race/ethnicity_American Indian Race/ethnicity_Asian Race/ethnicity_Asian, Hispanic or Race/ethnicity_Asian, Native Hawaiian Race/ethnicity_Black or African Race/ethnicity_Hispanic or Latino Race/ethnicity_Native Hawaiian Race/ethnicity_No response Race/ethnicity_White
0 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags ... 0 1 0 0 0 0 0 0 0 0
1 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags ... 0 1 0 0 0 0 0 0 0 0
2 hu620F18 Migraine without aura NaN High cholesterol (hypercholesterolemia) NaN Myopia (Nearsightedness), Astigmatism, Floaters NaN Impacted tooth, Dental cavities, Gingivitis NaN Eczema, Allergic contact dermatitis, Skin tags ... 0 1 0 0 0 0 0 0 0 0
3 hu3C0611 Migraine without aura, Hereditary motor and se... NaN Thyroid nodule(s), Hypothyroidism, Hashimoto's... Iron deficiency anemia Floaters Chronic tonsillitis, Allergic rhinitis, Asthma Dental cavities Kidney stones, Urinary tract infection (UTI) Eczema, Keloids ... 0 0 0 0 0 0 0 0 0 1
4 huE9B698 NaN NaN Hypothyroidism NaN Myopia (Nearsightedness), Floaters NaN Impacted tooth, Dental cavities, Canker sores ... NaN Allergic contact dermatitis, Keloids, Skin tags ... 0 0 0 0 0 0 0 0 0 1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1127 huD2D4B8 NaN Varicose veins NaN NaN NaN Asthma NaN NaN Acne ... 0 0 0 0 0 0 0 0 0 1
1128 huF8913E Recurrent sleep paralysis, Restless legs syndr... Hypertension, Hemorrhoids Thyroid nodule(s), High cholesterol (hyperchol... NaN Myopia (Nearsightedness), Astigmatism, Floaters Deviated septum, Chronic sinusitis, Allergic r... Impacted tooth, Dental cavities, Canker sores ... Urinary tract infection (UTI), Endometriosis Dandruff, Acne ... 0 0 0 0 0 0 1 0 0 0
1129 hu794D40 Recurrent sleep paralysis Hypertension Thyroid nodule(s) NaN Age-related macular degeneration Nasal polyps, Chronic sinusitis, Chronic tonsi... Dental cavities, Temporomandibular joint (TMJ)... Urinary tract infection (UTI) Eczema, Allergic contact dermatitis, Rosacea, ... ... 0 0 0 0 0 0 0 0 0 1
1130 huD8AD3F Restless legs syndrome, Migraine with aura, Mi... Angina, Cardiac arrhythmia NaN Iron deficiency anemia Myopia (Nearsightedness), Tinnitus NaN Gastroesophageal reflux disease (GERD), Irrita... Urinary tract infection (UTI), Ovarian cysts Eczema, Allergic contact dermatitis, Hyperhidr... ... 0 0 0 0 0 0 0 0 0 1
1131 hu09787B NaN Hypertension, Hemorrhoids Thyroid nodule(s), Hypothyroidism, Hashimoto's... Von Willebrand disease Myopia (Nearsightedness), Astigmatism, Age-rel... Chronic sinusitis Dental cavities, Gallstones Kidney stones Dandruff, Hair loss (includes female and male ... ... 0 0 0 0 0 0 0 0 0 1

1132 rows × 63 columns

In [359]:
all['Participant'].value_counts()
# need to drop dupes again

all = all.drop_duplicates(subset=['Participant'])
all = all.reset_index(drop=True)
In [360]:
# EDA for traits
# mig havers by sex
male = [((all['Has Migraines'] == 1) & (all['Sex/Gender_Male']==1)).sum() / all['Has Migraines'].sum(),
        ((all['Has Migraines with Aura'] == 1) & (all['Sex/Gender_Male']==1)).sum() / all['Has Migraines with Aura'].sum(),
        ((all['Has Migraines without Aura'] == 1) & (all['Sex/Gender_Male']==1)).sum() / all['Has Migraines without Aura'].sum()]
female = [((all['Has Migraines'] == 1) & (all['Sex/Gender_Female']==1)).sum() / all['Has Migraines'].sum(),
          ((all['Has Migraines with Aura'] == 1) & (all['Sex/Gender_Female']==1)).sum() / all['Has Migraines with Aura'].sum(),
          ((all['Has Migraines without Aura'] == 1) & (all['Sex/Gender_Female']==1)).sum() / all['Has Migraines without Aura'].sum()]
In [361]:
d={'male': male, 'female':female} #df for mig havers by sex
In [362]:
sex_props = pd.DataFrame(data=d,index=['w mig','w_aura','no_aura'])
In [363]:
sex_props = sex_props.reset_index()
In [364]:
sex_props #proportions of participant sex considering migraine types
Out[364]:
index male female
0 w mig 0.283505 0.706186
1 w_aura 0.227723 0.752475
2 no_aura 0.298246 0.684211
In [365]:
sex_props.plot.bar(x='index',y=['male','female'],color=[ 'royalblue', 'pink'], xlabel='Migraines and Types of Migraines',ylabel='Percent of Individuals', title='Sex Proportions of Having Migraines and Types of Migraines ', width=0.8, figsize=(12,6)).grid()

There appears to be a significant statistical relationship between biological sex and migraine occurance, with female participants making up close to 2/3 of the migraine having population

From there, we wanted to see if blood type had any statistical relationship with migraines

In [366]:
all.columns[28:37]
Out[366]:
Index(['Blood Type_A +', 'Blood Type_A -', 'Blood Type_AB +',
       'Blood Type_AB -', 'Blood Type_B +', 'Blood Type_B -',
       'Blood Type_Don't know', 'Blood Type_O +', 'Blood Type_O -'],
      dtype='object')
In [367]:
def prob_no_mig(cond): # Blood type proportions based on Migraine haver or not
  a = ((all['Has Migraines'] == 0) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 0).sum()
  return a
def prob_w_mig(cond):
  b =((all['Has Migraines'] == 1) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 1).sum()
  return b
In [368]:
prob_w_list = [] # iterating through columns for blood types
for i in all.columns[28:37]:
  prob_w_list.append(prob_w_mig(i))
prob_no_list = []
for i in all.columns[28:37]:
  prob_no_list.append(prob_no_mig(i))
In [369]:
def prob_no_aura(cond): # Blood type proportions based on Aura haver or not 
  a = ((all['Has Migraines with Aura'] == 1) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 1).sum()
  return a
def prob_w_aura(cond):
  b =((all['Has Migraines without Aura'] == 1) & (all[cond] == 1)).sum() / (all['Has Migraines'] == 1).sum()
  return b
In [370]:
prob_w_aura_list = [] # iterating through columns for blood types
for i in all.columns[28:37]:
  prob_w_aura_list.append(prob_w_aura(i))
prob_no_aura_list = []
for i in all.columns[28:37]:
  prob_no_aura_list.append(prob_no_aura(i))
In [371]:
d={'w mig': prob_w_list, 'no mig':prob_no_list, 'w_aura': prob_w_aura_list,'no_aura': prob_no_aura_list} #blood type probs dataframe
In [372]:
blood_props = pd.DataFrame(data=d,index=all.columns[28:37])
In [373]:
blood_props = blood_props.reset_index()
In [374]:
blood_props #blood type vs type of migraine
Out[374]:
index w mig no mig w_aura no_aura
0 Blood Type_A + 0.268041 0.256591 0.144330 0.154639
1 Blood Type_A - 0.041237 0.070299 0.041237 0.005155
2 Blood Type_AB + 0.061856 0.028120 0.036082 0.036082
3 Blood Type_AB - 0.015464 0.015817 0.005155 0.010309
4 Blood Type_B + 0.077320 0.086116 0.036082 0.051546
5 Blood Type_B - 0.015464 0.015817 0.010309 0.010309
6 Blood Type_Don't know 0.154639 0.184534 0.087629 0.077320
7 Blood Type_O + 0.278351 0.256591 0.170103 0.139175
8 Blood Type_O - 0.087629 0.086116 0.056701 0.036082
In [375]:
blood_props.plot.bar(x='index',y=['w mig','no mig'],color=[ 'brown', 'aqua'], xlabel='Blood Types',ylabel='Percent of Individuals', title='Proportions of Blood Types for Migraine Havers and Control', width=0.8, figsize=(12,6)).grid()

There does not appear to be a significant statistical relationship between having migraines and any specific blood type

In [376]:
blood_props.plot.bar(x='index',y=['w_aura','no_aura'],color=[ 'gray', 'red'], xlabel='Blood Types',ylabel='Percent of Individuals', title='Proportion of Blood Types for Aura and No Aura', width=0.8, figsize=(12,6)).grid()

There is also a lack of significant relationship between blood type and migraines with aura vs without aura. Although there are larger differences between the control and experimental, this can be attributed to natural variance in a small data set

We were also curious as to whether race/ethnicity had a relationship with migraine occurance

In [377]:
all.columns[54:]
Out[377]:
Index(['Race/ethnicity_American Indian', 'Race/ethnicity_Asian',
       'Race/ethnicity_Asian, Hispanic or',
       'Race/ethnicity_Asian, Native Hawaiian',
       'Race/ethnicity_Black or African', 'Race/ethnicity_Hispanic or Latino',
       'Race/ethnicity_Native Hawaiian', 'Race/ethnicity_No response',
       'Race/ethnicity_White'],
      dtype='object')
In [378]:
def prob_no_mig(cond): # proportions for race per those with and without migraines
  a = ((all['Has Migraines'] == 0) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
  return a
def prob_w_mig(cond): 
  b =((all['Has Migraines'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
  return b
In [379]:
prob_w_list = [] # iterating through columns for races
for i in all.columns[54:]:
  prob_w_list.append(prob_w_mig(i))
prob_no_list = []
for i in all.columns[54:]:
  prob_no_list.append(prob_no_mig(i))
<ipython-input-378-545a5e4fc270>:5: RuntimeWarning: invalid value encountered in long_scalars
  b =((all['Has Migraines'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
<ipython-input-378-545a5e4fc270>:2: RuntimeWarning: invalid value encountered in long_scalars
  a = ((all['Has Migraines'] == 0) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
In [380]:
def prob_no_aura(cond): # aura types per race
  a = ((all['Has Migraines with Aura'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
  return a
def prob_w_aura(cond):
  b =((all['Has Migraines without Aura'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
  return b
In [381]:
prob_w_aura_list = [] # iterating through columns for races
for i in all.columns[54:]:
  prob_w_aura_list.append(prob_w_aura(i))
prob_no_aura_list = []
for i in all.columns[54:]:
  prob_no_aura_list.append(prob_no_aura(i))
<ipython-input-380-6f2f39f387d9>:5: RuntimeWarning: invalid value encountered in long_scalars
  b =((all['Has Migraines without Aura'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
<ipython-input-380-6f2f39f387d9>:2: RuntimeWarning: invalid value encountered in long_scalars
  a = ((all['Has Migraines with Aura'] == 1) & (all[cond] == 1)).sum() / (all[cond] == 1).sum()
In [382]:
d={'w mig': prob_w_list, 'no mig':prob_no_list, 'w_aura': prob_w_aura_list,'no_aura': prob_no_aura_list} # race props df
In [383]:
race_props = pd.DataFrame(data=d,index=all.columns[54:])
In [384]:
race_props = race_props.reset_index()
In [385]:
race_props = race_props.dropna() # some races had very low values and no migs
In [386]:
race_props #race/ethnicity vs migraines
Out[386]:
index w mig no mig w_aura no_aura
0 Race/ethnicity_American Indian 0.419355 0.580645 0.258065 0.290323
1 Race/ethnicity_Asian 0.058824 0.941176 0.058824 0.000000
4 Race/ethnicity_Black or African 0.333333 0.666667 0.333333 0.000000
5 Race/ethnicity_Hispanic or Latino 0.269231 0.730769 0.230769 0.115385
6 Race/ethnicity_Native Hawaiian 0.000000 1.000000 0.000000 0.000000
7 Race/ethnicity_No response 0.500000 0.500000 0.000000 0.500000
8 Race/ethnicity_White 0.250368 0.749632 0.142857 0.129602
In [387]:
race_props.drop(race_props.loc[race_props.w_aura < 0.00001].index, inplace=True) # getting rid of columns with 0s (low pops, unimportant)
In [388]:
race_props
Out[388]:
index w mig no mig w_aura no_aura
0 Race/ethnicity_American Indian 0.419355 0.580645 0.258065 0.290323
1 Race/ethnicity_Asian 0.058824 0.941176 0.058824 0.000000
4 Race/ethnicity_Black or African 0.333333 0.666667 0.333333 0.000000
5 Race/ethnicity_Hispanic or Latino 0.269231 0.730769 0.230769 0.115385
8 Race/ethnicity_White 0.250368 0.749632 0.142857 0.129602
In [389]:
race_props.plot.bar(x='index',y=['w mig','no mig'],color=[ 'orange', 'blue'], xlabel='Race/Ethnicity',ylabel='Percent of Individuals', title='Proportions of Race/Ethnicity for Migraine Havers and Control', width=0.8, figsize=(12,6)).grid()

Based on this information alone, it would appear there is a statistical relationship between race and migraine. However, there is a higher proportion of white participants vs other races, which makes this metric a poor indicator

In [390]:
race_props.plot.bar(x='index',y=['w_aura','no_aura'],color=[ 'green', 'red'], xlabel='Race/Ethnicity',ylabel='Percent of Individuals', title='Proportion of Race/Ethnicity for Aura and No Aura', width=0.8, figsize=(12,6)).grid()

The same could be said for this plot --> although there appears to be a relationship between ethnicity and migraine with/without aura, the size and makeup of the dataset must be considered first

Exploring Genes

We also wanted to explore some relationships found in the genetic data. Within this section, we created additional dataframes and evaluated ratios of gene alleles in the population

In [416]:
alleles
Out[416]:
Participant Type of Migraine Gene Recessive or dominant Homo/heterozyg Mutation or variant? Disease capacity Has Migraine
0 hu620F18 Mig no aura CBS-I278T Recessive Carrier (Heterozygous) Mutation Likely pathogenic 1
1 hu620F18 Mig no aura C3-R102G Complex/Other Heterozygous Variant Likely pathogenic 1
2 hu620F18 Mig no aura COL4A1-Q1334H Dominant Heterozygous Variant Likely pathogenic 1
3 hu620F18 Mig no aura MTRR-I49M Recessive Carrier (Heterozygous) Variant Likely pathogenic 1
4 hu620F18 Mig no aura rs5186 Unknown Heterozygous Variant Likely pathogenic 1
... ... ... ... ... ... ... ... ...
1253 hu05FD49 None RPE65-N356 / heterozygous Mutation - Frameshift / 0
1254 hu05FD49 None PKD1-R2430 / heterozygous Mutation - nonsense / 0
1255 hu05FD49 None FLG-R3879 / heterozygous Mutation - nonsense / 0
1256 hu05FD49 None SBF2-H1549 / heterozygous Mutation - nonsense / 0
1257 hu05FD49 None NF2-K523 / heterozygous Mutation - Frameshift / 0

1258 rows × 8 columns

In [417]:
alleles.Gene.nunique() #there are 412 different genes in the df
Out[417]:
412
In [418]:
no_mig = alleles['Type of Migraine'].str.contains('None', case=False, na=False)
aura_mig = alleles['Type of Migraine'].str.contains('with aura', case=False, na=False)
no_aura_mig = alleles['Type of Migraine'].str.contains('no aura', case=False, na=False)
both_mig = alleles['Type of Migraine'].str.contains('Both', case=False, na=False)

none_df = alleles[no_mig]
aura_df = alleles[aura_mig]
no_aura_df = alleles[no_aura_mig]
both_df = alleles[both_mig]
In [419]:
none_df.Gene.value_counts()
none_df.Participant.nunique() #40 people
aura_df.Participant.nunique() #49 people
no_aura_df.Participant.nunique() #38 people
both_df.Participant.nunique() #10 people
Out[419]:
10

Out of the genetic information collected, there are alleles available for 40 participants with no migraines, 49 participants with migraine with aura, 38 people with migraine without aura, and 10 participants with both forms

In [420]:
none_df2 = none_df[none_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
aura_df2 = aura_df[aura_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
no_aura_df2 = no_aura_df[no_aura_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
both_df2 = both_df[both_df.duplicated(subset=['Gene', 'Homo/heterozyg'], keep=False)]
# keeping only the desired dataframes
Out[420]:
Participant Type of Migraine Gene Recessive or dominant Homo/heterozyg Mutation or variant? Disease capacity Has Migraine
78 hu4F8813 None MTRR-I49M Recessive Carrier (Heterozygous) Variant Likely pathogenic 0
81 hu4F8813 None PEX26-L153V / Carrier (Heterozygous) / Probably damaging 0
82 hu4386OC None SERPINA1-E366K Recessive Carrier (Heterozygous) Variant Well-established pathogenic 0
84 hu4386OC None C3-R102G Complex/Other Heterozygous Variant Likely pathogenic 0
88 hu4386OC None SERPINA1-E288V Recessive Carrier (Heterozygous) Variant Well-established pathogenic 0
... ... ... ... ... ... ... ... ...
1242 huA4E2CF None NEFL-S472 / homozygous Mutation - Frameshift / 0
1245 huA4E2CF None RYR1-P2002 / heterozygous Mutation - Frameshift / 0
1248 hu05FD49 None COL4A1-Q1334H Dominant homozygous variant likely pathogenic 0
1249 hu05FD49 None MTRR-I49M recessive homozygous variant likely pathogenic 0
1252 hu05FD49 None NEFL-S472 / homozygous Mutation - Frameshift / 0

204 rows × 8 columns

In [421]:
none_df2.Gene.value_counts()
Out[421]:
MTRR-I49M         27
NEFL-S472         27
COL4A1-Q1334H     17
C3-R102G          13
rs5186            13
APOE-C130R         8
CBS-I278T          7
MBL2-G54D          6
MBL2-R52C          6
NPC1-W1122         4
SERPINA1-E288V     4
SYNE1-N1915        4
AMPD1-Q12X         4
APOA5-S19W         4
HABP2-G534E        3
PAX2-Y273          3
HFE-C282Y          3
TGM1-E520G         3
TTN-E190           3
KRT5-G138E         3
NOD2-R702W         3
ACAD8-S171C        3
CREBBP-P1878       2
PRPH-D141Y         2
SERPINA1-E366K     2
RYR1-P2002         2
HPS6-A597          2
SNCA-Y39           2
PHKB-M185I         2
LPL-N318S          2
ALG3-F200          2
PKP2-S140F         2
MSR1-R293X         2
SPG11-K1013E       2
CETP-A390P         2
THBD-A43T          2
CD40LG-G219R       2
PEX26-L153V        2
WFS1-R456H         2
SNCA-A69           2
Name: Gene, dtype: int64
In [422]:
none_df2 = none_df2.drop('Disease capacity', axis = 1)
none_df2 = none_df2.drop('Recessive or dominant', axis = 1)
none_df2 = none_df2.drop('Mutation or variant?', axis = 1)
# dropping columns
#none_df2 = none_df2.drop('Unnamed: 7', axis = 1)
In [423]:
def get_ratio(df, pattern):
  num = df.groupby(df[pattern].str.lower()).size()
  denom = len(df[pattern])
  return num/denom

# function to get the ratio of heterozygous to homozygous
In [424]:
def gene_grab(df, listg):
  count = df.Gene.value_counts()
  for gene, num in count.iteritems():
    listg.append((gene, num))

  return listg

# function to collect the gene and its' relative frequency
In [425]:
none_count = none_df2.Gene.value_counts()
none_genes = []

test = gene_grab(none_df2, none_genes)
test

# this creates a list with all the genes and freqs
Out[425]:
[('MTRR-I49M', 27),
 ('NEFL-S472', 27),
 ('COL4A1-Q1334H', 17),
 ('C3-R102G', 13),
 ('rs5186', 13),
 ('APOE-C130R', 8),
 ('CBS-I278T', 7),
 ('MBL2-G54D', 6),
 ('MBL2-R52C', 6),
 ('NPC1-W1122', 4),
 ('SERPINA1-E288V', 4),
 ('SYNE1-N1915', 4),
 ('AMPD1-Q12X', 4),
 ('APOA5-S19W', 4),
 ('HABP2-G534E', 3),
 ('PAX2-Y273', 3),
 ('HFE-C282Y', 3),
 ('TGM1-E520G', 3),
 ('TTN-E190', 3),
 ('KRT5-G138E', 3),
 ('NOD2-R702W', 3),
 ('ACAD8-S171C', 3),
 ('CREBBP-P1878', 2),
 ('PRPH-D141Y', 2),
 ('SERPINA1-E366K', 2),
 ('RYR1-P2002', 2),
 ('HPS6-A597', 2),
 ('SNCA-Y39', 2),
 ('PHKB-M185I', 2),
 ('LPL-N318S', 2),
 ('ALG3-F200', 2),
 ('PKP2-S140F', 2),
 ('MSR1-R293X', 2),
 ('SPG11-K1013E', 2),
 ('CETP-A390P', 2),
 ('THBD-A43T', 2),
 ('CD40LG-G219R', 2),
 ('PEX26-L153V', 2),
 ('WFS1-R456H', 2),
 ('SNCA-A69', 2)]
In [426]:
def get_freq(gene_freq_list, df, ratios_list):
  i = 0
  while i < len(gene_freq_list):
    gene = gene_freq_list[i][0]
    specific_gene_df = df[df['Gene'].str.contains(gene)]
    ratlist = []
    hold = get_ratio(specific_gene_df, 'Homo/heterozyg')
    for pattern, ratio in hold.iteritems():
      ratlist.append((pattern, ratio))
    ratios_list.append(ratlist)
    i += 1

  return ratios_list

# function that returns the allele types broken down individually
In [427]:
breakdown_test = []
none_freqs = get_freq(test, none_df2, breakdown_test)

breakdown_test
Out[427]:
[[('carrier (heterozygous)', 0.6296296296296297),
  ('homozygous', 0.37037037037037035)],
 [('homozygous', 1.0)],
 [('heterozygous', 0.7058823529411765), ('homozygous', 0.29411764705882354)],
 [('heterozygous', 1.0)],
 [('heterozygous', 0.8461538461538461), ('homozygous', 0.15384615384615385)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)]]
In [428]:
total_none_set = []
for item in range(len(test)):
  gene = test[item][0]
  freq = test[item][1]
  ratios = breakdown_test[item]
  total_none_set.append((gene, freq, ratios))

total_none_set

# creating list with gene, freq, and allele type distrib
Out[428]:
[('MTRR-I49M',
  27,
  [('carrier (heterozygous)', 0.6296296296296297),
   ('homozygous', 0.37037037037037035)]),
 ('NEFL-S472', 27, [('homozygous', 1.0)]),
 ('COL4A1-Q1334H',
  17,
  [('heterozygous', 0.7058823529411765), ('homozygous', 0.29411764705882354)]),
 ('C3-R102G', 13, [('heterozygous', 1.0)]),
 ('rs5186',
  13,
  [('heterozygous', 0.8461538461538461), ('homozygous', 0.15384615384615385)]),
 ('APOE-C130R', 8, [('heterozygous', 1.0)]),
 ('CBS-I278T', 7, [('carrier (heterozygous)', 1.0)]),
 ('MBL2-G54D', 6, [('carrier (heterozygous)', 1.0)]),
 ('MBL2-R52C', 6, [('carrier (heterozygous)', 1.0)]),
 ('NPC1-W1122', 4, [('heterozygous', 1.0)]),
 ('SERPINA1-E288V', 4, [('carrier (heterozygous)', 1.0)]),
 ('SYNE1-N1915', 4, [('heterozygous', 1.0)]),
 ('AMPD1-Q12X', 4, [('carrier (heterozygous)', 1.0)]),
 ('APOA5-S19W', 4, [('heterozygous', 1.0)]),
 ('HABP2-G534E', 3, [('heterozygous', 1.0)]),
 ('PAX2-Y273', 3, [('heterozygous', 1.0)]),
 ('HFE-C282Y', 3, [('carrier (heterozygous)', 1.0)]),
 ('TGM1-E520G', 3, [('carrier (heterozygous)', 1.0)]),
 ('TTN-E190', 3, [('heterozygous', 1.0)]),
 ('KRT5-G138E', 3, [('heterozygous', 1.0)]),
 ('NOD2-R702W', 3, [('heterozygous', 1.0)]),
 ('ACAD8-S171C', 3, [('carrier (heterozygous)', 1.0)]),
 ('CREBBP-P1878', 2, [('heterozygous', 1.0)]),
 ('PRPH-D141Y', 2, [('carrier (heterozygous)', 1.0)]),
 ('SERPINA1-E366K', 2, [('carrier (heterozygous)', 1.0)]),
 ('RYR1-P2002', 2, [('heterozygous', 1.0)]),
 ('HPS6-A597', 2, [('heterozygous', 1.0)]),
 ('SNCA-Y39', 2, [('heterozygous', 1.0)]),
 ('PHKB-M185I', 2, [('carrier (heterozygous)', 1.0)]),
 ('LPL-N318S', 2, [('heterozygous', 1.0)]),
 ('ALG3-F200', 2, [('heterozygous', 1.0)]),
 ('PKP2-S140F', 2, [('heterozygous', 1.0)]),
 ('MSR1-R293X', 2, [('heterozygous', 1.0)]),
 ('SPG11-K1013E', 2, [('carrier (heterozygous)', 1.0)]),
 ('CETP-A390P', 2, [('heterozygous', 1.0)]),
 ('THBD-A43T', 2, [('heterozygous', 1.0)]),
 ('CD40LG-G219R', 2, [('carrier (heterozygous)', 1.0)]),
 ('PEX26-L153V', 2, [('carrier (heterozygous)', 1.0)]),
 ('WFS1-R456H', 2, [('heterozygous', 1.0)]),
 ('SNCA-A69', 2, [('heterozygous', 1.0)])]

After collecting this information for people without migraines, the same process was repeated with the other dfs. The same functions were utilized

In [429]:
aura_df2.Gene.value_counts()
Out[429]:
MTRR-I49M         35
COL4A1-Q1334H     30
rs5186            21
APOE-C130R        14
C3-R102G          13
BEST1-S192        12
BEST1-Y245        12
MBL2-R52C         10
AMPD1-Q12X        10
MBL2-G54D          9
CETP-A390P         7
NEFL-S472          7
KRT5-G138E         5
KRT86-E402Q        5
PIGR-A580V         5
BTD-D444H          5
ACAD8-S171C        4
SERPINA1-E288V     4
VWF-S1506L         3
HABP2-G534E        3
RET-R231H          3
WFS1-R456H         3
KDR-C482R          3
HFE-S65C           3
SLC4A1-E40K        3
MPO-M251T          3
NOD2-G908R         3
KRT14-C18X         2
FCGR2B-I232T       2
RPGRIP1L-A229T     2
CFTR-S1235R        2
HFE-C282Y          2
MEFV-E148Q         2
MFN2-Q276R         2
APOA5-S19W         2
TTN-E190           2
COL9A3-R103W       2
PRF1-A91V          2
NOD2-R702W         2
ABCA4-G863A        2
ABCA4-A1038V       2
Name: Gene, dtype: int64
In [430]:
aura_df2 = aura_df2.drop('Disease capacity', axis = 1)
aura_df2 = aura_df2.drop('Recessive or dominant', axis = 1)
aura_df2 = aura_df2.drop('Mutation or variant?', axis = 1)
#aura_df2 = aura_df2.drop('Unnamed: 7', axis = 1)
In [431]:
aura_genes = []

ag_list = gene_grab(aura_df2, aura_genes)
ag_list
Out[431]:
[('MTRR-I49M', 35),
 ('COL4A1-Q1334H', 30),
 ('rs5186', 21),
 ('APOE-C130R', 14),
 ('C3-R102G', 13),
 ('BEST1-S192', 12),
 ('BEST1-Y245', 12),
 ('MBL2-R52C', 10),
 ('AMPD1-Q12X', 10),
 ('MBL2-G54D', 9),
 ('CETP-A390P', 7),
 ('NEFL-S472', 7),
 ('KRT5-G138E', 5),
 ('KRT86-E402Q', 5),
 ('PIGR-A580V', 5),
 ('BTD-D444H', 5),
 ('ACAD8-S171C', 4),
 ('SERPINA1-E288V', 4),
 ('VWF-S1506L', 3),
 ('HABP2-G534E', 3),
 ('RET-R231H', 3),
 ('WFS1-R456H', 3),
 ('KDR-C482R', 3),
 ('HFE-S65C', 3),
 ('SLC4A1-E40K', 3),
 ('MPO-M251T', 3),
 ('NOD2-G908R', 3),
 ('KRT14-C18X', 2),
 ('FCGR2B-I232T', 2),
 ('RPGRIP1L-A229T', 2),
 ('CFTR-S1235R', 2),
 ('HFE-C282Y', 2),
 ('MEFV-E148Q', 2),
 ('MFN2-Q276R', 2),
 ('APOA5-S19W', 2),
 ('TTN-E190', 2),
 ('COL9A3-R103W', 2),
 ('PRF1-A91V', 2),
 ('NOD2-R702W', 2),
 ('ABCA4-G863A', 2),
 ('ABCA4-A1038V', 2)]
In [432]:
aurafr = []
aura_freqs = get_freq(ag_list, aura_df2, aurafr)
aurafr
Out[432]:
[[('carrier (heterozygous)', 0.7428571428571429),
  ('homozygous', 0.2571428571428571)],
 [('heterozygous', 0.6333333333333333), ('homozygous', 0.36666666666666664)],
 [('heterozygous', 0.7142857142857143), ('homozygous', 0.2857142857142857)],
 [('heterozygous', 0.8571428571428571), ('homozygous', 0.14285714285714285)],
 [('heterozygous', 0.8461538461538461), ('homozygous', 0.15384615384615385)],
 [('homozygous', 1.0)],
 [('homozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)]]
In [433]:
total_aura_set = []
for item in range(len(ag_list)):
  gene = ag_list[item][0]
  freq = ag_list[item][1]
  ratios = aurafr[item]
  total_aura_set.append((gene, freq, ratios))

total_aura_set #full set of information for people with migraine with aura
Out[433]:
[('MTRR-I49M',
  35,
  [('carrier (heterozygous)', 0.7428571428571429),
   ('homozygous', 0.2571428571428571)]),
 ('COL4A1-Q1334H',
  30,
  [('heterozygous', 0.6333333333333333), ('homozygous', 0.36666666666666664)]),
 ('rs5186',
  21,
  [('heterozygous', 0.7142857142857143), ('homozygous', 0.2857142857142857)]),
 ('APOE-C130R',
  14,
  [('heterozygous', 0.8571428571428571), ('homozygous', 0.14285714285714285)]),
 ('C3-R102G',
  13,
  [('heterozygous', 0.8461538461538461), ('homozygous', 0.15384615384615385)]),
 ('BEST1-S192', 12, [('homozygous', 1.0)]),
 ('BEST1-Y245', 12, [('homozygous', 1.0)]),
 ('MBL2-R52C', 10, [('carrier (heterozygous)', 1.0)]),
 ('AMPD1-Q12X', 10, [('carrier (heterozygous)', 1.0)]),
 ('MBL2-G54D', 9, [('carrier (heterozygous)', 1.0)]),
 ('CETP-A390P', 7, [('heterozygous', 1.0)]),
 ('NEFL-S472', 7, [('homozygous', 1.0)]),
 ('KRT5-G138E', 5, [('heterozygous', 1.0)]),
 ('KRT86-E402Q', 5, [('homozygous', 1.0)]),
 ('PIGR-A580V', 5, [('heterozygous', 1.0)]),
 ('BTD-D444H', 5, [('carrier (heterozygous)', 1.0)]),
 ('ACAD8-S171C', 4, [('carrier (heterozygous)', 1.0)]),
 ('SERPINA1-E288V', 4, [('carrier (heterozygous)', 1.0)]),
 ('VWF-S1506L', 3, [('carrier (heterozygous)', 1.0)]),
 ('HABP2-G534E', 3, [('heterozygous', 1.0)]),
 ('RET-R231H', 3, [('heterozygous', 1.0)]),
 ('WFS1-R456H', 3, [('heterozygous', 1.0)]),
 ('KDR-C482R', 3, [('heterozygous', 1.0)]),
 ('HFE-S65C', 3, [('carrier (heterozygous)', 1.0)]),
 ('SLC4A1-E40K', 3, [('carrier (heterozygous)', 1.0)]),
 ('MPO-M251T', 3, [('carrier (heterozygous)', 1.0)]),
 ('NOD2-G908R', 3, [('heterozygous', 1.0)]),
 ('KRT14-C18X', 2, [('heterozygous', 1.0)]),
 ('FCGR2B-I232T', 2, [('heterozygous', 1.0)]),
 ('RPGRIP1L-A229T', 2, [('heterozygous', 1.0)]),
 ('CFTR-S1235R', 2, [('carrier (heterozygous)', 1.0)]),
 ('HFE-C282Y', 2, [('carrier (heterozygous)', 1.0)]),
 ('MEFV-E148Q', 2, [('carrier (heterozygous)', 1.0)]),
 ('MFN2-Q276R', 2, [('heterozygous', 1.0)]),
 ('APOA5-S19W', 2, [('heterozygous', 1.0)]),
 ('TTN-E190', 2, [('heterozygous', 1.0)]),
 ('COL9A3-R103W', 2, [('heterozygous', 1.0)]),
 ('PRF1-A91V', 2, [('heterozygous', 1.0)]),
 ('NOD2-R702W', 2, [('heterozygous', 1.0)]),
 ('ABCA4-G863A', 2, [('carrier (heterozygous)', 1.0)]),
 ('ABCA4-A1038V', 2, [('heterozygous', 1.0)])]
In [434]:
both_df2.Gene.value_counts()
Out[434]:
MTRR-I49M         9
HFE-C282Y         5
KRT86-E402Q       4
COL4A1-Q1334H     4
APOA5-S19W        3
VWF-S1506L        3
MBL2-R52C         3
C3-R102G          3
SERPINA1-E366K    2
KRT5-G138E        2
CYP21A2-Q319X     2
MYH7-Q1334        2
MBL2-G54D         2
SERPINA1-E288V    2
RPGRIP1L-A229T    2
APC-Y486X         2
AMPD1-Q12X        2
PRF1-A91V         2
Name: Gene, dtype: int64
In [435]:
both_count = both_df2.Gene.value_counts()
both_genes = []

bboth = gene_grab(both_df2, both_genes)
bboth
Out[435]:
[('MTRR-I49M', 9),
 ('HFE-C282Y', 5),
 ('KRT86-E402Q', 4),
 ('COL4A1-Q1334H', 4),
 ('APOA5-S19W', 3),
 ('VWF-S1506L', 3),
 ('MBL2-R52C', 3),
 ('C3-R102G', 3),
 ('SERPINA1-E366K', 2),
 ('KRT5-G138E', 2),
 ('CYP21A2-Q319X', 2),
 ('MYH7-Q1334', 2),
 ('MBL2-G54D', 2),
 ('SERPINA1-E288V', 2),
 ('RPGRIP1L-A229T', 2),
 ('APC-Y486X', 2),
 ('AMPD1-Q12X', 2),
 ('PRF1-A91V', 2)]
In [436]:
both_test = []
both_freqs = get_freq(bboth, both_df2, both_test)

both_test
Out[436]:
[[('carrier (heterozygous)', 0.5555555555555556),
  ('homozygous', 0.4444444444444444)],
 [('carrier (heterozygous)', 1.0)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)]]
In [437]:
total_both_set = []
for item in range(len(bboth)):
  gene = bboth[item][0]
  freq = bboth[item][1]
  ratios = both_test[item]
  total_both_set.append((gene, freq, ratios))

total_both_set # information for those with both forms of migraine
Out[437]:
[('MTRR-I49M',
  9,
  [('carrier (heterozygous)', 0.5555555555555556),
   ('homozygous', 0.4444444444444444)]),
 ('HFE-C282Y', 5, [('carrier (heterozygous)', 1.0)]),
 ('KRT86-E402Q', 4, [('homozygous', 1.0)]),
 ('COL4A1-Q1334H', 4, [('heterozygous', 1.0)]),
 ('APOA5-S19W', 3, [('heterozygous', 1.0)]),
 ('VWF-S1506L', 3, [('carrier (heterozygous)', 1.0)]),
 ('MBL2-R52C', 3, [('carrier (heterozygous)', 1.0)]),
 ('C3-R102G', 3, [('heterozygous', 1.0)]),
 ('SERPINA1-E366K', 2, [('carrier (heterozygous)', 1.0)]),
 ('KRT5-G138E', 2, [('heterozygous', 1.0)]),
 ('CYP21A2-Q319X', 2, [('carrier (heterozygous)', 1.0)]),
 ('MYH7-Q1334', 2, [('heterozygous', 1.0)]),
 ('MBL2-G54D', 2, [('carrier (heterozygous)', 1.0)]),
 ('SERPINA1-E288V', 2, [('carrier (heterozygous)', 1.0)]),
 ('RPGRIP1L-A229T', 2, [('heterozygous', 1.0)]),
 ('APC-Y486X', 2, [('heterozygous', 1.0)]),
 ('AMPD1-Q12X', 2, [('carrier (heterozygous)', 1.0)]),
 ('PRF1-A91V', 2, [('heterozygous', 1.0)])]
In [438]:
no_aura_df2.Gene.value_counts()
Out[438]:
MTRR-I49M         29
rs5186            23
COL4A1-Q1334H     19
NEFL-S472         12
C3-R102G          11
AMPD1-Q12X        10
APOE-C130R        10
MBL2-G54D          8
BEST1-S192         8
BEST1-Y245         8
KRT86-E402Q        6
APOA5-S19W         6
MBL2-R52C          4
DMD-E2910V         3
NOD2-R702W         3
CYP21A2-Q319X      3
PRPH-D141Y         3
RP1-T373I          3
MEFV-P369S         3
CFTR-W1204X        2
VWF-S1506L         2
WFS1-R456H         2
ARSA-T274M         2
KRT5-G138E         2
PMP22-T118M        2
HFE-C282Y          2
APC-Y486X          2
CBS-I278T          2
KRT14-C18X         2
PMM2-V129M         2
ABCC6-R1164X       2
BTD-D444H          2
VCL-M1073          2
DOK7-S45L          2
RPGRIP1L-A229T     2
CETP-A390P         2
SPG11-K1013E       2
RET-R231H          2
Name: Gene, dtype: int64
In [439]:
no_aura_count = no_aura_df2.Gene.value_counts()
no_aura_genes = []

noaur = gene_grab(no_aura_df2, no_aura_genes)
noaur
Out[439]:
[('MTRR-I49M', 29),
 ('rs5186', 23),
 ('COL4A1-Q1334H', 19),
 ('NEFL-S472', 12),
 ('C3-R102G', 11),
 ('AMPD1-Q12X', 10),
 ('APOE-C130R', 10),
 ('MBL2-G54D', 8),
 ('BEST1-S192', 8),
 ('BEST1-Y245', 8),
 ('KRT86-E402Q', 6),
 ('APOA5-S19W', 6),
 ('MBL2-R52C', 4),
 ('DMD-E2910V', 3),
 ('NOD2-R702W', 3),
 ('CYP21A2-Q319X', 3),
 ('PRPH-D141Y', 3),
 ('RP1-T373I', 3),
 ('MEFV-P369S', 3),
 ('CFTR-W1204X', 2),
 ('VWF-S1506L', 2),
 ('WFS1-R456H', 2),
 ('ARSA-T274M', 2),
 ('KRT5-G138E', 2),
 ('PMP22-T118M', 2),
 ('HFE-C282Y', 2),
 ('APC-Y486X', 2),
 ('CBS-I278T', 2),
 ('KRT14-C18X', 2),
 ('PMM2-V129M', 2),
 ('ABCC6-R1164X', 2),
 ('BTD-D444H', 2),
 ('VCL-M1073', 2),
 ('DOK7-S45L', 2),
 ('RPGRIP1L-A229T', 2),
 ('CETP-A390P', 2),
 ('SPG11-K1013E', 2),
 ('RET-R231H', 2)]
In [440]:
noaur_test = []
noaur_freqs = get_freq(noaur, no_aura_df2, noaur_test)

noaur_test
Out[440]:
[[('carrier (heterozygous)', 0.5517241379310345),
  ('homozygous', 0.4482758620689655)],
 [('heterozygous', 0.9130434782608695), ('homozygous', 0.08695652173913043)],
 [('heterozygous', 0.8947368421052632), ('homozygous', 0.10526315789473684)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 0.8), ('homozygous', 0.2)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('homozygous', 1.0)],
 [('homozygous', 1.0)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('homozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('homozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)],
 [('heterozygous', 1.0)],
 [('carrier (heterozygous)', 1.0)],
 [('heterozygous', 1.0)]]
In [441]:
total_noaur_set = []
for item in range(len(noaur)):
  gene = noaur[item][0]
  freq = noaur[item][1]
  ratios = noaur_test[item]
  total_noaur_set.append((gene, freq, ratios))

total_noaur_set # total information for people with migraine without aura
Out[441]:
[('MTRR-I49M',
  29,
  [('carrier (heterozygous)', 0.5517241379310345),
   ('homozygous', 0.4482758620689655)]),
 ('rs5186',
  23,
  [('heterozygous', 0.9130434782608695), ('homozygous', 0.08695652173913043)]),
 ('COL4A1-Q1334H',
  19,
  [('heterozygous', 0.8947368421052632), ('homozygous', 0.10526315789473684)]),
 ('NEFL-S472', 12, [('homozygous', 1.0)]),
 ('C3-R102G', 11, [('heterozygous', 1.0)]),
 ('AMPD1-Q12X', 10, [('carrier (heterozygous)', 0.8), ('homozygous', 0.2)]),
 ('APOE-C130R', 10, [('heterozygous', 1.0)]),
 ('MBL2-G54D', 8, [('carrier (heterozygous)', 1.0)]),
 ('BEST1-S192', 8, [('homozygous', 1.0)]),
 ('BEST1-Y245', 8, [('homozygous', 1.0)]),
 ('KRT86-E402Q', 6, [('homozygous', 1.0)]),
 ('APOA5-S19W', 6, [('heterozygous', 1.0)]),
 ('MBL2-R52C', 4, [('carrier (heterozygous)', 1.0)]),
 ('DMD-E2910V', 3, [('heterozygous', 1.0)]),
 ('NOD2-R702W', 3, [('heterozygous', 1.0)]),
 ('CYP21A2-Q319X', 3, [('carrier (heterozygous)', 1.0)]),
 ('PRPH-D141Y', 3, [('carrier (heterozygous)', 1.0)]),
 ('RP1-T373I', 3, [('carrier (heterozygous)', 1.0)]),
 ('MEFV-P369S', 3, [('carrier (heterozygous)', 1.0)]),
 ('CFTR-W1204X', 2, [('homozygous', 1.0)]),
 ('VWF-S1506L', 2, [('carrier (heterozygous)', 1.0)]),
 ('WFS1-R456H', 2, [('heterozygous', 1.0)]),
 ('ARSA-T274M', 2, [('carrier (heterozygous)', 1.0)]),
 ('KRT5-G138E', 2, [('heterozygous', 1.0)]),
 ('PMP22-T118M', 2, [('heterozygous', 1.0)]),
 ('HFE-C282Y', 2, [('carrier (heterozygous)', 1.0)]),
 ('APC-Y486X', 2, [('heterozygous', 1.0)]),
 ('CBS-I278T', 2, [('carrier (heterozygous)', 1.0)]),
 ('KRT14-C18X', 2, [('heterozygous', 1.0)]),
 ('PMM2-V129M', 2, [('homozygous', 1.0)]),
 ('ABCC6-R1164X', 2, [('heterozygous', 1.0)]),
 ('BTD-D444H', 2, [('carrier (heterozygous)', 1.0)]),
 ('VCL-M1073', 2, [('heterozygous', 1.0)]),
 ('DOK7-S45L', 2, [('carrier (heterozygous)', 1.0)]),
 ('RPGRIP1L-A229T', 2, [('heterozygous', 1.0)]),
 ('CETP-A390P', 2, [('heterozygous', 1.0)]),
 ('SPG11-K1013E', 2, [('carrier (heterozygous)', 1.0)]),
 ('RET-R231H', 2, [('heterozygous', 1.0)])]

After this, a compiled df was made

In [442]:
total_none_set
tot_none_df2 = pd.DataFrame(total_none_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_none_df2

total_aura_set
tot_aura_df2 = pd.DataFrame(total_aura_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_aura_df2

total_both_set
tot_both_df2 = pd.DataFrame(total_both_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_both_df2

total_noaur_set
tot_noaur_df2 = pd.DataFrame(total_noaur_set, columns = ['Gene', 'Frequency', 'Alleles'])
tot_noaur_df2

# all the lists being converted to individual dataframes
Out[442]:
Gene Frequency Alleles
0 MTRR-I49M 27 [(carrier (heterozygous), 0.6296296296296297),...
1 NEFL-S472 27 [(homozygous, 1.0)]
2 COL4A1-Q1334H 17 [(heterozygous, 0.7058823529411765), (homozygo...
3 C3-R102G 13 [(heterozygous, 1.0)]
4 rs5186 13 [(heterozygous, 0.8461538461538461), (homozygo...
5 APOE-C130R 8 [(heterozygous, 1.0)]
6 CBS-I278T 7 [(carrier (heterozygous), 1.0)]
7 MBL2-G54D 6 [(carrier (heterozygous), 1.0)]
8 MBL2-R52C 6 [(carrier (heterozygous), 1.0)]
9 NPC1-W1122 4 [(heterozygous, 1.0)]
10 SERPINA1-E288V 4 [(carrier (heterozygous), 1.0)]
11 SYNE1-N1915 4 [(heterozygous, 1.0)]
12 AMPD1-Q12X 4 [(carrier (heterozygous), 1.0)]
13 APOA5-S19W 4 [(heterozygous, 1.0)]
14 HABP2-G534E 3 [(heterozygous, 1.0)]
15 PAX2-Y273 3 [(heterozygous, 1.0)]
16 HFE-C282Y 3 [(carrier (heterozygous), 1.0)]
17 TGM1-E520G 3 [(carrier (heterozygous), 1.0)]
18 TTN-E190 3 [(heterozygous, 1.0)]
19 KRT5-G138E 3 [(heterozygous, 1.0)]
20 NOD2-R702W 3 [(heterozygous, 1.0)]
21 ACAD8-S171C 3 [(carrier (heterozygous), 1.0)]
22 CREBBP-P1878 2 [(heterozygous, 1.0)]
23 PRPH-D141Y 2 [(carrier (heterozygous), 1.0)]
24 SERPINA1-E366K 2 [(carrier (heterozygous), 1.0)]
25 RYR1-P2002 2 [(heterozygous, 1.0)]
26 HPS6-A597 2 [(heterozygous, 1.0)]
27 SNCA-Y39 2 [(heterozygous, 1.0)]
28 PHKB-M185I 2 [(carrier (heterozygous), 1.0)]
29 LPL-N318S 2 [(heterozygous, 1.0)]
30 ALG3-F200 2 [(heterozygous, 1.0)]
31 PKP2-S140F 2 [(heterozygous, 1.0)]
32 MSR1-R293X 2 [(heterozygous, 1.0)]
33 SPG11-K1013E 2 [(carrier (heterozygous), 1.0)]
34 CETP-A390P 2 [(heterozygous, 1.0)]
35 THBD-A43T 2 [(heterozygous, 1.0)]
36 CD40LG-G219R 2 [(carrier (heterozygous), 1.0)]
37 PEX26-L153V 2 [(carrier (heterozygous), 1.0)]
38 WFS1-R456H 2 [(heterozygous, 1.0)]
39 SNCA-A69 2 [(heterozygous, 1.0)]
In [446]:
# merging into one large dataframe

none_and_aura = tot_none_df2.merge(tot_aura_df2, on="Gene", how="outer")
none_and_aura.columns = ["Gene","Freq: None", "Alleles: none", "Freq: Aura", "Alleles: Aura"]
none_and_aura

none_aura_both = none_and_aura.merge(tot_both_df2, on="Gene", how="outer")
none_aura_both.columns = ["Gene","Freq: None", "Alleles: none", "Freq: Aura", "Alleles: Aura", "Freq: Both", "Alleles: Both"]
none_aura_both

all_types_df2 = none_aura_both.merge(tot_noaur_df2, on="Gene", how="outer")
all_types_df2.columns = ["Gene","Freq: None", "Alleles: none", "Freq: Aura", "Alleles: Aura", "Freq: Both", "Alleles: Both", "Freq: No Aura", "Alleles: No aura"]
all_types_df2
Out[446]:
Gene Freq: None Alleles: none Freq: Aura Alleles: Aura
0 MTRR-I49M 27.0 [(carrier (heterozygous), 0.6296296296296297),... 35.0 [(carrier (heterozygous), 0.7428571428571429),...
1 NEFL-S472 27.0 [(homozygous, 1.0)] 7.0 [(homozygous, 1.0)]
2 COL4A1-Q1334H 17.0 [(heterozygous, 0.7058823529411765), (homozygo... 30.0 [(heterozygous, 0.6333333333333333), (homozygo...
3 C3-R102G 13.0 [(heterozygous, 1.0)] 13.0 [(heterozygous, 0.8461538461538461), (homozygo...
4 rs5186 13.0 [(heterozygous, 0.8461538461538461), (homozygo... 21.0 [(heterozygous, 0.7142857142857143), (homozygo...
... ... ... ... ... ...
57 MFN2-Q276R NaN NaN 2.0 [(heterozygous, 1.0)]
58 COL9A3-R103W NaN NaN 2.0 [(heterozygous, 1.0)]
59 PRF1-A91V NaN NaN 2.0 [(heterozygous, 1.0)]
60 ABCA4-G863A NaN NaN 2.0 [(carrier (heterozygous), 1.0)]
61 ABCA4-A1038V NaN NaN 2.0 [(heterozygous, 1.0)]

62 rows × 5 columns

In [448]:
#filling in the NaN values
all_types_df2['Freq: Aura'] = all_types_df2['Freq: Aura'].fillna(0.0)
all_types_df2['Freq: None'] = all_types_df2['Freq: None'].fillna(0.0)
all_types_df2['Freq: No Aura'] = all_types_df2['Freq: No Aura'].fillna(0.0)
all_types_df2['Freq: Both'] = all_types_df2['Freq: Both'].fillna(0.0)
all_types_df2
Out[448]:
Gene Freq: None Alleles: none Freq: Aura Alleles: Aura Freq: Both Alleles: Both Freq: No Aura Alleles: No aura
0 MTRR-I49M 27.0 [(carrier (heterozygous), 0.6296296296296297),... 35.0 [(carrier (heterozygous), 0.7428571428571429),... 9.0 [(carrier (heterozygous), 0.5555555555555556),... 29.0 [(carrier (heterozygous), 0.5517241379310345),...
1 NEFL-S472 27.0 [(homozygous, 1.0)] 7.0 [(homozygous, 1.0)] 0.0 NaN 12.0 [(homozygous, 1.0)]
2 COL4A1-Q1334H 17.0 [(heterozygous, 0.7058823529411765), (homozygo... 30.0 [(heterozygous, 0.6333333333333333), (homozygo... 4.0 [(heterozygous, 1.0)] 19.0 [(heterozygous, 0.8947368421052632), (homozygo...
3 C3-R102G 13.0 [(heterozygous, 1.0)] 13.0 [(heterozygous, 0.8461538461538461), (homozygo... 3.0 [(heterozygous, 1.0)] 11.0 [(heterozygous, 1.0)]
4 rs5186 13.0 [(heterozygous, 0.8461538461538461), (homozygo... 21.0 [(heterozygous, 0.7142857142857143), (homozygo... 0.0 NaN 23.0 [(heterozygous, 0.9130434782608695), (homozygo...
... ... ... ... ... ... ... ... ... ...
70 PMP22-T118M 0.0 NaN 0.0 NaN 0.0 NaN 2.0 [(heterozygous, 1.0)]
71 PMM2-V129M 0.0 NaN 0.0 NaN 0.0 NaN 2.0 [(homozygous, 1.0)]
72 ABCC6-R1164X 0.0 NaN 0.0 NaN 0.0 NaN 2.0 [(heterozygous, 1.0)]
73 VCL-M1073 0.0 NaN 0.0 NaN 0.0 NaN 2.0 [(heterozygous, 1.0)]
74 DOK7-S45L 0.0 NaN 0.0 NaN 0.0 NaN 2.0 [(carrier (heterozygous), 1.0)]

75 rows × 9 columns

The, various mathematical equations were applied to the frequencies. Ratios of gene frequency between all populations and proportions in each individual subset were considered

In [449]:
all_freq = ['Freq: None', "Freq: Both", "Freq: Aura", "Freq: No Aura"]
all_types_df2['Total'] = all_types_df2[all_freq].sum(axis=1)

all_types_df2
Out[449]:
Gene Freq: None Alleles: none Freq: Aura Alleles: Aura Freq: Both Alleles: Both Freq: No Aura Alleles: No aura Total
0 MTRR-I49M 27.0 [(carrier (heterozygous), 0.6296296296296297),... 35.0 [(carrier (heterozygous), 0.7428571428571429),... 9.0 [(carrier (heterozygous), 0.5555555555555556),... 29.0 [(carrier (heterozygous), 0.5517241379310345),... 100.0
1 NEFL-S472 27.0 [(homozygous, 1.0)] 7.0 [(homozygous, 1.0)] 0.0 NaN 12.0 [(homozygous, 1.0)] 46.0
2 COL4A1-Q1334H 17.0 [(heterozygous, 0.7058823529411765), (homozygo... 30.0 [(heterozygous, 0.6333333333333333), (homozygo... 4.0 [(heterozygous, 1.0)] 19.0 [(heterozygous, 0.8947368421052632), (homozygo... 70.0
3 C3-R102G 13.0 [(heterozygous, 1.0)] 13.0 [(heterozygous, 0.8461538461538461), (homozygo... 3.0 [(heterozygous, 1.0)] 11.0 [(heterozygous, 1.0)] 40.0
4 rs5186 13.0 [(heterozygous, 0.8461538461538461), (homozygo... 21.0 [(heterozygous, 0.7142857142857143), (homozygo... 0.0 NaN 23.0 [(heterozygous, 0.9130434782608695), (homozygo... 57.0
... ... ... ... ... ... ... ... ... ... ...
70 PMP22-T118M 0.0 NaN 0.0 NaN 0.0 NaN 2.0 [(heterozygous, 1.0)] 2.0
71 PMM2-V129M 0.0 NaN 0.0 NaN 0.0 NaN 2.0 [(homozygous, 1.0)] 2.0
72 ABCC6-R1164X 0.0 NaN 0.0 NaN 0.0 NaN 2.0 [(heterozygous, 1.0)] 2.0
73 VCL-M1073 0.0 NaN 0.0 NaN 0.0 NaN 2.0 [(heterozygous, 1.0)] 2.0
74 DOK7-S45L 0.0 NaN 0.0 NaN 0.0 NaN 2.0 [(carrier (heterozygous), 1.0)] 2.0

75 rows × 10 columns

In [450]:
all_types_df2['None Ratio'] = all_types_df2["Freq: None"].apply(lambda x: x / 40)
all_types_df2['Aura Ratio'] = all_types_df2["Freq: Aura"].apply(lambda x: x / 49)
all_types_df2['No Aura Ratio'] = all_types_df2["Freq: No Aura"].apply(lambda x: x / 38)
all_types_df2['Both Ratio'] = all_types_df2["Freq: Both"].apply(lambda x: x / 10)
all_types_df2['All Ratio'] = all_types_df2["Total"].apply(lambda x: x / 137)
all_types_df2['All Mig Ratio'] = ((all_types_df2["Freq: Aura"] + all_types_df2["Freq: No Aura"] + all_types_df2["Freq: Both"]) / 97)

all_types_df2

# hyper specific dropping of columns with not enough information
moddf = all_types_df2.drop([74, 73, 72, 71, 70, 69, 68, 67, 66, 65, 64, 63, 62, 61, 60, 58, 57, 56, 0, 2, 3, 5, 6,9,1,8, 10, 11, 15, 17, 18, 22, 21, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32 ])
moddf2 = moddf.drop([35, 36, 37, 38, 39, 55,20, 14,33, 34,13,  53,  16, 19, 47, 48, 49, 50, 51, 52, 59 ])
moddf2
Out[450]:
Gene Freq: None Alleles: none Freq: Aura Alleles: Aura Freq: Both Alleles: Both Freq: No Aura Alleles: No aura Total None Ratio Aura Ratio No Aura Ratio Both Ratio All Ratio All Mig Ratio
4 rs5186 13.0 [(heterozygous, 0.8461538461538461), (homozygo... 21.0 [(heterozygous, 0.7142857142857143), (homozygo... 0.0 NaN 23.0 [(heterozygous, 0.9130434782608695), (homozygo... 57.0 0.325 0.428571 0.605263 0.0 0.416058 0.453608
7 MBL2-G54D 6.0 [(carrier (heterozygous), 1.0)] 9.0 [(carrier (heterozygous), 1.0)] 2.0 [(carrier (heterozygous), 1.0)] 8.0 [(carrier (heterozygous), 1.0)] 25.0 0.150 0.183673 0.210526 0.2 0.182482 0.195876
12 AMPD1-Q12X 4.0 [(carrier (heterozygous), 1.0)] 10.0 [(carrier (heterozygous), 1.0)] 2.0 [(carrier (heterozygous), 1.0)] 10.0 [(carrier (heterozygous), 0.8), (homozygous, 0... 26.0 0.100 0.204082 0.263158 0.2 0.189781 0.226804
40 BEST1-S192 0.0 NaN 12.0 [(homozygous, 1.0)] 0.0 NaN 8.0 [(homozygous, 1.0)] 20.0 0.000 0.244898 0.210526 0.0 0.145985 0.206186
41 BEST1-Y245 0.0 NaN 12.0 [(homozygous, 1.0)] 0.0 NaN 8.0 [(homozygous, 1.0)] 20.0 0.000 0.244898 0.210526 0.0 0.145985 0.206186
42 KRT86-E402Q 0.0 NaN 5.0 [(homozygous, 1.0)] 4.0 [(homozygous, 1.0)] 6.0 [(homozygous, 1.0)] 15.0 0.000 0.102041 0.157895 0.4 0.109489 0.154639
43 PIGR-A580V 0.0 NaN 5.0 [(heterozygous, 1.0)] 0.0 NaN 0.0 NaN 5.0 0.000 0.102041 0.000000 0.0 0.036496 0.051546
44 BTD-D444H 0.0 NaN 5.0 [(carrier (heterozygous), 1.0)] 0.0 NaN 2.0 [(carrier (heterozygous), 1.0)] 7.0 0.000 0.102041 0.052632 0.0 0.051095 0.072165
45 VWF-S1506L 0.0 NaN 3.0 [(carrier (heterozygous), 1.0)] 3.0 [(carrier (heterozygous), 1.0)] 2.0 [(carrier (heterozygous), 1.0)] 8.0 0.000 0.061224 0.052632 0.3 0.058394 0.082474
46 RET-R231H 0.0 NaN 3.0 [(heterozygous, 1.0)] 0.0 NaN 2.0 [(heterozygous, 1.0)] 5.0 0.000 0.061224 0.052632 0.0 0.036496 0.051546
54 RPGRIP1L-A229T 0.0 NaN 2.0 [(heterozygous, 1.0)] 2.0 [(heterozygous, 1.0)] 2.0 [(heterozygous, 1.0)] 6.0 0.000 0.040816 0.052632 0.2 0.043796 0.061856

Note: In order to create a graphable dataframe, we dropped a number of genes that had overall frequencies of less than 5 and/or low frequencies with a similar distribution in the control and migraine populations.

We dropped these manually, which is why there are a wide array of columns dropped above. Doing this project again, we would have likely developed a better methodology

In [451]:
import sklearn
assert sklearn.__version__ >= "0.20"

import numpy as np
import os

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)


moddf2.plot.bar(x='Gene', y=['None Ratio', 'All Mig Ratio'], color=[ 'royalblue', 'aqua'], ylabel='Percent of Individuals', title='Gene Proportions of Migraine Havers vs Control', width=0.8, figsize=(12,6)).grid()
#sns.catplot(data=moddf2, x="Gene", y=["None Ratio", "All Mig Ratio"])

Within the set of significant alleles, there are a number of genes that are present in migraine populations, but not in the control group. Each column has at least 5 occurances of the gene across the board. These could possibly relate to migraines, or may be indicative of a scientific process outside the scope of our EDA

Part 3: Modeling

For the project, we decided we wanted to created two different models.

  1. Can we predict whether someone has a specific comorbidity (i.e. a condition in one of the 11 subsets) based on whether or not they have migraines?
  2. Can we predict whether someone has migraines using genomic data?

Comorbidity Based

In [452]:
all_features = ['Has Migraines', 'Has Migraines with Aura',
       'Has Migraines without Aura','Height (in)', 'Weight (lbs)', 'Blood Type_A +', 'Blood Type_A -',
       'Blood Type_AB +', 'Blood Type_AB -', 'Blood Type_B +',
       'Blood Type_B -', 'Blood Type_Don\'t know', 'Blood Type_O +',
       'Blood Type_O -', 'Sex/Gender_Female',
       'Sex/Gender_Male','Race/ethnicity_American Indian', 'Race/ethnicity_Asian',
       'Race/ethnicity_Black or African', 'Race/ethnicity_Hispanic or Latino',
       'Race/ethnicity_White'] # features to test
In [453]:
#knn for system that is seemingly highly correlated with having migraines (nervous) with and without migs
def knn_nerv(k): 
    model = KNeighborsClassifier(n_neighbors = k)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
 
    X_train = all[all_features]
    y_train = all['Has Nervous System Conditions']
    X_train=X_train.to_dict(orient="records")
    
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_nerv_no_mig(k):
    model = KNeighborsClassifier(n_neighbors = k)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
 
    X_train = all[all_features[3:]]
    y_train = all['Has Nervous System Conditions']
    X_train=X_train.to_dict(orient="records")
    
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
In [454]:
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
nerv = ks.apply(knn_nerv)
nerv_no_mig = ks.apply(knn_nerv_no_mig)
plt.plot(nerv,label='Nervous System Conditions')
plt.plot(nerv_no_mig,label='Nervous System Conditions no mig')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.legend()
# as you can see migraines are correlated to prediction accuracy for nervous system conditions
Out[454]:
<matplotlib.legend.Legend at 0x7f9f922499a0>

This model indicates that knowing whether someone has migraines or not leads to increased accuracy in predicting whether someone has a nervous system condition (other than migraines). This shows that there is likely a statistical relationship between migraines and the ability to predict comorbid neurological syndromes

In [455]:
#knn for system that is seemingly not highly correlated with having migraines (digestive) with and without migs
def knn_dig(k):
    model = KNeighborsClassifier(n_neighbors = k)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
 
    X_train = all[all_features]
    y_train = all['Has Digestive Conditions']
    X_train=X_train.to_dict(orient="records")
    
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_dig_no_mig(k):
    model = KNeighborsClassifier(n_neighbors = k)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
 
    X_train = all[all_features[3:]]
    y_train = all['Has Digestive Conditions']
    X_train=X_train.to_dict(orient="records")
    
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
In [456]:
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
dig = ks.apply(knn_dig)
dig_no_mig = ks.apply(knn_dig_no_mig)
plt.plot(dig,label='Digestive System Conditions')
plt.plot(dig_no_mig,label='Digestive System Conditions no mig')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.legend()
# as you can see removing migraines does not have a significant effect on accuracy for predicting digestive system conditions
Out[456]:
<matplotlib.legend.Legend at 0x7f9f92247100>

In this model, it is evident that knowing whether someone has migraines or not is not helpful information for predicting whether someone has a digestive system condition. This is not very successful and shows a lack of relationship between migraines and digestive comorbidities. This is likely the case for other conditions.

Gene Based

We started by rereading the alleles dataframe to ensure there were no issues with the dataset

In [457]:
alleles = pd.read_excel('vari.xlsx') # genetic data
In [458]:
alleles = alleles.drop('Unnamed: 7',axis=1) # dropping unnecessary data
In [459]:
alleles['Has Migraine'] = alleles['Type of Migraine'] != 'None' # getting binary migraine columns
alleles['Has Migraine'] = alleles['Has Migraine'].map({True: 1, False: 0})
In [460]:
alleles['Has Migraine with aura'] = alleles['Type of Migraine'] == 'Mig with aura'
alleles['Has Migraine with aura'] = alleles['Has Migraine with aura'].map({True: 1, False: 0})
alleles['Has Migraine without aura'] = alleles['Type of Migraine'] == 'Mig no aura'
alleles['Has Migraine without aura'] = alleles['Has Migraine without aura'].map({True: 1, False: 0})
In [461]:
alleles
Out[461]:
Participant Type of Migraine Gene Recessive or dominant Homo/heterozyg Mutation or variant? Disease capacity Has Migraine Has Migraine with aura Has Migraine without aura
0 hu620F18 Mig no aura CBS-I278T Recessive Carrier (Heterozygous) Mutation Likely pathogenic 1 0 1
1 hu620F18 Mig no aura C3-R102G Complex/Other Heterozygous Variant Likely pathogenic 1 0 1
2 hu620F18 Mig no aura COL4A1-Q1334H Dominant Heterozygous Variant Likely pathogenic 1 0 1
3 hu620F18 Mig no aura MTRR-I49M Recessive Carrier (Heterozygous) Variant Likely pathogenic 1 0 1
4 hu620F18 Mig no aura rs5186 Unknown Heterozygous Variant Likely pathogenic 1 0 1
... ... ... ... ... ... ... ... ... ... ...
1256 hu05FD49 None RPE65-N356 / heterozygous Mutation - Frameshift / 0 0 0
1257 hu05FD49 None PKD1-R2430 / heterozygous Mutation - nonsense / 0 0 0
1258 hu05FD49 None FLG-R3879 / heterozygous Mutation - nonsense / 0 0 0
1259 hu05FD49 None SBF2-H1549 / heterozygous Mutation - nonsense / 0 0 0
1260 hu05FD49 None NF2-K523 / heterozygous Mutation - Frameshift / 0 0 0

1261 rows × 10 columns

In [462]:
feat = ['Gene','Recessive or dominant','Homo/heterozyg','Mutation or variant?','Disease capacity'] # test cols
In [463]:
# testing which features are most relevant to accurately predicting whether someone has migraines
def knn_full(n):
  model = KNeighborsClassifier(n_neighbors=n)
  scaler=StandardScaler()
  vec = DictVectorizer(sparse=False)
  pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
  X_train = alleles[feat]
  y_train = alleles['Has Migraine']
  X_train=X_train.to_dict(orient="records")
  return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_gene(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[feat[1:]]
    y_train = alleles['Has Migraine']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_rec_dom(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_homo_hetero(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Recessive or dominant','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_mutate(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Recessive or dominant','Disease capacity']]
    y_train = alleles['Has Migraine']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_disease(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Recessive or dominant']]
    y_train = alleles['Has Migraine']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
In [464]:
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
full = ks.apply(knn_full)
no_gene = ks.apply(knn_no_gene)
no_rec_dom = ks.apply(knn_no_rec_dom)
no_homo_hetero = ks.apply(knn_no_homo_hetero)
no_mutate = ks.apply(knn_no_mutate)
no_disease = ks.apply(knn_no_disease)
plt.plot(full,label = 'full features')
plt.plot(no_gene,label = 'no genes')
plt.plot(no_rec_dom,label = 'no recessive/dom')
plt.plot(no_homo_hetero,label = 'no homo/het')
plt.plot(no_mutate,label = 'no mutations')
plt.plot(no_disease,label = 'no disease capacity')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.title('Accuracy of Predicting Migraines Per Missing Feature')
plt.legend()
# it seems that removing homo/hetero significantly impacts the models ability to predict whether someone has migraines
# most accurate disregards gene names but keeps other data
Out[464]:
<matplotlib.legend.Legend at 0x7f9f92cecf70>

This model used genomic information (from the dataframe) to predict whether someone had migraines or not. It appears that eliminating the genes themselves (no genes) actually has a positive relationship with accuracy! However, eliminating whether a gene is homozygous or heterozygous decreases accuracy, highlighting it has a more significant statistical relationship with the ability to predict migraines

In [465]:
# testing which features are most relevant to accurately predicting whether someone has migraines with aura

def knn_full(n):
  model = KNeighborsClassifier(n_neighbors=n)
  scaler=StandardScaler()
  vec = DictVectorizer(sparse=False)
  pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
  X_train = alleles[feat]
  y_train = alleles['Has Migraine with aura']
  X_train=X_train.to_dict(orient="records")
  return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_gene(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[feat[1:]]
    y_train = alleles['Has Migraine with aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_rec_dom(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine with aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_homo_hetero(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Recessive or dominant','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine with aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_mutate(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Recessive or dominant','Disease capacity']]
    y_train = alleles['Has Migraine with aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_disease(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Recessive or dominant']]
    y_train = alleles['Has Migraine with aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
In [466]:
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
full = ks.apply(knn_full)
no_gene = ks.apply(knn_no_gene)
no_rec_dom = ks.apply(knn_no_rec_dom)
no_homo_hetero = ks.apply(knn_no_homo_hetero)
no_mutate = ks.apply(knn_no_mutate)
no_disease = ks.apply(knn_no_disease)
plt.plot(full,label = 'full features')
plt.plot(no_gene,label = 'no genes')
plt.plot(no_rec_dom,label = 'no recessive/dom')
plt.plot(no_homo_hetero,label = 'no homo/het')
plt.plot(no_mutate,label = 'no mutations')
plt.plot(no_disease,label = 'no disease capacity')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.title('Accuracy of Predicting Migraines with Aura Per Missing Feature')
plt.legend()
# Genes and recessive/ dominant seem most important to predicting this type of migraine.
Out[466]:
<matplotlib.legend.Legend at 0x7f9f942b3eb0>

When specifically predicting whether someone has migraines with aura, it appears that more neighbors increases accuracy. In this case, removing genes led to the largest decrease in accuracy. This is supported by the chart from the EDA section highlighting gene variance in only migraine havers; however, there are a number of variables that impact this conclusion

In [467]:
# testing which features are most relevant to accurately predicting whether someone has migraines without aura

def knn_full(n):
  model = KNeighborsClassifier(n_neighbors=n)
  scaler=StandardScaler()
  vec = DictVectorizer(sparse=False)
  pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
  X_train = alleles[feat]
  y_train = alleles['Has Migraine without aura']
  X_train=X_train.to_dict(orient="records")
  return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_gene(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[feat[1:]]
    y_train = alleles['Has Migraine without aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_rec_dom(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine without aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_homo_hetero(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Recessive or dominant','Mutation or variant?','Disease capacity']]
    y_train = alleles['Has Migraine without aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_mutate(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Recessive or dominant','Disease capacity']]
    y_train = alleles['Has Migraine without aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
def knn_no_disease(n):
    model = KNeighborsClassifier(n_neighbors=n)
    scaler=StandardScaler()
    vec = DictVectorizer(sparse=False)
    pipeline = Pipeline([
        ("vec",vec),
        ("scaler", scaler),
        ("model", model)
    ])
    X_train = alleles[['Gene','Homo/heterozyg','Mutation or variant?','Recessive or dominant']]
    y_train = alleles['Has Migraine without aura']
    X_train=X_train.to_dict(orient="records")
    return (cross_val_score(pipeline, X_train, y_train, cv=5,scoring="accuracy").mean())
In [468]:
ks = pd.Series(range(1, 51))
ks.index = range(1, 51)
full = ks.apply(knn_full)
no_gene = ks.apply(knn_no_gene)
no_rec_dom = ks.apply(knn_no_rec_dom)
no_homo_hetero = ks.apply(knn_no_homo_hetero)
no_mutate = ks.apply(knn_no_mutate)
no_disease = ks.apply(knn_no_disease)
plt.plot(full,label = 'full features')
plt.plot(no_gene,label = 'no genes')
plt.plot(no_rec_dom,label = 'no recessive/dom')
plt.plot(no_homo_hetero,label = 'no homo/het')
plt.plot(no_mutate,label = 'no mutations')
plt.plot(no_disease,label = 'no disease capacity')
plt.xlabel('n_neighbors')
plt.ylabel('accuracy')
plt.title('Accuracy of Predicting Migraines without aura Per Missing Feature')
plt.legend()
# Harder to interpret, seems that homo/hetero has much smaller effect on Migraines without aura
Out[468]:
<matplotlib.legend.Legend at 0x7f9f92720790>

For predicting migraines without aura, eliminating whether there was a mutation appeared to decrease accuracy the most. It also shows a positive relationship between accuracy and increasing neighbors

Overall, it appears our models were not incredibly informative. However, they showed there could be a statistical relationship between certain comorbidities and specific genes. Additional data sources and further research would need to be conducted to validify any of these claims