Introduction - Slide 1 -> Project aim was to build a logistic regression classifier -> I used the logistic regression module within the biopython library BIOPYTHON: A molecular biology library for the python language -> this classifier was used to classify gene pairs into two classes ->What is an Operon Operon (Slide 2) -> each operon has a promoter and operator sequence -> promoter sequence act as a binding sie for the RNA polymerase -> The RNA polymerase traverses the operon reading the operator sequence next which instructs it to activate or suppress each of the genes that follow. -> The genes are then transcribed into an mRNA molecule which is then processed to form the rspective proteins. In this example, into these three enzymes. -> Aim of the classification project is to be able to identify if a gene pair belongs to the same operon or not. Logisitic Regression Model (Slide 3) -> two predictor variables are used -> The first being the pairwise distance between the genes and the second bing the gene similarity score of the gene pair -> The first is a good identifier becuase genes of the same operon tend to have short distances while genes of different operon tend to be separated by promoter and operator sequences -> Genes of the same operon also tend to be more similar than gene pairs of different operon -> These variables are used to compute the logit score which has the following form Training the model (Slide 4) -> Since different organisms have different operon structure, the classifier will focus on B. Subtilis genes -> the dataset used for training the model was gathered from the operon DataBase ->To find the values ofthe beta coeffecients in the previous equation maximum likelihood estimates of the probabilities would be used. This is done iteratively via the logistic regression routines. Model Accuracy (slide 5) -> Training the model using the gathered data resulted in the following equation to be used by the model -> two tests were setup for the model -> the first being 10-fold cross validation -> and leave-one-out cross validation, which is the same as a k-fold cross validation technique where K = size of the dataset Results I (slide 6) -> Running the 10-fold cross validation, yielded 10 results, after removing the best and worst run and averaging over the remaining 8, the rates were: -> 19% type I error = (FP/(TN+FP)) -> Specificity == TN/(TN+FP) -> type II == FN/(TN+FN) -> Sensitivity == TP/(TP+FN) Results II (Slide 7) ->Accuracy = Percentage of correct classifications, shows that around 1/10 pairs were classified incorrectly -> ... Conclusions (Slide 8) ->... ->... -> Gene expression score may be used universally however the gene distance varies in different genomes. the avg operon length in B. Subtilis is 2.4 genes. It can be different for other organisms. -> massive effort, claims to have gathered all published operon data within the database