Homework

You are evaluating the performance of a deep learning model for the binary classification of gene expression, where 1 label represents that the gene is expressed and 0 represents that it is not expressed. The model achieves a high AUROC (Area Under the ROC curve) score of 0.93 but produces a low AUPR (Area Under the Precision-Recall curve) score of 0.35. What could be the reason for this observation?


Homework Assignment #1
Instructor: Ritambhara Singh HTA: Daniel Ben-Isvy UTA: Giselle Garcia Email: cs1850headtas@lists.brown.edu About the assignment: This assignment has been designed to help you think critically about the topics we covered in the class, encourage you to look at the literature, and get familiar with applying deep learning frameworks on DNA sequences.
For the conceptual questions, we are looking for answers that are maximum 3 lines per point. So if a question is worth 2 points try answering it in maximum 6 lines. When writing answers, try to first list the main idea addressing the question and then expand on it. You may refer to the papers in the Reference section while answering these questions.
For programming assignment, we will run your code to check if it gives the correct output. If the code does not run successfully, we will assign partial scores to the correct logic behind the implementation.
Attempting bonus questions or tasks is encouraged but not required. The papers we covered (for example [1]) all utilize the first convolutional layer of their prediction networks as motif-finders. What properties of convolutional filters allow them to be used for this task? Background: Transcription Factors (TFs) often bind to particular sequence patterns, known as motifs. Regulatory regions of DNA often have more than one binding site or motif for a particular TF clustered in a small region Dataset: The provided dataset was synthetically generated for the task of "homotypic motif density localization", that is finding clusters of the same type of sequence motif (think of them as artificially placed TF binding motifs). It consists of DNA sequences classified as either positive (label = 1) or negative (label = 0) depending on where there are clusters of a certain motif. The dataset is pre-divided into training, validation, and test sets to perform cross-validation when training the model.
[5 points] Implement a basic 1D Convolutional Neural Network architecture (as discussed in the class), with 1 layer of convolution (with non-linearity and pooling) followed by 1 fully-connected layer to perform binary classification. The model should take as input a one-hot encoded DNA sequence. We recommend keeping the size of the model small for this simple task. Report the %Accuracy = ((Number of correctly predicted samples/Total samples) ×100) on the provided test set.

[4 points]
Plot the training and validation loss after each epoch of training. Describe any trends in the training and validation loss curves that you notice and provide an explanation for why those trends were observed. Based on this plot, for how many epochs should the model be trained? Why?
3. [4 points] Implement a 2-layer fully connected neural network architecture (with non-linearity) to perform binary classification. The model should take as input a vector of k-mer counts in a DNA sequence ("bag of k-mers" representation covered in class). Report the %Accuracy on the provided test set.

[2 points]
Compare the performance of the Convolutional Neural Network to the performance of the fully connected neural network, and provide a potential explanation for your results.
You may try different hyperparameters of your choice for these models (e.g., number of filters, size of filters, pooling size, size of linear layers etc.).

Bonus task [6 points]:
1. [4 points] Implement a Convolutional Neural Network (your choice of number/size of layers) for binary classification that uses an embedding layer to encode a DNA sequence. Specifically, the model should take as input a series of integers representing the DNA sequence where each integer corresponds to a specific nucleotide. Then, the model should use an embedding layer to learn a useful representation for the subsequent network layers. Report the %Accuracy on the provided test set.

[2 points]
Compare the performance of this model to the performance each of the two previous models, and provide a potential explanation for your results.