|
Areas of interest: Bioinformatics and Systems Biology (computational analysis of DNA
sequences/structures, genomes, protein sequences/structures,
microarrays, biological networks, amyloid diseases).
Other non-biological project areas supervised: Development of software for IT teaching and careers.
Project Suggestions 2009/2010
MSc Intelligent Systems / Machine Learning &
Birkbeck MSc Bioinformatics & Systems Biology
The bioinformatics group at UCL Computer Science works on protein structure prediction, protein function prediction and systems biology (analysis of high-throughput data and biological networks). More details can be found on the UCL Bioinformatics Group web pages. We focus on applying established machine learning methods (neural networks, SVMs, GAs) to prediction problems in bioinformatics - although we are very open to new machine learning approaches if effective. These projects generally involve applied machine learning rather than theoretical development of novel methods. So a substantial part of the projects may involve locating, understanding and integrating biological data before applying machine learning techniques.
Some project topics I would be interested in supervising:
1. Predicting amyloid propensity from protein sequence.
It has been shown that certain protein sequences have an increased propensity to 'aggregate' into amyloid structures that resulting in a wide range of amyloid diseases including Alzheimer's and Parkinson's (Amyloidosis). Some sequences are more prone to amyloid formation than others; often single amino acid mutations increase the amyloid propensity of sequences leading to disease.
A number of prediction servers have already been developed to predict amyloid propensity from sequence based on fairly simple models such as the recent FoldAmyloid server (Bioinformatics 2010 26(3):326-332).
A key challenge of this project would be to apply modern machine learning approaches to this problem, with the ultimate aim of outperforming current methods.
2. Integrating the detection of differentially expressed genes with transcription factor motif detection.
Large databases such as GEO and ArrayExpress archive microarray datasets that provide information about the expression levels of genes within particular samples. A common technique in microarray analysis, using packages such as BioConductor, is to detect genes that have different expression levels between, say, diseased tissue and normal. This can provide information about what the disease mechanism could be, possibly leading to drug targets to treat the disease. Unfortunately, changes in many transcription factors cannot be detected using microarrays since their overall expression level is low (below the level of noise).
However, often the genes that the transcription factor controls have higher expression levels and so changes in these can be detected using microarrays. Hence if transcription factor activator A regulates 10 other genes - often we can infer that it has been down-regulated when we see that the 10 other genes are down-regulated. It is this correlation between expression levels in genes that are regulated that may allow us to detect changes in transcription factors that are below the level of noise.
Unfortunately we would need to know which genes regulate other genes, and this 'gene regulatory network' is often poorly understood for many organisms. However, a lot of organisms now have complete genomes available and hence we can detect small 'motifs' in the DNA regulatory region before the gene to suggest what transcription factors are regulating it. Hence we could use motif detection software, such as MEME/MAST, to detect common motifs in front of genes that have correlated expression, working out the transcription factor responsible in the case of known motifs.
A simple approach would do just that; apply suitable clustering to detect genes that are differentially similar and then apply the MEME/MAST approach to examine if common regulatory motifs are present. A more sophisticated technique would integrate the two methods - hence combining both information from gene expression and motif patterns to detect differentially expressed genes and their regulatory motifs.
This project aims to apply and develop such techniques for disease datasets that are known to have such patterns present (quorum sensing from Streptococcus pneumonia and the NRSF regulatory motif from Huntington's Disease).
3. Unsupervised learning of disease gene expression signatures within the context of tissue-specific variation.
A number of genetic diseases affect diverse types of tissue in the body; some mechanisms will be common across tissue types whereas others are tissue-specific. But different tissues generally have very different transcriptional profiles and thus any traditional unsupervised techniques, such as clustering, will just relate similar tissues together while ignoring the (small and varying) differences between diseased and non-diseased tissue.
The aim of this project would be first to create synthetic data that replicates the situation of having 'disease variation' signatures embedded within 'tissue specific' signatures. (Essentially taking tissue specific expression profiles and disease-related profiles from simple two-way comparisons - and creating data based on these with added noise).
Once synthetic data has been generated where the answer is known - various filtering and unsupervised approaches can be tested. (For example, a simple approach would be to use ANOVA with two categorical factors to cover tissue-specific and disease-specific variation - although this may not be very effective since little prior knowledge about the expected expression differences is included.)
After establishing a suitable approach - it would then be tested on real data. Huntington's disease would make a good candidate since a variety of human gene expression studies have been carried out across different tissues (e.g. brain - GSE3790, peripheral blood - GSE1751, GDS1331, lymphocytes - GSE8762, muscle).
The overall aim of the project would be to develop effective supervised learning approaches for gene expression data that would work in the context of tissue-specific variation; and apply any effective approach to discover biological mechanisms in Huntington's Disease that are common between different tissue types.
4. Predicting protein structural domains from sequence.
Proteins structures are often made up of a number of 'structural domains' that often fold independently. Some domains have particular functions and may be reused between a number of different proteins (comparable with reusing function libraries in code!). For instance, four structural domains can be seen in the following protein:

Sometimes delineation of domain structure is not clear cut. The following structure clearly has 6 'structural units' - but are they domains? Would they fold independently? Probably not - this would generally be classified as one domain.

It is generally very important to be able to predict protein domain structure from sequence, since many other bioinformatics techniques only work effectively when given individual domains. We already have the DomPred server that predicts domain structure from sequence, as described in Curr Protein Pept Sci. 2007 Apr;8(2):181-8.
Domain prediction is a particularly challenging problem and generally a combination of different approaches are employed. A key aim of this project would be to use machine learning to predict 'linker' regions between structural domains and integrate this new method with the current two approaches used in DomPred; with the overall goal of increasing its prediction accuracy.
Undergraduate Project Suggestions
1. Development of a CS careers web application in PHP. (TAKEN - Richard Martell)
2. Development of a CS careers web application in JSF. (TAKEN - Nurzhan Izbassov)
3. Development of a CS careers web application in a third framework technology (Ruby on Rail, Groovy on Grails, etc.) (TAKEN - Bello Mustafa Bello)
Each of these projects should carry out independent requirements analysis with different companies and students. Different conceptual designs will hopefully be formulated to satisfy the requirements gathered. Interaction and interface designs should cater to the strengths of the different technologies employed.
The ultimate stakeholders for this application would be our current CS students and external companies wanting to employ them.
This application would allow students to log in - search for graduate positions and internships - deposit CVs and technical information about themselves securely.
It would also allow companies to log in - advertise graduate positions & internships in a variety of formats - add themselves to a 'list of IT companies' with relevant technical information such as salary ranges - search for students with particular skills - etc.
It would allow alumni of CS to log in - update what companies they are now working for - give a synopsis of how they are feeling working for that company (clearly only viewable by current students rather than the public!)
There is probably a lot of additional functionality that would be appropriate after a proper requirements assessment with stakeholders. For instance, statistics such as average salary for a 2nd year internship could be kept up-to-date based on advertisements and student feedback. This would be useful for small employers wanting to know how much internship students generally are paid - and students wanting to know if they are being offered a fair salary for project work.
Successful systems will be deployed within our department for beta-testing, hopefully early enough for proper user validation. Live deployment of an alpha-version will potentially have a large visible impact for potential employers.
4. Development of a graphical teaching tool for concurrent programming. (TAKEN - Adil Berikuly )
The concept of multiple threads interfering with each other when running a multithreaded application is a difficult one. This project aims to develop a graphical teaching tool that allows simple concurrent programs to be developed with boxes carrying out atomic actions, lines indicating program flow and diamonds carrying out conditional paths. The user would be able to simulate multiple threads running through the 'program' at the same time - these would be visually animated - particularly highlighting interference between the threads. Graphical representations of the working of multiple CPUs with different caches would highlight problems due to cache incoherency and visibility problems. The system would also allow concurrency mechanisms, such as locks, to be graphically added to the 'programs' to highlight the change of behaviour with multiple threads.
Ultimately, the system would be validated by allowing the current 2nd year concurrency students to experiment with it and provide feedback on its pedagogical benefits.
Previous Projects
Collaborative project with the Institute of Cancer (UCL) developing a DAS/2 server for genome annotation and microarray analysis.
Project details need to be discussed with Stephen Henderson at the Institute of Cancer.
Efficient modelling of the protein amyloid folding pathways using a reduced atom force field and Metropolis Monte Carlo approach. (TAKEN - Joan Augsburger, 4D Version - Alex Gillis )
The aggregation of proteins into amyloid fibrils is implicated in a large number of neurodegenerative diseases including Alzheimer’s, Parkinson’s, Huntington’s and different types of spongiform encephalopathies such as new variant Creutzfeldt-Jakob Disease (nvCJD), BSE (“mad cow disease”), scrapie, etc. These diseases involve protein aggregation where normal proteins, mostly consisting of globular structures, are converted into insoluble amyloid fibrils containing beta-sheet structures. Computer modelling is playing a valuable role in understanding the amyloid folding pathways involved.
One particular approach is to run molecular simulations of proteins that conventionally form alpha-helical structures, with the aim of characterizing the conditions under which they form beta-sheet structures, which is the first step towards forming insoluble amyloid fibrils. Unfortunately classical all-atom molecular dynamics simulation tends to be too computationally demanding for this purpose and so specialist molecular simulation software is required.
The overall aim of this project is to write a specialist molecular simulator that is able to efficiently model the key features that are responsible for this alpha-helical to beta-sheet transition in proteins. The simulator will be based on a reduced representation of protein residues, using only 5 to 6 atoms per residue and only 3 residue types (hydrophobic, polar, glycine), instead of the conventional 20 residue types, each containing between 5 and 16 atoms. Also internal degrees of freedom such as bond lengths and angles will be frozen at equilibrium values, leaving only two degrees of freedom per residue (phi and psi angles). Using a reduced protein model will make these types of simulations feasible. The amyloidosis process is likely to be robust to these simplifications in the protein model since there is evidence that a large number of proteins can form amyloids and so the properties required must be present in most proteins and are thus not specific to particular side-chains.
It is proposed that the simulator will be based on a Metropolis Monte Carlo approach, thus permitting sampling of ‘protein folding pathways’, starting proteins in random coil states and folding to either alpha-helical or beta-sheet structures. Alternatively a simulated annealing protocol may be employed within the Metropolis Monte Carlo in an attempt to locate global minimum energy structures under particular conditions. It has been shown that Metropolis Monte Carlo can be inefficient for flexible chains using random Cartesian displacement, but we believe this problem will be minimized since we are using a reduced (phi,psi) internal coordinate system.
What you will learn: You will extend your knowledge about software development and object-oriented design and programming using UML and Java. You will learn about protein structure, atomic force fields and molecular simulation methods, as well as advancing your knowledge of applied mathematics.
Building a relational amyloid database with a web-front end by text mining information the OMIM resource. (TAKEN - Katarzyna Filipek-Rayamajhi)
Amyloidosis are diseases involving the formation of insoluble amyloid aggregates. These types of diseases gained very public recognition, at least in the UK, from Bovine Spongiform Encephalopathy (BSE aka 'Mad Cow Disease') that resulted from cows being fed the carcasses of other cows. The unique infectious agent believed to be responsible for this disease is an extremely stable misfolded form of the prion protein, essentially an amyloid protein. Furthermore, this amyloid protein could jump between species and resulted in an outbreak of new-variant Creutzfeldt-Jakob disease (nvCJD) in people that ate BSE infected meat. Both BSE and nvCJD are neurodegenerative diseases, in a similar manner to other diseases associated with formation of amyloid protein structures.
It is generally believed that intrinsically unstructured proteins play a role in amyloid diseases. These disordered proteins are partially unfolded in solution, dynamically taking on many different conformations. Very infrequently, this allows two individual protein chains to form two beta strands in which the side-chains intercalate together, much like a zipper on a jacket does. It is then much easier for other disordered chains to join alongside this nucleas, forming what is known as a cross beta sheet structure.
But do intrinsically disordered proteins actually play a role in amyloid diseases? Are there other properties of these proteins that make them prone to form amyloids? The overall goal of this project is to answer these types of questions by creating a unified bioinformatics resource containing information about amyloid and non-amyloid diseases, together with information about proteins connected with these diseases and predicted features of their sequences. Features such as differences in overall hydrophobicity of the amino acids could explain why certain proteins are more prone to form amyloid than others. Also features such as intrinsic disorder can be determined from sequence using a number of methods, including our own method called DISOPRED2.
It is proposed that a MySQL database will be created to store information about diseases, protein sequences and sequence features. This database will be populated with information by 'web scraping' the OMIM disease database using the regular expression facilities in Java and JDBC to populate the database. Different algorithms would then be applied to the sequence data to determine sequence features that would be entered into the database. Finally, a Tomcat web application would be written using JSP to produce summary information about amyloid and non-amyloid associated proteins to determine if there are any overall global patterns characterizing the two different classes of proteins.
What you will learn: You will extend your knowledge about software development and object-oriented design and programming using UML and Java. You will learn about text mining from online data sources, further apply your knowledge about relational databases and the development of Tomcat/JSP web applications. You will learn about human diseases and protein structure - in particular amyloid diseases.
Bi-clustering approaches applied to microarray meta-analysis for Huntington's disease. (TAKEN - Alex Lubbock)
What you will learn: You will learn about unsupervised bi-clustering approaches applied to high-throughput biological data and meta-analysis techniques. You will learn about microarray methods and their application to biomarker discovery in human diseases.
Improving the detection of significantly differentially regulated genes by combining microarray data with biological network information. (TAKEN - Maria Papaioannou)
What you will learn: You will learn about supervised learning approaches for the analysis of high-throughput biological data. In particular, you will learn about microarray analysis and representation/analysis/integration of biological network information.
An online cross-species database for putative trinucleotide expansion. (TAKEN - Belle Lin, EXTENDED VERSION - Yannis Petrochilos )
What you will learn: You will extend your knowledge about software development and object-oriented design and programming using UML and Java. You will gain practical experience of creating a relational database with a web front-end using Tomcat/JSP and developing efficient algorithms for repeat pattern analysis. You will learn about the ENSEMBL genomic resource and trinucleotide expansion diseases.
Computational modelling of adult stem cell systems
Conway ’s Game of Life (http://en.wikipedia.org/wiki/Conway's_Game_of_Life) is probably one of the most famous Cellular Automaton (CA) (http://en.wikipedia.org/wiki/Cellular_automaton) and abstractly depicts life and death of cells from overcrowding and isolation. It clearly demonstrates emergence of system complexity due to the interaction of very simple rules.
Stem Cell Biology research is currently revolutionizing regenerative medicine. Stem cells live in particular niches within the human body and automatically regenerate damaged tissue as a response to particular signals. Again we have emergent systems behaviour arising from lower-level cellular behaviour. Although clearly more complex than the Game of Life, a number of researchers have used CA and multi-agent approaches to model stem cell niches (see for instance http://www2.wmin.ac.uk/~dinverm/cell/index.htm and http://dynamo.imise.uni-leipzig.de/).
Traditionally cellular automata, such as the Game of Life, use simple rules such as the number of neighbours in a particular state to determine the state transitions of cells. By necessity, biological cells employ a less direct means to probe their environment involving the diffusion of signalling molecules. A cell in a particular state would exude particular molecular signals which then diffuse throughout the intercellular space. Other cells within the system have specific receptors for these signalling molecules and then respond to them, possibly by changing their current state.
The overall goal of this project would be to develop a framework which could be used to model systems of stem cells. This would require a flexible extension of the simple Cellular Automata to include states that generate and respond to cellular signals. Also actions arising from states would need to be incorporated, for instance cell death (apoptosis) and cell replication. A generic way to ‘program’ the system would be developed that can be used to drive a simulation engine. This ‘stem cell programming language’ would probably involve designing a de facto XML-language tailored for this particular purpose. A final component would then allow the visualization of the resulting simulations.
A number of scientific questions could be tackled by such a generic framework. For instance, what is the simplest ‘cellular program’ that can generate a blastocyst (essentially a hollow ball of cells) derived from a single starting cell? What is the simplest ‘cellular program’ that can maintain a particular sized ball of cells by creating new cells when cells are removed, or inducing an appropriate amount of cell death when additional cells are added?
What you will learn: You will extent and apply your knowledge about cellular automata, object-oriented programming in Java and working with XML. You will also learn about modelling biological cells and, in particular, stem cells.
Biology-inspired extensions to traditional Genetic Algorithms
(TAKEN - MACS student )
John Holland initiated the field of Genetic Algorithms (GAs) with his classic 1975 work entitled 'Adaptation in Natural and Artificial Systems'. A Genetic Algorithm is a generic approach used to search for optimal solutions to a particular problem. It evolves a population of 'chromosomes' (often represented as binary strings), each one representing a possible solution to the problem, with a ‘fitness function’ judging the quality of the solution.
The way Genetic Algorithms work is loosely analogous to biological evolution. An initial population of chromosomes is first created, often at random. A selection mechanism then picks the 'fittest' chromosomes that will create the next generation. A number of 'genetic operators' (typically including mutation and cross-over) then create a new population of novel chromosomes and the process iterates. After a number of generations, the fittest chromosome within the population is determined and this hopefully provides a good solution to the problem.
A number of problems have been identified with standard GAs such as Hitchhiking (Schraudolph & Belew, 1992) in which a useless gene is transferred alongside a useful gene due to strong linkage between these genes (i.e. they are next to each other on the chromosome). The hitchhiking problem was actually recognized by Holland in his original work. He suggested biologically-inspired approaches such as 'genetic inversion' to minimize it (since linkage between genes is constantly disrupted). Others have used other biologically-inspired approaches to tackle the issue of linkage strength between genes, such as co-evolving genetic hot-spots where cross-over between genes is more frequent.
Apart from genetic inversion operators and genetic hotspots, many other biologically inspired extensions to the standard GA can be imagined. For instance, what would be the effect of having pairs of chromosomes in a similar manner to that of humans? (So two copies called ‘alleles’ of each gene would then be present.) How is the GA affected if we constrain cross-over only between two different sexes? (By employing a sex-determining gene.) Why limit the chromosomes to employing only two sexes within a Genetic Algorithm - what is the affect of having three different genders within the population or four?
The goal of this project is to first test GAs against random hill-climbing strategies for classic problems such as the Royal Roads fitness landscape and deceptive fitness landscapes (Mitchell et al., 1991; Mitchell et al., 1993). Then to examine the affect of novel biologically-inspired approaches such as those suggested above.
A number of GA programming frameworks do currently exist, but these tend to be fairly complex and generally would not easily accommodate the fairly radical changes to the basic algorithm discussed above. This project aims to develop a flexible object-oriented GA framework that could incorporate these biologically-inspired extensions.
What you will learn: You will learn about Genetic Algorithms and also search algorithms in general. You will apply your knowledge about object-oriented design in Java and learn about biological mechanisms involved in population genetics.
References
John H. Holland (1975) Adaptation in Natural and Artificial Systems, MIT Press.
Melanie Mitchell, Stephanie Forrest et al. (1991) The Royal Road for Genetic Algorithms: Fitness Landscapes and GA Performance. Towards a Practice of Autonomous Systems: Proceedings of the First European Conference on Artificial Life (http://citeseer.ist.psu.edu/mitchell91royal.html)
Melanie Mitchell, John H. Holland and Stephanie Forrest (1993) When Will a Genetic Algorithm Outperform Hill Climbing? Advances in Neural Information Processing Systems (http://citeseer.ist.psu.edu/mitchell93when.html)
Nicol N. Schraudolph and Richard K. Belew (1992) Dynamic Parameter Encoding for Genetic Algorithms. Machine Learning (http://citeseer.ist.psu.edu/schraudolph92dynamic.html)
Machine learning (SVMs) applied to class-based prediction of residue properties from protein sequence
(TAKEN - Hoe Bae)
Everyone knows that we need a regular intake of proteins in our diet, but the visual beauty of what we are eating isn’t always apparent (particularly at university canteens!).
 
There are thousands of different proteins within the human body, each one constructed with a unique sequence of amino acids, which in turn gives rise to a particular 3-dimensional structure with a precise function.
Within the bioinformatics group at UCL we have developed a number of methods based on machines learning techniques (generally neural networks and SVMs) that can predict structural features of proteins using just their amino acid sequence. The PSIPRED server predict secondary structure elements (arrowed strands and helices shown in the above pictures). The DISOPRED server predicts structural disorder in sequences. The DomPred server predict how sequences are partitioned into globular domains (an example of a 4 domain protein is given below).

These servers all employ a similar approach involving determining the ‘profile’ of the sequence in terms of evolutionary history and then training machine learning approaches to predict the feature of interest from this profile. However, all these servers were developed on an individual basis with completely different code employed behind the similar-looking front-ends.
The goal of this project is to develop a completely generic sequence prediction server that could be trained to predict any class-based residue feature using the SVM method (probably employing LIBSVM using Java). The idea is that individual users could create their own ‘prediction methods’ within the generic server and these would then be available to other users. To create a prediction server, the user would provide a description of the different residue class types being predicted, together with a complete training set in a standard format. The server would then train an SVM to predict the residue features and also benchmark the method using cross-validation to provide ROC analysis of sensitivity/specificity. Ideally, it would automatically try different feature encodings, SVM kernels and optimize parameters to obtain optimal prediction results.
The newly created prediction method would then be automatically advertised on the site with its description and benchmarking results. Clearly users employing any new prediction methods would need to establish how much they trust the users that generated them. Hence user registrations with details of affiliation (confirmed via email) will be important within the web interface.
This project will result in a fairly substantial web application and it is envisaged that Tomcat will be employed using JSP or Servlets, together with a MySQL database to handle the data requirements and LIBSVM to handle to the SVM training.
What you will learn: You will learn about machine learning algorithms (in particular SVMs) and the proper benchmarking of prediction methods. You will also learn about Tomcat/JSP/Servlets and apply your knowledge about databases and object-oriented programming in Java. You will also obtain a understanding of protein sequence bioinformatics.
This page last modified
15 February, 2010
by Kevin Bryson
"We build too many walls and not enough bridges" (Isaac Newton 1643-1727) |