Junhyong Kim
Sloan Foundation Young Investigator Award
Ph.D., SUNY at Stoney Brook, 1992
evolution of gene regulation and developmental systems
A key property of living objects is that each object, whether they are proteins, cells, or whole organisms, has an associated generating process?that is, a decoding process whereby stored information is converted into a complex functioning biological object. For example, generating a protein involves translation and folding; generating an organism involves a cascade of gene regulatory and cell biological processes. We are interested in such ?bio-generative processes? and understanding general principles of these processes. Questions include how to infer the organizational structure of such generative processes from available data, how do the generative processes evolve, and how the generative processes and selection on the generative processes affect the final form of the biological object. Currently, we have three related projects in this area.
Whole-genome gene expression regulation and evolution of the transcriptome
Recently, it has become possible to obtain the transcriptional profiles of the entire genome. We are interested in asking how we can deduce the molecular interaction of the genes from such transcriptional profiles and whether there are organizational regularities to the structure of molecular interactions. In particular, we are interested in large-scale properties such as the organization of connectivity (i.e., how many genes interact with a particular gene), the modularity of the interactions, and dimensionality of gene expression (i.e., the degrees of freedom in coordinated gene expression). We are also interested in the evolution of the transcriptome. In this area, we are collaborating the Kevin White at Yale University to study the macro-evolution and mutational dynamics of the transcriptome in Drosophila species. We have generated comparative expression data for six lineages of Drosophila and for mutation accumulation lines.

We are interested in the tempo and mode of evolution of gene expression, the mutational effects on gene expression, and changes in functional importance of gene expression through a developmental trajectory.
Dynamics of whole-genome gene expression
Transcriptional regulation is a dynamic process and a transcriptional profile has a naturally associated temporal dimension. We have been developing computational tools and laboratory experiments to understand the dynamics of the whole genome gene expression regulation. We recently developed tools for visualizing and analyzing a transcriptome time-series under periodic events such as the cell cycle. With Paul Sniegowski at Penn we are now generating time-series data from natural strains of yeast. We are pursuing the idea that a comparison of changes in the dynamics of gene expression (so-called heterochrony) will allow us to efficiently infer modular co-regulated gene groups involved in generating the cell cycle.
Protein structure evolution
Proteins are generated by a process of translation and statistical mechanical folding. Proteins also display macroevolutionary diversity in shape and function raising the question whether such diversity is due to chance, molecular function (e.g., particular catalytic activity), or constraints in the folding process. Here we are interested in the contribution of the folding process to the final form. In particular, we are statistically characterizing the regularity of form?which we define as the presence of self-similar substructures, and relating this regularity to the prevalence of particular structures in nature.

We have found statistical evidence that common protein structures, so-called superfolds, are more self-similar. We hypothesize that this self-similarity is due to selection on efficient folding. We are currently attempting address whether more self-similar structures are more efficient and robust in their folding properties.
computational infrastructure for deducing the tree of life
Because biological organization is fundamentally based on a bifurcating descent-with-modification process, the solution to the problem of phylogenetic estimation is extremely important to a wide variety of basic and applied biological problems. Recently, the introduction of molecular techniques has made available tremendous amount of data for phylogenetic analysis such that several NSF initiatives have been generated to elucidate the phylogenetic tree of all Life. In this area we are concentrating on the following problems:
Investigating the statistical properties of phylogenetic estimators
We are interested in understanding the large-sample and finite sample behavior of phylogenetic estimators especially with respect to their scaling properties both in problem size and data size. We have developed several geometrical analysis techniques using the idea of imbedded joint probability space. We are also working on developing new estimation principles. For example, we have pioneered the idea of using computational algebraic geometry techniques for estimating evolutionary trees. Algebraic geometry techniques can be used to extract invariant functions of the joint probability distribution of character states that can be used as efficient direct estimators of phylogeny. More recently we have been experimenting with using penalized likelihood methods to connect several distinct classes of phylogenetic estimators into a single family.

Establishing a simulated data set for validating phylogenetic estimators
One of the fundamental obstacles in developing phylogenetic estimation algorithms is the lack of an agreed standard for performance evaluations. Typical evaluation studies are ad hoc and use procedures that are difficult to replicate. We are part of a NSF funded project called National Resource for Phyloinformatics and Computational Phylogenomics. As part of this group, we are developing a suite of simulated data using a range of models of evolution from simple models of mutational changes to whole genome evolution. We are also developing a suite of statistical performance evaluation tools. The development of the simulated data set and performance evaluation methods will lead to a unified assessment of phylogenetic estimation algorithms.
Mining the existing database for phylogenetic information
The current databases such as Genbank contain a wealth of information for phylogenetic estimation. For example, in 2002, Genbank contained information from ~125,000 organisms. However, this data has been collected in a haphazard manner, which leads to the interesting problem of how we might most efficiently use this data. The computational problems here include obtaining maximal subsets of informative data, using fragmented datasets to generate a “super” tree that maximally cover available organisms, and illuminating bottlenecks in the available data. We are part of a NSF funded “Assembling the Tree of Life” project that will address these computational problems.
comparative genomics and molecular evolution
We are engaged in various projects involving comparative genomics and molecular evolution. These projects are mostly in collaboration with other labs around the country. The projects include developing high-throughput methods for rapid identification and phylogenetic placement of novel organisms, functional annotation of genomic sequences for molecules involved in synaptic transmission, molecular evolution of the olfactory receptor families, and elucidation of novel components in the osteoclast maturation pathway. Part of this activity is funded by a NIH planning grant for National Program of Excellence in Biomedical Computing.

Rifkin, S. A., Houle, D., Kim, J. and White, K.P. 2005. A mutation accumulation assay reveals extensive capacity for rapid gene expression evolution. Nature, 438: 220-223.
Ge, F., Wang, L.S., Kim, J. 2005. Cobweb of life revealed by genome-scale estimates of horizontal gene transfer. PLOS Biology, 3(10):e316.
Hadley, D., Murphy T., Valladares, O., Hannenhalli, S., Ungar, L., Kim, J. and Bucan, M. 2006. Patterns of sequence conservation in presynaptic neural genes. Genome Biology 7:R105.
Magwene, P. M., Lizardi, P., and J. Kim. 2003. Reconstructing the temporal order of biological samples using microarray data. Bioinformatics 19(7):842-850.
Rifkin, S. A., J. Kim, and K. P. White. 2003. Evolution of gene-expression during metamorphosis in the Drosophila melanogaster subgroup. Nature Genetics 33(2):138-144.
Rifin, S.A. and Kim, J. 2002. Geometry of gene expression dynamics. Bioinformatics 18:1176-1183.
Kim, J. Computers are from Mars, Organisms are from Venus: An interrelationship guide to Biology and Computer Science. IEEE Computer, July 2002.
Aspnes, J., M.-Y. Kao, J. Kim. J. Kreychman, G. Shah. 2002. A Combinatorial Toolbox for Protein Landscapes in the Grand Canonical Model. J. Comp. Biol. 9(5):721-742.
Kim, J. 2001. Macroevolution of the hairy enhancer in Drosophila species. J. Exp. Zool. (Mol. Dev. Evol.) 291:175-185.
Kim, J., E. Moriyama, C. G. Warr, P. J. Clyne, and J. R. Carlson. 2000. Identification of multi-transmembrane proteins from genomic databases using quasi-periodic structural properties. Bioinformatics 16:767-775.
Kim, J. 2000. Slicing hyperdimensional oranges: The geometry of phylogenetic estimation. Mol. Phyl. Evol. 17(1): 58-75.
Sanderson, M. J. and J. Kim. 2000. Parametric phylogeny estimation?. Syst. Biol. 49:817-829.
Introduction to Computational Biology (BIOL 536)
Advanced Computational Biology (BIOL 537)
