{"629644":{"#nid":"629644","#data":{"type":"event","title":"Phd Proposal by Prasanth Chatarasi","body":[{"value":"\u003Cp\u003E\u003Cstrong\u003ETitle:\u003C\/strong\u003E Advancing Compiler Optimizations for Parallel Architectures\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EPrasanth Chatarasi\u003C\/p\u003E\r\n\r\n\u003Cp\u003EPh.D. Student in Computer Science\u003C\/p\u003E\r\n\r\n\u003Cp\u003ESchool of Computer Science\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003ECollege of Computing\u003C\/p\u003E\r\n\r\n\u003Cp\u003EGeorgia Institute of Technology\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDate: Monday, December 9, 2019\u003C\/p\u003E\r\n\r\n\u003Cp\u003ETime: 11:00 - 1:00pm (EST)\u003C\/p\u003E\r\n\r\n\u003Cp\u003ELocation: KACB 3126\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cstrong\u003ECommittee:\u003C\/strong\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E------------\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDr. Vivek Sarkar (Advisor), School of Computer Science, Georgia Institute of Technology\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDr. Jun Shirako (Co-Advisor),\u0026nbsp;School of Computer Science, Georgia Institute of Technology\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDr. Tushar Krishna,\u0026nbsp;School of Electrical and Computer Engineering, Georgia Institute of Technology\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDr. Santosh Pande, School of Computer Science,\u0026nbsp; Georgia Institute of Technology\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Cstrong\u003EAbstract:\u003C\/strong\u003E\u003C\/p\u003E\r\n\r\n\u003Cp\u003E-----------\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EMany parallel architectures are emerging to improve computing capabilities as we approach the end of Moore\u0026rsquo;s law (doubling transistor\u0026nbsp;count per every two years).\u0026nbsp;Contemporaneously, the demand for higher performance is broadening across multiple application domains.\u0026nbsp;The above trends pose a plethora of challenges to application development using high-performance libraries, which is the de-facto\u0026nbsp;approach to achieving higher performance, and some of the challenges are porting\/adapting to multiple emerging parallel architectures,\u0026nbsp;supporting rapidly advancing domains, and also inhibiting optimizations across library calls.\u0026nbsp;Hence, there is a renewed focus on optimizing compilers from industry and academia to address the above trends, but it requires\u0026nbsp;advancements in enabling them to a wide range of applications and also to better exploit current and future parallel architectures, which is\u0026nbsp;the focus of my thesis.\u003Cbr \/\u003E\r\n\u003Cbr \/\u003E\r\nFirstly, most of the compiler frameworks perform conservative dependence analysis in the presence of unanalyzable program constructs\u0026nbsp;such as pointer aliasing, unknown function calls, non-affine expressions, recursion, and unstructured control flow, which can limit the\u0026nbsp;applicability of transformations even though they may be legal to apply.\u0026nbsp;Our work is motivated by the observation that software with explicit parallelism for multi-core CPUs, GPUs is on the rise.\u0026nbsp;Our approach (PoPP) uses explicit parallelism specified by the programmer as logical parallelism, and refines the conservative\u0026nbsp;dependence analysis with partial execution order from the explicit parallelism to enable a larger set of loop transformations for enhanced\u0026nbsp;parallelization, compared to what might have been possible if the input program is sequential.\u003Cbr \/\u003E\r\n\u003Cbr \/\u003E\r\nSecondly, despite the fact that compiler technologies for automatic vectorization have been under development for over four decades,\u0026nbsp;there are still considerable gaps in the capabilities of modern compilers to perform automatic vectorization for SIMD units.\u0026nbsp;One such gap can be found in the handling of loops with dependence cycles that involve memory-based anti (write-after-read) and output\u0026nbsp;(write-after-write) dependences.\u0026nbsp;A significant limitation in past work is the lack of a unified formulation that synergistically integrates multiple storage transformations to\u0026nbsp;break the cycles and further unify the formulation with loop transformations to enable vectorization.\u0026nbsp;To address this limitation, we propose an approach (PolySIMD).\u0026nbsp; \u0026nbsp;\u0026nbsp;\u003Cbr \/\u003E\r\n\u003Cbr \/\u003E\r\nThirdly, finding optimal compiler mappings for an application onto an accelerator can be extremely challenging because of a massive\u0026nbsp;space of possible data-layouts and loop transformations.\u0026nbsp;For example, there are over 10$^{19}$ valid mappings for a single convolution layer on average for mapping ResNet50 and MobileNetV2\u0026nbsp;on a representative DNN edge accelerator.\u0026nbsp;To address this challenge, we propose a decoupled off-chip\/on-chip approach that decomposes the mapping space into off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace. \u0026nbsp;The motivation for this decomposition is to\u0026nbsp;reduce the size of the search space dramatically, and also to prioritize the optimization of off-chip data movement, which is 2-3 orders of\u0026nbsp;magnitude more compared to the on-chip data movement.\u0026nbsp;We introduce {\\em Marvel}, which implements the above approach by leveraging two cost models to explore the two subspaces -- a\u0026nbsp;classical distinct-block (DB) locality cost model for the off-chip subspace, and a state-of-the-art DNN accelerator behavioral cost model,\u0026nbsp;MAESTRO, for the on-chip subspace.\u0026nbsp;Our approach also considers dimension permutation, a form of data-layouts, in the mapping space formulation along with the loop\u0026nbsp;transformations.\u0026nbsp;\u003Cbr \/\u003E\r\nh space problem.\u003Cbr \/\u003E\r\n\u003Cbr \/\u003E\r\nFinally, with the emergence of a near-memory thread migratory architecture (EMU) to address the locality wall from weak-locality\u0026nbsp;applications, as part of the proposed work, we plan to develop locality and thread-migration aware compiler optimizations to enhance the\u0026nbsp;performance of graph analytics on the EMU machine.\u0026nbsp;Our preliminary evaluation of compiler optimizations such as node fusion and edge flipping gives a significant benefit over the original\u0026nbsp;programs written without being aware of thread migrations.\u0026nbsp;\u003C\/p\u003E\r\n","summary":null,"format":"limited_html"}],"field_subtitle":"","field_summary":"","field_summary_sentence":[{"value":"Advancing Compiler Optimizations for Parallel Architectures"}],"uid":"27707","created_gmt":"2019-12-03 17:00:06","changed_gmt":"2019-12-03 17:00:06","author":"Tatianna Richardson","boilerplate_text":"","field_publication":"","field_article_url":"","field_event_time":{"event_time_start":"2019-12-09T11:00:00-05:00","event_time_end":"2019-12-09T13:00:00-05:00","event_time_end_last":"2019-12-09T13:00:00-05:00","gmt_time_start":"2019-12-09 16:00:00","gmt_time_end":"2019-12-09 18:00:00","gmt_time_end_last":"2019-12-09 18:00:00","rrule":null,"timezone":"America\/New_York"},"extras":[],"groups":[{"id":"221981","name":"Graduate Studies"}],"categories":[],"keywords":[{"id":"102851","name":"Phd proposal"}],"core_research_areas":[],"news_room_topics":[],"event_categories":[{"id":"1788","name":"Other\/Miscellaneous"}],"invited_audience":[{"id":"78761","name":"Faculty\/Staff"},{"id":"78771","name":"Public"},{"id":"174045","name":"Graduate students"},{"id":"78751","name":"Undergraduate students"}],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[],"email":[],"slides":[],"orientation":[],"userdata":""}}}