{"625078":{"#nid":"625078","#data":{"type":"news","title":"New Mining Techniques Effortlessly Provide Greater Insight for Unstructured Text Data","body":[{"value":"\u003Cp\u003EThe ability to effectively and efficiently harness unstructured data has been the proverbial race to the moon for the data mining research world.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EUnstructured data is information that is typically text-heavy and does not have a predefined model or is not organized in a predefined manner. It can represent the name of a disease, the location of a sale, a type of product sold, and much more. According to some estimates, it also accounts for roughly 85 percent of the data in the world.\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EHowever, acquiring useful insights from massive unstructured text data remains a challenge that was not readily addressed by existing mining techniques.\u003C\/p\u003E\r\n\r\n\u003Cp\u003ENow, a recently published\u0026nbsp;\u003Ca href=\u0022https:\/\/www.ideals.illinois.edu\/handle\/2142\/102465\u0022\u003EPh.D. defense\u003C\/a\u003E\u0026nbsp;combining\u0026nbsp;two algorithms\u0026nbsp;specializing in\u0026nbsp;\u003Ca href=\u0022http:\/\/chaozhang.org\/papers\/2018-cikm-westclass.pdf\u0022\u003Eneural text classification\u003C\/a\u003E\u0026nbsp;and\u0026nbsp;\u003Ca href=\u0022http:\/\/chaozhang.org\/papers\/2018-kdd-taxogen.pdf\u0022\u003Etaxonomy construction\u003C\/a\u003E, claims\u0026nbsp;to do just that.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EAccording to the defense, when these algorithms are used together, they create\u0026nbsp;an elegant data mining pipeline that can effortlessly\u0026nbsp;turn text data into useful insights. The\u0026nbsp;algorithms\u0026nbsp;allow for multi-dimensional analysis of unstructured text data using an integrated structuring-and-mining framework and discover\u0026nbsp;multi-granular structures from the text.\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u003Ca href=\u0022https:\/\/www.cse.gatech.edu\/\u0022\u003ESchool of Computational Science and Engineering\u003C\/a\u003E\u0026nbsp;Assistant Professor\u0026nbsp;\u003Ca href=\u0022http:\/\/chaozhang.org\/\u0022\u003E\u003Cstrong\u003EChao Zhang\u003C\/strong\u003E\u003C\/a\u003E\u0026nbsp;developed the new techniques while\u0026nbsp;studying under the direction of famed computer scientist\u0026nbsp;\u003Cstrong\u003EJiawei Han\u003C\/strong\u003E.The work has garnered so much recognition that it won the\u0026nbsp;\u003Ca href=\u0022https:\/\/www.kdd.org\/\u0022\u003E2019 SIGKDD Runner Up Dissertation Award\u003C\/a\u003E.\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026ldquo;As one of the most important data forms, unstructured text data plays a crucial role in data-driven decision making in domains ranging from social networking and information retrieval to healthcare and scientific research,\u0026rdquo; said Zhang.\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026ldquo;In many emerging applications, people\u0026#39;s information needs from text data are becoming multi-dimensional \u0026ndash; they demand useful insights for multiple aspects from the given body of a text.\u0026rdquo;\u003C\/p\u003E\r\n\r\n\u003Cp\u003EZhang\u0026rsquo;s two-part pipeline addresses this multi-dimensional need while also tackling one of the biggest bottlenecks of mining unstructured text data: labeling. Labeled data represents a set of samples that have been tagged with one or more labels and is used in a form of machine learning called supervised learning.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EUnfortunately, this is not always the practical approach for our world as data is often unlabeled and labeling a sufficient amount of data for training supervised models is often too expensive.\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cp\u003ETherefore, an elegant data mining algorithm has to be able to\u0026nbsp;discover latent structures from unstructured text data, or,\u0026nbsp;be able to sort unstructured data into multiple categories without extensive labeling.This is where Zhang\u0026rsquo;s methods, which require no or very little labeling effort, truly shine.\u003C\/p\u003E\r\n\r\n\u003Cp\u003E\u0026ldquo;The algorithms for multidimensional text analysis require little supervision, making this framework appealing for many applications where labeled data are expensive to obtain,\u0026rdquo; he said.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EUltimately, while being more effective than other models, this proposed framework has two distinctive advantages: flexibility and label efficiency. Both of which, drive costs down dramatically.\u003C\/p\u003E\r\n\r\n\u003Cp\u003EDownload the\u0026nbsp;algorithms on\u0026nbsp;GitHub here:\u0026nbsp;\u003C\/p\u003E\r\n\r\n\u003Cul\u003E\r\n\t\u003Cli\u003E\u003Ca href=\u0022https:\/\/github.com\/yumeng5\/WeSTClass\u0022\u003EWeakly-Supervised Neural Text Classification\u003C\/a\u003E\u003C\/li\u003E\r\n\t\u003Cli\u003E\u003Ca href=\u0022https:\/\/github.com\/franticnerd\/taxogen\u0022\u003ETaxoGen:\u0026nbsp;Unsupervised Topic Taxonomy Construction by Adaptive Term Embedding and Clustering\u003C\/a\u003E\u003C\/li\u003E\r\n\u003C\/ul\u003E\r\n\r\n\u003Cp\u003E\u0026nbsp;\u003C\/p\u003E\r\n","summary":null,"format":"limited_html"}],"field_subtitle":"","field_summary":"","field_summary_sentence":[{"value":"CSE Assistant Professor Chao Zing wins KDD 2019 Dissertation Award runner up for an elegant data mining pipeline."}],"uid":"34540","created_gmt":"2019-08-23 17:29:47","changed_gmt":"2019-08-23 18:15:11","author":"Kristen Perez","boilerplate_text":"","field_publication":"","field_article_url":"","dateline":{"date":"2019-08-23T00:00:00-04:00","iso_date":"2019-08-23T00:00:00-04:00","tz":"America\/New_York"},"extras":[],"hg_media":{"625076":{"id":"625076","type":"image","title":"Chao Zhang\u0027s Data Mining Dissertation Award Winning Work","body":null,"created":"1566581220","gmt_created":"2019-08-23 17:27:00","changed":"1566581220","gmt_changed":"2019-08-23 17:27:00","alt":"Chao Zhang\u0027s Data Mining Dissertation Award Winning Work showing cube construction","file":{"fid":"237990","name":"fig 1.1 dissertation.png","image_path":"\/sites\/default\/files\/images\/fig%201.1%20dissertation.png","image_full_path":"http:\/\/hg.gatech.edu\/\/sites\/default\/files\/images\/fig%201.1%20dissertation.png","mime":"image\/png","size":348777,"path_740":"http:\/\/hg.gatech.edu\/sites\/default\/files\/styles\/740xx_scale\/public\/images\/fig%201.1%20dissertation.png?itok=rW3ug0tJ"}}},"media_ids":["625076"],"groups":[{"id":"50877","name":"School of Computational Science and Engineering"},{"id":"47223","name":"College of Computing"},{"id":"431631","name":"OMS"}],"categories":[],"keywords":[{"id":"9168","name":"data mining"},{"id":"182133","name":"Chao Zhang"},{"id":"101","name":"Award"},{"id":"181845","name":"KDD 2019"},{"id":"181315","name":"cse-dse"},{"id":"181220","name":"cse-ml"}],"core_research_areas":[],"news_room_topics":[],"event_categories":[],"invited_audience":[],"affiliations":[],"classification":[],"areas_of_expertise":[],"news_and_recent_appearances":[],"phone":[],"contact":[{"value":"\u003Cp\u003EKristen Perez\u003C\/p\u003E\r\n\r\n\u003Cp\u003ECommunications Officer\u003C\/p\u003E\r\n","format":"limited_html"}],"email":["kristen.perez@cc.gatech.edu"],"slides":[],"orientation":[],"userdata":""}}}