Large-Scale Data Management and Its Interdisciplinary Relevance

I grew up in the north Chicago suburbs. My father, a purchasing agent for an electronics parts company, sometimes brought home electronics parts samples, and I began to wonder what they were and how to put them together to do interesting things. That interest led me to a B.S. in computer engineering and an M.S. in electrical engineering at the University of Illinois-Urbana. After four summer internships during college with IBM – in Burlington, Vt., and Rochester, Minn. – I joined IBM in Rochester, where I was an engineer involved in computer hardware and software projects, as well as an R&D manager. After eight years, IBM sent me to the University of Wisconsin-Madison, where I received my Ph.D. in computer science, specializing in databases and information retrieval. I returned to IBM as lead architect for the industry’s largest Java-based project at that time. Later, I joined Imation in Oakdale, Minn., as the chief technology officer and director of R&D of the Data Storage division. After leaving corporate life to consult in my own company, I became interested in computer security and developed and taught the first course in that area at the University of Minnesota. I later brought it to the University of St. Thomas as an adjunct instructor, and then joined the full-time faculty in the Graduate Programs in Software Department in 2003. I currently teach courses in computer security, software analysis and design, and information retrieval.

Over the last several years, I saw an increasing number of technology news reports about a technology called MapReduce, first revealed by Google in a 2004 paper. In 2006, a group of software engineers created an open source version called Hadoop. This technology uses large clusters of computers, numbering into the thousands, to distribute data and processing. Hadoop is used by a number of very high profile companies, including eBay, Facebook, Yahoo! and Walmart. The technology can tolerate faults and restart tasks as needed. It is designed to efficiently handle terabytes and petabytes of data. Google’s original use was to process the giant index it creates after it crawls the Web in search of information, but it has since spawned applications in many other directions.

One of our adjunct instructors, Gary Berosik, had experience with Hadoop at his company, Thomson Reuters, and encouraged us to explore it. Last year, I attended a local Java user’s group meeting where the speaker from the company Cloudera described the technology and their experiences consulting with it, and that really opened my eyes to Hadoop’s potential. We had an unused computing cluster in our department, so with the help of a student, Harlan Bloom, we got Hadoop running on it. I then decided to teach this technology in my information retrieval course via individual virtual machines and also make the cluster available for student projects.

In today’s computer systems, most of our computing resources are blindingly fast (i.e. CPUs) and hugely abundant (i.e. memory and disk capacity). Moore’s Law predicts that these capabilities double every 18 to 24 months; however, disk access time and throughput have not kept pace with this exponential growth, and this is typically the performance bottleneck for most applications. We compensate by putting more memory in our computers so that we don’t have to access the disk drive as often. Hadoop, operating on a cluster of computers, takes advantage not only of parallel processing but also of parallel disk access. Moving data on and off disk drives in parallel helps alleviate this historic performance bottleneck, and so enables efficient processing of huge amounts of data stored on these disk drives.

Recently, CPU speed is also being strangely affected by Moore’s Law. Instead of racing up the gigahertz ladder, the speed of an individual CPU core is tapering off, so the industry is responding by offering multiple CPU cores to keep pace over time. This, in addition to the disk bottleneck changes, is causing the software engineering community to rethink architecture and programming languages to respond to these changes. At the highest level, I am interested in how traditional applications change under this new paradigm, and what new applications now are enabled by it.

I have two sons. My eldest, Justin, is a junior at St. Thomas, majoring in actuarial science, economics and statistics. My youngest, Nathan, is a senior in high school. He will be attending St. Thomas to major in neuroscience. To help him make his college decision, he asked to attend a neuroscience course, so I found one – taught by Dr. Jadin Jackson, a clinical faculty member in the Biology Department – for him to visit. Afterward, while we discussed Nathan’s academic options, Jadin and I found we had some things in common, including degrees in electrical engineering. Over lunch, he described a computing problem that was getting in the way of his neuroscience research, so we teamed up to see if Hadoop could help him out. Meanwhile, one of my graduate students, Ashish Singh, wanted to work on Hadoop with me in an independent study course, so we decided that he would work on this real-world problem.

During Jadin’s post-doctoral work, he acquired a lot of data from electrodes that were implanted in rat brains. These signals represent individual neuronal activity in a brain region called the hippocampus, which correspond to the rat’s position in space. When the rat finds itself at a tee in the maze and has to decide whether to move left or right to get its reward, the signals reflect the rat’s thinking about moving down the left path, then thinking about moving down the right path, then deciding which route to take and then physically moving. Amazingly, sometimes this signal pattern is generated while the rat is sleeping, so he can see a rat "dreaming" about moving in the maze!

Jadin needs to digitally signal process these signals using a mathematical technique called wavelet analysis, which can pull out both frequency and time information from the neuronal signals. The huge volume of data and amount of computation needed, however, overwhelm his individual computer; furthermore, he would like to have all his processed data available online to query and explore. We hope to show that this processing can be efficiently performed using the parallelism available on a Hadoop cluster, and that the results efficiently can be accessed with a data warehouse component called Hive, which leverages Hadoop. Jadin and I have identified other pre- and post-processing steps that we can explore in future projects in this cross-discipline area called computational neuroscience.

Both Jadin and I enjoy crossing academic boundaries to engage in interdisciplinary work. For me, it is a chance to learn about a new field from an expert and to see if I can apply the knowledge in my own domain to someone else’s real-world problem. This is far more interesting and challenging than contriving a problem. During this process, I also get to deepen my understanding of the MapReduce model and the Hadoop technology, which allows me to share my experiences more effectively with my students.

As another example of how these projects can spawn other activities, I have teamed with another member of our department, Dr. Saeed Rahimi, to create a new special topics course in big data for fall 2012. The course will include these technologies and several others that have gained traction in the industry to deal with the increasingly massive amounts of data and the desire to efficiently analyze them and turn them into information. These technologies provide alternatives to traditional SQL-based relational databases and are better optimized for the fast-growing amount of unstructured and semi-structured data.

In our department, we recently decided to form a Big Data Center of Excellence to integrate our faculty expertise in database, data warehousing, data mining, operating systems, computer architecture, information retrieval and business intelligence around this new area. Our goals are to spawn further research activity within our department, between St. Thomas departments, with other universities and with industry. This effort will influence the curriculum for our existing courses and future ones, including homework assignments, class projects, independent study opportunities and thesis topics. I think this is a good example of using applied research to further our twin goals of maintaining currency and competency in the classroom. This effort is also a good example of a traditional strength of the Graduate Programs in Software, which is to quickly respond to the ever-present changes in the information technology industry, bringing these technologies into the classroom for our students, which in turn benefits their current or future employers.

Bradley Rubin is associate professor at the Graduate Programs in Software program.

From Exemplars, a publication of the Grants and Research Office.