Academia.eduAcademia.edu
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany jvargas2006@gmail.com 7333 Beniamino Murgante Osvaldo Gervasi Sanjay Misra Nadia Nedjah Ana Maria A.C. Rocha David Taniar Bernady O. Apduhan (Eds.) Computational Science and Its Applications – ICCSA 2012 12th International Conference Salvador de Bahia, Brazil, June 18-21, 2012 Proceedings, Part I 13 jvargas2006@gmail.com Volume Editors Beniamino Murgante University of Basilicata, Potenza, Italy, E-mail: beniamino.murgante@unibas.it Osvaldo Gervasi University of Perugia, Italy, E-mail: osvaldo@unipg.it Sanjay Misra Federal University of Technology, Minna, Nigeria, E-mail: smisra@futminna.edu.ng Nadia Nedjah State University of Rio de Janeiro, Brazil, E-mail: nadia@eng.uerj.br Ana Maria A. C. Rocha University of Minho, Braga, Portugal, E-mail: arocha@dps.uminho.pt David Taniar Monash University, Clayton,VIC,Australia, E-mail: david.taniar@infotech.monash.edu.au Bernady O. Apduhan Kyushu Sangyo University, Fukuoka, Japan, E-mail: bob@is.kyusan-u.ac.jp ISSN 0302-9743 e-ISSN 1611-3349 ISBN 978-3-642-31124-6 e-ISBN 978-3-642-31125-3 DOI 10.1007/978-3-642-31125-3 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2012939389 CR Subject Classification (1998): C.2.4, C.2, H.4, F.2, H.3, D.2, F.1, H.5, H.2.8, K.6.5, I.3 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues © Springer-Verlag Berlin Heidelberg 2012 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) jvargas2006@gmail.com Preface This four-part volume (LNCS 7333-7336) contains a collection of research papers from the 12th International Conference on Computational Science and Its Applications (ICCSA 2012) held in Salvador de Bahia, Brazil, during June 18–21, 2012. ICCSA is one of the successful international conferences in the field of computational sciences, and this year for the first time in the history of the ICCSA conference series it was held in South America. Previously the ICCSA conference series have been held in Santander, Spain (2011), Fukuoka, Japan (2010), Suwon, Korea (2009), Perugia, Italy (2008), Kuala Lumpur, Malaysia (2007), Glasgow, UK (2006), Singapore (2005), Assisi, Italy (2004), Montreal, Canada (2003), (as ICCS) Amsterdam, The Netherlands (2002), and San Francisco, USA (2001). The computational science community has enthusiastically embraced the successive editions of ICCSA, thus contributing to making ICCSA a focal meeting point for those interested in innovative, cutting-edge research about the latest and most exciting developments in the field. We are grateful to all those who have contributed to the ICCSA conference series. ICCSA 2012 would not have been made possible without the valuable contribution of many people. We would like to thank all session organizers for their diligent work, which further enhanced the conference level, and all reviewers for their expertise and generous effort, which led to a very high quality event with excellent papers and presentations. We specially recognize the contribution of the Program Committee and local Organizing Committee members for their tremendous support and for making this congress a very successful event. We would like to sincerely thank our keynote speakers, who willingly accepted our invitation and shared their expertise. We also thank our publisher, Springer, for accepting to publish the proceedings and for their kind assistance and cooperation during the editing process. Finally, we thank all authors for their submissions and all conference attendants for making ICCSA 2012 truly an excellent forum on computational science, facilitating the exchange of ideas, fostering new collaborations and shaping the future of this exciting field. Last, but certainly not least, we wish to thank our readers for their interest in this volume. We really hope you find in these pages interesting material and fruitful ideas for your future work. We cordially invite you to visit the ICCSA website—http://www.iccsa.org— where you can find relevant information about this interesting and exciting event. June 2012 Osvaldo Gervasi David Taniar jvargas2006@gmail.com Organization ICCSA 2012 was organized by Universidade Federal da Bahia (Brazil), Universidade Federal do Recôncavo da Bahia (Brazil), Universidade Estadual de Feira de Santana (Brazil), University of Perugia (Italy), University of Basilicata (Italy), Monash University (Australia), and Kyushu Sangyo University (Japan). Honorary General Chairs Antonio Laganà Norio Shiratori Kenneth C.J. Tan University of Perugia, Italy Tohoku University, Japan Qontix, UK General Chairs Osvaldo Gervasi David Taniar University of Perugia, Italy Monash University, Australia Program Committee Chairs Bernady O. Apduhan Beniamino Murgante Kyushu Sangyo University, Japan University of Basilicata, Italy Workshop and Session Organizing Chairs Beniamino Murgante University of Basilicata, Italy Local Organizing Committee Frederico V. Prudente Mirco Ragni Ana Carla P. Bitencourt Cassio Pigozzo Angelo Duarde Marcos E. Barreto José Garcia V. Miranda Universidade Universidade Brazil Universidade Brazil Universidade Universidade Brazil Universidade Universidade Federal da Bahia, Brazil (Chair) Estadual de Feira de Santana, Federal do Recôncavo da Bahia, Federal da Bahia, Brazil Estadual de Feira de Santana, Federal da Bahia, Brazil Federal da Bahia, Brazil jvargas2006@gmail.com VIII Organization International Liaison Chairs Jemal Abawajy Marina L. Gavrilova Robert C.H. Hsu Tai-Hoon Kim Andrés Iglesias Takashi Naka Rafael D.C. Santos Deakin University, Australia University of Calgary, Canada Chung Hua University, Taiwan Hannam University, Korea University of Cantabria, Spain Kyushu Sangyo University, Japan National Institute for Space Research, Brazil Workshop Organizers Advances in High-Performance Algorithms and Applications (AHPAA 2012) Massimo Cafaro Giovanni Aloisio University of Salento, Italy University of Salento, Italy Advances in Web-Based Learning (AWBL 2012) Mustafa Murat Inceoglu Ege University, Turkey Bio-inspired Computing and Applications (BIOCA 2012) Nadia Nedjah Luiza de Macedo Mourell State University of Rio de Janeiro, Brazil State University of Rio de Janeiro, Brazil Computer-Aided Modeling, Simulation, and Analysis (CAMSA 2012) Jie Shen Yuqing Song University of Michigan, USA Tianjing University of Technology and Education, China Cloud Computing and Its Applications (CCA 2012) Jemal Abawajy Osvaldo Gervasi University of Deakin, Australia University of Perugia, Italy Computational Geometry and Applications (CGA 2012) Marina L. Gavrilova University of Calgary, Canada jvargas2006@gmail.com Organization IX Chemistry and Materials Sciences and Technologies (CMST 2012) Antonio Laganà University of Perugia, Italy Cities, Technologies and Planning (CTP 2012) Giuseppe Borruso Beniamino Murgante University of Trieste, Italy University of Basilicata, Italy Computational Tools and Techniques for Citizen Science and Scientific Outreach (CTTCS 2012) Rafael Santos Jordan Raddickand Ani Thakar National Institute for Space Research, Brazil Johns Hopkins University, USA Johns Hopkins University, USA Econometrics and Multidimensional Evaluation in the Urban Environment (EMEUE 2012) Carmelo M. Torre Maria Cerreta Paola Perchinunno Polytechnic of Bari, Italy Università Federico II of Naples, Italy University of Bari, Italy Future Information System Technologies and Applications (FISTA 2012) Bernady O. Apduhan Kyushu Sangyo University, Japan Geographical Analysis, Urban Modeling, Spatial Statistics (GEOG-AN-MOD 2012) Stefania Bertazzon Giuseppe Borruso Beniamino Murgante University of Calgary, Canada University of Trieste, Italy University of Basilicata, Italy International Workshop on Biomathematics, Bioinformatics and Biostatistics (IBBB 2012) Unal Ufuktepe Andrés Iglesias Izmir University of Economics, Turkey University of Cantabria, Spain jvargas2006@gmail.com X Organization International Workshop on Collective Evolutionary Systems (IWCES 2012) Alfredo Milani Clement Leung University of Perugia, Italy Hong Kong Baptist University, Hong Kong Mobile Communications (MC 2012) Hyunseung Choo Sungkyunkwan University, Korea Mobile Computing, Sensing, and Actuation for Cyber Physical Systems (MSA4CPS 2012) Moonseong Kim Saad Qaisar Korean intellectual Property Office, Korea NUST School of Electrical Engineering and Computer Science, Pakistan Optimization Techniques and Applications (OTA 2012) Ana Maria Rocha University of Minho, Portugal Parallel and Mobile Computing in Future Networks (PMCFUN 2012) Al-Sakib Khan Pathan International Islamic University Malaysia, Malaysia PULSES - Transitions and Nonlinear Phenomena (PULSES 2012) Carlo Cattani Ming Li Shengyong Chen University of Salerno, Italy East China Normal University, China Zhejiang University of Technology, China Quantum Mechanics: Computational Strategies and Applications (QMCSA 2012) Mirco Ragni Frederico Vasconcellos Prudente Angelo Marconi Maniero Ana Carla Peixoto Bitencourt Universidad Federal de Bahia, Brazil Universidad Federal de Bahia, Brazil Universidad Federal de Bahia, Brazil Universidade Federal do Reconcavo da Bahia, Brazil jvargas2006@gmail.com Organization XI Remote Sensing Data Analysis, Modeling, Interpretation and Applications: From a Global View to a Local Analysis (RS 2012) Rosa Lasaponara Nicola Masini Institute of Methodologies for Environmental Analysis, National Research Council, Italy Archaeological and Mconumental Heritage Institute, National Research Council, Italy Soft Computing and Data Engineering (SCDE 2012) Mustafa Matt Deris Tutut Herawan Universiti Tun Hussein Onn Malaysia, Malaysia Universitas Ahmad Dahlan, Indonesia Software Engineering Processes and Applications (SEPA 2012) Sanjay Misra Federal University of Technology Minna, Nigeria Software Quality (SQ 2012) Sanjay Misra Federal University of Technology Minna, Nigeria Security and Privacy in Computational Sciences (SPCS 2012) Arijit Ukil Tata Consultancy Services, India Tools and Techniques in Software Development Processes (TTSDP 2012) Sanjay Misra Federal University of Technology Minna, Nigeria Virtual Reality and Its Applications (VRA 2012) Osvaldo Gervasi Andrès Iglesias University of Perugia, Italy University of Cantabria, Spain jvargas2006@gmail.com XII Organization Wireless and Ad-Hoc Networking (WADNet 2012) Jongchan Lee Sangjoon Park Kunsan National University, Korea Kunsan National University, Korea Program Committee Jemal Abawajy Kenny Adamson Filipe Alvelos Paula Amaral Hartmut Asche Md. Abul Kalam Azad Michela Bertolotto Sandro Bimonte Rod Blais Ivan Blecic Giuseppe Borruso Alfredo Buttari Yves Caniou José A. Cardoso e Cunha Leocadio G. Casado Carlo Cattani Mete Celik Alexander Chemeris Min Young Chung Gilberto Corso Pereira M. Fernanda Costa Gaspar Cunha Carla Dal Sasso Freitas Pradesh Debba Frank Devai Rodolphe Devillers Prabu Dorairaj M. Irene Falcao Cherry Liu Fang Edite M.G.P. Fernandes Jose-Jesus Fernandez Maria Antonia Forjaz Maria Celia Furtado Rocha Akemi Galvez Paulino Jose Garcia Nieto Marina Gavrilova Daekin University, Australia University of Ulster, UK University of Minho, Portugal Universidade Nova de Lisboa, Portugal University of Potsdam, Germany University of Minho, Portugal University College Dublin, Ireland CEMAGREF, TSCF, France University of Calgary, Canada University of Sassari, Italy University of Trieste, Italy CNRS-IRIT, France Lyon University, France Universidade Nova de Lisboa, Portugal University of Almeria, Spain University of Salerno, Italy Erciyes University, Turkey National Technical University of Ukraine “KPI”, Ukraine Sungkyunkwan University, Korea Federal University of Bahia, Brazil University of Minho, Portugal University of Minho, Portugal Universidade Federal do Rio Grande do Sul, Brazil The Council for Scientific and Industrial Research (CSIR), South Africa London South Bank University, UK Memorial University of Newfoundland, Canada NetApp, India/USA University of Minho, Portugal U.S. DOE Ames Laboratory, USA University of Minho, Portugal National Centre for Biotechnology, CSIS, Spain University of Minho, Portugal PRODEB–PósCultura/UFBA, Brazil University of Cantabria, Spain University of Oviedo, Spain University of Calgary, Canada jvargas2006@gmail.com Organization Jerome Gensel Maria Giaoutzi Andrzej M. Goscinski Alex Hagen-Zanker Malgorzata Hanzl Shanmugasundaram Hariharan Eligius M.T. Hendrix Hisamoto Hiyoshi Fermin Huarte Andres Iglesias Mustafa Inceoglu Peter Jimack Qun Jin Farid Karimipour Baris Kazar DongSeong Kim Taihoon Kim Ivana Kolingerova Dieter Kranzlmueller Antonio Laganà Rosa Lasaponara Maurizio Lazzari Cheng Siong Lee Sangyoun Lee Jongchan Lee Clement Leung Chendong Li Gang Li Ming Li Fang Liu Xin Liu Savino Longo Tinghuai Ma Sergio Maffioletti Ernesto Marcheggiani Antonino Marvuglia Nicola Masini Nirvana Meratnia Alfredo Milani Sanjay Misra Giuseppe Modica XIII LSR-IMAG, France National Technical University, Athens, Greece Deakin University, Australia University of Cambridge, UK Technical University of Lodz, Poland B.S. Abdur Rahman University, India University of Malaga/Wageningen University, Spain/The Netherlands Gunma University, Japan University of Barcelona, Spain University of Cantabria, Spain EGE University, Turkey University of Leeds, UK Waseda University, Japan Vienna University of Technology, Austria Oracle Corp., USA University of Canterbury, New Zealand Hannam University, Korea University of West Bohemia, Czech Republic LMU and LRZ Munich, Germany University of Perugia, Italy National Research Council, Italy National Research Council, Italy Monash University, Australia Yonsei University, Korea Kunsan National University, Korea Hong Kong Baptist University, Hong Kong University of Connecticut, USA Deakin University, Australia East China Normal University, China AMES Laboratories, USA University of Calgary, Canada University of Bari, Italy NanJing University of Information Science and Technology, China University of Zurich, Switzerland Katholieke Universiteit Leuven, Belgium Research Centre Henri Tudor, Luxembourg National Research Council, Italy University of Twente, The Netherlands University of Perugia, Italy Federal University of Technology Minna, Nigeria University of Reggio Calabria, Italy jvargas2006@gmail.com XIV Organization José Luis Montaña Beniamino Murgante Jiri Nedoma Laszlo Neumann Kok-Leong Ong Belen Palop Marcin Paprzycki Eric Pardede Kwangjin Park Ana Isabel Pereira Maurizio Pollino Alenka Poplin Vidyasagar Potdar David C. Prosperi Wenny Rahayu Jerzy Respondek Ana Maria A.C. Rocha Humberto Rocha Alexey Rodionov Cristina S. Rodrigues Octavio Roncero Maytham Safar Haiduke Sarafian Qi Shi Dale Shires Takuo Suganuma Ana Paula Teixeira Senhorinha Teixeira Parimala Thulasiraman Carmelo Torre Javier Martinez Torres Giuseppe A. Trunfio Unal Ufuktepe Mario Valle Pablo Vanegas Piero Giorgio Verdini Marco Vizzari Koichi Wada University of Cantabria, Spain University of Basilicata, Italy Academy of Sciences of the Czech Republic, Czech Republic University of Girona, Spain Deakin University, Australia Universidad de Valladolid, Spain Polish Academy of Sciences, Poland La Trobe University, Australia Wonkwang University, Korea Polytechnic Institute of Braganca, Portugal Italian National Agency for New Technologies, Energy and Sustainable Economic Development, Italy University of Hamburg, Germany Curtin University of Technology, Australia Florida Atlantic University, USA La Trobe University, Australia Silesian University of Technology Poland University of Minho, Portugal INESC-Coimbra, Portugal Institute of Computational Mathematics and Mathematical Geophysics, Russia University of Minho, Portugal CSIC, Spain Kuwait University, Kuwait The Pennsylvania State University, USA Liverpool John Moores University, UK U.S. Army Research Laboratory, USA Tohoku University, Japan University of Tras-os-Montes and Alto Douro, Portugal University of Minho, Portugal University of Manitoba, Canada Polytechnic of Bari, Italy Centro Universitario de la Defensa Zaragoza, Spain University of Sassari, Italy Izmir University of Economics, Turkey Swiss National Supercomputing Centre, Switzerland University of Cuenca, Equador INFN Pisa and CERN, Italy University of Perugia, Italy University of Tsukuba, Japan jvargas2006@gmail.com Organization Krzysztof Walkowiak Robert Weibel Roland Wismüller Mudasser Wyne Chung-Huang Yang Xin-She Yang Salim Zabir Albert Y. Zomaya XV Wroclaw University of Technology, Poland University of Zurich, Switzerland Universität Siegen, Germany SOET National University, USA National Kaohsiung Normal University, Taiwan National Physical Laboratory, UK France Telecom Japan Co., Japan University of Sydney, Australia Sponsoring Organizations ICCSA 2012 would not have been possible without tremendous support of many organizations and institutions, for which all organizers and participants of ICCSA 2012 express their sincere gratitude: Universidade Federal da Bahia, Brazil (http://www.ufba.br) Universidade Federal do Recôncavo da Bahia, Brazil (http://www.ufrb.edu.br) Universidade Estadual de Feira de Santana, Brazil (http://www.uefs.br) University of Perugia, Italy (http://www.unipg.it) University of Basilicata, Italy (http://www.unibas.it) jvargas2006@gmail.com XVI Organization Monash University, Australia (http://monash.edu) Kyushu Sangyo University, Japan (www.kyusan-u.ac.jp) Brazilian Computer Society (www.sbc.org.br) Coordenaçāo de Aperfeiçoamento de Pessoal de Nı́vel Superior (CAPES), Brazil (http://www.capes.gov.br) National Council for Scientific and Technological Development (CNPq), Brazil (http://www.cnpq.br) Fundaçāo de Amparo à Pesquisa do Estado da Bahia (FAPESB), Brazil (http://www.fapesb.ba.gov.br) jvargas2006@gmail.com Table of Contents – Part I Workshop on Advances in High Performance Algorithms and Applications (AHPAA 2012) Processor Allocation for Optimistic Parallelization of Irregular Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Versaci and Keshav Pingali Feedback-Based Global Instruction Scheduling for GPGPU Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Constantin Timm, Markus Görlich, Frank Weichert, Peter Marwedel, and Heinrich Müller Parallel Algorithm for Landform Attributes Representation on Multicore and Multi-GPU Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Murilo Boratto, Pedro Alonso, Carla Ramiro, Marcos Barreto, and Leandro Coelho The Performance Model of an Enhanced Parallel Algorithm for the SOR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Italo Epicoco and Silvia Mocavero Performance Driven Cooperation between Kernel and Auto-tuning Multi-threaded Interval B&B Applications . . . . . . . . . . . . . . . . . . . . . . . . . . Juan Francisco Sanjuan-Estrada, Leocadio Gonzalez Casado, Immaculada Garcı́a, and Eligius M.T. Hendrix k NN-Borůvka-GPU: A Fast and Scalable MST Construction from k NN Graphs on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ahmed Shamsul Arefin, Carlos Riveros, Regina Berretta, and Pablo Moscato 1 15 29 44 57 71 Workshop on Bio-inspired Computing and Applications (BIOCA 2012) Global Hybrid Ant Bee Colony Algorithm for Training Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Habib Shah, Rozaida Ghazali, Nazri Mohd Nawi, and Mustafa Mat Deris The Effect of Intelligent Escape on Distributed SER-Based Search . . . . . . Daniel S.F. Alves, Felipe M.G. França, Luiza de Macedo Mourelle, Nadia Nedjah, and Priscila M.V. Lima jvargas2006@gmail.com 87 101 XVIII Table of Contents – Part I ACO-Based Static Routing for Network-on-Chips . . . . . . . . . . . . . . . . . . . . Luneque Silva Jr., Nadia Nedjah, Luiza de Macedo Mourelle, and Fábio Gonçalves Pessanha A Genetic Algorithm Assisted by a Locally Weighted Regression Surrogate Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Leonardo G. Fonseca, Heder S. Bernardino, and Helio J.C. Barbosa Swarm Robots with Queue Organization Using Infrared Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rafael Mathias de Mendonça, Nadia Nedjah, and Luiza de Macedo Mourelle Swarm Grid: A Proposal for High Performance of Parallel Particle Swarm Optimization Using GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rogério M. Calazan, Nadia Nedjah, and Luiza de Macedo Mourelle An Artificial Immune System Approach to Associative Classification . . . . Samir A. Mohamed Elsayed, Sanguthevar Rajasekaran, and Reda A. Ammar 113 125 136 148 161 Workshop on Computational Geometry and Applications (CGA 2012) A Review on Delaunay Refinement Techniques . . . . . . . . . . . . . . . . . . . . . . . Sanderson L. Gonzaga de Oliveira 172 Axis-Parallel Dimension Reduction for Biometric Research . . . . . . . . . . . . Kushan Ahmadian and Marina Gavrilova 188 An Overview of Procedures for Refining Triangulations . . . . . . . . . . . . . . . Sanderson L. Gonzaga de Oliveira 198 DEM Interpolation from Contours Using Medial Axis Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joonsoo Choi, Jaewee Heo, Kwang-Soo Hahn, and Junho Kim Analysis of a High Definition Camera-Projector Video System for Geometry Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . José Luiz de Souza Filho, Roger Correia Silva, Dhiego Oliveira Sad, Renan Dembogurski, Marcelo Bernardes Vieira, Sócrates de Oliveira Dantas, and Rodrigo Silva Video-Based Face Verification with Local Binary Patterns and SVM Using GMM Supervectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tiago F. Pereira, Marcus A. Angeloni, Flávio O. Simões, and José Eduardo C. Silva jvargas2006@gmail.com 214 228 240 Table of Contents – Part I XIX GPU-Based Influence Regions Optimization . . . . . . . . . . . . . . . . . . . . . . . . . Marta Fort and J. Antoni Sellarès 253 Fast and Simple Approach for Polygon Schematization . . . . . . . . . . . . . . . . Serafino Cicerone and Matteo Cermignani 267 On Counting and Analyzing Empty Pseudo-triangles in a Point Set . . . . Sergey Kopeliovich and Kira Vyatkina 280 Workshop on Chemistry and Materials Sciences and Technologies (CMST 2012) Quantum Reactive Scattering Calculations on GPU . . . . . . . . . . . . . . . . . . Leonardo Pacifici, Danilo Nalli, and Antonio Laganà Tuning Heme Functionality: The Cases of Cytochrome c Oxidase and Myoglobin Oxidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vangelis Daskalakis, Stavros C. Farantos, and Constantinos Varotsis Theoretical and Experimental Study of the Energy and Structure of Fragment Ions Produced by Double Photoionization of Benzene Molecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marzio Rosi, Pietro Candori, Stefano Falcinelli, Maria Suelly Pedrosa Mundim, Fernando Pirani, and Franco Vecchiocattivi Theoretical Study of Reactions Relevant for Atmospheric Models of Titan: Interaction of Excited Nitrogen Atoms with Small Hydrocarbons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marzio Rosi, Stefano Falcinelli, Nadia Balucani, Piergiorgio Casavecchia, Francesca Leonori, and Dimitris Skouteris Efficient Workload Distribution Bridging HTC and HPC in Scientific Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carlo Manuali, Alessandro Costantini, Antonio Laganà, Marco Cecchi, Antonia Ghiselli, Michele Carpené, and Elda Rossi Taxonomy Management in a Federation of Distributed Repositories: A Chemistry Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sergio Tasso, Simonetta Pallottelli, Michele Ferroni, Riccardo Bastianini, and Antonio Laganà Grid Enabled High Level ab initio Electronic Structure Calculations for the N2 +N2 Exchange Reaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marco Verdicchio, Leonardo Pacifici, and Antonio Laganà jvargas2006@gmail.com 292 304 316 331 345 358 371 XX Table of Contents – Part I A Bond-Bond Portable Approach to Intermolecular Interactions: Simulations for N-methylacetamide and Carbon Dioxide Dimers . . . . . . . Andrea Lombardi, Noelia Faginas Lago, Antonio Laganà, Fernando Pirani, and Stefano Falcinelli A Grid Execution Model for Computational Chemistry Applications Using the GC3Pie Framework and the AppPot VM Environment. . . . . . . Alessandro Costantini, Riccardo Murri, Sergio Maffioletti, and Antonio Laganà The MPI Structure of Chimere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Laganà, Stefano Crocchianti, Giorgio Tentella, and Alessandro Costantini A New Statistical Method for the Determination of Dynamical Features of Molecular Dication Dissociation Processes . . . . . . . . . . . . . . . . . . . . . . . . Maria Suely Pedrosa Mundim, Pietro Candori, Stefano Falcinelli, Kleber Carlos Mundim, Fernando Pirani, and Franco Vecchiocattivi 387 401 417 432 Workshop on Cities, Technologies and Planning (CTP 2012) SWOT Analysis of Information Technology Industry in Beijing, China Using Patent Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lucheng Huang, Kangkang Wang, Feifei Wu, Yan Lou, Hong Miao, and Yanmei Xu Using 3D GeoDesign for Planning of New Electricity Networks in Spain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francisco-Javier Moreno Marimbaldo, Federico-Vladimir Gutiérrez Corea, and Miguel-Ángel Manso Callejo 447 462 Assessment of Online PPGIS Study Cases in Urban Planning . . . . . . . . . . Geisa Bugs 477 e-Participation: Social Media and the Public Space . . . . . . . . . . . . . . . . . . . Gilberto Corso Pereira, Maria Célia Furtado Rocha, and Alenka Poplin 491 Ubicomp and Environmental Designers: Assembling a Collective Work towards the Development of Sustainable Technologies . . . . . . . . . . . . . . . . . Renato Cesar Ferreira de Souza 502 Sustainable Micro-business in Environmental Unsustainability and Economic Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . José G. Vargas-Hernández 518 jvargas2006@gmail.com Table of Contents – Part I Efficient Visualization of the Geometric Information of CityGML: Application for the Documentation of Built Heritage . . . . . . . . . . . . . . . . . Iñaki Prieto, Jose Luis Izkara, and Francisco Javier Delgado del Hoyo ICT to Evaluate Participation in Urban Planning: Remarks from a Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Rotondo and Francesco Selicato A Spatial Data Infrastructure Situation-Aware to the 2014 World Cup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wellington Moreira de Oliveira, Jugurta Lisboa Filho, and Alcione de Paiva Oliveira Towards a Two-Way Participatory Process . . . . . . . . . . . . . . . . . . . . . . . . . . António Silva and Jorge Gustavo Rocha An Automatic Procedure to Select Areas for Transfer Development Rights in the Urban Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carmelo Maria Torre, Pasquale Balena, and Romina Zito XXI 529 545 561 571 583 General Track on Computational Methods, Algorithms and Scientific Application Magnetic Net and a Bouncing Magnetic Ball . . . . . . . . . . . . . . . . . . . . . . . . Haiduke Sarafian Autonomous Leaves Graph Applied to the Simulation of the Boundary Layer around a Non-symmetric NACA Airfoil . . . . . . . . . . . . . . . . . . . . . . . Sanderson Lincohn Gonzaga de Oliveira and Mauricio Kischinhevsky Sinimbu – Multimodal Queries to Support Biodiversity Studies . . . . . . . . Gabriel de S. Fedel, Claudia Bauzer Medeiros, and Jefersson Alex dos Santos Comparison between Genetic Algorithms and Differential Evolution for Solving the History Matching Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Elisa P. dos Santos Amorim, Carolina R. Xavier, Ricardo Silva Campos, and Rodrigo W. dos Santos An Adaptive Mesh Algorithm for the Numerical Solution of Electrical Models of the Heart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rafael S. Oliveira, Bernardo M. Rocha, Denise Burgarelli, Wagner Meira Jr., and Rodrigo W. dos Santos Decision Model to Predict the Implant Success . . . . . . . . . . . . . . . . . . . . . . . Ana Cristina Braga, Paula Vaz, João C. Sampaio-Fernandes, António Felino, and Maria Purificacão Tavares jvargas2006@gmail.com 599 610 620 635 649 665 XXII Table of Contents – Part I Multiscale Modeling of Heterogeneous Media Applying AEH to 3D Bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bárbara de Melo Quintela, Daniel Mendes Caldas, Michèle Cristina Resende Farage, and Marcelo Lobosco A Three-Dimensional Computational Model of the Innate Immune System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Augusto F. Rocha, Micael P. Xavier, Alexandre B. Pigozzo, Barbara de M. Quintela, Gilson C. Macedo, Rodrigo Weber dos Santos, and Marcelo Lobosco System Dynamics Metamodels Supporting the Development of Computational Models of the Human Innate Immune System . . . . . . . Igor Knop, Alexandre Pigozzo, Barbara Quintela, Gilson C. Macedo, Ciro Barbosa, Rodrigo Weber dos Santos, and Marcelo Lobosco Exact and Asymptotic Computations of Elementary Spin Networks: Classification of the Quantum–Classical Boundaries . . . . . . . . . . . . . . . . . . Ana Carla P. Bitencourt, Annalisa Marzuoli, Mirco Ragni, Roger W. Anderson, and Vincenzo Aquilanti Performance of DFT and MP2 Approaches for Geometry of Rhenium Allenylidenes Complexes and the Thermodynamics of Phosphines Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cecilia Coletti and Nazzareno Re Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . jvargas2006@gmail.com 675 691 707 723 738 753 Table of Contents – Part II Workshop on Econometrics and Multidimensional Evaluation in the Urban Environment (EMEUE 2012) Knowledge and Innovation in Manufacturing Sector: The Case of Wedding Dresses in Southern Italy . . . . . . . . . . . . . . . . . . . . . . Annunziata de Felice, Isabella Martucci, and Dario A. Schirone Marketing Strategies: Support and Enhancement of Core Business . . . . . Dario Antonio Schirone and Germano Torkan 1 17 The Rational Quantification of Social Housing: An Operative Research Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gianluigi De Mare, Antonio Nesticò, and Francesco Tajani 27 Simulation of Users Decision in Transport Mode Choice Using Neuro-Fuzzy Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mauro Dell’Orco and Michele Ottomanelli 44 Multidimensional Spatial Decision-Making Process: Local Shared Values in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Cerreta, Simona Panaro, and Daniele Cannatella 54 A Proposal for a Stepwise Fuzzy Regression: An Application to the Italian University System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francesco Campobasso and Annarita Fanizzi 71 Cluster Analysis for Strategic Management: A Case Study of IKEA . . . . Paola Perchinunno and Dario Antonio Schirone 88 Clustering for the Localization of Degraded Urban Areas . . . . . . . . . . . . . . Silvestro Montrone and Paola Perchinunno 102 A BEP Analysis of Energy Supply for Sustainable Urban Microgrids . . . Pasquale Balena, Giovanna Mangialardi, and Carmelo Maria Torre 116 The Effect of Infrastructural Works on Urban Property Values: The asse attrezzato in Pescara, Italy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sebastiano Carbonara Prospect of Integrate Monitoring: A Multidimensional Approach . . . . . . . Marco Selicato, Carmelo Maria Torre, and Giovanni La Trofa jvargas2006@gmail.com 128 144 XXIV Table of Contents – Part II The Use of Ahp in a Multiactor Evaluation for Urban Development Programs: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luigi Fusco Girard and Carmelo Maria Torre 157 Assessing Urban Transformations: A SDSS for the Master Plan of Castel Capuano, Naples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Cerreta and Pasquale De Toro 168 Workshop on Geographical Analysis, Urban Modeling, Spatial Statistics (Geo–An–Mod 2012) Computational Context to Promote Geographic Information Systems toward Human-Centric Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Luis Paulo da Silva Carvalho and Paulo Caetano da Silva Voronoi-Based Curve Reconstruction: Issues and Solutions . . . . . . . . . . . . Mehran Ghandehari and Farid Karimipour 181 194 Geovisualization and Geostatistics: A Concept for the Numerical and Visual Analysis of Geographic Mass Data . . . . . . . . . . . . . . . . . . . . . . . . . . . Julia Gonschorek and Lucia Tyrallová 208 Spatio-Explorative Analysis and Its Benefits for a GIS-integrated Automated Feature Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lucia Tyrallová and Julia Gonschorek 220 Peer Selection in P2P Service Overlays Using Geographical Location Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Adriano Fiorese, Paulo Simões, and Fernando Boavida 234 Models for Spatial Interaction Data: Computation and Interpretation of Accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Morton E. O’Kelly 249 Am I Safe in My Home? Fear of Crime Analyzed with Spatial Statistics Methods in a Central European City . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Lederer 263 Developing a GIS Based Decision Support System for Resource Allocation in Earthquake Search and Rescue Operation . . . . . . . . . . . . . . . Abolfazl Rasekh and Ali Reza Vafaeinezhad 275 Concepts, Compass and Computation: Models for Directional Part-Whole Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gaurav Singh, Rolf A. de By, and Ivana Ivánová 286 jvargas2006@gmail.com Table of Contents – Part II SIGHabitar – Business Intelligence Based Approach for the Development of Land Information Systems: The Multipurpose Technical Cadastre of Ouro Preto, Brazil . . . . . . . . . . . . . . . . . . . . . . . . . . . . João Tácio C. Silva, José Francisco V. Rezende, Érika Fidêncio, Tarick Melo, Brayan Neves, and Joubert C. Lima e Tiago G.S. Carneiro XXV 302 Rehabilitation and Reconstruction of Asphalts Pavement Decision Making Based on Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shaaban M. Shaaban and Hossam A. Nabwey 316 Cartographic Circuits Inside GIS Environment for the Construction of the Landscape Sensitivity Map in the Case of Cremona . . . . . . . . . . . . . . . Pier Luigi Paolillo, Umberto Baresi, and Roberto Bisceglie 331 Cloud Classification in JPEG-compressed Remote Sensing Data (LANDSAT 7/ETM+) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Erik Borg, Bernd Fichtelmann, and Hartmut Asche 347 A Probabilistic Rough Set Approach for Water Reservoirs Site Location Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shaaban M. Shaaban and Hossam A. Nabwey 358 Definition and Analysis of New Agricultural Farm Energetic Indicators Using Spatial OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sandro Bimonte, Kamal Boulil, Jean-Pierre Chanet, and Marilys Pradel Validating a Smartphone-Based Pedestrian Navigation System Prototype: An Informal Eye-Tracking Pilot Test . . . . . . . . . . . . . . . . . . . . . Mario Kluge and Hartmut Asche Open Access to Historical Atlas: Sources of Information and Services for Landscape Analysis in an SDI Framework . . . . . . . . . . . . . . . . . . . . . . . . Raffaella Brumana, Daniela Oreni, Branka Cuca, Anna Rampini, and Monica Pepe 373 386 397 From Concept to Implementation: Web-Based Cartographic Visualisation with CartoService . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hartmut Asche and Rita Engemaier 414 Multiagent Systems for the Governance of Spatial Environments: Some Modelling Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Domenico Camarda 425 A Data Fusion System for Spatial Data Mining, Analysis and Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Silvija Stankute and Hartmut Asche 439 jvargas2006@gmail.com XXVI Table of Contents – Part II Dealing with Multiple Source Spatio-temporal Data in Urban Dynamics Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . João Peixoto and Adriano Moreira 450 Public Decision Processes: The Interaction Space Supporting Planner’s Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giuseppe B. Las Casas, Lucia Tilio, and Alexis Tsoukiàs 466 Selection and Scheduling Problem in Continuous Time with Pairwise-Interdependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivan Blecic, Arnaldo Cecchini, and Giuseppe A. Trunfio 481 Parallel Simulation of Urban Dynamics on the GPU . . . . . . . . . . . . . . . . . . Ivan Blecic, Arnaldo Cecchini, and Giuseppe A. Trunfio Geolocalization as Wayfinding and User Experience Support in Cultural Heritage Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Letizia Bollini and Roberto Falcone Climate Alteration in the Metropolitan Area of Bari: Temperatures and Relationship with Characters of Urban Context . . . . . . . . . . . . . . . . . . Pierangela Loconte, Claudia Ceppi, Giorgia Lubisco, Francesco Mancini, Claudia Piscitelli, and Francesco Selicato 492 508 517 Study of Sustainability of Renewable Energy Sources through GIS Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Emanuela Caiaffa, Alessandro Marucci, and Maurizio Pollino 532 The Comparative Analysis of Urban Development in Two Geographic Regions: The State of Rio De Janeiro and the Campania Region . . . . . . . Massimiliano Bencardino, Ilaria Greco, and Pitter Reis Ladeira 548 Land-Use Dynamics at the Micro Level: Constructing and Analyzing Historical Datasets for the Portuguese Census Tracts . . . . . . . . . . . . . . . . . António M. Rodrigues, Teresa Santos, Raquel Faria de Deus, and Dulce Pimentel Using Hydrodynamic Modeling for Estimating Flooding and Water Depths in Grand Bay, Alabama . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vladimir J. Alarcon and William H. McAnally Comparison of Two Hydrodynamic Models of Weeks Bay, Alabama . . . . Vladimir J. Alarcon, William H. McAnally, and Surendra Pathak Connections between Urban Structure and Urban Heat Island Generation: An Analysis trough Remote Sensing and GIS . . . . . . . . . . . . . Marialuce Stanganelli and Marco Soravia jvargas2006@gmail.com 565 578 589 599 Table of Contents – Part II XXVII Taking the Leap: From Disparate Data to a Fully Interactive SEIS for the Maltese Islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Saviour Formosa, Elaine Sciberras, and Janice Formosa Pace 609 Analyzing the Central Business District: The Case of Sassari in the Sardinia Island . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Silvia Battino, Giuseppe Borruso, and Carlo Donato 624 That’s ReDO: Ontologies and Regional Development Planning . . . . . . . . . Francesco Scorza, Giuseppe B. Las Casas, and Beniamino Murgante 640 A Landscape Complex Values Map: Integration among Soft Values and Hard Values in a Spatial Decision Support System . . . . . . . . . . . . . . . . . . . Maria Cerreta and Roberta Mele 653 Analyzing Migration Phenomena with Spatial Autocorrelation Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Beniamino Murgante and Giuseppe Borruso 670 From Urban Labs in the City to Urban Labs on the Web . . . . . . . . . . . . . . Viviana Lanza, Lucia Tilio, Antonello Azzato, Giuseppe B. Las Casas, and Piergiuseppe Pontrandolfi 686 General Track on Geometric Modelling, Graphics and Visualization Bilayer Segmentation Augmented with Future Evidence . . . . . . . . . . . . . . . Silvio Ricardo Rodrigues Sanches, Valdinei Freire da Silva, and Romero Tori A Viewer-dependent Tensor Field Visualization Using Multiresolution and Particle Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . José Luiz Ribeiro de Souza Filho, Marcelo Caniato Renhe, Marcelo Bernardes Vieira, and Gildo de Almeida Leonel Abnormal Gastric Cell Segmentation Based on Shape Using Morphological Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noor Elaiza Abdul Khalid, Nurnabilah Samsudin, and Rathiah Hashim 699 712 728 A Bio-inspired System for Boundary Detection in Color Natural Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karin S. Komati, Evandro O.T. Salles, and Mario Sarcinelli-Filho 739 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753 jvargas2006@gmail.com Table of Contents – Part III Workshop on Optimization Techniques and Applications (OTA 2012) Incorporating Radial Basis Functions in Pattern Search Methods: Application to Beam Angle Optimization in Radiotherapy Treatment Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Humberto Rocha, Joana M. Dias, Brigida C. Ferreira, and Maria do Carmo Lopes 1 On the Complexity of a Mehrotra-Type Predictor-Corrector Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Paula Teixeira and Regina Almeida 17 Design of Wood Biomass Supply Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tiago Costa Gomes, Filipe Pereira e Alvelos, and Maria Sameiro Carvalho On Solving a Stochastic Programming Model for Perishable Inventory Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eligius M.T. Hendrix, Rene Haijema, Roberto Rossi, and Karin G.J. Pauls-Worm An Artificial Fish Swarm Filter-Based Method for Constrained Global Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ana Maria A.C. Rocha, M. Fernanda P. Costa, and Edite M.G.P. Fernandes Solving Multidimensional 0–1 Knapsack Problem with an Artificial Fish Swarm Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Md. Abul Kalam Azad, Ana Maria A.C. Rocha, and Edite M.G.P. Fernandes Optimization Model of COTS Selection Based on Cohesion and Coupling for Modular Software Systems under Multiple Applications Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pankaj Gupta, Shilpi Verma, and Mukesh Kumar Mehlawat A Derivative-Free Filter Driven Multistart Technique for Global Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Florbela P. Fernandes, M. Fernanda P. Costa, and Edite M.G.P. Fernandes jvargas2006@gmail.com 30 45 57 72 87 103 XXX Table of Contents – Part III On Lower Bounds Using Additively Separable Terms in Interval B&B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . José L. Berenguel, Leocadio G. Casado, I. Garcı́a, Eligius M.T. Hendrix, and F. Messine 119 A Genetic Algorithm for the Job Shop on an ASRS Warehouse . . . . . . . . José Figueiredo, José A. Oliveira, Luis Dias, and Guilherme A.B. Pereira 133 On Solving the Profit Maximization of Small Cogeneration Systems . . . . Ana C.M. Ferreira, Ana Maria A.C. Rocha, Senhorinha F.C.F. Teixeira, Manuel L. Nunes, and Luı́s B. Martins 147 Global Optimization Simplex Bisection Revisited Based on Considerations by Reiner Horst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eligius M.T. Hendrix, Leocadio G. Casado, and Paula Amaral Application of Variance Analysis to the Combustion of Residual Oils . . . Manuel Ferreira and José Carlos Teixeira Warehouse Design and Planning: A Mathematical Programming Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Carla A.S. Geraldes, Maria Sameiro Carvalho, and Guilherme A.B. Pereira Application of CFD Tools to Optimize Natural Building Ventilation Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . José Carlos Teixeira, Ricardo Lomba, Senhorinha F.C.F. Teixeira, and Pedro Lobarinhas 159 174 187 202 Workshop on Mobile Communications (MC 2012) Middleware Integration for Ubiquitous Sensor Networks in Agriculture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junghoon Lee, Gyung-Leen Park, Min-Jae Kang, Ho-Young Kwak, Sang Joon Lee, and Jikwang Han Usage Pattern-Based Prefetching: Quick Application Launch on Mobile Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hokwon Song, Changwoo Min, Jeehong Kim, and Young Ik Eom EIMOS: Enhancing Interactivity in Mobile Operating Systems . . . . . . . . . Sunwook Bae, Hokwon Song, Changwoo Min, Jeehong Kim, and Young Ik Eom jvargas2006@gmail.com 217 227 238 Table of Contents – Part III Development of Mobile Hybrid MedIntegraWeb App for Interoperation between u-RPMS and HIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Young-Hyuk Kim, Il-Kown Lim, Jae-Pil Lee, Jae-Gwang Lee, and Jae-Kwang Lee XXXI 248 A Distributed Lifetime-Maximizing Scheme for Connected Target Coverage in WSNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Duc Tai Le, Thang Le Duc, and Hyunseung Choo 259 Reducing Last Level Cache Pollution in NUMA Multicore Systems for Improving Cache Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deukhyeon An, Jeehong Kim, JungHyun Han, and Young Ik Eom 272 The Fast Handover Scheme for Mobile Nodes in NEMO-Enabled PMIPv6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Changyong Park, Junbeom Park, Hao Wang, and Hyunseung Choo 283 A Reference Model for Virtual Resource Description and Discovery in Virtual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yuemei Xu, Yanni Han, Wenjia Niu, Yang Li, Tao Lin, and Song Ci 297 TV Remote Control Using Human Hand Motion Based on Optical Flow System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Soonmook Jeong, Taehoun Song, Keyho Kwon, and Jae Wook Jeon 311 Fast and Reliable Data Forwarding in Low-Duty-Cycle Wireless Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junseong Choe, Nguyen Phan Khanh Ha, Junguye Hong, and Hyunseung Choo 324 Workshop on Mobile-Computing, Sensing, and Actuation for Cyber Physical Systems (MSA4CPS 2012) Neural Network and Physiological Parameters Based Control of Artificial Pancreas for Improved Patient Safety . . . . . . . . . . . . . . . . . . . . . . Saad Bin Qaisar, Salman H. Khan, and Sahar Imtiaz 339 A Genetic Algorithm Assisted Resource Management Scheme for Reliable Multimedia Delivery over Cognitive Networks . . . . . . . . . . . . . . . . Salman Ali, Ali Munir, Saad Bin Qaisar, and Junaid Qadir 352 Performance Analysis of WiMAX Best Effort and ertPS Service Classes for Video Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hassan Abid, Haroon Raja, Ali Munir, Jaweria Amjad, Aliya Mazhar, and Dong-Young Lee jvargas2006@gmail.com 368 XXXII Table of Contents – Part III Jump Oriented Programming on Windows Platform (on the x86) . . . . . . Jae-Won Min, Sung-Min Jung, Dong-Young Lee, and Tai-Myoung Chung Cryptanalysis and Improvement of a Biometrics-Based Multi-server Authentication with Key Agreement Scheme . . . . . . . . . . . . . . . . . . . . . . . . Hakhyun Kim, Woongryul Jeon, Kwangwoo Lee, Yunho Lee, and Dongho Won 376 391 Rate-Distortion Optimized Transcoder Selection for Multimedia Transmission in Heterogeneous Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Haroon Raja and Saad Bin Qaisar 407 Formal Probabilistic Analysis of Cyber-Physical Transportation Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Atif Mashkoor and Osman Hasan 419 Workshop on Remote Sensing (RS 2012) DEM Reconstruction of Coastal Geomorphology from DINSAR . . . . . . . . Maged Marghany 435 Three-Dimensional Coastal Front Visualization from RADARSAT-1 SAR Satellite Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maged Marghany 447 A New Self-Learning Algorithm for Dynamic Classification of Water Bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bernd Fichtelmann and Erik Borg 457 DEM Accuracy of High Resolution Satellite Images . . . . . . . . . . . . . . . . . . Mustafa Yanalak, Nebiye Musaoglu, Cengizhan Ipbuker, Elif Sertel, and Sinasi Kaya Low Cost Pre-operative Fire Monitoring from Fire Danger to Severity Estimation Based on Satellite MODIS, Landsat and ASTER Data: The Experience of FIRE-SAT Project in the Basilicata Region (Italy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antonio Lanorte, Fortunato De Santis, Angelo Aromando, and Rosa Lasaponara Investigating Satellite Landsat TM and ASTER Multitemporal Data Set to Discover Ancient Canals and Acqueduct Systems . . . . . . . . . . . . . . . Rosa Lasaponara and Nicola Masini Using Spatial Autocorrelation Techniques and Multi-temporal Satellite Data for Analyzing Urban Sprawl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gabriele Nolè, Maria Danese, Beniamino Murgante, Rosa Lasaponara, and Antonio Lanorte jvargas2006@gmail.com 471 481 497 512 Table of Contents – Part III XXXIII General Track on Information Systems and Technologies A Framework for QoS Based Dynamic Web Services Composition . . . . . . Jigyasu Nema, Rajdeep Niyogi, and Alfredo Milani 528 Data Summarization Model for User Action Log Files . . . . . . . . . . . . . . . . Eleonora Gentili, Alfredo Milani, and Valentina Poggioni 539 User Modeling for Adaptive E-Learning Systems . . . . . . . . . . . . . . . . . . . . . Birol Ciloglugil and Mustafa Murat Inceoglu 550 An Experimental Study of the Combination of Meta-Learning with Particle Swarm Algorithms for SVM Parameter Selection . . . . . . . . . . . . . Péricles B.C. de Miranda, Ricardo B.C. Prudêncio, Andre Carlos P.L.F. de Carvalho, and Carlos Soares An Investigation into Agile Methods in Embedded Systems Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Caroline Oliveira Albuquerque, Pablo Oliveira Antonino, and Elisa Yumi Nakagawa Heap Slicing Using Type Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed A. El-Zawawy Using Autonomous Search for Generating Good Enumeration Strategy Blends in Constraint Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ricardo Soto, Broderick Crawford, Eric Monfroy, and Vı́ctor Bustos Evaluation of Normalization Techniques in Text Classification for Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Merley da Silva Conrado, Vı́ctor Antonio Laguna Gutiérrez, and Solange Oliveira Rezende 562 576 592 607 618 Extracting Definitions from Brazilian Legal Texts . . . . . . . . . . . . . . . . . . . . Edilson Ferneda, Hércules Antonio do Prado, Augusto Herrmann Batista, and Marcello Sandi Pinheiro 631 A Heuristic Diversity Production Approach . . . . . . . . . . . . . . . . . . . . . . . . . Hamid Parvin, Hosein Alizadeh, Sajad Parvin, and Behzad Maleki 647 Structuring Taxonomies from Texts: A Case-Study on Defining Soil Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hércules Antonio do Prado, Edilson Ferneda, Francisco Carlos da Luz Rodrigues, Éder Martins de Souza, Osmar Abı́lio de Carvalho Jr., and Alfredo José Barreto Luiz Exploring Fuzzy Ontologies in Mining Generalized Association Rules . . . Rodrigo Moura Juvenil Ayres, Marcela Xavier Ribeiro, and Marilde Terezinha Prado Santos jvargas2006@gmail.com 657 667 XXXIV Table of Contents – Part III BTA: Architecture for Reusable Business Tier Components with Access Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Óscar Mortágua Pereira, Rui L. Aguiar, and Maribel Yasmina Santos 682 Analysing the PDDL Language for Argumentation-Based Negotiation Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ariel Monteserin, Luis Berdún, and Analı́a A. Amandi 698 Predicting Potential Responders in Twitter: A Query Routing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cleyton Caetano de Souza, Jonathas José de Magalhães, Evandro Barros de Costa, and Joseana Macêdo Fechine Towards a Goal Recognition Model for the Organizational Memory . . . . . Marcelo G. Armentano and Analı́a A. Amandi SART: A New Association Rule Method for Mining Sequential Patterns in Time Series of Climate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marcos Daniel Cano, Marilde Terezinha Prado Santos, Ana Maria H. de Avila, Luciana A.S. Romani, Agma J.M. Traina, and Marcela Xavier Ribeiro Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . jvargas2006@gmail.com 714 730 743 759 Table of Contents – Part IV Workshop on Software Engineering Processes and Applications (SEPA 2012) Modeling Road Traffic Signals Control Using UML and the MARTE Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eduardo Augusto Silvestre and Michel dos Santos Soares Analysis of Techniques for Documenting User Requirements . . . . . . . . . . . Michel dos Santos Soares and Daniel Souza Cioquetta Predicting Web Service Maintainability via Object-Oriented Metrics: A Statistics-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . José Luis Ordiales Coscia, Marco Crasso, Cristian Mateos, Alejandro Zunino, and Sanjay Misra Early Automated Verification of Tool Chain Design . . . . . . . . . . . . . . . . . . Matthias Biehl Using UML Stereotypes to Support the Requirement Engineering: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vitor A. Batista, Daniela C.C. Peixoto, Wilson Pádua, and Clarindo Isaı́as P.S. Pádua Identifying Business Rules to Legacy Systems Reengineering Based on BPM and SOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gleison S. do Nascimento, Cirano Iochpe, Lucinéia Thom, André C. Kalsing, and Álvaro Moreira Abstraction Analysis and Certified Flow and Context Sensitive Points-to Relation for Distributed Programs . . . . . . . . . . . . . . . . . . . . . . . . . Mohamed A. El-Zawawy An Approach to Measure Understandability of Extended UML Based on Metamodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Zhang, Yi Liu, Zhiyi Ma, Xuying Zhao, Xiaokun Zhang, and Tian Zhang Dealing with Dependencies among Functional and Non-functional Requirements for Impact Analysis in Web Engineering . . . . . . . . . . . . . . . . José Alfonso Aguilar, Irene Garrigós, Jose-Norberto Mazón, and Anibal Zaldı́var jvargas2006@gmail.com 1 16 29 40 51 67 83 100 116 XXXVI Table of Contents – Part IV Assessing Maintainability Metrics in Software Architectures Using COSMIC and UML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Eudisley Gomes dos Anjos, Ruan Delgado Gomes, and Mário Zenha-Rela Plagiarism Detection in Software Using Efficient String Matching . . . . . . Kusum Lata Pandey, Suneeta Agarwal, Sanjay Misra, and Rajesh Prasad Dynamic Software Maintenance Effort Estimation Modeling Using Neural Network, Rule Engine and Multi-regression Approach . . . . . . . . . . Ruchi Shukla, Mukul Shukla, A.K. Misra, T. Marwala, and W.A. Clarke 132 147 157 Workshop on Software Quality (SQ 2012) New Measures for Maintaining the Quality of Databases . . . . . . . . . . . . . . Hendrik Decker 170 A New Way to Determine External Quality of ERP Software . . . . . . . . . . Ali Orhan Aydin 186 Towards a Catalog of Spreadsheet Smells . . . . . . . . . . . . . . . . . . . . . . . . . . . Jácome Cunha, João P. Fernandes, Hugo Ribeiro, and João Saraiva 202 Program and Aspect Metrics for MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . . Pedro Martins, Paulo Lopes, João P. Fernandes, João Saraiva, and João M.P. Cardoso 217 A Suite of Cognitive Complexity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sanjay Misra, Murat Koyuncu, Marco Crasso, Cristian Mateos, and Alejandro Zunino 234 Complexity Metrics for Cascading Style Sheets . . . . . . . . . . . . . . . . . . . . . . Adewole Adewumi, Sanjay Misra, and Nicholas Ikhu-Omoregbe 248 A Systematic Review on the Impact of CK Metrics on the Functional Correctness of Object-Oriented Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yasser A. Khan, Mahmoud O. Elish, and Mohamed El-Attar 258 Workshop on Security and Privacy in Computational Sciences (SPCS 2012) Pinpointing Malicious Activities through Network and System-Level Malware Execution Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . André Ricardo Abed Grégio, Vitor Monte Afonso, Dario Simões Fernandes Filho, Paulo Lı́cio de Geus, Mario Jino, and Rafael Duarte Coelho dos Santos jvargas2006@gmail.com 274 Table of Contents – Part IV XXXVII A Malware Detection System Inspired on the Human Immune System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Isabela Liane de Oliveira, André Ricardo Abed Grégio, and Adriano Mauro Cansian Interactive, Visual-Aided Tools to Analyze Malware Behavior . . . . . . . . . . André Ricardo Abed Grégio, Alexandre Or Cansian Baruque, Vitor Monte Afonso, Dario Simões Fernandes Filho, Paulo Lı́cio de Geus, Mario Jino, and Rafael Duarte Coelho dos Santos Interactive Analysis of Computer Scenarios through Parallel Coordinates Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gabriel D. Cavalcante, Sebastien Tricaud, Cleber P. Souza, and Paulo Lı́cio de Geus Methodology for Detection and Restraint of P2P Applications in the Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rodrigo M.P. Silva and Ronaldo M. Salles 286 302 314 326 Workshop on Soft Computing and Data Engineering (SCDE 2012) Text Categorization Based on Fuzzy Soft Set Theory . . . . . . . . . . . . . . . . . Bana Handaga and Mustafa Mat Deris 340 Cluster Size Determination Using JPEG Files . . . . . . . . . . . . . . . . . . . . . . . Nurul Azma Abdullah, Rosziati Ibrahim, and Kamaruddin Malik Mohamad 353 Semantic Web Search Engine Using Ontology, Clustering and Personalization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Noryusliza Abdullah and Rosziati Ibrahim 364 Granules of Words to Represent Text: An Approach Based on Fuzzy Relations and Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patrı́cia F. Castro and Geraldo B. Xexéo 379 Multivariate Time Series Classification by Combining Trend-Based and Value-Based Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bilal Esmael, Arghad Arnaout, Rudolf K. Fruhwirth, and Gerhard Thonhauser jvargas2006@gmail.com 392 XXXVIII Table of Contents – Part IV General Track on High Performance Computing and Networks Impact of pay-as-you-go Cloud Platforms on Software Pricing and Development: A Review and Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fernando Pires Barbosa and Andrea Schwertner Charão 404 Resilience for Collaborative Applications on Clouds: Fault-Tolerance for Distributed HPC Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Toàn Nguyên and Jean-Antoine Désidéri 418 T-DMB Receiver Model for Emergency Alert Service . . . . . . . . . . . . . . . . . Seong-Geun Kwon, Suk-Hwan Lee, Eung-Joo Lee, and Ki-Ryong Kwon 434 A Framework for Context-Aware Systems in Mobile Devices . . . . . . . . . . . Eduardo Jorge, Matheus Farias, Rafael Carmo, and Weslley Vieira 444 A Simulation Framework for Scheduling Performance Evaluation on CPU-GPU Heterogeneous System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Flavio Vella, Igor Neri, Osvaldo Gervasi, and Sergio Tasso Influence of Topology on Mobility and Transmission Capacity of Human-Based DTNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Danilo A. Moschetto, Douglas O. Freitas, Lourdes P.P. Poma, Ricardo Aparecido Perez de Almeida, and Cesar A.C. Marcondes Towards a Computer Assisted Approach for Migrating Legacy Systems to SOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gonzalo Salvatierra, Cristian Mateos, Marco Crasso, and Alejandro Zunino 1+1 Protection of Overlay Distributed Computing Systems: Modeling and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Krzysztof Walkowiak and Jacek Rak 457 470 484 498 Scheduling and Capacity Design in Overlay Computing Systems . . . . . . . Krzysztof Walkowiak, Andrzej Kasprzak, Michal Kosowski, and Marek Miziolek 514 GPU Acceleration of the caffa3d.MB Model . . . . . . . . . . . . . . . . . . . . . . . . . Pablo Igounet, Pablo Alfaro, Gabriel Usera, and Pablo Ezzatti 530 Security-Effective Fast Authentication Mechanism for Network Mobility in Proxy Mobile IPv6 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illkyun Im, Young-Hwa Cho, Jae-Young Choi, and Jongpil Jeong 543 An Architecture for Service Integration and Unified Communication in Mobile Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ricardo Aparecido Perez de Almeida and Hélio C. Guardia 560 jvargas2006@gmail.com Table of Contents – Part IV XXXIX Task Allocation in Mesh Structure: 2Side LeapFrog Algorithm and Q-Learning Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iwona Pozniak-Koszalka, Wojciech Proma, Leszek Koszalka, Maciej Pol, and Andrzej Kasprzak Follow-Us: A Distributed Ubiquitous Healthcare System Simulated by MannaSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Maria Luı́sa Amarante Ghizoni, Adauto Santos, and Linnyer Beatrys Ruiz Adaptive Dynamic Frequency Scaling for Thermal-Aware 3D Multi-core Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong Jun Choi, Young Jin Park, Hsien-Hsin Lee, and Cheol Hong Kim 576 588 602 A Context-Aware Service Model Based on the OSGi Framework for u-Agricultural Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jongsun Choi, Sangjoon Park, Jongchan Lee, and Yongyun Cho 613 A Security Framework for Blocking New Types of Internet Worms in Ubiquitous Computing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Iksu Kim and Yongyun Cho 622 Quality Factors in Development Best Practices for Mobile Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Euler Horta Marinho and Rodolfo Ferreira Resende 632 ShadowNet: An Active Defense Infrastructure for Insider Cyber Attack Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaohui Cui, Wade Gasior, Justin Beaver, and Jim Treadwell 646 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 655 jvargas2006@gmail.com Processor Allocation for Optimistic Parallelization of Irregular Programs Francesco Versaci1, and Keshav Pingali2 1 University of Padova & Technische Universität Wien versacif@dei.unipd.it 2 University of Texas at Austin pingali@cs.utexas.edu Abstract. Optimistic parallelization is a promising approach for the parallelization of irregular algorithms: potentially interfering tasks are launched dynamically, and the runtime system detects conflicts between concurrent activities, aborting and rolling back conflicting tasks. However, parallelism in irregular algorithms is very complex. In a regular algorithm like dense matrix multiplication, the amount of parallelism can usually be expressed as a function of the problem size, so it is reasonably straightforward to determine how many processors should be allocated to execute a regular algorithm of a certain size (this is called the processor allocation problem). In contrast, parallelism in irregular algorithms can be a function of input parameters, and the amount of parallelism can vary dramatically during the execution of the irregular algorithm. Therefore, the processor allocation problem for irregular algorithms is very difficult. In this paper, we describe the first systematic strategy for addressing this problem. Our approach is based on a construct called the conflict graph, which (i) provides insight into the amount of parallelism that can be extracted from an irregular algorithm, and (ii) can be used to address the processor allocation problem for irregular algorithms. We show that this problem is related to a generalization of the unfriendly seating problem and, by extending Turán’s theorem, we obtain a worstcase class of problems for optimistic parallelization, which we use to derive a lower bound on the exploitable parallelism. Finally, using some theoretically derived properties and some experimental facts, we design a quick and stable control strategy for solving the processor allocation problem heuristically. Keywords: Irregular algorithms, Optimistic parallelization, Automatic parallelization, Amorphous data-parallelism, Processor allocation, Unfriendly seating, Turán’s theorem.   Supported by PAT-INFN Project AuroraScience, by MIUR-PRIN Project AlgoDEEP, and by the University of Padova Projects STPD08JA32 and CPDA099949. Corresponding author. B. Murgante et al. (Eds.): ICCSA 2012, Part I, LNCS 7333, pp. 1–14, 2012. c Springer-Verlag Berlin Heidelberg 2012  jvargas2006@gmail.com 2 F. Versaci and K. Pingali 1 Introduction The advent of on-chip multiprocessors has made parallel programming a mainstream concern. Unfortunately writing correct and efficient parallel programs is a challenging task for the average programmer. Hence, in recent years, many projects [14,10,3,20] have tried to automate parallel programming for some classes of algorithms. Most of them focus on regular algorithms such as Fourier transforms [9,19] and dense linear algebra routines [4]. Automation is more difficult when the algorithms are irregular and use pointer-based data structures such as graphs and sets. One promising approach is based on the concept of amorphous data parallelism [17]. Algorithms are formulated as iterative computations on work-sets, and each iteration is identified as a quantum of work (task) that can potentially be executed in parallel with other iterations. The Galois project [18] has shown that algorithms formulated in this way can be parallelized automatically using optimistic parallelization): iterations are executed speculatively in parallel and, when an iteration conflicts with concurrently executing iterations, it is rolled-back. Algorithms that have been successfully parallelized in this manner include Survey propagation [5], Boruvka’s algorithm [6], Delauney triangulation and refinement [12], and Agglomerative clustering [21]. In a regular algorithm like dense matrix multiplication, the amount of parallelism can usually be expressed as a function of the problem size, so it is reasonably straightforward to determine how many processors should be allocated to execute a regular algorithm of a certain size (this is called the processor allocation problem). In contrast, parallelism in irregular algorithms can be a function of input parameters, and the amount of parallelism can vary dramatically during the execution of the irregular algorithm [16]. Therefore, the processor allocation problem for irregular algorithms is very difficult. Optimistic parallelization complicates this problem even more: if there are too many processors and too little parallel work, not only might some processors be idle but speculative conflicts may actually retard the progress of even those processors that have useful work to do, increasing both program execution time and power consumption. This paper1 presents the first systematic approach to addressing the processor allocation problem for irregular algorithms under optimistic parallelization, and it makes the following contributions. – We develop a simple graph-theoretic model for optimistic parallelization and use it to formulate processor allocation as an optimization problem that balances parallelism exploitation with minimizing speculative conflicts (Section 2). – We identify a worst-case class of problems for optimistic parallelization; to this purpose, we develop an extension of Turán’s theorem [2] (Section 3). – Using these ideas, we develop an adaptive controller that dynamically solves the processor allocation problem for amorphous data-parallel programs, providing rapid response to changes in the amount of amorphous data-parallelism (Section 4). 1 A brief announcement of this work has been presented at SPAA’11 [23]. jvargas2006@gmail.com Processor Allocation for Optimistic Parallelization 2 3 Modeling Optimistic Parallelization A typical example of an algorithm that exhibits amorphous data-parallelism is Dalauney mesh refinement, summarized as follows. A triangulation of some planar region is given, containing some “bad” triangles (according to some quality criterion). To remove them, each bad triangle is selected (in any arbitrary order), and this triangle, together with triangles that lie in its cavity, are replaced with new triangles. The retriangulation can produce new bad triangles, but this process can be proved to halt after a finite number of steps. Two bad triangles can be processed in parallel, given that their cavities do not overlap. There are also algorithms, which exhibit amorphous data-parallelism, for which the order of execution of the parallel tasks cannot be arbitrary, but must satisfy some constraints (e.g., in discrete event simulations the events must commit chronologically). We will not treat this class of problems in this work, but we will focus only on unordered algorithms [16]. A different context in which there is no roll-back and tasks do not conflict, but obey some precedence relations, is treated in [1]. Optimistic parallelization deals with amorphous data-parallelism by maintaining a work-set of the tasks to be executed. At each temporal step some tasks are selected and speculatively launched in parallel. If, at runtime, two processes modify the same data a conflict is detected and one of the two has to abort and roll-back its execution. Neglecting the details of the various amorphous dataparallel algorithms, we can model their common behavior at a higher level with a simple graph-theoretic model: we can think a scheduler as working on a dynamic graph Gt = (Vt , Et ), where the nodes represent computations we want to do, but we have no initial knowledge of the edges, which represent conflicts between computations (see Fig. 1). At time step t the system picks uniformly at random mt nodes (the active nodes) and tries to process them concurrently. When it processes a node it figures out if it has some connections with other executed nodes and, if a neighbor node happens to have been processed before it, aborts, otherwise the node is considered processed, is removed from the graph and some operations may be performed in the neighborhood, such as adding new nodes with edges or altering the neighbors. The time taken to process conflicting and non-conflicting nodes is assumed to be the same, as it happens, e.g., for Dalauney mesh refinement. 2.1 Control Optimization Goal When we run an optimistic parallelization we have two contrasting goals: we both want to maximize the work done, achieving high parallelism, but at the same time we want to minimize the conflicts, hence obtaining a good use of the processors time. (Furthermore, for some algorithms the roll-back work can be quite resource-consuming.) These two goals are not compatible, in fact if we naïvely try to minimize the total execution time the system is forced to use always all the available processors, whereas if we try to minimize the time wasted from aborted processes the system uses only one processor. Therefore in the following we choose a trade-off goal and cast it in our graph-theoretic model. jvargas2006@gmail.com 4 F. Versaci and K. Pingali (i) (ii) (iii) Fig. 1. Optimistic parallelization. (i) Nodes represent possible computations, edges conflicts between them. (ii) m nodes are chosen at random and run concurrently. (iii) At runtime the conflicts are detected, some nodes abort and their execution is rolled back, leaving a maximal independent set in the subgraph induced by the initial nodes choice. Let G = (V, E) be a computations/conflicts (CC) graph with n = |V | nodes. When a scheduler chooses, uniformly at random, m nodes to be run, the ordered set πm (·) by which they commit can be modeled as a random permutation: if i < j then πm (i) commits before πm (j) (if there is a conflict between πm (i) and πm (j) then πm (i) commits and πm (j) aborts, if πm (i) aborted due to conflicts with previous processes πm (j) can commit, if not conflicting with other committed processes). Let kt (πm ) be the number of aborted processes due to conflicts and rt (πm ) ∈ [0, 1) the ratio of conflicting processors observed at time t (i.e. rt (πm )  kt (πm )/m). We define the conflict ratio r̄t (m) to be the expected r that we obtain when the system is run with m processors: r̄t (m)  Eπm [rt (πm )] , (1) where the expectation is computed uniformly over the possible prefixes of length m of the n nodes permutations. The control problem we want to solve is the following: given r(τ ) and mτ for τ < t, choose mt = μt such that r̄t (μt )  ρ, where ρ is a suitable parameter. Remark 1. If we want to dynamically control the number of processors, ρ must be chosen different from zero, otherwise the system converges to use only one processor, thus not being able to identify available parallelism. A value of ρ ∈ [20%, 30%] is often reasonable, together with the constraint mt ≥ 2. 3 Exploiting Parallelism In this section we study how much parallelism can be extracted from a given CC graph and how its sparsity can affect the conflict ratio. To this purpose we obtain a worst case class of graphs and use it to analytically derive a lower bound for jvargas2006@gmail.com Processor Allocation for Optimistic Parallelization 5 the exploitable parallelism (i.e., an upper bound for the conflict ratio). We make extensive use of finite differences (i.e., discrete derivatives), which are defined recursively as follows. Let f : Z → R be a real function defined on the integers, then the i-th (forward) finite difference of f is i−1 Δif (k) = Δi−1 f (k + 1) − Δf (k) , with Δ0f (k) = f (k) . (2) (In the following we will omit Δ’s superscript when equal to one, i.e., Δ  Δ1 .) First, we obtain two basic properties of r̄, which are given by the following propositions. Proposition 1. The conflict ratio function r̄(m) is non-decreasing in m. To prove Prop. 1 we first need a lemma: Lemma 1. Let k̄(m)  Eπm [k(πm )]. Then k̄ is a non-decreasing convex function, i.e. Δk̄ (m) ≥ 0 and Δ2k̄ (m) ≥ 0. Proof. Let k̃(πm , i) be the expected number of conflicting nodes running r = m + i nodes concurrently, the first m of which are πm and the last i are chosen uniformly at random among the remaining ones. By definition, we have   Eπm k̃(πm , i) = k̄(m + i) . (3) In particular, which brings k̃(πm , 1) = k(πm ) + Pr [(m + 1)-th conflicts] , (4)   k̄(m + 1) = Eπm k̃(πm , 1) = k̄(m) + η , (5) with η = k̄(m + 1) − k̄(m) = Δk̄ (m) ≥ 0, hence proving the monotonicity of k̄. Consider now k̃(πm , 2) = k(πm ) + Pr [(m + 1)-th conflicts] + Pr [(m + 2)-th conflicts] . (6) If the (m + 1)-th node does not add any edge, then we have Pr [(m + 1)-th conflicts] = Pr [(m + 2)-th conflicts] , (7) but since it may add some edges the probability of conflicting the second time   is in general larger and thus Δ2k̄ (m) ≥ 0. Proof (Prop. 1). Since r̄(m) = k̄(m)/m, its finite difference can be written as Δr̄ (m) = mΔk̄ (m) − k̄(m) . m(m + 1) (8) Because of Lemma 1 and being k̄(1) = 0 we have k̄(m + 1) ≤ mΔk̄ (m) , jvargas2006@gmail.com (9) 6 F. Versaci and K. Pingali which finally brings Δr̄ (m) = k̄(m + 1) − k̄(m) Δk̄ (m) mΔk̄ (m) − k̄(m) ≥ = ≥0 . m(m + 1) m(m + 1) m(m + 1) (10)   Proposition 2. Let G be a CC graph, with n nodes and average degree d, then the initial derivative of r̄ depends only on n and d as d . 2(n − 1) (11) k̄(2) Δk̄ (1) − k̄(1) = , 2 2 (12) Δr̄ (1) = Proof. Since Δr̄ (1) = we just need to obtain k̄(2). Let k̃ be defined as in the proof on Lemma 1 and π1 = v a node chosen uniformly at random. Then     Ev [dv ] d dv = = . (13) k̄(2) = Ev k̃(v, 1) = Ev n−1 n−1 n−1   A measure of the available parallelism for a given CC graph has been identified in [15] considering, at each temporal step, a maximal independent set of the CC graph. The expected size of a maximal independent set gives a reasonable and computable estimate of the available parallelism. However, this is not enough to predict the actual amount of parallelism that a scheduler can exploit while keeping a low conflict ratio, as shown in the following example. Example 1. Let G = Kn2 ∪ Dn where Kn2 is the complete graph of size n2 and Dn a disconnected graph of size n (i.e. G is made up of a clique of size n2 and n disconnected nodes). For this graph every maximal independent set is maximum too and has size n+1, but if we choose n+1 nodes uniformly at random and then compute the conflicts we obtain that, on average, there are only 2 independent nodes. A more realistic estimate of the performance of a scheduler can be obtained by analyzing the CC graph sparsity. The average degree of the CC graph is linked to the expected size of a maximal independent set of the graph by the following well known theorem (in the variant shown in [2] or [22]): Theorem 1. (Turán, strong formulation). Let G = (V, E) be a graph, n = |V | and let d be the average degree of G. Then the expected size of a maximal independent set, obtained choosing greedily the nodes from a random permutation, is at least s = n/(d + 1). jvargas2006@gmail.com Processor Allocation for Optimistic Parallelization 7 Remark 2. The previous bound is existentially tight: let Kdn be the graph made up of s = n/(d + 1) cliques of size d + 1, then the average degree is d and the size of every maximal (and maximum) independent set is exactly s. Furthermore, every other graph with the same number of nodes and edges has a bigger average maximal independent set. The study of the expected size of a maximal independent set in a given graph is also known as the unfriendly seating problem [7,8] and is particularly relevant in statistical physics, where it is usually studied on mesh-like graphs [11]. The properties of the graph Kdn has suggested us the formulation of an extension of the Turán’s theorem. We prove that the graphs Kdn provide a worst case (for a given degree d) for the generalization of this problem obtained by focusing on maximal independent set of induced subgraphs. This allows, when given a target conflict ratio ρ, the computation of a lower bound for the parallelism a scheduler can exploit. Theorem 2. Let G be a graph with same nodes number and degree of Kdn and let EMm (G) be the expected size of a maximal independent set of the subgraph induced by a uniformly random choice of m nodes in G, then EMm (G) ≥ EMm (Kdn ) . (14) To prove it we first need the following lemma. j Lemma 2. The function ηj (x)  i=1 (n − i − x) is convex for x ∈ [0, n − j]. Proof. We prove by induction on j that, for x ∈ [0, n − j], ηj (x) ≥ 0 , ηj (x) ≤ 0 , ηj (x) ≥ 0 . (15) Base case Let η0 (x) = 1. The properties above are easily verified. Induction Since ηj (x) = ηj−1 (x)(n − j − x), we obtain  (x) , ηj (x) = −ηj−1 (x) + (n − j − x)ηj−1 (16) which is non-positive by inductive hypotheses. Similarly,   (x) + (n − j − x)ηj−1 (x) ηj (x) = −2ηj−1 (17)   is non-negative. Proof (Thm. 2). Consider a random permutation π of the nodes of a generic graph G that has the same number of nodes and edges of Kdn . We assume the prefix of length m of π (i.e. π(1), . . . , π(m)) forms the active nodes and focus on the following independent set ISm in the subgraph induced: a node v is in ISm (G, π) if and only if it is in the first m positions of π and it has no neighbors preceding it. Let bm (G) be the expected size of ISm (G, π) averaged over all possible π’s (chosen uniformly): bm (G)  Eπ [# ISm (G, π)] . jvargas2006@gmail.com (18) 8 F. Versaci and K. Pingali Since for construction bm (G) ≤ EMm (G) whereas bm (Kdn ) = EMm (Kdn ), we just need to prove that bm (Kdn ) ≤ bm (G). Given a generic node v of degree dv and a random permutation π, its probability to be in ISm (G, π) is 1   n − i − dv . Pr [v ∈ ISm (G, π)] = n j=1 i=1 n − i m j−1 By the linearity of the expectation we can write b as ⎡ ⎤ vn  m j−1 m j−1  n − i − dv   n − i − dv 1  ⎦ , bm (G) = = Ev ⎣ n v=v j=1 i=1 n − i n − i j=1 i=1 (19) (20) 1 bm (Kdn ) = m j−1 m j−1   n−i−d   n − i − Ev [dv ] = . n−i n−i j=1 i=1 j=1 i=1 To prove that EMm (G) ≥ EMm (Kdn ) is thus enough showing that  j j   (n − i − dv ) ≥ (n − i − Ev [dv ]) , ∀j Ev i=1 (21) (22) i=1 which can be done applying Jensen’s inequality [13], since in Lemma 2 we have j   proved the convexity of ηj (x)  i=1 (n − i − x). Corollary 1. The worst case for a scheduler among the graphs with the same number of nodes and edges is obtained for the graph Kdn (for which we can analytically approximate the performance, as shown in §3.1). Proof. Since r̄(m) = m − EMm (G) 1 =1− EMm (G) , m m   the thesis follows. 3.1 (23) Analysis of the Worst-Case Performance Theorem 3. Let d be the average degree of G = (V, E) with n = |V | (for simplicity we assume n/(d + 1) ∈ N). The conflict ratio is bounded from above as   m  n n−d−i r̄(m) ≤ 1 − 1− . (24) m(d + 1) n+1−i i=1 Proof. Let s = n/(d+1) be the number of connected components in Kdn . Because of Thm. 2 and Cor. 1 it suffices to show that   m  n−d−i n . (25) EMm (Kd ) = s 1 − n+1−i i=1 jvargas2006@gmail.com Processor Allocation for Optimistic Parallelization 9 The probability for a connected component k of Kdn not to be accessed when m nodes are chosen is given by the following hypergeometric    d+1 n−d−1 m  0 m n−d−i   = . (26) Pr[k not hit] = n+1−i n i=1 m Let Xk be a random variable that is 1 when component k is hit and 0 otherwise. m We have that E[Xk ] = 1 − i=1 n−d−i n+1−i and, by the linearity of the expectation, the average number of components accessed is    s s m    n−d−i . (27) E Xk = E[Xk ] = s 1 − n+1−i i=1 k=1 k=1   Corollary 2. When n and m increase the bound is well approximated by    m d+1 n 1− 1− . (28) r̄(m) ≤ 1 − m(d + 1) n Proof. Stirling approximation for the binomial, followed by low order terms deletion in the resulting formula.   Corollary 3. If we set m = αs = r̄(m) ≤ 1 − 4 αn d+1 we obtain    α 1 1− 1− α d+1 d+1 ≤1−  1 1 − e−α . α (29) Controlling Processors Allocation In this section we will design an efficient control heuristic that dynamically chooses the number of processes to be run by a scheduler, in order to obtain high parallelism while keeping the conflict ratio low. In the following we suppose that the properties of Gt are varying slowly compared to the convergence of mt toward μt under the algorithm we will develop (see §4.1), so we can consider Gt = G and μt = μ and thus our goal is making mt converge to μ. Since the conflict ratio is a non-decreasing function of the number of launched tasks m (Prop. 1) we could find m  μ by bisection simply noticing that r̄(m ) ≤ ρ ≤ r̄(m ) ⇒ m ≤ μ ≤ m . (30) The control we propose is slightly more complex and is based on recurrence relations, i.e., we compute mt+1 as a function F of the target conflict ratio ρ and of the parameters which characterize the system at the previous timestep: mF t+1 = F (ρ, rt , mt ) . jvargas2006@gmail.com (31) 10 F. Versaci and K. Pingali The initial value m0 for a recurrence can be chosen to be 2 but, if we have an estimate of the CC graph average degree d, we can choose a smarter value: in n fact applying Cor. 3 we are sure that using, e.g., m = 2(d+1) processors we will have at most a conflict ratio of 21.3%. Algorithm 1. Pseudo-code of the proposed hybrid control algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 // Tunable parameters mmax = 1024; mmin = 2; m0 = 2; T = 4; rmin = 3%; α0 = 25%; α1 = 6%; // Variables r ← 0; t ← 0; m ← m0 ; // Main loop while nodes to elaborate = 0 do t ← t + 1; if m > mmax then m ← mmax ; else if m < mmin then m ← mmin ; Launch the scheduler with m nodes; r ← r + new conflict ratio; if (t mod T ) = T − 1 then r ← r/T  ;   r α ← 1 − ; ρ if α > α0 then if r < rmin then r ← rmin ; ρ  m ; m← r else if α > α1 then m ← (1 − r + ρ) m; r ← 0; Our control heuristic (Algorithm 1) is a hybridization of two simple recurrences. The first recurrence is quite natural and increases m based on the distance between r and ρ: Recurrence A: mA t+1 = (1 − rt + ρ)mt . (32) The second recurrence exploits some experimental facts. In Fig. 2 we have plotted the conflict ratio functions for three CC graphs with the same size and average degree (note that initial derivative is the same for all the graphs, in accordance with Prop. 2). We see that conflict ratios which reach a high value (r̄(n) > 12 ) are initially well approximated by a straight line (for m such that r̄(m) ≤ ρ = 20 ÷ 30%), whereas functions that deviates from this behavior do not raise too much. This suggests us to assume an initial linearity in controlling mt , as done by the following recurrence: ρ Recurrence B: mB mt . (33) t+1 = rt jvargas2006@gmail.com Processor Allocation for Optimistic Parallelization 11 1 0.8 r̄(m) Upper bound 0.6 Random graph Cliques + discon. nodes Common tangent 0.4 0.2 0 1 200 400 600 800 1000 1200 1400 1600 1800 2000 m Fig. 2. A plot of r̄(m) for some graphs with n = 2000 and d = 16: (i) the worst case upper bound of Cor. 2 (ii) a random graph (edges chosen uniformly at random until desired degree is reached; data obtained by computer simulation) (iii) a graph unions of cliques and disconnected nodes. The two recurrences can be roughly compared as follows (see Fig. 3): Recurrence A has a slower convergence than Recurrence B, but it is less susceptible to noise (the variance that makes rt realizations different from r̄t ). This is the reason for which we chose to merge them in an hybrid algorithm: initially, when the difference between r and ρ is big, we use Recurrence B to exploit its quick convergence and then Recurrence A is adopted, for a finer tuning of the control. 4.1 Experimental Evaluation In the practical implementation of the control algorithm we have made the following optimizations: – Since rt can have a big variance, especially when m is small, we decided to apply the changes to m every T steps, using the averaged values obtained in these intervals, to smooth the oscillations. – To further reduce the oscillations we apply a change only if the observed rt is sufficiently different from ρ (e.g. more than 6%), thus avoiding small variations in the steady state, which interfere with locality exploitation because of the data moving from one processor to another. – Another problem that must be considered is that for small values of m the variance is much bigger, so it is better to tune separately this case using different parameters (this optimization is not shown in the pseudo-code). To validate our controller we have run the following simulation: a random CC graph of fixed average degree d is taken and the controller runs on it, starting with m0 = 2. We are interested in seeing how many temporal steps it takes to converge to mt  μ. As can be seen in [15] the parallelism profile of many practical applications can vary quite abruptly, e.g., Delauney mesh refinement jvargas2006@gmail.com 12 F. Versaci and K. Pingali 300 d = 4, Rec. A d = 4, Hybrid 200 d = 16, Rec. A d = 16, Hybrid mt 100 0 0 20 40 60 80 100 t Fig. 3. Comparison between two realizations of the hybrid algorithm and one that only uses Recurrence A, for two different random graphs (n = 2000 in both cases). The hybrid version has different parameters for m greater or smaller than 20. ρ was chosen to be 20%. The proposed algorithm proves to be both quick in convergence and stable. can go from no parallelism to one thousand possible parallel tasks in just 30 temporal steps. Therefore, an algorithm that wants to efficiently control the processors allocations for these problems must adapt very quickly to changes in the available parallelism. Our controller, that uses the very fast Recurrence B in the initial phase, proves to do a fast enough job: as shown in Fig. 3 in about 15 steps the controller converges close to the desired μ value. 5 Conclusions and Future Work Automatic parallelization of irregular algorithms is a rich and complex subject and will offer many difficult challenges to researchers in the next future. In this paper we have focused on the processor allocation problem for unordered dataamorphous algorithms; it would be extremely valuable to obtain similar results for the more general and difficult case of ordered algorithms (e.g., discrete event simulation), in particular it is very hard to obtain good estimates of the available parallelism for such algorithms, given the complex dependencies arising between the concurrent tasks. Another aspect which needs investigation, especially in the ordered context, is whether some statical properties of the behavior of irregular algorithms can be modeled, extracted and exploited to build better controllers, able to dynamically adapt to the different execution phases. As for a real-world implementation, the proposed control heuristic is now being integrated in the Galois system and it will be evaluated on more realistic workloads. jvargas2006@gmail.com Processor Allocation for Optimistic Parallelization 13 Acknowledgments. We express our gratitude to Gianfranco Bilardi for the valuable feedback on recurrence-based controllers and to all the Galois project members for the useful discussions on optimistic parallelization modeling. References 1. Agrawal, K., Leiserson, C.E., He, Y., Hsu, W.J.: Adaptive work-stealing with parallelism feedback. ACM Trans. Comput. Syst. 26(3), 7:1–7:32 (2008), http://doi.acm.org/10.1145/1394441.1394443 2. Alon, N., Spencer, J.: The probabilistic method. Wiley-Interscience (2000) 3. An, P., Jula, A., Rus, S., Saunders, S., Smith, T.G., Tanase, G., Thomas, N., Amato, N.M., Rauchwerger, L.: Stapl: An Adaptive, Generic Parallel C++ Library. In: Dietz, H.G. (ed.) LCPC 2001. LNCS, vol. 2624, pp. 193–208. Springer, Heidelberg (2003) 4. Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley, R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia (1997) 5. Braunstein, A., Mézard, M., Zecchina, R.: Survey propagation: An algorithm for satisfiability. Random Struct. Algorithms 27(2), 201–226 (2005) 6. Eppstein, D.: Spanning trees and spanners. In: Sack, J., Urrutia, J. (eds.) Handbook of Computational Geometry, pp. 425–461. Elsevier (2000) 7. Freedman, D., Shepp, L.: Problem 62-3, an unfriendly seating arrangement. SIAM Review 4(2), 150 (1962), http://www.jstor.org/stable/2028372 8. Friedman, H.D., Rothman, D., MacKenzie, J.K.: Problem 62-3. SIAM Review 6(2), 180–182 (1964), http://www.jstor.org/stable/2028090 9. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE 93(2), 216–231 (2005); special issue on Program Generation, Optimization, and Platform Adaptation 10. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: PLDI, pp. 212–223 (1998) 11. Georgiou, K., Kranakis, E., Krizanc, D.: Random maximal independent sets and the unfriendly theater seating arrangement problem. Discrete Mathematics 309(16), 5120–5129 (2009), http://www.sciencedirect.com/science/article/B6V00-4W55T4X-2/2/ 72d38a668c737e68edf497512e606e12 12. Guibas, L.J., Knuth, D.E., Sharir, M.: Randomized incremental construction of delaunay and voronoi diagrams. Algorithmica 7(4), 381–413 (1992) 13. Jensen, J.: Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Mathematica 30(1), 175–193 (1906) 14. Kalé, L.V., Krishnan, S.: Charm++: A portable concurrent object oriented system based on C++. In: OOPSLA, pp. 91–108 (1993) 15. Kulkarni, M., Burtscher, M., Cascaval, C., Pingali, K.: Lonestar: A suite of parallel irregular programs. In: ISPASS, pp. 65–76. IEEE (2009) 16. Kulkarni, M., Burtscher, M., Inkulu, R., Pingali, K., Cascaval, C.: How much parallelism is there in irregular applications? In: Reed, D.A., Sarkar, V. (eds.) PPOPP, pp. 3–14. ACM (2009) jvargas2006@gmail.com 14 F. Versaci and K. Pingali 17. Méndez-Lojo, M., Nguyen, D., Prountzos, D., Sui, X., Hassaan, M.A., Kulkarni, M., Burtscher, M., Pingali, K.: Structure-driven optimizations for amorphous dataparallel programs. In: Govindarajan, R., Padua, D.A., Hall, M.W. (eds.) PPOPP, pp. 3–14. ACM (2010) 18. Pingali, K., Nguyen, D., Kulkarni, M., Burtscher, M., Hassaan, M.A., Kaleem, R., Lee, T.H., Lenharth, A., Manevich, R., Méndez-Lojo, M., Prountzos, D., Sui, X.: The tao of parallelism in algorithms. In: Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI 2011, pp. 12–25. ACM, New York (2011), http://doi.acm.org/10.1145/1993498.1993501 19. Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J., Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R., Rizzolo, N.: Spiral: Code generation for dsp transforms. Proceedings of the IEEE 93(2), 232–275 (2005) 20. Reinders, J.: Intel threading building blocks. O’Reilly & Associates, Inc., Sebastopol (2007) 21. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. AddisonWesley (2005) 22. Tao, T.: Additive combinatorics. Cambridge University Press (2006) 23. Versaci, F., Pingali, K.: Brief announcement: processor allocation for optimistic parallelization of irregular programs. In: Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures, SPAA 2011, pp. 261–262. ACM, New York (2011), http://doi.acm.org/10.1145/1989493.1989533 jvargas2006@gmail.com Feedback-Based Global Instruction Scheduling for GPGPU Applications Constantin Timm1 , Markus Görlich1 , Frank Weichert2 , Peter Marwedel1 , and Heinrich Müller2 1 Computer Science 12, TU Dortmund, Germany constantin.timm@postamt.cs.uni-dortmund.de 2 Computer Science 7, TU Dortmund, Germany frank.weichert@tu-dortmund.de Abstract. In the face of the memory wall even in high bandwidth systems such as GPUs, an efficient handling of memory accesses and memory-related instructions is mandatory. Up to now, memory performance considerations were only made for GPGPU applications at source code level. This is not enough when optimizing an application towards high performance: The code has to be optimized at assembly level as well. Due to the spreading of GPGPU-capable hardware in smaller and smaller devices, the energy consumption of a program is – besides the performance – an important optimization goal. In this paper, a novel compiler optimization technique, called FALIS (Feedback-based and memory-Aware gLobal Instruction Scheduling), is presented based on global instruction scheduling and multi-objective genetic algorithms. The approach uses a profiling-based feedback in order to take the measured performance and energy consumption values inside a compiler into account. Profiling on the real hardware platform is important in order to consider the characteristics of the underlying hardware. FALIS increases runtime performance of a GPGPU application by up to 13.02% and decreases energy consumption by up to 10.23%. Keywords: Energy-Aware Systems, Compilers, Objective Genetic Algorithm, Profiling. 1 GPGPU, Multi- Introduction The development of faster single core processors and the availability of higher performance due to higher clock frequency is at an impasse [2]. The shift towards multi-core and many-core systems is tedious because of the increasing complexity on programming and the efficient utilization of parallelism. Furthermore, the memory wall [13] still exists, i.e. the processing speed is much higher than the memory speed. In the face of this memory wall, an efficient utilization of memory accesses and memory-related operations is mandatory. This also applies to multi- and many-core systems such as GPUs which can be utilized – enabled by GPGPU (General Purpose Computing on Graphics Processing Units) – for B. Murgante et al. (Eds.): ICCSA 2012, Part I, LNCS 7333, pp. 15–28, 2012. c Springer-Verlag Berlin Heidelberg 2012  jvargas2006@gmail.com 16 C. Timm et al. HPC (High Performance Computing) applications in scientific and industrial contexts [17]. Due to green computing, considering energy consumption in the GPGPU software design process is important. Therefore, energy consumption and runtime performance are the two optimization targets in this paper. One of the most powerful and most interesting compiler optimization techniques for improving the execution order of instructions and therefore for memory and memory-related instructions, is instruction scheduling (IS). IS is used in this work to optimize the memory utilization of GPGPU applications. A novel partitioning method for scheduling on linear code sections will be presented. This method enables the scheduler to place the memory and memory-related instructions in the code of GPGPU applications in a more efficient manner, meaning that the available memory bandwidth is better utilized. Most of the average case optimization techniques at compiler level have the drawback that they lack the capability to optimize the code in an elaborated way since they have no knowledge of the actual hardware platform. Therefore, this paper presents a profiling-based approach which feeds profiled performance indicators back into the compiler backend in order to achieve better solutions. The major contributions of this paper are summarized as follows: 1. A global instruction scheduling method was integrated in the Nvidia CUDA compiler [15]. 2. A multi-objective genetic algorithm was employed around the compilation process to optimize GPGPU applications with an adaptive IS towards the characteristic of the underlying hardware platform. The remainder of the paper is organized as follows: After this introduction, Section 2 discusses related work. The main concepts of this paper is presented in Section 3 where a genetic global instruction scheduling is introduced. Section 4 evaluates the performance of the techniques presented in the preceding section. The paper ends with a conclusion of the work and an overview of possible future work. 2 Related Work In [23], the authors showed that in traditional single core environments, instruction scheduling can have a negative effect on energy consumption. The authors of [9] developed an algorithm called balanced scheduling which performs scheduling of the instructions based on an assumption on instruction level parallelism. However, energy consumption was not part of that work and the runtime environment was not taken into account, as it has to be done for a GPU with its hardware thread scheduler. For special purpose system such as DSPs, several approaches with instruction scheduling methods [12,25] exist to produce optimized applications with respect to performance [12] and energy consumption [25]. Both works only targeted single core and single thread code. The first work which used local instruction scheduling for optimizing GPGPU applications was presented in [5]. In that work, a performance degradation was possible, because jvargas2006@gmail.com Feedback-Based Global Instruction Scheduling for GPGPU Applications 17 FALIS Unoptimized CUDA Kernel Choose Point from Pareto Front Extracting Mobile Instructions (Memory and Memory-Related) Optimized Kernel Calculating Mobility of Instructions on Create Linear Code Sections Create Initial Chromosome Population Mutate and Crossover Evaluate Chromosomes Assign Fitness SPEA2 Fig. 1. FALIS Framework Structure the optimizations had no information about the later stages of the compilation process and the runtime performance. A work optimizing GPGPU application towards better energy consumption and performance by applying local instruction scheduling was presented in [21]. The proposed method can be employed as a pre-optimization step for this work as it did not focus on memory instructions. Overall, it can be summarized that not all aspects of instruction scheduling for GPGPU applications have taken into account. In addition to that there were several papers published which have taken the energy consumption of GPGPU applications from a software perspective into account [4,19,20,26]. In these papers it was shown that software can be written in an energy efficient manner, but it was also shown that this is time-consuming, if there is no efficient tool (e.g. compiler) support. The latter is done within the novel FALIS framework. 3 FALIS The FALIS (Feedback-based and memory-Aware gLobal Instruction Scheduling) framework (as depicted in Figure 1) is presented in this section . FALIS applies instruction scheduling in order to optimize the CUDA kernels of a GPGPU application. Therefore, a genetic algorithm interfaces with the code generation module of the NVOpenCC compiler. In this module which works on a low level intermediate representation called Whirl [16], all memory and memory-related instructions are extracted as described in section 3.1. Afterwards, for all these instructions the basic block sequences are created where an instruction can be scheduled to (Section 3.2) and a mobility interval is calculated. The mobility interval comprises the positions in the Whirl intermediate representation of a kernel, where memory and memory-related instructions can be positioned. Finally, the sequences of instructions are optimized towards the objectives energy jvargas2006@gmail.com 18 C. Timm et al. /*01c0*/ /*01c8*/ /*01d0*/ /*01d8*/ /*01e0*/ SYNC 01d0; ST g[0x411], R8; NOP; BAR.SYNC; MOV R10, R124; /*01c0*/ /*01c8*/ /*01d0*/ /*01d8*/ /*01e0*/ Unoptimized Sequence SYNC 01d0; ST g[0x411], R8; MOV R10, R124; BAR.SYNC; ... Optimized Sequence Fig. 2. Exemplary Unfavourable Instruction Sequences consumption and runtime performance by employing a Genetic Algorithm (GA) named SPEA2 [27] (Section 3.3). 3.1 Extracting Mobile Instructions The list of instructions relevant for optimization, called mobile instructions (M ob Ins), can comprise all load and store operations for the different memories (const, global/local, shared) and memory-related instructions such as (barrier) synchronisation statements. The reason, why the latter are also considered is described in the following. The atomic runtime entity of a Nvidia GPGPU application is a thread. A thread runs on a single streaming processor (SP) of a graphics card. A set of threads, called a block, is allocated on one streaming multiprocessor which is a group of several SPs. At block level, there is no memory consistency – except in the thread itself. Nevertheless, threads of a block can force a consistent view on the shared memory by using synchronisation statements. They have to be added to the thread at source code level by the programmer. In [21], it was revealed that the performance can be decreased by such statements due to the requirement of adding additional instructions at machine code level, ensuring a proper timing. The evaluation results showed that it is not always mandatory to add these instructions if other existing instructions can substitute them (as depicted in Figure 2 – on the left-hand side). The substitution possibly can save cycles, resulting in performance increase and energy consumption decrease. Thus, the approach presented in this paper can also treat synchronisation statements, in addition to memory instructions. For them, a position in the code should be revealed which lead to a better performance or/and lower energy consumption. In the scope of FALIS, the position of each extracted mobile instruction is variable. The variability is explained in the following section. 3.2 Calculating Mobility of Instructions on Linear Code Sections The purpose of instruction scheduling is to schedule mobile instructions in way that they access the GPU’s memory system in an efficient way. In the following, mobile instruction positioning paradigms (MIPP), which can be beneficial and therefore are supported by FALIS, are presented. For the first MIPP, the jvargas2006@gmail.com Feedback-Based Global Instruction Scheduling for GPGPU Applications BB1 BB1 BB2 BB3 BB2 19 BB1 BB3 BB2 BB3 BB4 BB4 BB4 (a) Local Scheduler (b) Treegion Scheduler (c) Branch Head Partitioning Fig. 3. Reachability of Different Instruction Scheduling Methods memory instructions are close so that their sequence (depending on instruction dependencies) can possibly changed. Within the second MIPP, they are far away from each other so that they cannot interfered with each other. The latter is, in particular, important when the limited bandwidth of the graphics cards main memory should be utilized in an efficient way. This method is based on the observation of the authors of [8]. In that work the authors revealed that many concurrent memory access can decrease the performance. Figures 3 shows different methods for scheduling instructions. In Figure 3(a) one can see that within local instruction scheduling, the first MIPP is applicable due to the capability to schedule instructions near to each other. Local instruction scheduling can only schedule an instruction inside one basic block as denoted by the different layouts of the basic blocks. In Figures 3(a), 3(b) and 3(c), different layouts mean that instructions can not be moved from one basic block to another. While basic blocks with the same layout mean that instructions can possibly be exchange between them. But for the second MIPP local instruction scheduling is not enough. Therefore, methods considering several basic blocks for scheduling of instructions are required. An example for a state-of-the-art technique is TREEGION scheduling [1]. It only schedules to neighbouring basic blocks and therefore, this method is not able to handle the second MIPP in an efficient manner. In addition to that, TREEGION scheduling uses compensation code which may adversely affect the performance objective because code which is executed predicated is slower than normally executed code on the GPU [7]. Therefore, a technique called branch-head-partitioning was introduced in [6] which enables a global scheduler to schedule on linear code portions (e.g. linear code sections inside loops) without the use of compensation code but with the possibility to schedule an instruction far away from the original basic block as depicted in Figure 3(c). The Nvidia compiler works on a combined control flow and data dependency graph G = (V, E). V = {i1 , .., in } is the set of all instructions in intermediate representation of one kernel and E ⊆ V × V are the data dependencies or control jvargas2006@gmail.com 20 C. Timm et al. Mobility Interval Mob_Ins1 Mob_Ins1 ASAP ALAP Mobility Interval Mob_Ins2 BB1 Mob_Ins2 ASAP BB2 ALAP BB3 BB4 Fig. 4. Mobility of Mobile Instructions (M ob Ins) flow dependencies between two instructions. Branch-head-partitioning [6] creates new dependency edges in the combined control flow and data dependency graph G. The combined control flow and data dependency graph comprises also the barrier synchronisation statements. In order to take the barrier synchronisation instruction for global instruction scheduling into account, the original combined control flow and data dependency graph G is used as a basis. In order to maintain the semantics of a kernel, for each load/store instruction ix ∈ V preceding a barrier synchronisation instruction iy ∈ V (x < y), a dependency edge is inserted between ix and iy . The same is done for a barrier synchronisation instruction and all succeeding load/store instructions. All control flow edges of the original combined control flow and data dependency graph G were replaced by edges between non-branch instructions. This enables single instructions to be moved over branches, if there are no data dependencies in the branch.. For calculating the mobility of a mobile instruction, firstly, ASAP scheduling (As Soon As Possible) [22] is conducted with the help of the control flow and data dependency graph G. This reveals the lower bound to where the memory instruction can be relocated. In a second step a scheduling with scheduling policy ALAP (As Late As Possible) [11] is performed to determine the upper bound for positions. Both, ASAP and ALAP were originally used at the synthesis of hardware but also work on other graphs like combined control flow and data dependency graph utilized in the paper. The mobility interval for a mobile instruction is, as depicted in Figure 4, the interval between the ASAP and the ALAP position. The shaded area for mobile instructions in Figure 4 marks positions where M ob Ins1 can not be scheduled to, because the instruction can not be placed inside a branch if it is not placed there before. The mobility values are then employed by FALIS in order to optimize a program. As it was revealed in [21], a GA is an appropriate optimization technique in the field of instruction scheduling for GPGPU applications since it can cope with effects of the warp scheduler of Nvidia graphics cards. 3.3 Multi-Objective Genetic Algorithm Specification FALIS utilizes a multi-objective genetic algorithm, which uses real profiling data to evaluate the optimization potential of a solution. In the following, terms in the scope of FALIS are presented in more detail: jvargas2006@gmail.com Feedback-Based Global Instruction Scheduling for GPGPU Applications 21 – Gene g: Basic feature which represents the position of a certain instruction inside the linear basic block sequences for one CUDA kernel. The minimal and maximal value for each gene is the interval limited by the positions evaluated by the ASAP and the ALAP schedule. – Chromosome c: Gene sequence g1 , ..., gn . gi (i ∈ [1, n]) is the assignment of position to a mobile instruction. There are n mobile instructions in the code. – Individual I: An element of the solution space (I ∈ I). It comprises a chromosome. The set of all individuals is denoted as I in the following. – Population Ix : Set of considered individuals Ix ⊆ I. Energy and Runtime Measurement: The two major goals of the optimization process are energy consumption decrease and performance increase. Both are measured with a performance and energy benchmarking system as depicted in Figure 5. Due to the major focus of benchmarking the GPGPU application executed by the graphics card, the power consumption of the system hosting the graphics card is not measured. The power consumption is measured with power clamps at the power supply lines of the PCI Express bus for 12V and 3.3V. Therefore the power supply lines of the PCI express bus have been extended to measure the current with the power clamps. If a graphics card needs even more power from the main system power supply it is also possible to measure the current through this interface. The power consumption of a graphics card is described as: P (t) = P 3.3V (t) + P 12V (t) (1) where P 12V is the power consumption (W ) at the 12V power supply lines, P 3.3V is a power consumption (W ) at the 3.3V power supply lines and t is a time value in the runtime interval of kernel. The energy consumption and the runtime of a GPGPU application kernel are evaluated as follows: The runtime interval [T0 , Trun ] is delimited by trigger signals (TS) in the source code which are initiated by the host and can be measured at the output of the RS232 serial port. The end signal is triggered after a cudaThreadSynchronize() command (also in the host code) which forces CUDA to finish the kernel execution until this position in the source code. A certain energy E(I) is consumed for the runtime runtime(I) ([T0 , Trun ]) for individual I (representing a particular scheduling for one kernel):  P (t) E(I) = (2) fs t∈[T0 ,Trun ] in which fs is the sampling frequency (s−1 or Hertz) of the oscilloscope measurements, P (t) the power consumption in Watt (W ), E is the energy in Joule (J) and t is the time in the runtime interval. jvargas2006@gmail.com 22 C. Timm et al. Trigger Signal Line RS 232 Testbed System Oscilloscope Nvidia Graphics Card 3.3V 12V Current Clamp Current Clamp PCI Express PCI Express Power Supply Lines Fig. 5. Energy and Runtime Evaluation Framework FALIS Workflow: The work flow of the genetic algorithm – in a more abstract form as opposed to the detailed description of SPEA2 in [27] – comprises several steps and starts with the generation of an initial population Ix with x = 0. For the individuals in that population I ∈ Ix , the fitness values are evaluated. Based on these fitness values, a selection of the most appropriate candidates for the creation of the next generation Ix+1 is done. The schematic of the workflow of SPEA2 is depicted in Figure 1. A subset of the elements of the Ix+1 -th population take part in evolutionary processes: – Crossover : Exchange of a part of the genes between two individuals I, I  ∈ Ix . – Mutation: Randomized mutation of genes of an individual I ∈ Ix . In FALIS, Crossover and Mutation change the position of the memory and memory-related instruction which can lead to the situation that invalid individuals can be created, which represent a GPGPU application which can have another semantics with respect to the original version. Therefore, an individual validator was implemented which checks, with the help of a control flow graph and a data dependency graph if the created program is correct. In FALIS,s. When population size μ is reached in the selection process for population Ix+1 , the complete process restarts with this population. The complete process is repeated until the population converges or a fixed number of evolutions is accomplished. The fitness function enables the assessment of the quality of an individual I ∈ I with regard to a certain objective. In order to determine better solutions for each kernel, each individual of a population is evaluated with regard to the jvargas2006@gmail.com Feedback-Based Global Instruction Scheduling for GPGPU Applications 23 fitness function by FALIS. In contrast to single objective genetic algorithms, in multi-objective algorithms several competing objectives must be handled. This can be done, e.g. with a fitness function and selection process tailored towards multi-objectives. The most important method of SPEA2 is the multi-objective fitness function. SPEA2 assigns the fitness f (I) = R(I) + D(I) to an individual I, according to the following equations.  R(I) = S(J) (3) J∈Ix ,JI is the raw fitness of an individual I which takes into account the strength S(J) = {Z|Z ∈ Ix , J  Z} (4) of the individuals J which dominate I. I  J is the Pareto dominance function defined in [24] and means I dominates J, if the condition runtimeratio (I) ≤ runtimeratio (J) ∧ Eratio (I) ≤ Eratio (J) ∧ (runtimeratio (I) < runtimeratio (J) ∨ Eratio (I) < Eratio (J)) (5) is true. The reference individual without scheduling for calculating the performance increase and energy consumption decrease is defined as Iun with mobile instructions at their original position. The runtime improvement for an individual I is specified as the runtime of the scheduled version runtime(I) with respect to the completely unchanged version of the benchmark runtimeratio (I) = runtime(I) E(I) runtime(Iun ) . The energy consumption reduction Eratio (I) = E(Iun ) is, analogously to the runtime improvement, defined as the energy consumption of the scheduled version E(I) in comparison to the energy consumption of unchanged benchmark code E(Iun ). In addition to that SPEA2 also provides a correction function D(I) to take density information into account. With this approach, it is possible to optimize CUDA kernels towards several objectives. FALIS is designed to optimize the objectives energy consumption and performance. The runtime and energy consumption values are stored for each individual, but only the individuals on the pareto-front are interesting for the application designer, as they minimize at least one objective. 4 Evaluation In this section, the optimization capability of FALIS is evaluated. This is accomplished towards the following to objectives: runtime and energy consumption. First of all, an overview of the benchmarks and the test system is given and then results are presented. jvargas2006@gmail.com 24 C. Timm et al. 100 95 8 registers/thread 7 registers/thread Pareto Front 90 88 90 92 94 96 98 100 Proportional Runtime(%) Proportional Runtime(%) 100 95 90 92 94 96 98 100 102 Proportional Energy Consumption (%) Proportional Energy Consumption (%) (a) Kernel convolutionRowsKernel [14] 10 registers/thread Pareto Front 90 (b) Kernel srad cuda 1 [3] Fig. 6. Proportional Energy Consumption (Eratio ) and Proportional Runtime (runtimeratio ) Reduction for all Evaluated Individuals 4.1 Experimental Environment and Theoretical Evaluation The benchmark suite for this evaluation contains benchmarks from the following sources: – Nvidia CUDA examples [14] – VSIPL-GPU-Library [10] – Rodinia benchmark suite [3]. The benchmark suite comprises a large variety of application domains such as: medical imaging, data mining, image processing, pattern recognition, simulation etc. The benchmark characteristics cover benchmarks with and without extensive main memory utilization and benchmarks which are more computationally expensive. The system used in this study consists of the following components: AMD Phenom X4 9650 (Processor), 4GB DDR3 PC1066 (Main Memory). The power clamp was a Hameg - HZO50 and the oscilloscope was a Digilent Electronics Explorer Board. For the SPEA2 implementation in FALIS the JECO library [18] was utilized. The graphics card used in these tests was a Nvidia 9500 GT (shader clock speed : 1107 MHz and memory clock speed: 400 MHz). The probability values for SPEA2 were 40% for mutation and 20% for crossover. The start population size depends on the number of genes in a chromosome of a certain benchmark to ensure that the solution space is sufficiently explored. Due to the specialisation to memory and memory-related instructions, not all benchmarks of the three benchmarks suites show a performance increase or energy consumption decrease. Only the following benchmarks show significant optimization gain and will be taken into account in the results section: – – – – – – Kernel Kernel Kernel Kernel Kernel Kernel convolutionRowsKernel, benchmark convolutionSeparable [14] convolutionColumnsKernel, benchmark convolutionSeparable [14] srad cuda 1, benchmark SRAD [3] CUDAkernelQuantizationShort, benchmark dct8x8 [14] d recursiveGaussian rgba, benchmark recursiveGaussian [14] cuda compute flux, benchmark cfd [3] jvargas2006@gmail.com Relative Energy Consumption Decrease (%) Feedback-Based Global Instruction Scheduling for GPGPU Applications 25 10 5 0 convolutionRowsKernel CUDAkernelQuantizationShort srad cuda 1 convolutionColumnsKernel d recursiveGaussianrgba cuda compute flux Benchmark Fig. 7. Energy Consumption Analysis of Optimized Benchmarks 4.2 Results Figures 7 and 8 depict the improvements achieved with FALIS for benchmarks leading to a performance increase (shown in Figure 8) or an energy consumption decrease (shown in Figure 7) as listed in the former section. The maximal value for reducing the energy consumption is 10.23% and the maximal value for runtime decrease is 13.02%. They were measured for kernel convolutionRowsKernel of the benchmark convolutionSeparable [14]. Another kernel which was accelerated significantly was the kernel srad cuda 1 of benchmark SRAD [3]. Detailed evaluations of explored solutions are depicted in Figure 6(a) respectively Figure 6(b). Single points inside the figures show the runtime and energy consumption decrease for one individual. The Pareto-optimal points are connected with a line called Pareto front. A runtime decrease of 11.66% and an energy consumption decrease of 8.59% was achieved for kernel srad cuda 1 (chromosome length: 64). As one can see from Figure 6(b), there are two clusters. One cluster comprises solutions which have an impact on the runtime and the energy consumption. The other exemplary kernel is depicted in Figure 6(a) (chromosome length: 304). The energy consumption decreases up to 10.23% and the runtime decreases up to 13.02%. Within the optimization of kernel convolutionRowsKernel, the change in register utilization – 7 registers per thread changed to 8 registers per thread – has a positive effect on the runtime and the energy consumption. The performance of kernel CUDAkernelQuantizationShort of benchmark dct8x8 can be increased by 4.84% and energy consumption is decreased by 3.7%. The performance of kernel CUDAkernelQuantizationShort of benchmark dct8x8 can be increased by 4.84% and energy consumption is decreased by 3.7%. In addition to that performance of kernel d recursiveGaussian rgba of benchmark recursiveGaussian is increased by 8.13% and energy consumption is decreased by 7.28%. jvargas2006@gmail.com C. Timm et al. Relative Runtime Decrease (%) 26 10 5 0 convolutionRowsKernel CUDAkernelQuantizationShort srad cuda 1 convolutionColumnsKernel d recursiveGaussianrgba cuda compute flux Benchmark Fig. 8. Runtime Analysis of Optimized Benchmarks 5 Conclusion GPGPU application optimizations are usually performed manually in a timeconsuming trial-and-error process without efficient compiler support. Additionally, the lack of energy consumption aware optimizations is unfavorable for green computing and the utilization of GPGPU-capable chips in mobile systems. Therefore, this paper presents a novel multi-objective optimization process based on global instruction scheduling methods, called FALIS, which optimizes the energy consumption and the runtime simultaneously. For solving this optimization problem, a genetic algorithm was used which feeds profiling data back to the optimization process. Due to the utilization of a state-of-the-art multi-objective genetic algorithm it was possible to optimize an GPGPU application towards two objectives: runtime and energy consumption. FALIS was focus on memory and memory-related instructions which can have a great impact on GPGPU application performance in the face of the memory wall problem. With FALIS reductions of up to 10.23% in energy consumption and 13.02% in runtime could be achieved for real-world benchmarks. In the study presented in this paper, memory and memory-related instructions have been taken as relevant instructions to be scheduled. However, it could possibly advantageous to schedule also other instructions such as the instructions used to calculated the addresses for the memory accesses. Therefore, in future works it will be evaluated if optimized scheduling of more types of instructions is even more beneficial for the energy consumption and the performance. Acknowledgement. Part of the work on this paper has been supported by Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 ”Providing Information by Resource-Constrained Analysis”, project B2. jvargas2006@gmail.com Feedback-Based Global Instruction Scheduling for GPGPU Applications 27 References 1. Banerjia, S., Havanki, W.A., Conte, T.M.: Treegion Scheduling for Highly Parallel Processors. In: Lengauer, C., Griebl, M., Gorlatch, S. (eds.) Euro-Par 1997. LNCS, vol. 1300, pp. 1074–1078. Springer, Heidelberg (1997) 2. De Bosschere, K., Luk, W., Martorell, X., Navarro, N., O’Boyle, M., Pnevmatikatos, D., Ramı́rez, A., Sainrat, P., Seznec, A., Stenström, P., Temam, O.: High-Performance Embedded Architecture and Compilation Roadmap. In: Stenström, P. (ed.) Transactions on High-Performance Embedded Architectures and Compilers I. LNCS, vol. 4050, pp. 5–29. Springer, Heidelberg (2007) 3. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.: Rodinia: A Benchmark Suite for Heterogeneous Computing. In: Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009) 4. Cho, S., Melhem, R.: Corollaries to Amdahl’s Law for Energy. IEEE Computer Architecture Letters, 25–28 (2008) 5. Dominguez, R., Kaeli, D.R.: Improving the open64 backend for GPUs. Poster at Google Summer School (2009) 6. Görlich, M.: Untersuchung und Verbesserung der Speicherzugriffsverteilung in GPGPU-Programmen unter Nutzung von lokalen Schedulingmethoden. Master’s thesis, Embedded System Group, Faculty of Computer Science, TU Dortmund (2011) 7. Han, T.D., Abdelrahman, T.S.: Reducing branch Divergence in GPU Programs. In: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, pp. 1–8 (2011) 8. Hong, S., Kim, H.: An Analytical Model for a GPU Architecture with Memorylevel and Thread-level Parallelism Awareness. In: Proceedings of the 36th Annual International Symposium on Computer Architecture (ISCA), pp. 152–163 (2009) 9. Kerns, D.R., Eggers, S.J.: Balanced Scheduling: Instruction Scheduling When Memory Latency is Uncertain. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 278–289 (1993) 10. Kerr, A., Campbell, D., Richards, M.: GPU VSIPL: High-Performance VSIPL Implementation for GPUs. In: Proceedings of the 12th High Performance Embedded Computing Workshop (HPEC), Lexington, Massachusetts, USA (2008) 11. Kung, S.Y., Kailath, T., Whitehouse, H.J.: VLSI and Modern Signal Processing. Prentice Hall Professional Technical Reference (1984) 12. Leupers, R.: Instruction Scheduling for Clustered VLIW DSPs. In: Proceedings of the International Conference on Parallel Architecture and Compilation Techniques (PACT), pp. 291–300 (2000) 13. Machanick, P.: Approaches to Addressing the Memory Wall. Technical report, School of IT and Electrical Engineering, University of Queensland (2002) 14. NVIDIA Corporation: CUDA Architecture (2009) 15. NVIDIA Corporation: The CUDA Compiler Driver NVCC (2009) 16. Open64 Project at Rice University: Open64 Compiler: Whirl Intermediate Representation (2007), www.mcs.anl.gov/OpenAD/open64A.pdf 17. Owens, J., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A., Purcell, T.: A Survey of General-Purpose Computation on Graphics Hardware. Computer Graphics Forum, 80–113 (2007) jvargas2006@gmail.com 28 C. Timm et al. 18. Risco-Martin, J.: Java Evolutionary COmputation library (JECO) (2012), https://sourceforge.net/projects/jeco 19. Rofouei, M., Stathopoulos, T., Ryffel, S., Kaiser, W., Sarrafzadeh, M.: EnergyAware High Performance Computing with Graphic Processing Units. In: Proceedings of the Workshop on Power Aware Computing and Systems, HotPower (2008) 20. Timm, C., Gelenberg, A., Marwedel, P., Weichert, F.: Energy Considerations within the Integration of General Purpose GPUs in Embedded Systems. In: Proceedigns of the Annual Internation Conference on Advances in Distributed and Parallel Computing, ADPC (2010) 21. Timm, C., Weichert, F., Marwedel, P., Müller, H.: Multi-Objective Local Instruction Scheduling for GPGPU Applications. In: Proceedings of the International Conference on Parallel and Distributed Computing Systems, PDCS (2011) 22. Tseng, C.J., Siewiorek, D.: Automated Synthesis of Data Paths in Digital Systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 379–395 (1986) 23. Valluri, M., John, L.: Is Compiling for Performance == Compiling for Power? In: Proceedings oh the Workshop on Interaction between Compilers and Computer Architectures, INTERACT (2001) 24. Voorneveld, M.: Characterization of Pareto Dominance. Operations Research Letters, 7–11 (2003) 25. Wang, Z., Hu, X.S.: Energy-Aware Variable Partitioning and Instruction Scheduling for Multibank Memory Architectures. ACM Transactions on Design Automation of Electronic Systems (TODAES), 369–388 (2005) 26. Woo, D.H., Lee, H.H.: Extending Amdahl’s Law for Energy-Efficient Computing in the Many-Core Era. IEEE Computer, 24–31 (2008) 27. Zitzler, E., Giannakoglou, K., Tsahalis, D., Periaux, J., Papailiou, K., Fogarty, T., Ler, E.Z., Laumanns, M., Thiele, L.: SPEA2: Improving the Strength Pareto Evolutionary Algorithm For Multiobjective Optimization. In: Proceedings of the International Conference on Evolutionary and Deterministic Methods for Design, Optimization and Control with Applications to Industrial and Societal Problems, EUROGEN (2001) jvargas2006@gmail.com Parallel Algorithm for Landform Attributes Representation on Multicore and Multi-GPU Systems Murilo Boratto1 , Pedro Alonso2 , Carla Ramiro2 , Marcos Barreto3 , and Leandro Coelho1 1 Núcleo de Arquitetura de Computadores e Sistemas Operacionais (ACSO), Universidade do Estado da Bahia (UNEB), Salvador, Bahia, Brazil {muriloboratto,leandrocoelho}@uneb.br 2 Departamento de Sistemas Informaticos y Computación (DSIC), Universidad Politécnica de Valencia (UPV), Valencia, España {palonso,cramiro}@dsic.upv.es 3 Laboratório de Sistemas Distribuídos (LaSiD), Universidade Federal da Bahia (UFBA), Salvador, Bahia, Brazil marcoseb@dcc.ufba.br Abstract. Mathematical models are often used to simplify landform representation. Its importance is due to the possibility of describing phenomena by means of mathematical models from a data sample. High processing power is needed to represent large areas with a satisfactory level of details. In order to accelerate the solution of complex problems, it is necessary to combine two basic components in heterogeneous systems formed by a multicore with one or more GPUs. In this paper, we present a methodology to represent landform attributes on multicore and multi-GPU systems using high performance computing techniques for efficient solution of two-dimensional polynomial regression model that allow to address large problem instances. Keywords: Mathematical Modeling, Landform Representation, Parallel Computing, Performance Estimation, Multicore, Multi-GPU. 1 Introduction Some recent events have encouraged the development of applications that represent geophysical resources efficiently. Among these representations, mathematical models for representing landform attributes stand out, based on two-dimensional polynomial equations [8]. Landform attributes representation problem using polynomial equations had already been studied in [1]. However, distributed processing was not used in that work, which implied the usage of a high degree polynomial, thus limiting the area representation. It occurred because the greater the represented information, the greater computational power is needed. Furthermore, a high degree polynomial is required to represent a large area correctly, which also demands a great computational power. B. Murgante et al. (Eds.): ICCSA 2012, Part I, LNCS 7333, pp. 29–43, 2012. c Springer-Verlag Berlin Heidelberg 2012  jvargas2006@gmail.com 30 M. Boratto et al. Among the reasons for fulfilling landform representation, we focus on measuring agricultural areas, having as preponderant factors: plantation design, water resource optimization, logistics and minimization of erosive effects. Consequently, landform representation process becomes a fundamental and needful tool in efficient agriculture operation, especially in the agricultural region located in São Francisco river valley, which stands out as one of the largest viniculture and fruit export areas in Brazil. In addition, one of the main problems that make efficiency use difficult in agricultural productivity factors in this areas occurred due to soil erosion associated with inappropriate land use. In this context, the work proposed here contributes to the characterization of soil degradation processes. Today it is usual to have computational systems formed by a multicore with one or more Graphics Processing Units (GPUs) [13]. These systems are heterogeneous, due to different types of memory and different speeds of computation between CPU and GPU cores. In order to accelerate the solution of complex problems it is necessary to use the aggregated computational power of the two subsystems. Heterogeneity introduces new challenges to algorithm design and system software. Our main goal is to fully exploit all the CPU cores and all GPUs devices on these systems to support matrix computation [14]. Our approach achieves the objective of maximum efficiency by an appropriate balancing of the workload among all these computational resources. The purpose of this paper is to present a methodology for landform attributes representation of São Francisco river valley region based on two-dimensional polynomial regression method on multicore and multi-GPU systems. Section 2 briefly describes the mathematical model. Section 3 explains the parallel model on multicore and multi-GPU systems and the parallel implementation. Section 4 presents the experimental results. Conclusions and future work section closes the paper. 2 Mathematical Model A mathematical landform model is a computational mathematical representation of a phenomenon that occurs within a region of the earth surface [7]. This model can represent plenty of geographic information from a site such as: geological, geophysical, humidity, altitude, terrain, etc. One available technique to accomplish this representation is the Regular Grid Model [11]. This work makes the surface mapping with a global fitting using the polynomial regression technique. This technique fits a two-dimensional polynomial that best describes the data variation from a sample. The problem is that high computational power demanded to perform the regression in a large data set made the process very limited. Polynomial regression is a mathematical modeling that attempts to describe the relation among observed phenomena. Figure 1 shows an example of a Regular Grid representation generated from a regular sparse sample that represents information of an area altitude. According to Rawlings [10], modeling can be understood as the development of a mathematical analytical model that describes the behavior of a random variable of interest. This model is used to describe the behavior of independent variables whose relationship with the dependent variable is best represented in a non-linear equation. The relationship among variables is described by two-dimensional polynomial functions, where the fluctuation of the jvargas2006@gmail.com Parallel Algorithm 31 P3 P1 P2 Fig. 1. Model for landform attributes representation: Regular Grid dependent variable y is related to the fluctuation of the independent variable. Particularly, in the case study developed in this project, the non-linear regression has been used to describe the relationship between two independent variables (latitude and longitude) and a dependent variable (height). The mathematical model we use provides the estimation of the coefficients of two-dimensional polynomial functions, of different degrees in x and y, which represent terrain altitude variation from any area. When using mathematical regression models, the most widely used estimation method of parameters is the Ordinary Least Squares [5] that consists of the estimation of a function to represent a set of points minimizing the deviations squared. Given a set of geographic coordinates (x, y, z), taking the estimated altitude as estimation function of these points, a polynomial of degree r and s in x and y can be given as Equation 1, with εij as the error estimated by Equation 2, where 0 ≤ i ≤ m and 0 ≤ j ≤ n, z = f (xi , yj ) = r s k=0 l=0 akl xki yjl , εij = zij − z ij . (1) (2) The coefficients akl (k = 0, 1, ..., r; l = 0, 1, ..., s) that minimize the function error of the estimation function f (x, y), can be obtained by solving Equation 3 for c = 0, 1, ..., r and d = 0, 1, ..., s, ∂ξ ∂acd where, ξ= m n i=l 2 j=0 εij = = 0, (3) m n i=l j=o (zij jvargas2006@gmail.com 2 − z ij ) . 32 M. Boratto et al. From Equations 4 to 10 we get the development of Equation 3. ε2ij = (zij − zij = 2 =2 i=0 r j=0 [(zij − j=0 [(zij i=0 m n r j=0 − l=0 c d j=0 [zij xi yj − l=0 s k=0 r l=0 s k=0 s k=0 akl xki yjl )2 , (4) akl xki yjl + εij , r r s k=0 m n i=0 l=0 j=0 (zij m n i=0 s k=0 m n i=0 s k=0 m n ξ= ∂ξ ∂acd r l=0 l=0 (5) akl xki yjl )2 , (6) akl xki yjl )2 )xci yjd ], (7) akl xki yjl )2 )xci yjd ] = 0, akl xk+c yjl+d = i m n i=0 c d j=0 zij xi yj , r s − ( k=0 l=0 akl xk+c yjl+d )] = 0. i (8) (9) (10) The particularized form of the polynomial can be exemplified for r = s = n case, from Equation 11, where: 0 0 0 1 n n Z ij (xi , yi ) = a00 x y + a01 x y + · · · + ann x y . (11) Developing Equation 10 for the same particular case, we obtain Equations 12 and 13 [9]. The final solution is summarized in the matrix representation form Ax = b, where A is the matrix formed by xlc terms, vector x is formed by the akl terms and vector b is formed by bl terms. m n β (12) xlc = i=0 j=0 xα i yj , bl = m n i=0 j=0 Zij xγi yjδ . (13) This information is valid for any matrix format that is intended to represent and to any degree of two-dimensional polynomial, which is to be fitted. 3 Parallel Computational Model One of the most decisive concepts to successfully program modern GPU and multicore computers uses GPU and multicore processors is the underlying model of the parallel computer. A GPU card connected to a sequential computer can be considered as an isolated parallel computer fitting a SIMD model, i.e. a set of up to 512 (depending on model) processors running the same instruction simultaneously, each on its own set of data. On the other hand, CPU can be seen as a set of independent computational resources (core) that can cooperate in the solution of a given problem. jvargas2006@gmail.com Parallel Algorithm 33 Thus, a realistic programming model should consider the host system comprising CPU cores and graphic cards as a whole thus leading to a heterogeneous parallel computer model. We follow a model similar to the one described in [2], where the machine is considered as a set of computing resources with different characteristics connected via independent links to a shared global memory. Such model would be characterized by the number and type of concurrent computational resources with different access time to reach each resource and the different types and levels of memory (Figure 2). Programming such a heterogeneous environment poses challenges at two different levels. At the first one, the programming models for CPUs and GPUs are very different. The performance of each single host subsystem depends on the availability of exploiting the algorithm’s intrinsic parallelism and how it can be tailored to be fitted in the GPU or CPU cores. At a second level, the challenge consists of how the whole problem can be partitioned into pieces (tasks) and how they can be delivered to CPU cores or GPU cards so that the workload is evenly distributed between these subsystems. Fig. 2. Heterogeneous parallel computer model The partition of the problem into tasks and the scheduling of these tasks can be based on performance models obtained from previous executions or on a more sophisticated strategy. This strategy is based on small and fast benchmarks representative of the application that allows to predict, at runtime, the amount of workload that should be dispatched to each resource so that it would minimize the total execution time. Currently, we focus our work on how to leverage the heterogeneous concurrent underlying hardware to minimize the time-to-solution of our application leaving the study of more sophisticated strategies of workload distribution to further research. 3.1 The Parallel Algorithm In particular, the algorithm for the construction of matrix A of order N= (s + 1)2 with a polynomial degree s can be seen in Algorithm 1. Arrays x and y have been previously loaded from a file and stored in such a way that allows to simplify two sums in only one jvargas2006@gmail.com 34 M. Boratto et al. of length n= m2 . Routine matrixComputation receives as arguments the arrays x and y, the length of the sum (n) and the order of the polynomial (s), and returns a pointer to the output matrix A. Algorithm 1. Routine matrixComputation for the construction of matrix A. 1: int N = (s+1)*(s+1); 2: for( int k = 0; k < N; k++ ) { 3: for( int l = 0; l < N; l++ ) { 4: int exp1 = k/s+l/s; 5: int exp2 = k%s+l%s; 6: int ind = k+l*N; 7: for ( int i=0; i<n; i++ ) { 8: A[ind] += pow( x[i], exp1 ) * pow( y[i], exp2 ); 9: } 10: } 11: } The construction of matrices A and b (Equations 12 and 13) is by far the most expensive part of the overall process. However, there exists a great opportunity for parallelism in this computation. It is not hard to see that all the elements of the matrix can be calculated simultaneously. Furthermore, each term of the sum can be calculated independently. This can be performed to partition the sum into chunks of different sizes. Our parallel program is based on this approach since the usual value for the order of the polynomial s ranges from 2 to 20, yielding matrices of order from 9 to 441 (variable N), whereas the length of the sum ranges from 1, 3 to 25, 4 million terms (variable n), depending on how fine the grid is desired. We partition the sum into chunks, each one with a given size so they do not be necessarily equal. The application firstly spawns a number of nthreads where each one (thread_id) will work on a different sumchunk yielding a matrix A_[thread_id]. We consider here A_[thread_id] as a pointer to a linear array of length N×N. The result, i.e. matrix A, is the sum of all these matrices so A=A_[0]+A_[1]+. . . +A_[nthreads-1]. Function matrixComputation can be easily modified to work on these particular matrices that present the computation over chunks of arrays x and y. Everything discussed for the computation of matrix A can be extrapolated to the computation of vector b including its computation in the same routine. Now, we consider our heterogeneous system consisting of two CPU processors with six cores each and two identical GPUs. The workload consisting of the computation of matrix A and array b has been separated into two main parts in order to deliver its computation to both the multicore CPU system and two-GPUs system. Algorithm 2 shows the scheme used to partition the workload into these two pieces by means of the if branching which starts in line 4. Because it is necessary to have one CPU thread linked to each GPU, we initialize the runtime with a total of nthreads, i.e., as many CPU threads as CUDA devices [3] plus a number of CPU threads that will be linked to a CPU core each. This is carried out through an OpenMP [4] pragma directive (lines 1 and 2). jvargas2006@gmail.com Parallel Algorithm 35 The first two CPU threads is binded to two GPUs devices and the rest is binded to CPU virtual processors. The right number of CPU threads can be fewer than the available number of CPU in some cases. Someteimes, we got better results with a number of threads larger than number of cores since we have the Intel Hyper-Threading [6] capability activated. We have employed a static strategy to dispatch data and tasks to the CPU cores and to the GPUs, i.e., the percentage of workload delivered to each system is an input to the algorithm provided by the user. Once given the desired percentage of computation, the size of data is calculated before calling Algorithm 2 so that variable sizeGPU stores the number of terms of the sum (lines 7–9 of Algorithm 1) that each one of two GPUs will compute, and variable sizeCPU stores the total amount of terms that all the cores of the CPU system will compute. Each system will perform computation if the piece of data assigned is larger than zero. Algorithm 2. Using multiple GPU devices and cores. 1: omp_set_num_threads(nthreads); 2: #pragma omp parallel { 3: int thread_id = omp_get_thread_num(); 4: if( sizeGPU ) { /* GPU Computing */ 5: int gpu_id = 2*thread_id; 6: int first = thread_id * sizeGPU; 7: cudaSetDevice(gpu_id); 8: matrixGPU(A,b,&(x[first]),&(y[first]),&(z[first]),sizeGPU,s); 9: } else { 10: if( sizeCPU ) { /* CPU Computing */ 11: int cpu_thread_id = thread_id-2; 12: int first = 2*sizeGPU+cpu_thread_id*sizeThr; 13: int size = thread_id==(nthreads-1)?sizeLth:sizeThr; 14: matrixCPU(A,b,&(x[first]),&(y[first]),&(z[first]),size,s); 15: } 16: } 17: } Matrix A and vector b are the output data of Algorithm 2. They are the sum of the partial sums computed by each system. We use arrays of matrices A_ and b_ described earlier to store these partial results independently whether they were computed by a CPU core or a GPU device. Once the threads are joined (after line 17) a total sum of these partial results is performed by the main thread to form A and b. Each thread works on a different piece of arrays x, y, and z. Routines matrixCPU and matrixGPU are adaptations of Algorithm 1 that receive pointers to the suitable location in arrays x, y and z (set in variable first in lines 6 and 12, respectively) and the length of the sum, i.e., sizeGPU for the each GPU or size for each CPU core. These routines also include the computation of vector b that was omitted in Algorithm 1. The total amount of work performed by the CPU system (line 13) is divided into equal chunks of size sizeThr (sizeThr=sizeCPU/(nthreads-2)) except for the last core which is jvargas2006@gmail.com 36 M. Boratto et al. sizeLth, i.e., sizeThr plus the remaining terms. The GPU devices in our system are identified with integers 0 and 2 (gpu_id), which explains line 5. 3.2 The CUDA Kernel The computation performed by each GPU is implemented in the matrixGPU function, called in line 8 of Algorithm 2. This function firstly performs the usual operations of allocating memory in GPU and uploading data from the CPU to the GPU kernel. Thus, it is supposed that arrays A, x and y have been previously uploaded into the card’s global memory. For the sake of simplicity we restrict the explanation to the computation of matrix A since the computation of vector b can be easily deduced. The construction of matrix A through a CUDA kernel has a great opportunity of parallelism. In this case, we exploit both, the fact that all the elements of matrix A can be computed concurrently and that each term of the sum is independent of any other one. In order to exploit all this concurrency we used a grid of three-dimensional thread blocks. The thread blocks have dimension BLKSZ_X×BLKSZ_Y×BLKSZ_Z whose values are macro definitions in the first three lines of Algorithm 3. Each thread is located in the block through 3 coordinates which are represented by variables X, Y and Z (lines 9–11). The thread blocks are arranged in a three-dimensional grid. The first dimension is 1, and the other two are N/BLKSZ_Y and N/BLKSZ_Z , respectively, being N the dimension of matrix A and idiv an integer function which returns the length of the last two dimensions. The following code is within matrixGPU function and shows this arrangement and the call to the kernel: 1: dim3 dimGrid( 1, idiv( N, BLKSZ_Y ), idiv( N, BLKSZ_Z ) ); 2: dim3 dimBlock( BLKSZ_X, BLKSZ_Y, BLKSZ_Z ); 3: kernel<<< dimGrid, dimBlock >>>( A, x, y, s, n, N ); The aim is that all the threads within a block calculate concurrently the core computation of line 27. The thread with coordinates (X,Y,Z) is assigned to calculate the terms of the sum X+i, with i = 0 :BLKSZ_X: n. This operation is specified by the loop which starts in line 19. The exponents exp1 and exp2 depend on the row (k) and column (l) indexes of the sough-after matrix A. These indexes are calculated in lines 7 and 8, respectively, based on coordinates Y and Z of the thread. All the threads in the block use data in arrays x and y so, before calculation in line 27, a piece of these arrays must be loaded into the shared memory from the global memory. Shared memory is a rapid access memory space that all the threads within a block can access. Each thread in the X dimension with Y=0 and Z=0 performs this load into shared memory copying one element of arrays x and y into arrays sh_x and sh_y, respectively (lines 22–25). These last arrays have been allocated in the shared memory in line 13. Upon completion of the loop in line 29, all the terms of the sum assigned to that thread have been calculated and stored in the register variable a. Now, this value is stored in shared memory (line 30). Therefore, a three-dimensional array sh_A of size BLKSZ_X×BLKSZ_Y×BLKSZ_Z has been allocated in shared memory in line 14. We need to imagine the shared data sh_A as a three-dimensional cube where each position has a partial sum of the total sum. There is one sum for each element r×c jvargas2006@gmail.com Parallel Algorithm 37 Algorithm 3. CUDA Kernel. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38: 39: 40: #define BLKSZ_X #define BLKSZ_Y #define BLKSZ_Z 128 2 2 __global__ void kernel( double *A, double *x, double *y, int s, int n, int N ) { int k = blockIdx.y * blockDim.y + threadIdx.y; int l = blockIdx.z * blockDim.z + threadIdx.z; int X = threadIdx.x; int Y = threadIdx.y; int Z = threadIdx.z; double a = 0.0; __shared__ double sh_x[BLKSZ_X], sh_y[BLKSZ_X]; __shared__ double sh_A[BLKSZ_X][BLKSZ_Y][BLKSZ_Z]; if( k<N && l<N ) { int exp1 = (k/s)+(l/s); int exp2 = (k%s)+(l%s); for( int K=0; K<n; K+=BLKSZ_X ) { int i = X+K; if( i<n ) { if( Y == 0 && Z == 0 ) { sh_x[X] = x[i]; sh_y[X] = y[i]; } __syncthreads(); a += pow( sh_x[X], exp1 ) * pow( sh_y[X], exp2 ); } } sh_A[X][Y][Z] = a; __syncthreads(); if( X == 0 ) { a = 0.0; for( int i=0; i<BLKSZ_X; i++ ) { a += sh_A[i][Y][Z]; } A[k+N*l] = a; } } } of matrix A. In other words, elements sh_A[i][Y][Z], for all i, contain the partial sums corresponding to a given element r×c, taking into account the parity between the matrix element and the thread coordinates set in lines 7–11. Thus, it is necessary now to add all the partial sums in the X dimension for all the sums. This operation (lines 32– 38) is performed only by threads such that X=0. Once the sum has been computed, the result is saved in, global memory (line 37). jvargas2006@gmail.com 38 M. Boratto et al. We use synchronization points within the code (__syncthreads()) to make sure data in shared memory is saved before read. The use of shared memory is restricted to a small size that must be statically allocated (at compilation time). The size of our target GPU is 48KB. The maximum number of threads per block is limited to 1024. However, the total amount of shared memory is what really determines the size of the threads block. Anyway, the limitation in the number of threads per block is easily overcome by the number of blocks that can be run concurrently. Typical values for the threads block dimensions are the ones defined in lines 1–3. We experimentally checked that dimensions Y and Z should be equal. Somehow there are proportional to N (size of matrix A) and dimension X should be related to n (size of arrays x and y) and so much longer. Values of N range from 9 to 441, while values of n, range between 1, 3 × 106 and 25, 4 × 106 in our experiments. Given this relationship between N and n, it is clear that the opportunity for concurrency spreads in the X dimension of the block. Just as a simply note to say that we chose the first dimension as the “largest” one due to the GPU limits, the last dimension of the threads block to 64, allowing up to 1024 the number of threads in the other two dimensions. In addition, it is possible to say that the three-dimensional grid of blocks has really been limited to an effective two-dimensional grid since the first dimension is set to 1. More blocks in coordinate X means that data computed by each block in that dimension and stored in the shared memory should be also shared among the thread blocks. This can only be done through global memory resulting in a performance penalty. 4 Experimental Results 4.1 Characterization of the Execution Environment The computer used in our experiments has two Intel Xeon X5680 at 3.33Ghz and 96GB of GDDR3 main memory. Each one is a hexacore processor with 12MB of cache memory. It contains two GPUs NVIDIA Tesla C2070 with 14 stream multiprocessors (SM) and 32 stream processors (SP) each (448 cores in total), 16 load/store units, four special functions, a 32K-word register file, 64K of configurable RAM, and thread control logic. Each SM has both floating-point and integer execution units. Floating point operations follow the IEEE 754-2008 floating-point standard. Each core can perform one double precision fused multiply-add operation in each clock period and one double-memory. The installed CUDA toolkit is version 4.0. We use library MKL 10.3 to perform BLAS operations in the CPU subsystem. 4.2 Landform Attributes Representation Analysis In order to validate the presented methodology and derived equations, it will be applied computing techiques for efficient solution of two dimensional polynomial regression model to represent the landform attributes of an area of São Francisco river valley region. The data source of the chosen area comes from Digital Terrain Models (DTM) [12], in the form of a regular matrix, with spacing approximately 900 meters in jvargas2006@gmail.com Parallel Algorithm 39 the geographic coordinates. The statistical analyses of the elevations indicate a dispersion from 1 to 2, 863 meters. The DTM with 1, 400 points has only 722 points inside the region, the other points are outside the area. Using all the points representing the landform attributes of the area and, by Equation 12, we estimate the polynomial coefficients for representing terrain altitude variation. The time to estimate such a polynomial in high degree needs great computer power and a long time of processing. However, the higher the degree of the polynomial is more accuracy in the description of landform attributes representation we have, thus achieving a more satisfactory level of details. Using the DTM data source, the solution of the model shows an elevation map generated in 3D vision (Figure 3) and a 2D projection in gray tones (Figure 4). It can also be observed that São Francisco river valley region has a heavily uneven topography so a high degree of the adjusted polynomial is needed to attain an accurate representation of the surface. Figure 3 shows the elevation map generated with the coefficients of polynomials with degrees 2, 4, 6 and 20, respectively. Therefore, by fitting a high degree polynomial to the data a better landform attributes representation and a more accurate extrapolation is obtained. 1200 1200 1000 1000 800 800 z z 600 600 400 400 5000 200 00 00 3000 1000 2000 x 2000 3000 4000 5000 200 4000 y 4000 3000 1000 1000 2000 x 50000 1200 1200 1000 1000 800 2000 3000 4000 y 1000 50000 800 z z 600 600 400 400 5000 200 00 4000 1000 2000 x 2000 3000 4000 5000 200 00 3000 y 4000 3000 1000 1000 2000 x 50000 2000 3000 4000 y 1000 50000 Fig. 3. 3D Vision landform attributes representation of São Francisco river valley region for polynomial degrees 2, 4, 6 and 20 4.3 Experiments Using Double Precision Data We have implemented a parallel algorithm in CUDA, based on landform attributes representation by using the parallel scheme proposed in Section 3 to build the surface mapping with a global fitting using the polynomial regression technique. The jvargas2006@gmail.com 40 M. Boratto et al. Fig. 4. Gray tones landform attributes representation of São Francisco river valley region for polynomial degrees 2, 4, 6 and 20 benchmarks were compiled with nvcc. In the experiments, we first increased the number of CPU threads from 1 to 24 (Hyper-Threading is set) to obtain the number that minimizes time. Then we add 1 and 2 GPUs to the number of threads obtained in the former test. The input sizes of the problem (degree of a polynomial) for the experiments were 2, 4, ..., 40. The algorithm performance is analyzed in Figure 5. The execution with one thread is denoted by “Sequential” in the figure while “OMP” denotes the use of several CPU threads. The OMP version distributes the matrix calculation among the threads and each thread is run exclusively on a CPU core. Versions denoted by 1GPU and 2GPU represent executions in one single and two devices, respectively. The Parallel Model (“Model”) uses all cores available in the heterogeneous system. In this model the threads are executed by all the elements of the machine, the suitable number of CPU cores and the two GPUs. The results of the experiments show that the parallel CPU algorithm (OMP) reduces the execution time significantly. As can be seen in Figure 5 the maximum speedup is around 12, matching the number of cores. The second figure shows Gflops, presenting a difference in performance that can be observed more clearly. It must be noted that, for small polynomial degrees, the performance of OMP is larger than the performance with 1 GPU (size <= 10). This is due to the data transfer between CPU and GPU. Similarly, the performance of 1 GPU is larger than the performance with 2 GPUs (size <= 20). In this case, this is due to the setup time needed in the selection of the devices, which is high in our target machine (4.5 sec.) and is not necessary if just one GPU is used. The best result has been obtained with every resource available in the heterogeneous system. The speedup increases with the problem size reaching the theoretical maximum of 78, a number that has been obtained by comparing the computational power of one GPU with the CPU. The use of GPU as a standalone tool provides benefits but does not allow to reach the potential performance that could be obtained by adding more GPUs and/or the CPU subsystem. jvargas2006@gmail.com Parallel Algorithm 41 Experimental Execution Time 5000 Sequential OMP 1GPU 2GPU Model Time (Seconds) 4000 3000 2000 1000 0 0 5 10 15 20 25 30 35 40 35 40 Size (Degree of a Polynomial) Experimental Gflops 50 OMP 1GPU 2GPU Model 45 40 35 Gflops/s 30 25 20 15 10 5 0 0 5 10 15 20 25 30 Size (Degree of a Polynomial) Speedup 100 OMP 1GPU 2GPU Model 80 Speedup 60 40 20 0 0 5 10 15 20 25 30 35 40 Size (Degree of a Polynomial) Fig. 5. Graphical representation of the execution time (in seconds), Gflops and Speedup rates by varying the size of the problem (degree of the polynomial) jvargas2006@gmail.com 42 M. Boratto et al. Table 1. Comparative performance analysis of execution time (in seconds) Degree Polynomial 8 12 16 20 24 28 32 36 40 Sequential 84.49 386.17 1, 166.88 2, 842.52 5, 916.06 11, 064.96 24, 397.66 30, 926.82 46, 812.70 OMP 12.32 41.85 114.55 268.32 544.93 1, 011.42 1, 777.25 2, 700.00 4, 252.69 1GPU 12.44 21.36 43.31 90.57 172.71 310.53 521.63 828.50 1, 261.09 2GPU 14.19 19.39 31.48 57.03 101.08 176.88 285.07 450.67 666.77 Model 13.61 18.04 25.53 49.29 88.86 156.72 256.62 404.25 600.90 5 Conclusion and Future Works The experimental results obtained in this work indicate that our approach to the solution of the mathematical model for representing landform attributes is efficient and scalable. The built application exploits the computing power of current GPUs leveraging the intrinsic parallelism contained in the algorithm. Furthermore, our solution is designed so that tasks in which the building matrix is partitioned can be either been dispatched to a GPU or a CPU core. The high computing cost of the application and the way in which we performed the solution in this paper motivate us to extended this algorithm further to other hierarchically higher levels such as clusters of nodes like the one we used in these experiments. To this end, we propose for the future an auto-tuning method to determine the best tile size that will be computed by each subsystem in order to attain load balancing among all possible computational resources available. Acknowledgment. We would like to thank The Generalitat Valenciana for PROMETEO/2009/2013 project. References 1. Bajaj, C., Ihm, I., Warren, J.: Higher-order interpolation and least-squares approximation using implicit algebraic surfaces. ACM Transactions on Graphics 12, 327–347 (1993) 2. Ballard, G., Demmel, J., Gearhart, A.: Communication bounds for heterogeneous architectures. Tech. Rep. 239, LAPACK Working Note (February 2011) 3. Barnat, J., Bauch, P., Brim, L., Ceska, M.: Computing strongly connected components in parallel on CUDA. In: Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2011), pp. 544–555. IEEE Computer Society (2011) 4. Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: portable shared memory parallel programming (scientific and engineering computation). The MIT Press (2007) 5. Golub, G.H., Loan, C.F.V.: Matrix Computations, 2nd edn., Baltimore, MD, USA (1989) 6. Marr, D.T., Binns, F., Hill, D.L., Hinton, G., Koufaty, D.A., Miller, J.A., Upton, M.: Hyperthreading technology architecture and microarchitecture. Intel Technology Journal 6(1), 1–12 (2002) jvargas2006@gmail.com Parallel Algorithm 43 7. Namikawa, L.M., Renschler, C.S.: Uncertainty in digital elevation data used for geophysical flow simulation. In: GeoInfo, pp. 91–108 (2004) 8. Nogueira, L., Abrantes, R.P., Leal, B.: A methodology of distributed processing using a mathematical model for landform attributes representation. In: Proceeding of the IADIS International Conference on Applied Computing (April 2008) 9. Nogueira, L., Abrantes, R.P., Leal, B., Goulart, C.: A model of landform attributes representation for application in distributed systems. In: Proceeding of the IADIS International Conference on Applied Computing (April 2008) 10. Rawlings, J.O., Pantula, S.G., Dickey, D.A.: Applied Regression Analysis: A Research Tool. Springer Texts in Statistics. Springer (April 1998) 11. Rufino, I., Galvão, C., Rego, J., Albuquerque, J.: Water resources and urban planning: the case of a coastal area in brazil. Journal of Urban and Environmental Engineering 3, 32–42 (2009) 12. Rutzinger, M., Hofle, B., Vetter, M., Pfeifer, N.: Digital terrain models from airborne laser scanning for the automatic extraction of natural and anthropogenic linear structures. In: Geomorphological Mapping: a Professional Handbook of Techniques and Applications, pp. 475– 488. Elsevier (2011) 13. Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for GPU computing. In: Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, pp. 97–106. Eurographics Association, Aire-la-Ville (2007) 14. Song, F., Tomov, S., Dongarra, J.: Efficient support for matrix computations on heterogeneous multicore and multi-GPU architectures. Tech. Rep. 250, LAPACK Working Note (June 2011) jvargas2006@gmail.com The Performance Model of an Enhanced Parallel Algorithm for the SOR Method Italo Epicoco1,2 and Silvia Mocavero2 1 2 University of Salento, Lecce, Italy italo.epicoco@unisalento.it Euro-Mediterranean Center for Climate Change (CMCC), Lecce, Italy silvia.mocavero@cmcc.it Abstract. The Successive Over Relaxation (SOR) is a variant of the iterative Gauss-Seidel method for solving a linear system of equations Ax = b. The SOR algorithm is used within the NEMO (Nucleus for European Modelling of the Ocean) ocean model for solving the elliptical equation for the barotropic stream function. The NEMO performance analysis shows that the SOR algorithm introduces a significant communication overhead. Its parallel implementation is based on the Red-Black method and foresees a communication step at each iteration. An enhanced parallel version of the algorithm has been developed by acting on the size of the overlap region to reduce the frequency of communications. The overlap size must be carefully tuned for reducing the communication overhead without increasing the computing time. This work describes an analytical performance model of the SOR algorithm that can be used for establishing the optimal size of the overlap region. Keywords: SOR, NEMO, Performance Model. 1 Introduction The ocean engine of NEMO (Nucleus for European Modelling of the Ocean) [1] is a primitive equation model adapted to regional and global ocean circulation problems. It is a flexible tool for studying the ocean and its interactions with the other components of the earth climate system over a wide range of space and time scales. Prognostic variables are the three-dimensional velocity field, the sea surface height, the temperature and the salinity. In the horizontal direction, the model uses a curvilinear orthogonal grid and in the vertical direction, a full or partial step z-coordinate, or s-coordinate, or a mixture of the two. The model time stepping environment is a three level scheme in which the tendency terms of the equations are evaluated either centered in time, or forward, or backward depending on the nature of the term. The model is spatially discretized on a staggered grid (Arakawa C grid) masking the land points. Vertical discretization depends on both how the bottom topography is represented and whether the free surface is linear or not. Explicit, split-explicit and filtered free surface formulations are implemented for solving the prognostic equations for the active B. Murgante et al. (Eds.): ICCSA 2012, Part I, LNCS 7333, pp. 44–56, 2012. c Springer-Verlag Berlin Heidelberg 2012  jvargas2006@gmail.com The Performance Model of an Enhanced Parallel Algorithm 45 tracers and the momentum. A number of numerical schemes are available for the momentum advection, for the computation of the pressure gradients, as well as for the advection of the tracers (second or higher order advection schemes, including positive ones). When the filtered sea surface height option is used, a new force that can be interpreted as a diffusion of the vertically integrated volume flux divergence is added in the momentum equation. The equation is solved implicitly and it represents an elliptic equation for which two solvers are available: the SOR and the Preconditioned Conjugate Gradient (PCG) schemes. The SOR has been retained because it is a linear solver very useful when using the adjoint model of NEMO. The NEMO model with the MFS16 [2] configuration has been evaluated on the MareNostrum platform at the Barcelona Supercomputing Center. The routine named dyn spg is the most time consuming one; it computes the surface pressure gradient term using the SOR scheme. The paper is organized as follows: next section introduces the SOR (Successive Over Relaxation) method. Section 3 describes our parallel approach, while the latter sections, 4 and 5, show respectively the analytical performance model of the parallel algorithm and the iso-efficiency analysis. 2 SOR Overview The iterative methods for solving the linear equation systems Ax = b iteratively generates a sequence {pk } of approximate solutions such that the residual vector (rk = b − Apk ) converges to 0. The Gauss-Seidel algorithm [3] is an example of iterative method for solving a linear equation system. The method can be applied only if the matrix A is strictly diagonally dominant. Each equation is solved by the unknown on the diagonal and the approximated values for the other unknowns are plugged in. The process is then iterated until convergence. The Gauss-Seidel method is easily derived by examining separately each of the n equations in the linear system. Let the i-th equation given by: n  aij xj = bi (1) j=1 (k) At the iteration k, it can be solved by (2) for the value of xi assuming the (k−1) approximation of the previous iteration (xj=i ) for the other unkowns xj=i . (k) xi ⎛ ⎞ i−1 n   1 ⎝ (k) (k−1) ⎠ = aij xj − aij xj bi − aii j=1 j=i+1 (2) There are two important characteristics of the Gauss-Seidel method that should be noted. Firstly, the computation appears to be serial: since each component at the new iteration depends on all of the previously computed components, the updates cannot be done simultaneously as in the Jacobi method [4]. Secondly, the new iterate value x(k) depends upon the order in which the equations are jvargas2006@gmail.com 46 I. Epicoco and S. Mocavero examined. If it changes, the values at the new iteration (and not just their order) change accordingly. The definition of the Gauss-Seidel method can be expressed using the following matrix notation: x(k) = (D − L)−1 (U x(k−1) + b) (3) where the matrices D, −L, and −U represent the diagonal, the strictly lower triangular, and the strictly upper triangular parts of A, respectively. The SOR [5] is an iterative method for solving a linear system of equations derived by extrapolating the Gauss-Seidel algorithm. This extrapolation takes the form of a weighted average between the previous iteration and the Gauss-Seidel component computed at the current iteration. Given a value for the weight ω the component at iteration k is given by: (k) (k) xi = ωxi (k−1) + (1 − ω)xi (4) where x denotes a Gauss-Seidel approximation. The idea is to choose a value for ω within the interval (0, 2) that will accelerate the rate of convergence to the solution. In general, it is not possible to compute in advance the value of ω that will maximize the rate of convergence of the SOR. Frequently, some heuristic estimate is used, such as ω = 2 − O(h) where h is the mesh spacing of the discretization of the underlying physical domain. In matrix terms, the SOR algorithm can be written as follows: x(k) = (D − ωL)−1 [ωU + (1 − ω)D]x(k−1) + ω(D − ωL)−1 b 3 (5) Parallel Algorithm While the matrix notation for the SOR algorithm is useful for a theoretical analysis, a practical implementation requires an explicit formula to be defined [6]. Let’s consider a general second-order elliptic equation in x and y, finite differenced on a square. Each row of the matrix A is an equation of the form: ai,j ui+1,j + bi,j ui−1,j + ci,j ui,j+1 + di,j ui,j−1 + ei,j ui,j = fi,j (6) The iterative procedure is defined by solving the following equation for ui,j . u∗i,j = fi,j − ai,j ui+1,j − bi,j ui−1,j − ci,j ui,j+1 − di,j ui,j−1 ei,j (7) Then, considering the (4), the unew i,j is a weighted average given by: ∗ old unew i,j = ωui,j + (1 − ω)ui,j (8) If we consider that the residual at any stage of the iteration is given by: ξi,j =ai,j ui+1,j + bi,j ui−1,j + ci,j ui,j+1 + di,j ui,j−1 + ei,j ui,j − fi,j jvargas2006@gmail.com (9) The Performance Model of an Enhanced Parallel Algorithm 47 we can calculate the new value at each iteration given by: old unew i,j = ui,j − ω ξi,j ei,j (10) This formulation is very easy to program, and the norm of the residual vector ξi,j can be used as a criterion for terminating the iteration. The need to reduce the time spent by the SOR algorithm without increasing the number of iterations to reach convergence has been the main goal of several previous works. Different multi-color ordering techniques, such as the Red-Black [7] method for two dimensional problems, have been investigated; they allow the parallelization of operations on the same color. Other techniques, overlapping computation and communication or allowing an optimal scheduling of available processors, have been designed and implemented producing parallel versions of SOR [8]. Parallel SOR algorithms, suitable for use on an asynchronous MIMD computer, are presented since 1984 [9]. In the last years, the BPSOR [10], characterized by a new mesh domain partition and ordering, allows retaining the same convergence rate of the sequential SOR method with an easy parallel implementation on an MIMD parallel computing. This work analyzes a parallel algorithm for the SOR based on the Red-Black method that supposes to divide the mesh into odd and even cells, like in a checkerboard. Equation (10) shows that the odd point values depend only on the even points, and vice versa. Accordingly, we can carry out one half-sweep updating the odd points and then another half-sweep updating the even points with the new odd values. The parallel algorithm uses a 2D domain decomposition based on checkerboard blocks. Let ni and nj be respectively the number of rows and columns of the global domain, and pi and pj respectively the number of processes along i and j directions, then each process will compute a subdomain made of ni /pi x nj /pj elements. If we consider only one overlap line between neighbors, each parallel process must exchange the computed values at the border at each iteration of the SOR. Two communication steps must be performed for each iteration (for the odd and for the even points). At each iteration, the generic process computes the odd points inside its domain, exchanges the odd points with its neighbors and updates the boundaries values, computes the inner even points and finally updates even points on the boundaries exchanging with neighbors (see Fig. 1). At each iteration, the generic parallel process will then communicate twice for each neighbor. In order to reduce the frequency of communication, the size of the overlap region could be increased [11]. In that case the neighboring processes would exchange a wider overlap region, but the values exchanged can be used for further iterations without the need of communication. Each process, after exchanging the data, computes a total number of lines given by Ninner + Nol − 1, where Ninner and Nol are respectively the total number of lines in the inner domain and the number of overlap lines. At each iteration only one line of the overlap expires so that the process has no need to exchange for Nol − 1 iterations. The convergence rate of the algorithm does not change, since the ordering and partition is the same of the original SOR algorithm. The algorithm is explained by the following pseudo-code fragment. jvargas2006@gmail.com 48 I. Epicoco and S. Mocavero Fig. 1. SOR Red-Back computing algorithm Require: u //result matrix with initial value a, b, c, d, e //coefficient matrix f //known term ol exp ← ol while ξ(i, j) is not enough small do if ol exp == 0 then call data exch //exchange odd points over the overlap end if for all even points do tmp ← (f (i, j) − a(i, j) ∗ u(i, j − 1) − b(i, j) ∗ u(i, j + 1) − c(i, j) ∗ u(i − 1, j) − d(i, j) ∗ u(i + 1, j))/e(i, j) ξ(i, j) ← tmp − u(i, j) u(i, j) ← ω ∗ tmp + (1 − ω) ∗ u(i, j) end for if ol exp == 0 then call data exch //exchange even points over the overlap ol exp ← ol end if for all odd points do tmp ← (f (i, j) − a(i, j) ∗ u(i, j − 1) − b(i, j) ∗ u(i, j + 1) − c(i, j) ∗ u(i − 1, j) − d(i, j) ∗ u(i + 1, j))/e(i, j) ξ(i, j) ← tmp − u(i, j) u(i, j) ← ω ∗ tmp + (1 − ω) ∗ u(i, j) end for call convergence test (ξ) ol exp ← ol exp − 1 end while jvargas2006@gmail.com The Performance Model of an Enhanced Parallel Algorithm 49 A similar approach has been used by the HYCOM ocean model [12] where a maximum number of wide halo lines can be added to reduce the halo communication overhead. 4 The Analytical Performance Model The SOR algorithm has been implemented in a test program made by the main sor routine that (i) calls the data exch routine for exchanging the data between the neighbors and (ii) evaluates the convergence. Both the routines are then characterized by two kind of operations: computing and communication. data exch performs some data buffering operations and the actual send and receive of the data on the boundaries. The size of the overlap region directly impacts on the frequency of the data exch invocation. The sor routine computes the result matrix and performs a collective communication during the convergence test. If we increase the size of the overlap, the computation increases, while the time for collective communication does not change. The total time is the sum of these four components, three of them depending on the size of the overlap. Which is the best value of the overlap to get the best benefit? The optimal value is related to some architectural aspects (i.e. the processor speed and the network bandwidth and latency) and changes with both the number of parallel processes and the domain decomposition. A performance model for estimating the behavior of the SOR algorithm has been defined, such as in [13]. It takes into consideration the four above mentioned aspects. The total time spent by the solver (Tsor ) is given by: (i) the communication time spent in the sor routine for the convergence test (Tc sor ); (ii) the computing time spent in the sor routine for evaluating the result matrix at each iteration (Tu sor ); (iii) the computing time spent in the data exch routine for managing the data buffer used for the data transmission (Tu data ) and (iv) the communication time in the data exch for the data transfer to the neighbors (Tc data ). The number of calls of data exch depends on both the overlap size (l) and the number of iterations (m) needed to reach the convergence. The performance model is summarized as follows: 2m Tsor = Tc sor + Tu sor + + 1 (Tc data + Tu data ) (11) l The four timing components can be modeled as in (12)(13)(14). The time spent by the collective communication depends only on the number of parallel processes (pi pj ). The convergence test is performed after the first 100 iterations and has a frequency of 10 iterations through an allreduce MPI collective communication, where the maximum residual value is exchanged among all of the parallel processes. The amount of data exchanged is constant (it is independent from the subdomain dimension) and, considering the implementation of the allreduce with the butterfly parallel scheme, we have a number of communication steps logarithmic to the total number of processes. m − 100 Tc sor = O( log pi pj ) (12) 10 jvargas2006@gmail.com 50 I. Epicoco and S. Mocavero The computing time of the sor is related to the domain dimension: di and dj are the dimensions of the biggest subdomain along the i and j directions respectively and they are given by di = ni /pi and dj = nj /pj . For each iteration of the SOR a complete sweep of the subdomain elements plus the overlap region is performed. Tu sor = O(m(di + l)(dj + l)) (13) The communication is implemented with four point-to-point sends/receives hence the communication time is directly proportional to the number of exchanged elements. Here we consider a parallel process with four neighbors, but not all of the processes have four neighbors like those in the border of the global domain. Tc data = O(Li + Lj ) Tu data = O(Li + Lj ) (14) Li and Lj represent the total number of elements exchanged between neighbors, Li = (di + 2l)l, Lj = (dj + 2l)l. Considering all of the previous equations, the parallel time of the whole algorithm can be expressed as follows: Tsor = O( ni nj ni l nj l + + + l2 + log pi pj ) pi pj pi pj (15) If we consider a square global domain then ni = nj = n, we can also impose √ pi = pj = p. The (15) can be simplified: Tsor = O( nl n2 + √ + l2 + log p) p p (16) The evaluation of the analytical equation for the performance model has been defined experimentally on an IBM Power6 cluster. It has 30 IBM p575 nodes, each