Lecture Notes in Computer Science
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Alfred Kobsa
University of California, Irvine, CA, USA
Friedemann Mattern
ETH Zurich, Switzerland
John C. Mitchell
Stanford University, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
Oscar Nierstrasz
University of Bern, Switzerland
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Germany
Madhu Sudan
Microsoft Research, Cambridge, MA, USA
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbruecken, Germany
jvargas2006@gmail.com
7333
Beniamino Murgante Osvaldo Gervasi
Sanjay Misra Nadia Nedjah
Ana Maria A.C. Rocha David Taniar
Bernady O. Apduhan (Eds.)
Computational Science
and Its Applications –
ICCSA 2012
12th International Conference
Salvador de Bahia, Brazil, June 18-21, 2012
Proceedings, Part I
13
jvargas2006@gmail.com
Volume Editors
Beniamino Murgante
University of Basilicata, Potenza, Italy, E-mail: beniamino.murgante@unibas.it
Osvaldo Gervasi
University of Perugia, Italy, E-mail: osvaldo@unipg.it
Sanjay Misra
Federal University of Technology, Minna, Nigeria, E-mail: smisra@futminna.edu.ng
Nadia Nedjah
State University of Rio de Janeiro, Brazil, E-mail: nadia@eng.uerj.br
Ana Maria A. C. Rocha
University of Minho, Braga, Portugal, E-mail: arocha@dps.uminho.pt
David Taniar
Monash University, Clayton,VIC,Australia, E-mail: david.taniar@infotech.monash.edu.au
Bernady O. Apduhan
Kyushu Sangyo University, Fukuoka, Japan, E-mail: bob@is.kyusan-u.ac.jp
ISSN 0302-9743
e-ISSN 1611-3349
ISBN 978-3-642-31124-6
e-ISBN 978-3-642-31125-3
DOI 10.1007/978-3-642-31125-3
Springer Heidelberg Dordrecht London New York
Library of Congress Control Number: 2012939389
CR Subject Classification (1998): C.2.4, C.2, H.4, F.2, H.3, D.2, F.1, H.5, H.2.8,
K.6.5, I.3
LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues
© Springer-Verlag Berlin Heidelberg 2012
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting,
reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication
or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965,
in its current version, and permission for use must always be obtained from Springer. Violations are liable
to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws
and regulations and therefore free for general use.
Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India
Printed on acid-free paper
Springer is part of Springer Science+Business Media (www.springer.com)
jvargas2006@gmail.com
Preface
This four-part volume (LNCS 7333-7336) contains a collection of research papers
from the 12th International Conference on Computational Science and Its Applications (ICCSA 2012) held in Salvador de Bahia, Brazil, during June 18–21,
2012. ICCSA is one of the successful international conferences in the field of computational sciences, and this year for the first time in the history of the ICCSA
conference series it was held in South America. Previously the ICCSA conference
series have been held in Santander, Spain (2011), Fukuoka, Japan (2010), Suwon,
Korea (2009), Perugia, Italy (2008), Kuala Lumpur, Malaysia (2007), Glasgow,
UK (2006), Singapore (2005), Assisi, Italy (2004), Montreal, Canada (2003), (as
ICCS) Amsterdam, The Netherlands (2002), and San Francisco, USA (2001).
The computational science community has enthusiastically embraced the successive editions of ICCSA, thus contributing to making ICCSA a focal meeting
point for those interested in innovative, cutting-edge research about the latest
and most exciting developments in the field. We are grateful to all those who
have contributed to the ICCSA conference series.
ICCSA 2012 would not have been made possible without the valuable contribution of many people. We would like to thank all session organizers for their
diligent work, which further enhanced the conference level, and all reviewers
for their expertise and generous effort, which led to a very high quality event
with excellent papers and presentations. We specially recognize the contribution
of the Program Committee and local Organizing Committee members for their
tremendous support and for making this congress a very successful event. We
would like to sincerely thank our keynote speakers, who willingly accepted our
invitation and shared their expertise.
We also thank our publisher, Springer, for accepting to publish the proceedings and for their kind assistance and cooperation during the editing process.
Finally, we thank all authors for their submissions and all conference attendants for making ICCSA 2012 truly an excellent forum on computational science,
facilitating the exchange of ideas, fostering new collaborations and shaping the
future of this exciting field. Last, but certainly not least, we wish to thank our
readers for their interest in this volume. We really hope you find in these pages
interesting material and fruitful ideas for your future work.
We cordially invite you to visit the ICCSA website—http://www.iccsa.org—
where you can find relevant information about this interesting and exciting event.
June 2012
Osvaldo Gervasi
David Taniar
jvargas2006@gmail.com
Organization
ICCSA 2012 was organized by Universidade Federal da Bahia (Brazil), Universidade Federal do Recôncavo da Bahia (Brazil), Universidade Estadual de Feira de
Santana (Brazil), University of Perugia (Italy), University of Basilicata (Italy),
Monash University (Australia), and Kyushu Sangyo University (Japan).
Honorary General Chairs
Antonio Laganà
Norio Shiratori
Kenneth C.J. Tan
University of Perugia, Italy
Tohoku University, Japan
Qontix, UK
General Chairs
Osvaldo Gervasi
David Taniar
University of Perugia, Italy
Monash University, Australia
Program Committee Chairs
Bernady O. Apduhan
Beniamino Murgante
Kyushu Sangyo University, Japan
University of Basilicata, Italy
Workshop and Session Organizing Chairs
Beniamino Murgante
University of Basilicata, Italy
Local Organizing Committee
Frederico V. Prudente
Mirco Ragni
Ana Carla P. Bitencourt
Cassio Pigozzo
Angelo Duarde
Marcos E. Barreto
José Garcia V. Miranda
Universidade
Universidade
Brazil
Universidade
Brazil
Universidade
Universidade
Brazil
Universidade
Universidade
Federal da Bahia, Brazil (Chair)
Estadual de Feira de Santana,
Federal do Recôncavo da Bahia,
Federal da Bahia, Brazil
Estadual de Feira de Santana,
Federal da Bahia, Brazil
Federal da Bahia, Brazil
jvargas2006@gmail.com
VIII
Organization
International Liaison Chairs
Jemal Abawajy
Marina L. Gavrilova
Robert C.H. Hsu
Tai-Hoon Kim
Andrés Iglesias
Takashi Naka
Rafael D.C. Santos
Deakin University, Australia
University of Calgary, Canada
Chung Hua University, Taiwan
Hannam University, Korea
University of Cantabria, Spain
Kyushu Sangyo University, Japan
National Institute for Space Research, Brazil
Workshop Organizers
Advances in High-Performance Algorithms
and Applications (AHPAA 2012)
Massimo Cafaro
Giovanni Aloisio
University of Salento, Italy
University of Salento, Italy
Advances in Web-Based Learning (AWBL 2012)
Mustafa Murat Inceoglu
Ege University, Turkey
Bio-inspired Computing and Applications (BIOCA 2012)
Nadia Nedjah
Luiza de Macedo Mourell
State University of Rio de Janeiro, Brazil
State University of Rio de Janeiro, Brazil
Computer-Aided Modeling, Simulation, and Analysis
(CAMSA 2012)
Jie Shen
Yuqing Song
University of Michigan, USA
Tianjing University of Technology and
Education, China
Cloud Computing and Its Applications (CCA 2012)
Jemal Abawajy
Osvaldo Gervasi
University of Deakin, Australia
University of Perugia, Italy
Computational Geometry and Applications (CGA 2012)
Marina L. Gavrilova
University of Calgary, Canada
jvargas2006@gmail.com
Organization
IX
Chemistry and Materials Sciences and Technologies
(CMST 2012)
Antonio Laganà
University of Perugia, Italy
Cities, Technologies and Planning (CTP 2012)
Giuseppe Borruso
Beniamino Murgante
University of Trieste, Italy
University of Basilicata, Italy
Computational Tools and Techniques for Citizen Science
and Scientific Outreach (CTTCS 2012)
Rafael Santos
Jordan Raddickand
Ani Thakar
National Institute for Space Research, Brazil
Johns Hopkins University, USA
Johns Hopkins University, USA
Econometrics and Multidimensional Evaluation in the
Urban Environment (EMEUE 2012)
Carmelo M. Torre
Maria Cerreta
Paola Perchinunno
Polytechnic of Bari, Italy
Università Federico II of Naples, Italy
University of Bari, Italy
Future Information System Technologies and Applications
(FISTA 2012)
Bernady O. Apduhan
Kyushu Sangyo University, Japan
Geographical Analysis, Urban Modeling, Spatial Statistics
(GEOG-AN-MOD 2012)
Stefania Bertazzon
Giuseppe Borruso
Beniamino Murgante
University of Calgary, Canada
University of Trieste, Italy
University of Basilicata, Italy
International Workshop on Biomathematics,
Bioinformatics and Biostatistics (IBBB 2012)
Unal Ufuktepe
Andrés Iglesias
Izmir University of Economics, Turkey
University of Cantabria, Spain
jvargas2006@gmail.com
X
Organization
International Workshop on Collective Evolutionary
Systems (IWCES 2012)
Alfredo Milani
Clement Leung
University of Perugia, Italy
Hong Kong Baptist University, Hong Kong
Mobile Communications (MC 2012)
Hyunseung Choo
Sungkyunkwan University, Korea
Mobile Computing, Sensing, and Actuation for Cyber
Physical Systems (MSA4CPS 2012)
Moonseong Kim
Saad Qaisar
Korean intellectual Property Office, Korea
NUST School of Electrical Engineering and
Computer Science, Pakistan
Optimization Techniques and Applications (OTA 2012)
Ana Maria Rocha
University of Minho, Portugal
Parallel and Mobile Computing in Future Networks
(PMCFUN 2012)
Al-Sakib Khan Pathan
International Islamic University Malaysia,
Malaysia
PULSES - Transitions and Nonlinear Phenomena
(PULSES 2012)
Carlo Cattani
Ming Li
Shengyong Chen
University of Salerno, Italy
East China Normal University, China
Zhejiang University of Technology, China
Quantum Mechanics: Computational Strategies and
Applications (QMCSA 2012)
Mirco Ragni
Frederico Vasconcellos
Prudente
Angelo Marconi Maniero
Ana Carla Peixoto Bitencourt
Universidad Federal de Bahia, Brazil
Universidad Federal de Bahia, Brazil
Universidad Federal de Bahia, Brazil
Universidade Federal do Reconcavo da Bahia,
Brazil
jvargas2006@gmail.com
Organization
XI
Remote Sensing Data Analysis, Modeling, Interpretation
and Applications: From a Global View to a Local Analysis
(RS 2012)
Rosa Lasaponara
Nicola Masini
Institute of Methodologies for Environmental
Analysis, National Research Council, Italy
Archaeological and Mconumental Heritage
Institute, National Research Council, Italy
Soft Computing and Data Engineering (SCDE 2012)
Mustafa Matt Deris
Tutut Herawan
Universiti Tun Hussein Onn Malaysia, Malaysia
Universitas Ahmad Dahlan, Indonesia
Software Engineering Processes and Applications
(SEPA 2012)
Sanjay Misra
Federal University of Technology Minna,
Nigeria
Software Quality (SQ 2012)
Sanjay Misra
Federal University of Technology Minna,
Nigeria
Security and Privacy in Computational Sciences
(SPCS 2012)
Arijit Ukil
Tata Consultancy Services, India
Tools and Techniques in Software Development Processes
(TTSDP 2012)
Sanjay Misra
Federal University of Technology Minna,
Nigeria
Virtual Reality and Its Applications (VRA 2012)
Osvaldo Gervasi
Andrès Iglesias
University of Perugia, Italy
University of Cantabria, Spain
jvargas2006@gmail.com
XII
Organization
Wireless and Ad-Hoc Networking (WADNet 2012)
Jongchan Lee
Sangjoon Park
Kunsan National University, Korea
Kunsan National University, Korea
Program Committee
Jemal Abawajy
Kenny Adamson
Filipe Alvelos
Paula Amaral
Hartmut Asche
Md. Abul Kalam Azad
Michela Bertolotto
Sandro Bimonte
Rod Blais
Ivan Blecic
Giuseppe Borruso
Alfredo Buttari
Yves Caniou
José A. Cardoso e Cunha
Leocadio G. Casado
Carlo Cattani
Mete Celik
Alexander Chemeris
Min Young Chung
Gilberto Corso Pereira
M. Fernanda Costa
Gaspar Cunha
Carla Dal Sasso Freitas
Pradesh Debba
Frank Devai
Rodolphe Devillers
Prabu Dorairaj
M. Irene Falcao
Cherry Liu Fang
Edite M.G.P. Fernandes
Jose-Jesus Fernandez
Maria Antonia Forjaz
Maria Celia Furtado Rocha
Akemi Galvez
Paulino Jose Garcia Nieto
Marina Gavrilova
Daekin University, Australia
University of Ulster, UK
University of Minho, Portugal
Universidade Nova de Lisboa, Portugal
University of Potsdam, Germany
University of Minho, Portugal
University College Dublin, Ireland
CEMAGREF, TSCF, France
University of Calgary, Canada
University of Sassari, Italy
University of Trieste, Italy
CNRS-IRIT, France
Lyon University, France
Universidade Nova de Lisboa, Portugal
University of Almeria, Spain
University of Salerno, Italy
Erciyes University, Turkey
National Technical University of Ukraine
“KPI”, Ukraine
Sungkyunkwan University, Korea
Federal University of Bahia, Brazil
University of Minho, Portugal
University of Minho, Portugal
Universidade Federal do Rio Grande do Sul,
Brazil
The Council for Scientific and Industrial
Research (CSIR), South Africa
London South Bank University, UK
Memorial University of Newfoundland, Canada
NetApp, India/USA
University of Minho, Portugal
U.S. DOE Ames Laboratory, USA
University of Minho, Portugal
National Centre for Biotechnology, CSIS, Spain
University of Minho, Portugal
PRODEB–PósCultura/UFBA, Brazil
University of Cantabria, Spain
University of Oviedo, Spain
University of Calgary, Canada
jvargas2006@gmail.com
Organization
Jerome Gensel
Maria Giaoutzi
Andrzej M. Goscinski
Alex Hagen-Zanker
Malgorzata Hanzl
Shanmugasundaram
Hariharan
Eligius M.T. Hendrix
Hisamoto Hiyoshi
Fermin Huarte
Andres Iglesias
Mustafa Inceoglu
Peter Jimack
Qun Jin
Farid Karimipour
Baris Kazar
DongSeong Kim
Taihoon Kim
Ivana Kolingerova
Dieter Kranzlmueller
Antonio Laganà
Rosa Lasaponara
Maurizio Lazzari
Cheng Siong Lee
Sangyoun Lee
Jongchan Lee
Clement Leung
Chendong Li
Gang Li
Ming Li
Fang Liu
Xin Liu
Savino Longo
Tinghuai Ma
Sergio Maffioletti
Ernesto Marcheggiani
Antonino Marvuglia
Nicola Masini
Nirvana Meratnia
Alfredo Milani
Sanjay Misra
Giuseppe Modica
XIII
LSR-IMAG, France
National Technical University, Athens, Greece
Deakin University, Australia
University of Cambridge, UK
Technical University of Lodz, Poland
B.S. Abdur Rahman University, India
University of Malaga/Wageningen University,
Spain/The Netherlands
Gunma University, Japan
University of Barcelona, Spain
University of Cantabria, Spain
EGE University, Turkey
University of Leeds, UK
Waseda University, Japan
Vienna University of Technology, Austria
Oracle Corp., USA
University of Canterbury, New Zealand
Hannam University, Korea
University of West Bohemia, Czech Republic
LMU and LRZ Munich, Germany
University of Perugia, Italy
National Research Council, Italy
National Research Council, Italy
Monash University, Australia
Yonsei University, Korea
Kunsan National University, Korea
Hong Kong Baptist University, Hong Kong
University of Connecticut, USA
Deakin University, Australia
East China Normal University, China
AMES Laboratories, USA
University of Calgary, Canada
University of Bari, Italy
NanJing University of Information Science and
Technology, China
University of Zurich, Switzerland
Katholieke Universiteit Leuven, Belgium
Research Centre Henri Tudor, Luxembourg
National Research Council, Italy
University of Twente, The Netherlands
University of Perugia, Italy
Federal University of Technology Minna,
Nigeria
University of Reggio Calabria, Italy
jvargas2006@gmail.com
XIV
Organization
José Luis Montaña
Beniamino Murgante
Jiri Nedoma
Laszlo Neumann
Kok-Leong Ong
Belen Palop
Marcin Paprzycki
Eric Pardede
Kwangjin Park
Ana Isabel Pereira
Maurizio Pollino
Alenka Poplin
Vidyasagar Potdar
David C. Prosperi
Wenny Rahayu
Jerzy Respondek
Ana Maria A.C. Rocha
Humberto Rocha
Alexey Rodionov
Cristina S. Rodrigues
Octavio Roncero
Maytham Safar
Haiduke Sarafian
Qi Shi
Dale Shires
Takuo Suganuma
Ana Paula Teixeira
Senhorinha Teixeira
Parimala Thulasiraman
Carmelo Torre
Javier Martinez Torres
Giuseppe A. Trunfio
Unal Ufuktepe
Mario Valle
Pablo Vanegas
Piero Giorgio Verdini
Marco Vizzari
Koichi Wada
University of Cantabria, Spain
University of Basilicata, Italy
Academy of Sciences of the Czech Republic,
Czech Republic
University of Girona, Spain
Deakin University, Australia
Universidad de Valladolid, Spain
Polish Academy of Sciences, Poland
La Trobe University, Australia
Wonkwang University, Korea
Polytechnic Institute of Braganca, Portugal
Italian National Agency for New
Technologies, Energy and Sustainable
Economic Development, Italy
University of Hamburg, Germany
Curtin University of Technology, Australia
Florida Atlantic University, USA
La Trobe University, Australia
Silesian University of Technology Poland
University of Minho, Portugal
INESC-Coimbra, Portugal
Institute of Computational Mathematics and
Mathematical Geophysics, Russia
University of Minho, Portugal
CSIC, Spain
Kuwait University, Kuwait
The Pennsylvania State University, USA
Liverpool John Moores University, UK
U.S. Army Research Laboratory, USA
Tohoku University, Japan
University of Tras-os-Montes and Alto Douro,
Portugal
University of Minho, Portugal
University of Manitoba, Canada
Polytechnic of Bari, Italy
Centro Universitario de la Defensa Zaragoza,
Spain
University of Sassari, Italy
Izmir University of Economics, Turkey
Swiss National Supercomputing Centre,
Switzerland
University of Cuenca, Equador
INFN Pisa and CERN, Italy
University of Perugia, Italy
University of Tsukuba, Japan
jvargas2006@gmail.com
Organization
Krzysztof Walkowiak
Robert Weibel
Roland Wismüller
Mudasser Wyne
Chung-Huang Yang
Xin-She Yang
Salim Zabir
Albert Y. Zomaya
XV
Wroclaw University of Technology, Poland
University of Zurich, Switzerland
Universität Siegen, Germany
SOET National University, USA
National Kaohsiung Normal University, Taiwan
National Physical Laboratory, UK
France Telecom Japan Co., Japan
University of Sydney, Australia
Sponsoring Organizations
ICCSA 2012 would not have been possible without tremendous support of many
organizations and institutions, for which all organizers and participants of ICCSA
2012 express their sincere gratitude:
Universidade Federal da Bahia, Brazil
(http://www.ufba.br)
Universidade Federal do Recôncavo da Bahia,
Brazil
(http://www.ufrb.edu.br)
Universidade Estadual de Feira de Santana,
Brazil
(http://www.uefs.br)
University of Perugia, Italy
(http://www.unipg.it)
University of Basilicata, Italy
(http://www.unibas.it)
jvargas2006@gmail.com
XVI
Organization
Monash University, Australia
(http://monash.edu)
Kyushu Sangyo University, Japan
(www.kyusan-u.ac.jp)
Brazilian Computer Society
(www.sbc.org.br)
Coordenaçāo de Aperfeiçoamento de Pessoal de
Nı́vel Superior (CAPES), Brazil
(http://www.capes.gov.br)
National Council for Scientific and
Technological Development (CNPq), Brazil
(http://www.cnpq.br)
Fundaçāo de Amparo à Pesquisa do Estado
da Bahia (FAPESB), Brazil
(http://www.fapesb.ba.gov.br)
jvargas2006@gmail.com
Table of Contents – Part I
Workshop on Advances in High Performance
Algorithms and Applications (AHPAA 2012)
Processor Allocation for Optimistic Parallelization of Irregular
Programs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Francesco Versaci and Keshav Pingali
Feedback-Based Global Instruction Scheduling for GPGPU
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Constantin Timm, Markus Görlich, Frank Weichert,
Peter Marwedel, and Heinrich Müller
Parallel Algorithm for Landform Attributes Representation on
Multicore and Multi-GPU Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Murilo Boratto, Pedro Alonso, Carla Ramiro, Marcos Barreto, and
Leandro Coelho
The Performance Model of an Enhanced Parallel Algorithm for the
SOR Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Italo Epicoco and Silvia Mocavero
Performance Driven Cooperation between Kernel and Auto-tuning
Multi-threaded Interval B&B Applications . . . . . . . . . . . . . . . . . . . . . . . . . .
Juan Francisco Sanjuan-Estrada, Leocadio Gonzalez Casado,
Immaculada Garcı́a, and Eligius M.T. Hendrix
k NN-Borůvka-GPU: A Fast and Scalable MST Construction from k NN
Graphs on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ahmed Shamsul Arefin, Carlos Riveros, Regina Berretta, and
Pablo Moscato
1
15
29
44
57
71
Workshop on Bio-inspired Computing and
Applications (BIOCA 2012)
Global Hybrid Ant Bee Colony Algorithm for Training Artificial Neural
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Habib Shah, Rozaida Ghazali, Nazri Mohd Nawi, and
Mustafa Mat Deris
The Effect of Intelligent Escape on Distributed SER-Based Search . . . . . .
Daniel S.F. Alves, Felipe M.G. França, Luiza de Macedo Mourelle,
Nadia Nedjah, and Priscila M.V. Lima
jvargas2006@gmail.com
87
101
XVIII
Table of Contents – Part I
ACO-Based Static Routing for Network-on-Chips . . . . . . . . . . . . . . . . . . . .
Luneque Silva Jr., Nadia Nedjah, Luiza de Macedo Mourelle, and
Fábio Gonçalves Pessanha
A Genetic Algorithm Assisted by a Locally Weighted Regression
Surrogate Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Leonardo G. Fonseca, Heder S. Bernardino, and Helio J.C. Barbosa
Swarm Robots with Queue Organization Using Infrared
Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rafael Mathias de Mendonça, Nadia Nedjah, and
Luiza de Macedo Mourelle
Swarm Grid: A Proposal for High Performance of Parallel Particle
Swarm Optimization Using GPGPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rogério M. Calazan, Nadia Nedjah, and Luiza de Macedo Mourelle
An Artificial Immune System Approach to Associative Classification . . . .
Samir A. Mohamed Elsayed, Sanguthevar Rajasekaran, and
Reda A. Ammar
113
125
136
148
161
Workshop on Computational Geometry and
Applications (CGA 2012)
A Review on Delaunay Refinement Techniques . . . . . . . . . . . . . . . . . . . . . . .
Sanderson L. Gonzaga de Oliveira
172
Axis-Parallel Dimension Reduction for Biometric Research . . . . . . . . . . . .
Kushan Ahmadian and Marina Gavrilova
188
An Overview of Procedures for Refining Triangulations . . . . . . . . . . . . . . .
Sanderson L. Gonzaga de Oliveira
198
DEM Interpolation from Contours Using Medial Axis
Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Joonsoo Choi, Jaewee Heo, Kwang-Soo Hahn, and Junho Kim
Analysis of a High Definition Camera-Projector Video System
for Geometry Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
José Luiz de Souza Filho, Roger Correia Silva, Dhiego Oliveira Sad,
Renan Dembogurski, Marcelo Bernardes Vieira,
Sócrates de Oliveira Dantas, and Rodrigo Silva
Video-Based Face Verification with Local Binary Patterns and SVM
Using GMM Supervectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tiago F. Pereira, Marcus A. Angeloni, Flávio O. Simões, and
José Eduardo C. Silva
jvargas2006@gmail.com
214
228
240
Table of Contents – Part I
XIX
GPU-Based Influence Regions Optimization . . . . . . . . . . . . . . . . . . . . . . . . .
Marta Fort and J. Antoni Sellarès
253
Fast and Simple Approach for Polygon Schematization . . . . . . . . . . . . . . . .
Serafino Cicerone and Matteo Cermignani
267
On Counting and Analyzing Empty Pseudo-triangles in a Point Set . . . .
Sergey Kopeliovich and Kira Vyatkina
280
Workshop on Chemistry and Materials Sciences and
Technologies (CMST 2012)
Quantum Reactive Scattering Calculations on GPU . . . . . . . . . . . . . . . . . .
Leonardo Pacifici, Danilo Nalli, and Antonio Laganà
Tuning Heme Functionality: The Cases of Cytochrome c Oxidase and
Myoglobin Oxidation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vangelis Daskalakis, Stavros C. Farantos, and Constantinos Varotsis
Theoretical and Experimental Study of the Energy and Structure
of Fragment Ions Produced by Double Photoionization of Benzene
Molecules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marzio Rosi, Pietro Candori, Stefano Falcinelli,
Maria Suelly Pedrosa Mundim, Fernando Pirani, and
Franco Vecchiocattivi
Theoretical Study of Reactions Relevant for Atmospheric Models
of Titan: Interaction of Excited Nitrogen Atoms with Small
Hydrocarbons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marzio Rosi, Stefano Falcinelli, Nadia Balucani,
Piergiorgio Casavecchia, Francesca Leonori, and
Dimitris Skouteris
Efficient Workload Distribution Bridging HTC and HPC in Scientific
Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Carlo Manuali, Alessandro Costantini, Antonio Laganà,
Marco Cecchi, Antonia Ghiselli, Michele Carpené, and Elda Rossi
Taxonomy Management in a Federation of Distributed Repositories:
A Chemistry Use Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sergio Tasso, Simonetta Pallottelli, Michele Ferroni,
Riccardo Bastianini, and Antonio Laganà
Grid Enabled High Level ab initio Electronic Structure Calculations
for the N2 +N2 Exchange Reaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marco Verdicchio, Leonardo Pacifici, and Antonio Laganà
jvargas2006@gmail.com
292
304
316
331
345
358
371
XX
Table of Contents – Part I
A Bond-Bond Portable Approach to Intermolecular Interactions:
Simulations for N-methylacetamide and Carbon Dioxide Dimers . . . . . . .
Andrea Lombardi, Noelia Faginas Lago, Antonio Laganà,
Fernando Pirani, and Stefano Falcinelli
A Grid Execution Model for Computational Chemistry Applications
Using the GC3Pie Framework and the AppPot VM Environment. . . . . . .
Alessandro Costantini, Riccardo Murri, Sergio Maffioletti, and
Antonio Laganà
The MPI Structure of Chimere . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Antonio Laganà, Stefano Crocchianti, Giorgio Tentella, and
Alessandro Costantini
A New Statistical Method for the Determination of Dynamical Features
of Molecular Dication Dissociation Processes . . . . . . . . . . . . . . . . . . . . . . . .
Maria Suely Pedrosa Mundim, Pietro Candori, Stefano Falcinelli,
Kleber Carlos Mundim, Fernando Pirani, and Franco Vecchiocattivi
387
401
417
432
Workshop on Cities, Technologies and Planning
(CTP 2012)
SWOT Analysis of Information Technology Industry in Beijing, China
Using Patent Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lucheng Huang, Kangkang Wang, Feifei Wu, Yan Lou,
Hong Miao, and Yanmei Xu
Using 3D GeoDesign for Planning of New Electricity Networks in
Spain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Francisco-Javier Moreno Marimbaldo,
Federico-Vladimir Gutiérrez Corea, and Miguel-Ángel Manso Callejo
447
462
Assessment of Online PPGIS Study Cases in Urban Planning . . . . . . . . . .
Geisa Bugs
477
e-Participation: Social Media and the Public Space . . . . . . . . . . . . . . . . . . .
Gilberto Corso Pereira, Maria Célia Furtado Rocha, and
Alenka Poplin
491
Ubicomp and Environmental Designers: Assembling a Collective Work
towards the Development of Sustainable Technologies . . . . . . . . . . . . . . . . .
Renato Cesar Ferreira de Souza
502
Sustainable Micro-business in Environmental Unsustainability and
Economic Inefficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
José G. Vargas-Hernández
518
jvargas2006@gmail.com
Table of Contents – Part I
Efficient Visualization of the Geometric Information of CityGML:
Application for the Documentation of Built Heritage . . . . . . . . . . . . . . . . .
Iñaki Prieto, Jose Luis Izkara, and
Francisco Javier Delgado del Hoyo
ICT to Evaluate Participation in Urban Planning: Remarks from a
Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Francesco Rotondo and Francesco Selicato
A Spatial Data Infrastructure Situation-Aware to the 2014 World
Cup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Wellington Moreira de Oliveira, Jugurta Lisboa Filho, and
Alcione de Paiva Oliveira
Towards a Two-Way Participatory Process . . . . . . . . . . . . . . . . . . . . . . . . . .
António Silva and Jorge Gustavo Rocha
An Automatic Procedure to Select Areas for Transfer Development
Rights in the Urban Market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Carmelo Maria Torre, Pasquale Balena, and Romina Zito
XXI
529
545
561
571
583
General Track on Computational Methods,
Algorithms and Scientific Application
Magnetic Net and a Bouncing Magnetic Ball . . . . . . . . . . . . . . . . . . . . . . . .
Haiduke Sarafian
Autonomous Leaves Graph Applied to the Simulation of the Boundary
Layer around a Non-symmetric NACA Airfoil . . . . . . . . . . . . . . . . . . . . . . .
Sanderson Lincohn Gonzaga de Oliveira and Mauricio Kischinhevsky
Sinimbu – Multimodal Queries to Support Biodiversity Studies . . . . . . . .
Gabriel de S. Fedel, Claudia Bauzer Medeiros, and
Jefersson Alex dos Santos
Comparison between Genetic Algorithms and Differential Evolution for
Solving the History Matching Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Elisa P. dos Santos Amorim, Carolina R. Xavier,
Ricardo Silva Campos, and Rodrigo W. dos Santos
An Adaptive Mesh Algorithm for the Numerical Solution of Electrical
Models of the Heart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rafael S. Oliveira, Bernardo M. Rocha, Denise Burgarelli,
Wagner Meira Jr., and Rodrigo W. dos Santos
Decision Model to Predict the Implant Success . . . . . . . . . . . . . . . . . . . . . . .
Ana Cristina Braga, Paula Vaz, João C. Sampaio-Fernandes,
António Felino, and Maria Purificacão Tavares
jvargas2006@gmail.com
599
610
620
635
649
665
XXII
Table of Contents – Part I
Multiscale Modeling of Heterogeneous Media Applying AEH to 3D
Bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bárbara de Melo Quintela, Daniel Mendes Caldas,
Michèle Cristina Resende Farage, and Marcelo Lobosco
A Three-Dimensional Computational Model of the Innate Immune
System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pedro Augusto F. Rocha, Micael P. Xavier, Alexandre B. Pigozzo,
Barbara de M. Quintela, Gilson C. Macedo,
Rodrigo Weber dos Santos, and Marcelo Lobosco
System Dynamics Metamodels Supporting the Development
of Computational Models of the Human Innate Immune System . . . . . . .
Igor Knop, Alexandre Pigozzo, Barbara Quintela, Gilson C. Macedo,
Ciro Barbosa, Rodrigo Weber dos Santos, and Marcelo Lobosco
Exact and Asymptotic Computations of Elementary Spin Networks:
Classification of the Quantum–Classical Boundaries . . . . . . . . . . . . . . . . . .
Ana Carla P. Bitencourt, Annalisa Marzuoli, Mirco Ragni,
Roger W. Anderson, and Vincenzo Aquilanti
Performance of DFT and MP2 Approaches for Geometry of Rhenium
Allenylidenes Complexes and the Thermodynamics of Phosphines
Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cecilia Coletti and Nazzareno Re
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
jvargas2006@gmail.com
675
691
707
723
738
753
Table of Contents – Part II
Workshop on Econometrics and Multidimensional
Evaluation in the Urban Environment
(EMEUE 2012)
Knowledge and Innovation in Manufacturing Sector:
The Case of Wedding Dresses in Southern Italy . . . . . . . . . . . . . . . . . . . . . .
Annunziata de Felice, Isabella Martucci, and Dario A. Schirone
Marketing Strategies: Support and Enhancement of Core Business . . . . .
Dario Antonio Schirone and Germano Torkan
1
17
The Rational Quantification of Social Housing: An Operative Research
Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gianluigi De Mare, Antonio Nesticò, and Francesco Tajani
27
Simulation of Users Decision in Transport Mode Choice Using
Neuro-Fuzzy Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mauro Dell’Orco and Michele Ottomanelli
44
Multidimensional Spatial Decision-Making Process: Local Shared
Values in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Maria Cerreta, Simona Panaro, and Daniele Cannatella
54
A Proposal for a Stepwise Fuzzy Regression: An Application to the
Italian University System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Francesco Campobasso and Annarita Fanizzi
71
Cluster Analysis for Strategic Management: A Case Study of IKEA . . . .
Paola Perchinunno and Dario Antonio Schirone
88
Clustering for the Localization of Degraded Urban Areas . . . . . . . . . . . . . .
Silvestro Montrone and Paola Perchinunno
102
A BEP Analysis of Energy Supply for Sustainable Urban Microgrids . . .
Pasquale Balena, Giovanna Mangialardi, and Carmelo Maria Torre
116
The Effect of Infrastructural Works on Urban Property Values:
The asse attrezzato in Pescara, Italy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sebastiano Carbonara
Prospect of Integrate Monitoring: A Multidimensional Approach . . . . . . .
Marco Selicato, Carmelo Maria Torre, and Giovanni La Trofa
jvargas2006@gmail.com
128
144
XXIV
Table of Contents – Part II
The Use of Ahp in a Multiactor Evaluation for Urban Development
Programs: A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Luigi Fusco Girard and Carmelo Maria Torre
157
Assessing Urban Transformations: A SDSS for the Master Plan of
Castel Capuano, Naples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Maria Cerreta and Pasquale De Toro
168
Workshop on Geographical Analysis, Urban
Modeling, Spatial Statistics (Geo–An–Mod 2012)
Computational Context to Promote Geographic Information Systems
toward Human-Centric Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Luis Paulo da Silva Carvalho and Paulo Caetano da Silva
Voronoi-Based Curve Reconstruction: Issues and Solutions . . . . . . . . . . . .
Mehran Ghandehari and Farid Karimipour
181
194
Geovisualization and Geostatistics: A Concept for the Numerical and
Visual Analysis of Geographic Mass Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
Julia Gonschorek and Lucia Tyrallová
208
Spatio-Explorative Analysis and Its Benefits for a GIS-integrated
Automated Feature Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Lucia Tyrallová and Julia Gonschorek
220
Peer Selection in P2P Service Overlays Using Geographical Location
Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Adriano Fiorese, Paulo Simões, and Fernando Boavida
234
Models for Spatial Interaction Data: Computation and Interpretation
of Accessibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Morton E. O’Kelly
249
Am I Safe in My Home? Fear of Crime Analyzed with Spatial Statistics
Methods in a Central European City . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Daniel Lederer
263
Developing a GIS Based Decision Support System for Resource
Allocation in Earthquake Search and Rescue Operation . . . . . . . . . . . . . . .
Abolfazl Rasekh and Ali Reza Vafaeinezhad
275
Concepts, Compass and Computation: Models for Directional
Part-Whole Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gaurav Singh, Rolf A. de By, and Ivana Ivánová
286
jvargas2006@gmail.com
Table of Contents – Part II
SIGHabitar – Business Intelligence Based Approach for the
Development of Land Information Systems: The Multipurpose
Technical Cadastre of Ouro Preto, Brazil . . . . . . . . . . . . . . . . . . . . . . . . . . . .
João Tácio C. Silva, José Francisco V. Rezende, Érika Fidêncio,
Tarick Melo, Brayan Neves, and
Joubert C. Lima e Tiago G.S. Carneiro
XXV
302
Rehabilitation and Reconstruction of Asphalts Pavement Decision
Making Based on Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Shaaban M. Shaaban and Hossam A. Nabwey
316
Cartographic Circuits Inside GIS Environment for the Construction of
the Landscape Sensitivity Map in the Case of Cremona . . . . . . . . . . . . . . .
Pier Luigi Paolillo, Umberto Baresi, and Roberto Bisceglie
331
Cloud Classification in JPEG-compressed Remote Sensing Data
(LANDSAT 7/ETM+) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Erik Borg, Bernd Fichtelmann, and Hartmut Asche
347
A Probabilistic Rough Set Approach for Water Reservoirs Site Location
Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Shaaban M. Shaaban and Hossam A. Nabwey
358
Definition and Analysis of New Agricultural Farm Energetic Indicators
Using Spatial OLAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sandro Bimonte, Kamal Boulil, Jean-Pierre Chanet, and
Marilys Pradel
Validating a Smartphone-Based Pedestrian Navigation System
Prototype: An Informal Eye-Tracking Pilot Test . . . . . . . . . . . . . . . . . . . . .
Mario Kluge and Hartmut Asche
Open Access to Historical Atlas: Sources of Information and Services
for Landscape Analysis in an SDI Framework . . . . . . . . . . . . . . . . . . . . . . . .
Raffaella Brumana, Daniela Oreni, Branka Cuca,
Anna Rampini, and Monica Pepe
373
386
397
From Concept to Implementation: Web-Based Cartographic
Visualisation with CartoService . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hartmut Asche and Rita Engemaier
414
Multiagent Systems for the Governance of Spatial Environments:
Some Modelling Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Domenico Camarda
425
A Data Fusion System for Spatial Data Mining, Analysis and
Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Silvija Stankute and Hartmut Asche
439
jvargas2006@gmail.com
XXVI
Table of Contents – Part II
Dealing with Multiple Source Spatio-temporal Data in Urban Dynamics
Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
João Peixoto and Adriano Moreira
450
Public Decision Processes: The Interaction Space Supporting Planner’s
Activity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Giuseppe B. Las Casas, Lucia Tilio, and Alexis Tsoukiàs
466
Selection and Scheduling Problem in Continuous Time
with Pairwise-Interdependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ivan Blecic, Arnaldo Cecchini, and Giuseppe A. Trunfio
481
Parallel Simulation of Urban Dynamics on the GPU . . . . . . . . . . . . . . . . . .
Ivan Blecic, Arnaldo Cecchini, and Giuseppe A. Trunfio
Geolocalization as Wayfinding and User Experience Support in Cultural
Heritage Locations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Letizia Bollini and Roberto Falcone
Climate Alteration in the Metropolitan Area of Bari: Temperatures
and Relationship with Characters of Urban Context . . . . . . . . . . . . . . . . . .
Pierangela Loconte, Claudia Ceppi, Giorgia Lubisco,
Francesco Mancini, Claudia Piscitelli, and Francesco Selicato
492
508
517
Study of Sustainability of Renewable Energy Sources through GIS
Analysis Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Emanuela Caiaffa, Alessandro Marucci, and Maurizio Pollino
532
The Comparative Analysis of Urban Development in Two Geographic
Regions: The State of Rio De Janeiro and the Campania Region . . . . . . .
Massimiliano Bencardino, Ilaria Greco, and Pitter Reis Ladeira
548
Land-Use Dynamics at the Micro Level: Constructing and Analyzing
Historical Datasets for the Portuguese Census Tracts . . . . . . . . . . . . . . . . .
António M. Rodrigues, Teresa Santos, Raquel Faria de Deus, and
Dulce Pimentel
Using Hydrodynamic Modeling for Estimating Flooding and Water
Depths in Grand Bay, Alabama . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vladimir J. Alarcon and William H. McAnally
Comparison of Two Hydrodynamic Models of Weeks Bay, Alabama . . . .
Vladimir J. Alarcon, William H. McAnally, and Surendra Pathak
Connections between Urban Structure and Urban Heat Island
Generation: An Analysis trough Remote Sensing and GIS . . . . . . . . . . . . .
Marialuce Stanganelli and Marco Soravia
jvargas2006@gmail.com
565
578
589
599
Table of Contents – Part II
XXVII
Taking the Leap: From Disparate Data to a Fully Interactive SEIS for
the Maltese Islands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Saviour Formosa, Elaine Sciberras, and Janice Formosa Pace
609
Analyzing the Central Business District: The Case of Sassari in the
Sardinia Island . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Silvia Battino, Giuseppe Borruso, and Carlo Donato
624
That’s ReDO: Ontologies and Regional Development Planning . . . . . . . . .
Francesco Scorza, Giuseppe B. Las Casas, and Beniamino Murgante
640
A Landscape Complex Values Map: Integration among Soft Values and
Hard Values in a Spatial Decision Support System . . . . . . . . . . . . . . . . . . .
Maria Cerreta and Roberta Mele
653
Analyzing Migration Phenomena with Spatial Autocorrelation
Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Beniamino Murgante and Giuseppe Borruso
670
From Urban Labs in the City to Urban Labs on the Web . . . . . . . . . . . . . .
Viviana Lanza, Lucia Tilio, Antonello Azzato,
Giuseppe B. Las Casas, and Piergiuseppe Pontrandolfi
686
General Track on Geometric Modelling, Graphics
and Visualization
Bilayer Segmentation Augmented with Future Evidence . . . . . . . . . . . . . . .
Silvio Ricardo Rodrigues Sanches, Valdinei Freire da Silva, and
Romero Tori
A Viewer-dependent Tensor Field Visualization Using Multiresolution
and Particle Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
José Luiz Ribeiro de Souza Filho, Marcelo Caniato Renhe,
Marcelo Bernardes Vieira, and Gildo de Almeida Leonel
Abnormal Gastric Cell Segmentation Based on Shape Using
Morphological Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Noor Elaiza Abdul Khalid, Nurnabilah Samsudin, and
Rathiah Hashim
699
712
728
A Bio-inspired System for Boundary Detection in Color Natural
Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Karin S. Komati, Evandro O.T. Salles, and Mario Sarcinelli-Filho
739
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
753
jvargas2006@gmail.com
Table of Contents – Part III
Workshop on Optimization Techniques
and Applications (OTA 2012)
Incorporating Radial Basis Functions in Pattern Search Methods:
Application to Beam Angle Optimization in Radiotherapy Treatment
Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Humberto Rocha, Joana M. Dias, Brigida C. Ferreira, and
Maria do Carmo Lopes
1
On the Complexity of a Mehrotra-Type Predictor-Corrector
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ana Paula Teixeira and Regina Almeida
17
Design of Wood Biomass Supply Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tiago Costa Gomes, Filipe Pereira e Alvelos, and
Maria Sameiro Carvalho
On Solving a Stochastic Programming Model for Perishable Inventory
Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eligius M.T. Hendrix, Rene Haijema, Roberto Rossi, and
Karin G.J. Pauls-Worm
An Artificial Fish Swarm Filter-Based Method for Constrained Global
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ana Maria A.C. Rocha, M. Fernanda P. Costa, and
Edite M.G.P. Fernandes
Solving Multidimensional 0–1 Knapsack Problem with an Artificial
Fish Swarm Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Md. Abul Kalam Azad, Ana Maria A.C. Rocha, and
Edite M.G.P. Fernandes
Optimization Model of COTS Selection Based on Cohesion and
Coupling for Modular Software Systems under Multiple Applications
Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Pankaj Gupta, Shilpi Verma, and Mukesh Kumar Mehlawat
A Derivative-Free Filter Driven Multistart Technique for Global
Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Florbela P. Fernandes, M. Fernanda P. Costa, and
Edite M.G.P. Fernandes
jvargas2006@gmail.com
30
45
57
72
87
103
XXX
Table of Contents – Part III
On Lower Bounds Using Additively Separable Terms
in Interval B&B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
José L. Berenguel, Leocadio G. Casado, I. Garcı́a,
Eligius M.T. Hendrix, and F. Messine
119
A Genetic Algorithm for the Job Shop on an ASRS Warehouse . . . . . . . .
José Figueiredo, José A. Oliveira, Luis Dias, and
Guilherme A.B. Pereira
133
On Solving the Profit Maximization of Small Cogeneration Systems . . . .
Ana C.M. Ferreira, Ana Maria A.C. Rocha,
Senhorinha F.C.F. Teixeira, Manuel L. Nunes, and
Luı́s B. Martins
147
Global Optimization Simplex Bisection Revisited Based on
Considerations by Reiner Horst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eligius M.T. Hendrix, Leocadio G. Casado, and Paula Amaral
Application of Variance Analysis to the Combustion of Residual Oils . . .
Manuel Ferreira and José Carlos Teixeira
Warehouse Design and Planning: A Mathematical Programming
Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Carla A.S. Geraldes, Maria Sameiro Carvalho, and
Guilherme A.B. Pereira
Application of CFD Tools to Optimize Natural Building Ventilation
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
José Carlos Teixeira, Ricardo Lomba,
Senhorinha F.C.F. Teixeira, and Pedro Lobarinhas
159
174
187
202
Workshop on Mobile Communications (MC 2012)
Middleware Integration for Ubiquitous Sensor Networks
in Agriculture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Junghoon Lee, Gyung-Leen Park, Min-Jae Kang, Ho-Young Kwak,
Sang Joon Lee, and Jikwang Han
Usage Pattern-Based Prefetching: Quick Application Launch on Mobile
Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hokwon Song, Changwoo Min, Jeehong Kim, and Young Ik Eom
EIMOS: Enhancing Interactivity in Mobile Operating Systems . . . . . . . . .
Sunwook Bae, Hokwon Song, Changwoo Min, Jeehong Kim, and
Young Ik Eom
jvargas2006@gmail.com
217
227
238
Table of Contents – Part III
Development of Mobile Hybrid MedIntegraWeb App for Interoperation
between u-RPMS and HIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Young-Hyuk Kim, Il-Kown Lim, Jae-Pil Lee, Jae-Gwang Lee, and
Jae-Kwang Lee
XXXI
248
A Distributed Lifetime-Maximizing Scheme for Connected Target
Coverage in WSNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Duc Tai Le, Thang Le Duc, and Hyunseung Choo
259
Reducing Last Level Cache Pollution in NUMA Multicore Systems for
Improving Cache Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Deukhyeon An, Jeehong Kim, JungHyun Han, and Young Ik Eom
272
The Fast Handover Scheme for Mobile Nodes in NEMO-Enabled
PMIPv6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Changyong Park, Junbeom Park, Hao Wang, and Hyunseung Choo
283
A Reference Model for Virtual Resource Description and Discovery in
Virtual Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yuemei Xu, Yanni Han, Wenjia Niu, Yang Li, Tao Lin, and Song Ci
297
TV Remote Control Using Human Hand Motion Based on Optical
Flow System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Soonmook Jeong, Taehoun Song, Keyho Kwon, and Jae Wook Jeon
311
Fast and Reliable Data Forwarding in Low-Duty-Cycle Wireless Sensor
Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Junseong Choe, Nguyen Phan Khanh Ha, Junguye Hong, and
Hyunseung Choo
324
Workshop on Mobile-Computing, Sensing,
and Actuation for Cyber Physical Systems
(MSA4CPS 2012)
Neural Network and Physiological Parameters Based Control of
Artificial Pancreas for Improved Patient Safety . . . . . . . . . . . . . . . . . . . . . .
Saad Bin Qaisar, Salman H. Khan, and Sahar Imtiaz
339
A Genetic Algorithm Assisted Resource Management Scheme for
Reliable Multimedia Delivery over Cognitive Networks . . . . . . . . . . . . . . . .
Salman Ali, Ali Munir, Saad Bin Qaisar, and Junaid Qadir
352
Performance Analysis of WiMAX Best Effort and ertPS Service Classes
for Video Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hassan Abid, Haroon Raja, Ali Munir, Jaweria Amjad,
Aliya Mazhar, and Dong-Young Lee
jvargas2006@gmail.com
368
XXXII
Table of Contents – Part III
Jump Oriented Programming on Windows Platform (on the x86) . . . . . .
Jae-Won Min, Sung-Min Jung, Dong-Young Lee, and
Tai-Myoung Chung
Cryptanalysis and Improvement of a Biometrics-Based Multi-server
Authentication with Key Agreement Scheme . . . . . . . . . . . . . . . . . . . . . . . .
Hakhyun Kim, Woongryul Jeon, Kwangwoo Lee, Yunho Lee, and
Dongho Won
376
391
Rate-Distortion Optimized Transcoder Selection for Multimedia
Transmission in Heterogeneous Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Haroon Raja and Saad Bin Qaisar
407
Formal Probabilistic Analysis of Cyber-Physical Transportation
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Atif Mashkoor and Osman Hasan
419
Workshop on Remote Sensing (RS 2012)
DEM Reconstruction of Coastal Geomorphology from DINSAR . . . . . . . .
Maged Marghany
435
Three-Dimensional Coastal Front Visualization from RADARSAT-1
SAR Satellite Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Maged Marghany
447
A New Self-Learning Algorithm for Dynamic Classification of Water
Bodies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bernd Fichtelmann and Erik Borg
457
DEM Accuracy of High Resolution Satellite Images . . . . . . . . . . . . . . . . . .
Mustafa Yanalak, Nebiye Musaoglu, Cengizhan Ipbuker,
Elif Sertel, and Sinasi Kaya
Low Cost Pre-operative Fire Monitoring from Fire Danger to Severity
Estimation Based on Satellite MODIS, Landsat and ASTER Data:
The Experience of FIRE-SAT Project in the Basilicata Region
(Italy) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Antonio Lanorte, Fortunato De Santis, Angelo Aromando, and
Rosa Lasaponara
Investigating Satellite Landsat TM and ASTER Multitemporal Data
Set to Discover Ancient Canals and Acqueduct Systems . . . . . . . . . . . . . . .
Rosa Lasaponara and Nicola Masini
Using Spatial Autocorrelation Techniques and Multi-temporal Satellite
Data for Analyzing Urban Sprawl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gabriele Nolè, Maria Danese, Beniamino Murgante,
Rosa Lasaponara, and Antonio Lanorte
jvargas2006@gmail.com
471
481
497
512
Table of Contents – Part III
XXXIII
General Track on Information Systems
and Technologies
A Framework for QoS Based Dynamic Web Services Composition . . . . . .
Jigyasu Nema, Rajdeep Niyogi, and Alfredo Milani
528
Data Summarization Model for User Action Log Files . . . . . . . . . . . . . . . .
Eleonora Gentili, Alfredo Milani, and Valentina Poggioni
539
User Modeling for Adaptive E-Learning Systems . . . . . . . . . . . . . . . . . . . . .
Birol Ciloglugil and Mustafa Murat Inceoglu
550
An Experimental Study of the Combination of Meta-Learning with
Particle Swarm Algorithms for SVM Parameter Selection . . . . . . . . . . . . .
Péricles B.C. de Miranda, Ricardo B.C. Prudêncio,
Andre Carlos P.L.F. de Carvalho, and Carlos Soares
An Investigation into Agile Methods in Embedded Systems
Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Caroline Oliveira Albuquerque, Pablo Oliveira Antonino, and
Elisa Yumi Nakagawa
Heap Slicing Using Type Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Mohamed A. El-Zawawy
Using Autonomous Search for Generating Good Enumeration Strategy
Blends in Constraint Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ricardo Soto, Broderick Crawford, Eric Monfroy, and Vı́ctor Bustos
Evaluation of Normalization Techniques in Text Classification for
Portuguese . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Merley da Silva Conrado, Vı́ctor Antonio Laguna Gutiérrez, and
Solange Oliveira Rezende
562
576
592
607
618
Extracting Definitions from Brazilian Legal Texts . . . . . . . . . . . . . . . . . . . .
Edilson Ferneda, Hércules Antonio do Prado,
Augusto Herrmann Batista, and Marcello Sandi Pinheiro
631
A Heuristic Diversity Production Approach . . . . . . . . . . . . . . . . . . . . . . . . .
Hamid Parvin, Hosein Alizadeh, Sajad Parvin, and Behzad Maleki
647
Structuring Taxonomies from Texts: A Case-Study on Defining Soil
Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hércules Antonio do Prado, Edilson Ferneda,
Francisco Carlos da Luz Rodrigues, Éder Martins de Souza,
Osmar Abı́lio de Carvalho Jr., and Alfredo José Barreto Luiz
Exploring Fuzzy Ontologies in Mining Generalized Association Rules . . .
Rodrigo Moura Juvenil Ayres, Marcela Xavier Ribeiro, and
Marilde Terezinha Prado Santos
jvargas2006@gmail.com
657
667
XXXIV
Table of Contents – Part III
BTA: Architecture for Reusable Business Tier Components with Access
Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Óscar Mortágua Pereira, Rui L. Aguiar, and Maribel Yasmina Santos
682
Analysing the PDDL Language for Argumentation-Based Negotiation
Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ariel Monteserin, Luis Berdún, and Analı́a A. Amandi
698
Predicting Potential Responders in Twitter: A Query Routing
Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Cleyton Caetano de Souza, Jonathas José de Magalhães,
Evandro Barros de Costa, and Joseana Macêdo Fechine
Towards a Goal Recognition Model for the Organizational Memory . . . . .
Marcelo G. Armentano and Analı́a A. Amandi
SART: A New Association Rule Method for Mining Sequential Patterns
in Time Series of Climate Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Marcos Daniel Cano, Marilde Terezinha Prado Santos,
Ana Maria H. de Avila, Luciana A.S. Romani,
Agma J.M. Traina, and Marcela Xavier Ribeiro
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
jvargas2006@gmail.com
714
730
743
759
Table of Contents – Part IV
Workshop on Software Engineering Processes
and Applications (SEPA 2012)
Modeling Road Traffic Signals Control Using UML and the MARTE
Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eduardo Augusto Silvestre and Michel dos Santos Soares
Analysis of Techniques for Documenting User Requirements . . . . . . . . . . .
Michel dos Santos Soares and Daniel Souza Cioquetta
Predicting Web Service Maintainability via Object-Oriented Metrics:
A Statistics-Based Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
José Luis Ordiales Coscia, Marco Crasso, Cristian Mateos,
Alejandro Zunino, and Sanjay Misra
Early Automated Verification of Tool Chain Design . . . . . . . . . . . . . . . . . .
Matthias Biehl
Using UML Stereotypes to Support the Requirement Engineering:
A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Vitor A. Batista, Daniela C.C. Peixoto, Wilson Pádua, and
Clarindo Isaı́as P.S. Pádua
Identifying Business Rules to Legacy Systems Reengineering Based on
BPM and SOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gleison S. do Nascimento, Cirano Iochpe, Lucinéia Thom,
André C. Kalsing, and Álvaro Moreira
Abstraction Analysis and Certified Flow and Context Sensitive
Points-to Relation for Distributed Programs . . . . . . . . . . . . . . . . . . . . . . . . .
Mohamed A. El-Zawawy
An Approach to Measure Understandability of Extended UML Based
on Metamodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yan Zhang, Yi Liu, Zhiyi Ma, Xuying Zhao, Xiaokun Zhang, and
Tian Zhang
Dealing with Dependencies among Functional and Non-functional
Requirements for Impact Analysis in Web Engineering . . . . . . . . . . . . . . . .
José Alfonso Aguilar, Irene Garrigós, Jose-Norberto Mazón, and
Anibal Zaldı́var
jvargas2006@gmail.com
1
16
29
40
51
67
83
100
116
XXXVI
Table of Contents – Part IV
Assessing Maintainability Metrics in Software Architectures Using
COSMIC and UML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Eudisley Gomes dos Anjos, Ruan Delgado Gomes, and
Mário Zenha-Rela
Plagiarism Detection in Software Using Efficient String Matching . . . . . .
Kusum Lata Pandey, Suneeta Agarwal, Sanjay Misra, and
Rajesh Prasad
Dynamic Software Maintenance Effort Estimation Modeling Using
Neural Network, Rule Engine and Multi-regression Approach . . . . . . . . . .
Ruchi Shukla, Mukul Shukla, A.K. Misra, T. Marwala, and
W.A. Clarke
132
147
157
Workshop on Software Quality (SQ 2012)
New Measures for Maintaining the Quality of Databases . . . . . . . . . . . . . .
Hendrik Decker
170
A New Way to Determine External Quality of ERP Software . . . . . . . . . .
Ali Orhan Aydin
186
Towards a Catalog of Spreadsheet Smells . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jácome Cunha, João P. Fernandes, Hugo Ribeiro, and João Saraiva
202
Program and Aspect Metrics for MATLAB . . . . . . . . . . . . . . . . . . . . . . . . . .
Pedro Martins, Paulo Lopes, João P. Fernandes, João Saraiva, and
João M.P. Cardoso
217
A Suite of Cognitive Complexity Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Sanjay Misra, Murat Koyuncu, Marco Crasso, Cristian Mateos, and
Alejandro Zunino
234
Complexity Metrics for Cascading Style Sheets . . . . . . . . . . . . . . . . . . . . . .
Adewole Adewumi, Sanjay Misra, and Nicholas Ikhu-Omoregbe
248
A Systematic Review on the Impact of CK Metrics on the Functional
Correctness of Object-Oriented Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Yasser A. Khan, Mahmoud O. Elish, and Mohamed El-Attar
258
Workshop on Security and Privacy in Computational
Sciences (SPCS 2012)
Pinpointing Malicious Activities through Network and System-Level
Malware Execution Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
André Ricardo Abed Grégio, Vitor Monte Afonso,
Dario Simões Fernandes Filho, Paulo Lı́cio de Geus,
Mario Jino, and Rafael Duarte Coelho dos Santos
jvargas2006@gmail.com
274
Table of Contents – Part IV
XXXVII
A Malware Detection System Inspired on the Human Immune
System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Isabela Liane de Oliveira, André Ricardo Abed Grégio, and
Adriano Mauro Cansian
Interactive, Visual-Aided Tools to Analyze Malware Behavior . . . . . . . . . .
André Ricardo Abed Grégio, Alexandre Or Cansian Baruque,
Vitor Monte Afonso, Dario Simões Fernandes Filho,
Paulo Lı́cio de Geus, Mario Jino, and
Rafael Duarte Coelho dos Santos
Interactive Analysis of Computer Scenarios through Parallel
Coordinates Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gabriel D. Cavalcante, Sebastien Tricaud, Cleber P. Souza, and
Paulo Lı́cio de Geus
Methodology for Detection and Restraint of P2P Applications in the
Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rodrigo M.P. Silva and Ronaldo M. Salles
286
302
314
326
Workshop on Soft Computing and Data Engineering
(SCDE 2012)
Text Categorization Based on Fuzzy Soft Set Theory . . . . . . . . . . . . . . . . .
Bana Handaga and Mustafa Mat Deris
340
Cluster Size Determination Using JPEG Files . . . . . . . . . . . . . . . . . . . . . . .
Nurul Azma Abdullah, Rosziati Ibrahim, and
Kamaruddin Malik Mohamad
353
Semantic Web Search Engine Using Ontology, Clustering and
Personalization Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Noryusliza Abdullah and Rosziati Ibrahim
364
Granules of Words to Represent Text: An Approach Based on Fuzzy
Relations and Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Patrı́cia F. Castro and Geraldo B. Xexéo
379
Multivariate Time Series Classification by Combining Trend-Based and
Value-Based Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bilal Esmael, Arghad Arnaout, Rudolf K. Fruhwirth, and
Gerhard Thonhauser
jvargas2006@gmail.com
392
XXXVIII Table of Contents – Part IV
General Track on High Performance Computing
and Networks
Impact of pay-as-you-go Cloud Platforms on Software Pricing and
Development: A Review and Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fernando Pires Barbosa and Andrea Schwertner Charão
404
Resilience for Collaborative Applications on Clouds: Fault-Tolerance
for Distributed HPC Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Toàn Nguyên and Jean-Antoine Désidéri
418
T-DMB Receiver Model for Emergency Alert Service . . . . . . . . . . . . . . . . .
Seong-Geun Kwon, Suk-Hwan Lee, Eung-Joo Lee, and
Ki-Ryong Kwon
434
A Framework for Context-Aware Systems in Mobile Devices . . . . . . . . . . .
Eduardo Jorge, Matheus Farias, Rafael Carmo, and Weslley Vieira
444
A Simulation Framework for Scheduling Performance Evaluation on
CPU-GPU Heterogeneous System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Flavio Vella, Igor Neri, Osvaldo Gervasi, and Sergio Tasso
Influence of Topology on Mobility and Transmission Capacity
of Human-Based DTNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Danilo A. Moschetto, Douglas O. Freitas, Lourdes P.P. Poma,
Ricardo Aparecido Perez de Almeida, and Cesar A.C. Marcondes
Towards a Computer Assisted Approach for Migrating Legacy Systems
to SOA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Gonzalo Salvatierra, Cristian Mateos, Marco Crasso, and
Alejandro Zunino
1+1 Protection of Overlay Distributed Computing Systems:
Modeling and Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Krzysztof Walkowiak and Jacek Rak
457
470
484
498
Scheduling and Capacity Design in Overlay Computing Systems . . . . . . .
Krzysztof Walkowiak, Andrzej Kasprzak, Michal Kosowski, and
Marek Miziolek
514
GPU Acceleration of the caffa3d.MB Model . . . . . . . . . . . . . . . . . . . . . . . . .
Pablo Igounet, Pablo Alfaro, Gabriel Usera, and Pablo Ezzatti
530
Security-Effective Fast Authentication Mechanism for Network Mobility
in Proxy Mobile IPv6 Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Illkyun Im, Young-Hwa Cho, Jae-Young Choi, and Jongpil Jeong
543
An Architecture for Service Integration and Unified Communication
in Mobile Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Ricardo Aparecido Perez de Almeida and Hélio C. Guardia
560
jvargas2006@gmail.com
Table of Contents – Part IV
XXXIX
Task Allocation in Mesh Structure: 2Side LeapFrog Algorithm and
Q-Learning Based Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Iwona Pozniak-Koszalka, Wojciech Proma, Leszek Koszalka,
Maciej Pol, and Andrzej Kasprzak
Follow-Us: A Distributed Ubiquitous Healthcare System Simulated by
MannaSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Maria Luı́sa Amarante Ghizoni, Adauto Santos, and
Linnyer Beatrys Ruiz
Adaptive Dynamic Frequency Scaling for Thermal-Aware 3D Multi-core
Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Hong Jun Choi, Young Jin Park, Hsien-Hsin Lee, and
Cheol Hong Kim
576
588
602
A Context-Aware Service Model Based on the OSGi Framework for
u-Agricultural Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Jongsun Choi, Sangjoon Park, Jongchan Lee, and Yongyun Cho
613
A Security Framework for Blocking New Types of Internet Worms in
Ubiquitous Computing Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Iksu Kim and Yongyun Cho
622
Quality Factors in Development Best Practices for Mobile
Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Euler Horta Marinho and Rodolfo Ferreira Resende
632
ShadowNet: An Active Defense Infrastructure for Insider Cyber Attack
Prevention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Xiaohui Cui, Wade Gasior, Justin Beaver, and Jim Treadwell
646
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
655
jvargas2006@gmail.com
Processor Allocation for Optimistic
Parallelization of Irregular Programs
Francesco Versaci1, and Keshav Pingali2
1
University of Padova & Technische Universität Wien
versacif@dei.unipd.it
2
University of Texas at Austin
pingali@cs.utexas.edu
Abstract. Optimistic parallelization is a promising approach for the
parallelization of irregular algorithms: potentially interfering tasks are
launched dynamically, and the runtime system detects conflicts between
concurrent activities, aborting and rolling back conflicting tasks. However, parallelism in irregular algorithms is very complex. In a regular algorithm like dense matrix multiplication, the amount of parallelism can
usually be expressed as a function of the problem size, so it is reasonably
straightforward to determine how many processors should be allocated
to execute a regular algorithm of a certain size (this is called the processor allocation problem). In contrast, parallelism in irregular algorithms
can be a function of input parameters, and the amount of parallelism
can vary dramatically during the execution of the irregular algorithm.
Therefore, the processor allocation problem for irregular algorithms is
very difficult.
In this paper, we describe the first systematic strategy for addressing
this problem. Our approach is based on a construct called the conflict
graph, which (i) provides insight into the amount of parallelism that
can be extracted from an irregular algorithm, and (ii) can be used to
address the processor allocation problem for irregular algorithms. We
show that this problem is related to a generalization of the unfriendly
seating problem and, by extending Turán’s theorem, we obtain a worstcase class of problems for optimistic parallelization, which we use to
derive a lower bound on the exploitable parallelism. Finally, using some
theoretically derived properties and some experimental facts, we design
a quick and stable control strategy for solving the processor allocation
problem heuristically.
Keywords: Irregular algorithms, Optimistic parallelization, Automatic
parallelization, Amorphous data-parallelism, Processor allocation, Unfriendly seating, Turán’s theorem.
Supported by PAT-INFN Project AuroraScience, by MIUR-PRIN Project AlgoDEEP, and by the University of Padova Projects STPD08JA32 and CPDA099949.
Corresponding author.
B. Murgante et al. (Eds.): ICCSA 2012, Part I, LNCS 7333, pp. 1–14, 2012.
c Springer-Verlag Berlin Heidelberg 2012
jvargas2006@gmail.com
2
F. Versaci and K. Pingali
1
Introduction
The advent of on-chip multiprocessors has made parallel programming a mainstream concern. Unfortunately writing correct and efficient parallel programs is
a challenging task for the average programmer. Hence, in recent years, many
projects [14,10,3,20] have tried to automate parallel programming for some
classes of algorithms. Most of them focus on regular algorithms such as Fourier
transforms [9,19] and dense linear algebra routines [4]. Automation is more difficult when the algorithms are irregular and use pointer-based data structures such
as graphs and sets. One promising approach is based on the concept of amorphous data parallelism [17]. Algorithms are formulated as iterative computations
on work-sets, and each iteration is identified as a quantum of work (task) that
can potentially be executed in parallel with other iterations. The Galois project
[18] has shown that algorithms formulated in this way can be parallelized automatically using optimistic parallelization): iterations are executed speculatively
in parallel and, when an iteration conflicts with concurrently executing iterations, it is rolled-back. Algorithms that have been successfully parallelized in
this manner include Survey propagation [5], Boruvka’s algorithm [6], Delauney
triangulation and refinement [12], and Agglomerative clustering [21].
In a regular algorithm like dense matrix multiplication, the amount of parallelism can usually be expressed as a function of the problem size, so it is reasonably straightforward to determine how many processors should be allocated to
execute a regular algorithm of a certain size (this is called the processor allocation problem). In contrast, parallelism in irregular algorithms can be a function
of input parameters, and the amount of parallelism can vary dramatically during
the execution of the irregular algorithm [16]. Therefore, the processor allocation
problem for irregular algorithms is very difficult. Optimistic parallelization complicates this problem even more: if there are too many processors and too little
parallel work, not only might some processors be idle but speculative conflicts
may actually retard the progress of even those processors that have useful work
to do, increasing both program execution time and power consumption. This paper1 presents the first systematic approach to addressing the processor allocation
problem for irregular algorithms under optimistic parallelization, and it makes
the following contributions.
– We develop a simple graph-theoretic model for optimistic parallelization and
use it to formulate processor allocation as an optimization problem that
balances parallelism exploitation with minimizing speculative conflicts (Section 2).
– We identify a worst-case class of problems for optimistic parallelization; to
this purpose, we develop an extension of Turán’s theorem [2] (Section 3).
– Using these ideas, we develop an adaptive controller that dynamically solves
the processor allocation problem for amorphous data-parallel programs, providing rapid response to changes in the amount of amorphous data-parallelism
(Section 4).
1
A brief announcement of this work has been presented at SPAA’11 [23].
jvargas2006@gmail.com
Processor Allocation for Optimistic Parallelization
2
3
Modeling Optimistic Parallelization
A typical example of an algorithm that exhibits amorphous data-parallelism is
Dalauney mesh refinement, summarized as follows. A triangulation of some planar region is given, containing some “bad” triangles (according to some quality
criterion). To remove them, each bad triangle is selected (in any arbitrary order), and this triangle, together with triangles that lie in its cavity, are replaced
with new triangles. The retriangulation can produce new bad triangles, but this
process can be proved to halt after a finite number of steps. Two bad triangles
can be processed in parallel, given that their cavities do not overlap.
There are also algorithms, which exhibit amorphous data-parallelism, for
which the order of execution of the parallel tasks cannot be arbitrary, but must
satisfy some constraints (e.g., in discrete event simulations the events must commit chronologically). We will not treat this class of problems in this work, but we
will focus only on unordered algorithms [16]. A different context in which there
is no roll-back and tasks do not conflict, but obey some precedence relations, is
treated in [1].
Optimistic parallelization deals with amorphous data-parallelism by maintaining a work-set of the tasks to be executed. At each temporal step some tasks
are selected and speculatively launched in parallel. If, at runtime, two processes
modify the same data a conflict is detected and one of the two has to abort and
roll-back its execution. Neglecting the details of the various amorphous dataparallel algorithms, we can model their common behavior at a higher level with
a simple graph-theoretic model: we can think a scheduler as working on a dynamic graph Gt = (Vt , Et ), where the nodes represent computations we want
to do, but we have no initial knowledge of the edges, which represent conflicts
between computations (see Fig. 1). At time step t the system picks uniformly
at random mt nodes (the active nodes) and tries to process them concurrently.
When it processes a node it figures out if it has some connections with other
executed nodes and, if a neighbor node happens to have been processed before
it, aborts, otherwise the node is considered processed, is removed from the graph
and some operations may be performed in the neighborhood, such as adding new
nodes with edges or altering the neighbors. The time taken to process conflicting and non-conflicting nodes is assumed to be the same, as it happens, e.g., for
Dalauney mesh refinement.
2.1
Control Optimization Goal
When we run an optimistic parallelization we have two contrasting goals: we
both want to maximize the work done, achieving high parallelism, but at the
same time we want to minimize the conflicts, hence obtaining a good use of
the processors time. (Furthermore, for some algorithms the roll-back work can
be quite resource-consuming.) These two goals are not compatible, in fact if
we naïvely try to minimize the total execution time the system is forced to
use always all the available processors, whereas if we try to minimize the time
wasted from aborted processes the system uses only one processor. Therefore in
the following we choose a trade-off goal and cast it in our graph-theoretic model.
jvargas2006@gmail.com
4
F. Versaci and K. Pingali
(i)
(ii)
(iii)
Fig. 1. Optimistic parallelization. (i) Nodes represent possible computations, edges
conflicts between them. (ii) m nodes are chosen at random and run concurrently. (iii)
At runtime the conflicts are detected, some nodes abort and their execution is rolled
back, leaving a maximal independent set in the subgraph induced by the initial nodes
choice.
Let G = (V, E) be a computations/conflicts (CC) graph with n = |V | nodes.
When a scheduler chooses, uniformly at random, m nodes to be run, the ordered
set πm (·) by which they commit can be modeled as a random permutation: if i <
j then πm (i) commits before πm (j) (if there is a conflict between πm (i) and πm (j)
then πm (i) commits and πm (j) aborts, if πm (i) aborted due to conflicts with
previous processes πm (j) can commit, if not conflicting with other committed
processes). Let kt (πm ) be the number of aborted processes due to conflicts and
rt (πm ) ∈ [0, 1) the ratio of conflicting processors observed at time t (i.e. rt (πm )
kt (πm )/m). We define the conflict ratio r̄t (m) to be the expected r that we obtain
when the system is run with m processors:
r̄t (m) Eπm [rt (πm )] ,
(1)
where the expectation is computed uniformly over the possible prefixes of length
m of the n nodes permutations. The control problem we want to solve is the
following: given r(τ ) and mτ for τ < t, choose mt = μt such that r̄t (μt ) ρ,
where ρ is a suitable parameter.
Remark 1. If we want to dynamically control the number of processors, ρ must
be chosen different from zero, otherwise the system converges to use only one
processor, thus not being able to identify available parallelism. A value of ρ ∈
[20%, 30%] is often reasonable, together with the constraint mt ≥ 2.
3
Exploiting Parallelism
In this section we study how much parallelism can be extracted from a given CC
graph and how its sparsity can affect the conflict ratio. To this purpose we obtain
a worst case class of graphs and use it to analytically derive a lower bound for
jvargas2006@gmail.com
Processor Allocation for Optimistic Parallelization
5
the exploitable parallelism (i.e., an upper bound for the conflict ratio). We make
extensive use of finite differences (i.e., discrete derivatives), which are defined
recursively as follows. Let f : Z → R be a real function defined on the integers,
then the i-th (forward) finite difference of f is
i−1
Δif (k) = Δi−1
f (k + 1) − Δf (k) ,
with Δ0f (k) = f (k) .
(2)
(In the following we will omit Δ’s superscript when equal to one, i.e., Δ Δ1 .)
First, we obtain two basic properties of r̄, which are given by the following
propositions.
Proposition 1. The conflict ratio function r̄(m) is non-decreasing in m.
To prove Prop. 1 we first need a lemma:
Lemma 1. Let k̄(m) Eπm [k(πm )]. Then k̄ is a non-decreasing convex function, i.e. Δk̄ (m) ≥ 0 and Δ2k̄ (m) ≥ 0.
Proof. Let k̃(πm , i) be the expected number of conflicting nodes running r =
m + i nodes concurrently, the first m of which are πm and the last i are chosen
uniformly at random among the remaining ones. By definition, we have
Eπm k̃(πm , i) = k̄(m + i) .
(3)
In particular,
which brings
k̃(πm , 1) = k(πm ) + Pr [(m + 1)-th conflicts] ,
(4)
k̄(m + 1) = Eπm k̃(πm , 1) = k̄(m) + η ,
(5)
with η = k̄(m + 1) − k̄(m) = Δk̄ (m) ≥ 0, hence proving the monotonicity of k̄.
Consider now
k̃(πm , 2) = k(πm ) + Pr [(m + 1)-th conflicts] + Pr [(m + 2)-th conflicts] . (6)
If the (m + 1)-th node does not add any edge, then we have
Pr [(m + 1)-th conflicts] = Pr [(m + 2)-th conflicts] ,
(7)
but since it may add some edges the probability of conflicting the second time
is in general larger and thus Δ2k̄ (m) ≥ 0.
Proof (Prop. 1). Since r̄(m) = k̄(m)/m, its finite difference can be written as
Δr̄ (m) =
mΔk̄ (m) − k̄(m)
.
m(m + 1)
(8)
Because of Lemma 1 and being k̄(1) = 0 we have
k̄(m + 1) ≤ mΔk̄ (m) ,
jvargas2006@gmail.com
(9)
6
F. Versaci and K. Pingali
which finally brings
Δr̄ (m) =
k̄(m + 1) − k̄(m)
Δk̄ (m)
mΔk̄ (m) − k̄(m)
≥
=
≥0 .
m(m + 1)
m(m + 1)
m(m + 1)
(10)
Proposition 2. Let G be a CC graph, with n nodes and average degree d, then
the initial derivative of r̄ depends only on n and d as
d
.
2(n − 1)
(11)
k̄(2)
Δk̄ (1) − k̄(1)
=
,
2
2
(12)
Δr̄ (1) =
Proof. Since
Δr̄ (1) =
we just need to obtain k̄(2). Let k̃ be defined as in the proof on Lemma 1 and
π1 = v a node chosen uniformly at random. Then
Ev [dv ]
d
dv
=
=
.
(13)
k̄(2) = Ev k̃(v, 1) = Ev
n−1
n−1
n−1
A measure of the available parallelism for a given CC graph has been identified
in [15] considering, at each temporal step, a maximal independent set of the CC
graph. The expected size of a maximal independent set gives a reasonable and
computable estimate of the available parallelism. However, this is not enough
to predict the actual amount of parallelism that a scheduler can exploit while
keeping a low conflict ratio, as shown in the following example.
Example 1. Let G = Kn2 ∪ Dn where Kn2 is the complete graph of size n2 and
Dn a disconnected graph of size n (i.e. G is made up of a clique of size n2 and n
disconnected nodes). For this graph every maximal independent set is maximum
too and has size n+1, but if we choose n+1 nodes uniformly at random and then
compute the conflicts we obtain that, on average, there are only 2 independent
nodes.
A more realistic estimate of the performance of a scheduler can be obtained by
analyzing the CC graph sparsity. The average degree of the CC graph is linked
to the expected size of a maximal independent set of the graph by the following
well known theorem (in the variant shown in [2] or [22]):
Theorem 1. (Turán, strong formulation). Let G = (V, E) be a graph, n = |V |
and let d be the average degree of G. Then the expected size of a maximal independent set, obtained choosing greedily the nodes from a random permutation, is
at least s = n/(d + 1).
jvargas2006@gmail.com
Processor Allocation for Optimistic Parallelization
7
Remark 2. The previous bound is existentially tight: let Kdn be the graph made
up of s = n/(d + 1) cliques of size d + 1, then the average degree is d and the
size of every maximal (and maximum) independent set is exactly s. Furthermore,
every other graph with the same number of nodes and edges has a bigger average
maximal independent set.
The study of the expected size of a maximal independent set in a given graph
is also known as the unfriendly seating problem [7,8] and is particularly relevant
in statistical physics, where it is usually studied on mesh-like graphs [11]. The
properties of the graph Kdn has suggested us the formulation of an extension of
the Turán’s theorem. We prove that the graphs Kdn provide a worst case (for a
given degree d) for the generalization of this problem obtained by focusing on
maximal independent set of induced subgraphs. This allows, when given a target
conflict ratio ρ, the computation of a lower bound for the parallelism a scheduler
can exploit.
Theorem 2. Let G be a graph with same nodes number and degree of Kdn and
let EMm (G) be the expected size of a maximal independent set of the subgraph
induced by a uniformly random choice of m nodes in G, then
EMm (G) ≥ EMm (Kdn ) .
(14)
To prove it we first need the following lemma.
j
Lemma 2. The function ηj (x) i=1 (n − i − x) is convex for x ∈ [0, n − j].
Proof. We prove by induction on j that, for x ∈ [0, n − j],
ηj (x) ≥ 0 ,
ηj (x) ≤ 0 ,
ηj (x) ≥ 0 .
(15)
Base case Let η0 (x) = 1. The properties above are easily verified.
Induction Since ηj (x) = ηj−1 (x)(n − j − x), we obtain
(x) ,
ηj (x) = −ηj−1 (x) + (n − j − x)ηj−1
(16)
which is non-positive by inductive hypotheses. Similarly,
(x) + (n − j − x)ηj−1
(x)
ηj (x) = −2ηj−1
(17)
is non-negative.
Proof (Thm. 2). Consider a random permutation π of the nodes of a generic
graph G that has the same number of nodes and edges of Kdn . We assume the
prefix of length m of π (i.e. π(1), . . . , π(m)) forms the active nodes and focus
on the following independent set ISm in the subgraph induced: a node v is in
ISm (G, π) if and only if it is in the first m positions of π and it has no neighbors
preceding it. Let bm (G) be the expected size of ISm (G, π) averaged over all
possible π’s (chosen uniformly):
bm (G) Eπ [# ISm (G, π)] .
jvargas2006@gmail.com
(18)
8
F. Versaci and K. Pingali
Since for construction bm (G) ≤ EMm (G) whereas bm (Kdn ) = EMm (Kdn ), we just
need to prove that bm (Kdn ) ≤ bm (G). Given a generic node v of degree dv and a
random permutation π, its probability to be in ISm (G, π) is
1 n − i − dv
.
Pr [v ∈ ISm (G, π)] =
n j=1 i=1 n − i
m j−1
By the linearity of the expectation we can write b as
⎡
⎤
vn
m j−1
m j−1
n − i − dv
n − i − dv
1
⎦ ,
bm (G) =
= Ev ⎣
n v=v j=1 i=1 n − i
n
−
i
j=1 i=1
(19)
(20)
1
bm (Kdn ) =
m j−1
m j−1
n−i−d
n − i − Ev [dv ]
=
.
n−i
n−i
j=1 i=1
j=1 i=1
To prove that EMm (G) ≥ EMm (Kdn ) is thus enough showing that
j
j
(n − i − dv ) ≥
(n − i − Ev [dv ]) ,
∀j Ev
i=1
(21)
(22)
i=1
which can be done applying Jensen’s inequality [13], since in Lemma 2 we have
j
proved the convexity of ηj (x) i=1 (n − i − x).
Corollary 1. The worst case for a scheduler among the graphs with the same
number of nodes and edges is obtained for the graph Kdn (for which we can
analytically approximate the performance, as shown in §3.1).
Proof. Since
r̄(m) =
m − EMm (G)
1
=1−
EMm (G) ,
m
m
the thesis follows.
3.1
(23)
Analysis of the Worst-Case Performance
Theorem 3. Let d be the average degree of G = (V, E) with n = |V | (for
simplicity we assume n/(d + 1) ∈ N). The conflict ratio is bounded from above
as
m
n
n−d−i
r̄(m) ≤ 1 −
1−
.
(24)
m(d + 1)
n+1−i
i=1
Proof. Let s = n/(d+1) be the number of connected components in Kdn . Because
of Thm. 2 and Cor. 1 it suffices to show that
m
n−d−i
n
.
(25)
EMm (Kd ) = s 1 −
n+1−i
i=1
jvargas2006@gmail.com
Processor Allocation for Optimistic Parallelization
9
The probability for a connected component k of Kdn not to be accessed when m
nodes are chosen is given by the following hypergeometric
d+1
n−d−1
m
0
m
n−d−i
=
.
(26)
Pr[k not hit] =
n+1−i
n
i=1
m
Let Xk be a random variable
that is 1 when component k is hit and 0 otherwise.
m
We have that E[Xk ] = 1 − i=1 n−d−i
n+1−i and, by the linearity of the expectation,
the average number of components accessed is
s
s
m
n−d−i
.
(27)
E
Xk =
E[Xk ] = s 1 −
n+1−i
i=1
k=1
k=1
Corollary 2. When n and m increase the bound is well approximated by
m d+1
n
1− 1−
.
(28)
r̄(m) ≤ 1 −
m(d + 1)
n
Proof. Stirling approximation for the binomial, followed by low order terms deletion in the resulting formula.
Corollary 3. If we set m = αs =
r̄(m) ≤ 1 −
4
αn
d+1
we obtain
α
1
1− 1−
α
d+1
d+1
≤1−
1
1 − e−α .
α
(29)
Controlling Processors Allocation
In this section we will design an efficient control heuristic that dynamically
chooses the number of processes to be run by a scheduler, in order to obtain
high parallelism while keeping the conflict ratio low. In the following we suppose
that the properties of Gt are varying slowly compared to the convergence of mt
toward μt under the algorithm we will develop (see §4.1), so we can consider
Gt = G and μt = μ and thus our goal is making mt converge to μ.
Since the conflict ratio is a non-decreasing function of the number of launched
tasks m (Prop. 1) we could find m μ by bisection simply noticing that
r̄(m ) ≤ ρ ≤ r̄(m )
⇒
m ≤ μ ≤ m .
(30)
The control we propose is slightly more complex and is based on recurrence
relations, i.e., we compute mt+1 as a function F of the target conflict ratio ρ
and of the parameters which characterize the system at the previous timestep:
mF
t+1 = F (ρ, rt , mt ) .
jvargas2006@gmail.com
(31)
10
F. Versaci and K. Pingali
The initial value m0 for a recurrence can be chosen to be 2 but, if we have an
estimate of the CC graph average degree d, we can choose a smarter value: in
n
fact applying Cor. 3 we are sure that using, e.g., m = 2(d+1)
processors we will
have at most a conflict ratio of 21.3%.
Algorithm 1. Pseudo-code of the proposed hybrid control algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// Tunable parameters
mmax = 1024;
mmin = 2;
m0 = 2;
T = 4;
rmin = 3%;
α0 = 25%;
α1 = 6%;
// Variables
r ← 0;
t ← 0;
m ← m0 ;
// Main loop
while nodes to elaborate = 0 do
t ← t + 1;
if m > mmax then m ← mmax ;
else if m < mmin then m ← mmin ;
Launch the scheduler with m nodes;
r ← r + new conflict ratio;
if (t mod T ) = T − 1 then
r ← r/T
;
r
α ← 1 − ;
ρ
if α > α0 then
if r < rmin then r ← rmin ;
ρ
m ;
m←
r
else if α > α1 then
m ← (1 − r + ρ) m;
r ← 0;
Our control heuristic (Algorithm 1) is a hybridization of two simple recurrences. The first recurrence is quite natural and increases m based on the distance between r and ρ:
Recurrence A:
mA
t+1 = (1 − rt + ρ)mt .
(32)
The second recurrence exploits some experimental facts. In Fig. 2 we have plotted
the conflict ratio functions for three CC graphs with the same size and average
degree (note that initial derivative is the same for all the graphs, in accordance
with Prop. 2). We see that conflict ratios which reach a high value (r̄(n) > 12 )
are initially well approximated by a straight line (for m such that r̄(m) ≤ ρ =
20 ÷ 30%), whereas functions that deviates from this behavior do not raise too
much. This suggests us to assume an initial linearity in controlling mt , as done
by the following recurrence:
ρ
Recurrence B:
mB
mt .
(33)
t+1 =
rt
jvargas2006@gmail.com
Processor Allocation for Optimistic Parallelization
11
1
0.8
r̄(m)
Upper bound
0.6
Random graph
Cliques + discon. nodes
Common tangent
0.4
0.2
0
1
200 400 600 800 1000 1200 1400 1600 1800 2000
m
Fig. 2. A plot of r̄(m) for some graphs with n = 2000 and d = 16: (i) the worst case
upper bound of Cor. 2 (ii) a random graph (edges chosen uniformly at random until
desired degree is reached; data obtained by computer simulation) (iii) a graph unions
of cliques and disconnected nodes.
The two recurrences can be roughly compared as follows (see Fig. 3): Recurrence A has a slower convergence than Recurrence B, but it is less susceptible
to noise (the variance that makes rt realizations different from r̄t ). This is the
reason for which we chose to merge them in an hybrid algorithm: initially, when
the difference between r and ρ is big, we use Recurrence B to exploit its quick
convergence and then Recurrence A is adopted, for a finer tuning of the control.
4.1
Experimental Evaluation
In the practical implementation of the control algorithm we have made the following optimizations:
– Since rt can have a big variance, especially when m is small, we decided to
apply the changes to m every T steps, using the averaged values obtained in
these intervals, to smooth the oscillations.
– To further reduce the oscillations we apply a change only if the observed rt
is sufficiently different from ρ (e.g. more than 6%), thus avoiding small variations in the steady state, which interfere with locality exploitation because
of the data moving from one processor to another.
– Another problem that must be considered is that for small values of m the
variance is much bigger, so it is better to tune separately this case using
different parameters (this optimization is not shown in the pseudo-code).
To validate our controller we have run the following simulation: a random CC
graph of fixed average degree d is taken and the controller runs on it, starting
with m0 = 2. We are interested in seeing how many temporal steps it takes
to converge to mt μ. As can be seen in [15] the parallelism profile of many
practical applications can vary quite abruptly, e.g., Delauney mesh refinement
jvargas2006@gmail.com
12
F. Versaci and K. Pingali
300
d = 4, Rec. A
d = 4, Hybrid
200
d = 16, Rec. A
d = 16, Hybrid
mt
100
0
0
20
40
60
80
100
t
Fig. 3. Comparison between two realizations of the hybrid algorithm and one that
only uses Recurrence A, for two different random graphs (n = 2000 in both cases). The
hybrid version has different parameters for m greater or smaller than 20. ρ was chosen
to be 20%. The proposed algorithm proves to be both quick in convergence and stable.
can go from no parallelism to one thousand possible parallel tasks in just 30
temporal steps. Therefore, an algorithm that wants to efficiently control the
processors allocations for these problems must adapt very quickly to changes in
the available parallelism. Our controller, that uses the very fast Recurrence B in
the initial phase, proves to do a fast enough job: as shown in Fig. 3 in about 15
steps the controller converges close to the desired μ value.
5
Conclusions and Future Work
Automatic parallelization of irregular algorithms is a rich and complex subject
and will offer many difficult challenges to researchers in the next future. In this
paper we have focused on the processor allocation problem for unordered dataamorphous algorithms; it would be extremely valuable to obtain similar results
for the more general and difficult case of ordered algorithms (e.g., discrete event
simulation), in particular it is very hard to obtain good estimates of the available
parallelism for such algorithms, given the complex dependencies arising between
the concurrent tasks. Another aspect which needs investigation, especially in the
ordered context, is whether some statical properties of the behavior of irregular
algorithms can be modeled, extracted and exploited to build better controllers,
able to dynamically adapt to the different execution phases.
As for a real-world implementation, the proposed control heuristic is now
being integrated in the Galois system and it will be evaluated on more realistic
workloads.
jvargas2006@gmail.com
Processor Allocation for Optimistic Parallelization
13
Acknowledgments. We express our gratitude to Gianfranco Bilardi for the
valuable feedback on recurrence-based controllers and to all the Galois project
members for the useful discussions on optimistic parallelization modeling.
References
1. Agrawal, K., Leiserson, C.E., He, Y., Hsu, W.J.: Adaptive work-stealing with parallelism feedback. ACM Trans. Comput. Syst. 26(3), 7:1–7:32 (2008),
http://doi.acm.org/10.1145/1394441.1394443
2. Alon, N., Spencer, J.: The probabilistic method. Wiley-Interscience (2000)
3. An, P., Jula, A., Rus, S., Saunders, S., Smith, T.G., Tanase, G., Thomas, N., Amato, N.M., Rauchwerger, L.: Stapl: An Adaptive, Generic Parallel C++ Library. In:
Dietz, H.G. (ed.) LCPC 2001. LNCS, vol. 2624, pp. 193–208. Springer, Heidelberg
(2003)
4. Blackford, L.S., Choi, J., Cleary, A., D’Azevedo, E., Demmel, J., Dhillon, I., Dongarra, J., Hammarling, S., Henry, G., Petitet, A., Stanley, K., Walker, D., Whaley,
R.C.: ScaLAPACK Users’ Guide. Society for Industrial and Applied Mathematics,
Philadelphia (1997)
5. Braunstein, A., Mézard, M., Zecchina, R.: Survey propagation: An algorithm for
satisfiability. Random Struct. Algorithms 27(2), 201–226 (2005)
6. Eppstein, D.: Spanning trees and spanners. In: Sack, J., Urrutia, J. (eds.) Handbook
of Computational Geometry, pp. 425–461. Elsevier (2000)
7. Freedman, D., Shepp, L.: Problem 62-3, an unfriendly seating arrangement. SIAM
Review 4(2), 150 (1962), http://www.jstor.org/stable/2028372
8. Friedman, H.D., Rothman, D., MacKenzie, J.K.: Problem 62-3. SIAM Review 6(2),
180–182 (1964), http://www.jstor.org/stable/2028090
9. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings
of the IEEE 93(2), 216–231 (2005); special issue on Program Generation, Optimization, and Platform Adaptation
10. Frigo, M., Leiserson, C.E., Randall, K.H.: The implementation of the Cilk-5 multithreaded language. In: PLDI, pp. 212–223 (1998)
11. Georgiou, K., Kranakis, E., Krizanc, D.: Random maximal independent sets
and the unfriendly theater seating arrangement problem. Discrete Mathematics 309(16), 5120–5129 (2009),
http://www.sciencedirect.com/science/article/B6V00-4W55T4X-2/2/
72d38a668c737e68edf497512e606e12
12. Guibas, L.J., Knuth, D.E., Sharir, M.: Randomized incremental construction of
delaunay and voronoi diagrams. Algorithmica 7(4), 381–413 (1992)
13. Jensen, J.: Sur les fonctions convexes et les inégalités entre les valeurs moyennes.
Acta Mathematica 30(1), 175–193 (1906)
14. Kalé, L.V., Krishnan, S.: Charm++: A portable concurrent object oriented system
based on C++. In: OOPSLA, pp. 91–108 (1993)
15. Kulkarni, M., Burtscher, M., Cascaval, C., Pingali, K.: Lonestar: A suite of parallel
irregular programs. In: ISPASS, pp. 65–76. IEEE (2009)
16. Kulkarni, M., Burtscher, M., Inkulu, R., Pingali, K., Cascaval, C.: How much
parallelism is there in irregular applications? In: Reed, D.A., Sarkar, V. (eds.)
PPOPP, pp. 3–14. ACM (2009)
jvargas2006@gmail.com
14
F. Versaci and K. Pingali
17. Méndez-Lojo, M., Nguyen, D., Prountzos, D., Sui, X., Hassaan, M.A., Kulkarni,
M., Burtscher, M., Pingali, K.: Structure-driven optimizations for amorphous dataparallel programs. In: Govindarajan, R., Padua, D.A., Hall, M.W. (eds.) PPOPP,
pp. 3–14. ACM (2010)
18. Pingali, K., Nguyen, D., Kulkarni, M., Burtscher, M., Hassaan, M.A., Kaleem, R.,
Lee, T.H., Lenharth, A., Manevich, R., Méndez-Lojo, M., Prountzos, D., Sui, X.:
The tao of parallelism in algorithms. In: Proceedings of the 32nd ACM SIGPLAN
Conference on Programming Language Design and Implementation, PLDI 2011,
pp. 12–25. ACM, New York (2011),
http://doi.acm.org/10.1145/1993498.1993501
19. Püschel, M., Moura, J., Johnson, J., Padua, D., Veloso, M., Singer, B., Xiong, J.,
Franchetti, F., Gacic, A., Voronenko, Y., Chen, K., Johnson, R., Rizzolo, N.: Spiral:
Code generation for dsp transforms. Proceedings of the IEEE 93(2), 232–275 (2005)
20. Reinders, J.: Intel threading building blocks. O’Reilly & Associates, Inc., Sebastopol (2007)
21. Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. AddisonWesley (2005)
22. Tao, T.: Additive combinatorics. Cambridge University Press (2006)
23. Versaci, F., Pingali, K.: Brief announcement: processor allocation for optimistic
parallelization of irregular programs. In: Proceedings of the 23rd ACM Symposium
on Parallelism in Algorithms and Architectures, SPAA 2011, pp. 261–262. ACM,
New York (2011), http://doi.acm.org/10.1145/1989493.1989533
jvargas2006@gmail.com
Feedback-Based Global Instruction Scheduling
for GPGPU Applications
Constantin Timm1 , Markus Görlich1 , Frank Weichert2 ,
Peter Marwedel1 , and Heinrich Müller2
1
Computer Science 12, TU Dortmund, Germany
constantin.timm@postamt.cs.uni-dortmund.de
2
Computer Science 7, TU Dortmund, Germany
frank.weichert@tu-dortmund.de
Abstract. In the face of the memory wall even in high bandwidth
systems such as GPUs, an efficient handling of memory accesses and
memory-related instructions is mandatory. Up to now, memory performance considerations were only made for GPGPU applications at source
code level. This is not enough when optimizing an application towards
high performance: The code has to be optimized at assembly level as
well. Due to the spreading of GPGPU-capable hardware in smaller and
smaller devices, the energy consumption of a program is – besides the
performance – an important optimization goal.
In this paper, a novel compiler optimization technique, called FALIS
(Feedback-based and memory-Aware gLobal Instruction Scheduling), is
presented based on global instruction scheduling and multi-objective genetic algorithms. The approach uses a profiling-based feedback in order
to take the measured performance and energy consumption values inside a compiler into account. Profiling on the real hardware platform is
important in order to consider the characteristics of the underlying hardware. FALIS increases runtime performance of a GPGPU application by
up to 13.02% and decreases energy consumption by up to 10.23%.
Keywords: Energy-Aware Systems, Compilers,
Objective Genetic Algorithm, Profiling.
1
GPGPU,
Multi-
Introduction
The development of faster single core processors and the availability of higher
performance due to higher clock frequency is at an impasse [2]. The shift towards
multi-core and many-core systems is tedious because of the increasing complexity on programming and the efficient utilization of parallelism. Furthermore, the
memory wall [13] still exists, i.e. the processing speed is much higher than the
memory speed. In the face of this memory wall, an efficient utilization of memory accesses and memory-related operations is mandatory. This also applies to
multi- and many-core systems such as GPUs which can be utilized – enabled
by GPGPU (General Purpose Computing on Graphics Processing Units) – for
B. Murgante et al. (Eds.): ICCSA 2012, Part I, LNCS 7333, pp. 15–28, 2012.
c Springer-Verlag Berlin Heidelberg 2012
jvargas2006@gmail.com
16
C. Timm et al.
HPC (High Performance Computing) applications in scientific and industrial
contexts [17]. Due to green computing, considering energy consumption in the
GPGPU software design process is important. Therefore, energy consumption
and runtime performance are the two optimization targets in this paper. One
of the most powerful and most interesting compiler optimization techniques for
improving the execution order of instructions and therefore for memory and
memory-related instructions, is instruction scheduling (IS). IS is used in this
work to optimize the memory utilization of GPGPU applications. A novel partitioning method for scheduling on linear code sections will be presented. This
method enables the scheduler to place the memory and memory-related instructions in the code of GPGPU applications in a more efficient manner, meaning
that the available memory bandwidth is better utilized.
Most of the average case optimization techniques at compiler level have the
drawback that they lack the capability to optimize the code in an elaborated
way since they have no knowledge of the actual hardware platform. Therefore,
this paper presents a profiling-based approach which feeds profiled performance
indicators back into the compiler backend in order to achieve better solutions.
The major contributions of this paper are summarized as follows:
1. A global instruction scheduling method was integrated in the Nvidia CUDA
compiler [15].
2. A multi-objective genetic algorithm was employed around the compilation
process to optimize GPGPU applications with an adaptive IS towards the
characteristic of the underlying hardware platform.
The remainder of the paper is organized as follows: After this introduction,
Section 2 discusses related work. The main concepts of this paper is presented in
Section 3 where a genetic global instruction scheduling is introduced. Section 4
evaluates the performance of the techniques presented in the preceding section.
The paper ends with a conclusion of the work and an overview of possible future
work.
2
Related Work
In [23], the authors showed that in traditional single core environments, instruction scheduling can have a negative effect on energy consumption. The authors of
[9] developed an algorithm called balanced scheduling which performs scheduling of the instructions based on an assumption on instruction level parallelism.
However, energy consumption was not part of that work and the runtime environment was not taken into account, as it has to be done for a GPU with its
hardware thread scheduler. For special purpose system such as DSPs, several approaches with instruction scheduling methods [12,25] exist to produce optimized
applications with respect to performance [12] and energy consumption [25]. Both
works only targeted single core and single thread code. The first work which
used local instruction scheduling for optimizing GPGPU applications was presented in [5]. In that work, a performance degradation was possible, because
jvargas2006@gmail.com
Feedback-Based Global Instruction Scheduling for GPGPU Applications
17
FALIS
Unoptimized
CUDA Kernel
Choose
Point from
Pareto Front
Extracting Mobile
Instructions
(Memory and
Memory-Related)
Optimized Kernel
Calculating
Mobility of
Instructions
on Create Linear
Code Sections
Create Initial
Chromosome
Population
Mutate and
Crossover
Evaluate
Chromosomes
Assign
Fitness
SPEA2
Fig. 1. FALIS Framework Structure
the optimizations had no information about the later stages of the compilation
process and the runtime performance. A work optimizing GPGPU application
towards better energy consumption and performance by applying local instruction scheduling was presented in [21]. The proposed method can be employed as
a pre-optimization step for this work as it did not focus on memory instructions.
Overall, it can be summarized that not all aspects of instruction scheduling for
GPGPU applications have taken into account. In addition to that there were
several papers published which have taken the energy consumption of GPGPU
applications from a software perspective into account [4,19,20,26]. In these papers it was shown that software can be written in an energy efficient manner,
but it was also shown that this is time-consuming, if there is no efficient tool
(e.g. compiler) support. The latter is done within the novel FALIS framework.
3
FALIS
The FALIS (Feedback-based and memory-Aware gLobal Instruction Scheduling)
framework (as depicted in Figure 1) is presented in this section . FALIS applies
instruction scheduling in order to optimize the CUDA kernels of a GPGPU
application. Therefore, a genetic algorithm interfaces with the code generation
module of the NVOpenCC compiler. In this module which works on a low level
intermediate representation called Whirl [16], all memory and memory-related
instructions are extracted as described in section 3.1. Afterwards, for all these
instructions the basic block sequences are created where an instruction can be
scheduled to (Section 3.2) and a mobility interval is calculated. The mobility
interval comprises the positions in the Whirl intermediate representation of a
kernel, where memory and memory-related instructions can be positioned. Finally, the sequences of instructions are optimized towards the objectives energy
jvargas2006@gmail.com
18
C. Timm et al.
/*01c0*/
/*01c8*/
/*01d0*/
/*01d8*/
/*01e0*/
SYNC 01d0;
ST g[0x411], R8;
NOP;
BAR.SYNC;
MOV R10, R124;
/*01c0*/
/*01c8*/
/*01d0*/
/*01d8*/
/*01e0*/
Unoptimized Sequence
SYNC 01d0;
ST g[0x411], R8;
MOV R10, R124;
BAR.SYNC;
...
Optimized Sequence
Fig. 2. Exemplary Unfavourable Instruction Sequences
consumption and runtime performance by employing a Genetic Algorithm (GA)
named SPEA2 [27] (Section 3.3).
3.1
Extracting Mobile Instructions
The list of instructions relevant for optimization, called mobile instructions
(M ob Ins), can comprise all load and store operations for the different memories
(const, global/local, shared) and memory-related instructions such as (barrier)
synchronisation statements. The reason, why the latter are also considered is
described in the following. The atomic runtime entity of a Nvidia GPGPU application is a thread. A thread runs on a single streaming processor (SP) of
a graphics card. A set of threads, called a block, is allocated on one streaming multiprocessor which is a group of several SPs. At block level, there is no
memory consistency – except in the thread itself. Nevertheless, threads of a
block can force a consistent view on the shared memory by using synchronisation statements. They have to be added to the thread at source code level by
the programmer. In [21], it was revealed that the performance can be decreased
by such statements due to the requirement of adding additional instructions at
machine code level, ensuring a proper timing. The evaluation results showed
that it is not always mandatory to add these instructions if other existing instructions can substitute them (as depicted in Figure 2 – on the left-hand side).
The substitution possibly can save cycles, resulting in performance increase and
energy consumption decrease. Thus, the approach presented in this paper can
also treat synchronisation statements, in addition to memory instructions. For
them, a position in the code should be revealed which lead to a better performance or/and lower energy consumption. In the scope of FALIS, the position of
each extracted mobile instruction is variable. The variability is explained in the
following section.
3.2
Calculating Mobility of Instructions on Linear Code Sections
The purpose of instruction scheduling is to schedule mobile instructions in way
that they access the GPU’s memory system in an efficient way. In the following, mobile instruction positioning paradigms (MIPP), which can be beneficial
and therefore are supported by FALIS, are presented. For the first MIPP, the
jvargas2006@gmail.com
Feedback-Based Global Instruction Scheduling for GPGPU Applications
BB1
BB1
BB2
BB3
BB2
19
BB1
BB3
BB2
BB3
BB4
BB4
BB4
(a) Local Scheduler
(b) Treegion Scheduler
(c) Branch Head Partitioning
Fig. 3. Reachability of Different Instruction Scheduling Methods
memory instructions are close so that their sequence (depending on instruction dependencies) can possibly changed. Within the second MIPP, they are
far away from each other so that they cannot interfered with each other. The
latter is, in particular, important when the limited bandwidth of the graphics cards main memory should be utilized in an efficient way. This method is
based on the observation of the authors of [8]. In that work the authors revealed
that many concurrent memory access can decrease the performance. Figures 3
shows different methods for scheduling instructions. In Figure 3(a) one can see
that within local instruction scheduling, the first MIPP is applicable due to the
capability to schedule instructions near to each other. Local instruction scheduling can only schedule an instruction inside one basic block as denoted by the
different layouts of the basic blocks. In Figures 3(a), 3(b) and 3(c), different layouts mean that instructions can not be moved from one basic block to another.
While basic blocks with the same layout mean that instructions can possibly be
exchange between them. But for the second MIPP local instruction scheduling
is not enough. Therefore, methods considering several basic blocks for scheduling of instructions are required. An example for a state-of-the-art technique is
TREEGION scheduling [1]. It only schedules to neighbouring basic blocks and
therefore, this method is not able to handle the second MIPP in an efficient manner. In addition to that, TREEGION scheduling uses compensation code which
may adversely affect the performance objective because code which is executed
predicated is slower than normally executed code on the GPU [7]. Therefore,
a technique called branch-head-partitioning was introduced in [6] which enables
a global scheduler to schedule on linear code portions (e.g. linear code sections
inside loops) without the use of compensation code but with the possibility to
schedule an instruction far away from the original basic block as depicted in
Figure 3(c).
The Nvidia compiler works on a combined control flow and data dependency
graph G = (V, E). V = {i1 , .., in } is the set of all instructions in intermediate
representation of one kernel and E ⊆ V × V are the data dependencies or control
jvargas2006@gmail.com
20
C. Timm et al.
Mobility Interval
Mob_Ins1
Mob_Ins1
ASAP
ALAP
Mobility Interval
Mob_Ins2
BB1
Mob_Ins2
ASAP
BB2
ALAP
BB3
BB4
Fig. 4. Mobility of Mobile Instructions (M ob Ins)
flow dependencies between two instructions. Branch-head-partitioning [6] creates new dependency edges in the combined control flow and data dependency
graph G. The combined control flow and data dependency graph comprises also
the barrier synchronisation statements. In order to take the barrier synchronisation instruction for global instruction scheduling into account, the original
combined control flow and data dependency graph G is used as a basis. In order
to maintain the semantics of a kernel, for each load/store instruction ix ∈ V preceding a barrier synchronisation instruction iy ∈ V (x < y), a dependency edge
is inserted between ix and iy . The same is done for a barrier synchronisation
instruction and all succeeding load/store instructions. All control flow edges of
the original combined control flow and data dependency graph G were replaced
by edges between non-branch instructions. This enables single instructions to be
moved over branches, if there are no data dependencies in the branch..
For calculating the mobility of a mobile instruction, firstly, ASAP scheduling
(As Soon As Possible) [22] is conducted with the help of the control flow and
data dependency graph G. This reveals the lower bound to where the memory
instruction can be relocated. In a second step a scheduling with scheduling policy
ALAP (As Late As Possible) [11] is performed to determine the upper bound
for positions. Both, ASAP and ALAP were originally used at the synthesis of
hardware but also work on other graphs like combined control flow and data
dependency graph utilized in the paper. The mobility interval for a mobile instruction is, as depicted in Figure 4, the interval between the ASAP and the
ALAP position. The shaded area for mobile instructions in Figure 4 marks positions where M ob Ins1 can not be scheduled to, because the instruction can not
be placed inside a branch if it is not placed there before. The mobility values
are then employed by FALIS in order to optimize a program. As it was revealed
in [21], a GA is an appropriate optimization technique in the field of instruction
scheduling for GPGPU applications since it can cope with effects of the warp
scheduler of Nvidia graphics cards.
3.3
Multi-Objective Genetic Algorithm Specification
FALIS utilizes a multi-objective genetic algorithm, which uses real profiling data
to evaluate the optimization potential of a solution. In the following, terms in
the scope of FALIS are presented in more detail:
jvargas2006@gmail.com
Feedback-Based Global Instruction Scheduling for GPGPU Applications
21
– Gene g: Basic feature which represents the position of a certain instruction
inside the linear basic block sequences for one CUDA kernel. The minimal
and maximal value for each gene is the interval limited by the positions
evaluated by the ASAP and the ALAP schedule.
– Chromosome c: Gene sequence g1 , ..., gn . gi (i ∈ [1, n]) is the assignment
of position to a mobile instruction. There are n mobile instructions in the
code.
– Individual I: An element of the solution space (I ∈ I). It comprises a
chromosome. The set of all individuals is denoted as I in the following.
– Population Ix : Set of considered individuals Ix ⊆ I.
Energy and Runtime Measurement: The two major goals of the optimization process are energy consumption decrease and performance increase. Both
are measured with a performance and energy benchmarking system as depicted
in Figure 5. Due to the major focus of benchmarking the GPGPU application
executed by the graphics card, the power consumption of the system hosting the
graphics card is not measured. The power consumption is measured with power
clamps at the power supply lines of the PCI Express bus for 12V and 3.3V.
Therefore the power supply lines of the PCI express bus have been extended
to measure the current with the power clamps. If a graphics card needs even
more power from the main system power supply it is also possible to measure
the current through this interface.
The power consumption of a graphics card is described as:
P (t) = P 3.3V (t) + P 12V (t)
(1)
where P 12V is the power consumption (W ) at the 12V power supply lines, P 3.3V
is a power consumption (W ) at the 3.3V power supply lines and t is a time
value in the runtime interval of kernel. The energy consumption and the runtime
of a GPGPU application kernel are evaluated as follows: The runtime interval
[T0 , Trun ] is delimited by trigger signals (TS) in the source code which are initiated by the host and can be measured at the output of the RS232 serial port.
The end signal is triggered after a cudaThreadSynchronize() command (also in
the host code) which forces CUDA to finish the kernel execution until this position in the source code. A certain energy E(I) is consumed for the runtime
runtime(I) ([T0 , Trun ]) for individual I (representing a particular scheduling for
one kernel):
P (t)
E(I) =
(2)
fs
t∈[T0 ,Trun ]
in which fs is the sampling frequency (s−1 or Hertz) of the oscilloscope measurements, P (t) the power consumption in Watt (W ), E is the energy in Joule
(J) and t is the time in the runtime interval.
jvargas2006@gmail.com
22
C. Timm et al.
Trigger Signal Line
RS 232
Testbed
System
Oscilloscope
Nvidia Graphics Card
3.3V
12V
Current
Clamp
Current
Clamp
PCI Express
PCI Express Power
Supply Lines
Fig. 5. Energy and Runtime Evaluation Framework
FALIS Workflow: The work flow of the genetic algorithm – in a more abstract
form as opposed to the detailed description of SPEA2 in [27] – comprises several
steps and starts with the generation of an initial population Ix with x = 0. For
the individuals in that population I ∈ Ix , the fitness values are evaluated. Based
on these fitness values, a selection of the most appropriate candidates for the
creation of the next generation Ix+1 is done. The schematic of the workflow of
SPEA2 is depicted in Figure 1. A subset of the elements of the Ix+1 -th population
take part in evolutionary processes:
– Crossover : Exchange of a part of the genes between two individuals I, I ∈
Ix .
– Mutation: Randomized mutation of genes of an individual I ∈ Ix .
In FALIS, Crossover and Mutation change the position of the memory and
memory-related instruction which can lead to the situation that invalid individuals can be created, which represent a GPGPU application which can have
another semantics with respect to the original version. Therefore, an individual
validator was implemented which checks, with the help of a control flow graph
and a data dependency graph if the created program is correct. In FALIS,s.
When population size μ is reached in the selection process for population Ix+1 ,
the complete process restarts with this population. The complete process is repeated until the population converges or a fixed number of evolutions is accomplished.
The fitness function enables the assessment of the quality of an individual
I ∈ I with regard to a certain objective. In order to determine better solutions
for each kernel, each individual of a population is evaluated with regard to the
jvargas2006@gmail.com
Feedback-Based Global Instruction Scheduling for GPGPU Applications
23
fitness function by FALIS. In contrast to single objective genetic algorithms, in
multi-objective algorithms several competing objectives must be handled. This
can be done, e.g. with a fitness function and selection process tailored towards
multi-objectives. The most important method of SPEA2 is the multi-objective
fitness function. SPEA2 assigns the fitness f (I) = R(I) + D(I) to an individual
I, according to the following equations.
R(I) =
S(J)
(3)
J∈Ix ,JI
is the raw fitness of an individual I which takes into account the strength
S(J) = {Z|Z ∈ Ix , J Z}
(4)
of the individuals J which dominate I. I J is the Pareto dominance function
defined in [24] and means I dominates J, if the condition
runtimeratio (I) ≤ runtimeratio (J) ∧
Eratio (I) ≤ Eratio (J) ∧
(runtimeratio (I) < runtimeratio (J) ∨
Eratio (I) < Eratio (J))
(5)
is true. The reference individual without scheduling for calculating the performance increase and energy consumption decrease is defined as Iun with mobile
instructions at their original position. The runtime improvement for an individual I is specified as the runtime of the scheduled version runtime(I) with respect to the completely unchanged version of the benchmark runtimeratio (I) =
runtime(I)
E(I)
runtime(Iun ) . The energy consumption reduction Eratio (I) = E(Iun ) is, analogously to the runtime improvement, defined as the energy consumption of the
scheduled version E(I) in comparison to the energy consumption of unchanged
benchmark code E(Iun ). In addition to that SPEA2 also provides a correction
function D(I) to take density information into account. With this approach,
it is possible to optimize CUDA kernels towards several objectives. FALIS is
designed to optimize the objectives energy consumption and performance. The
runtime and energy consumption values are stored for each individual, but only
the individuals on the pareto-front are interesting for the application designer,
as they minimize at least one objective.
4
Evaluation
In this section, the optimization capability of FALIS is evaluated. This is accomplished towards the following to objectives: runtime and energy consumption.
First of all, an overview of the benchmarks and the test system is given and
then results are presented.
jvargas2006@gmail.com
24
C. Timm et al.
100
95
8 registers/thread
7 registers/thread
Pareto Front
90
88
90
92
94
96
98
100
Proportional Runtime(%)
Proportional Runtime(%)
100
95
90
92
94
96
98
100
102
Proportional Energy Consumption (%)
Proportional Energy Consumption (%)
(a) Kernel convolutionRowsKernel [14]
10 registers/thread
Pareto Front
90
(b) Kernel srad cuda 1 [3]
Fig. 6. Proportional Energy Consumption (Eratio ) and Proportional Runtime
(runtimeratio ) Reduction for all Evaluated Individuals
4.1
Experimental Environment and Theoretical Evaluation
The benchmark suite for this evaluation contains benchmarks from the following
sources:
– Nvidia CUDA examples [14]
– VSIPL-GPU-Library [10]
– Rodinia benchmark suite [3].
The benchmark suite comprises a large variety of application domains such as:
medical imaging, data mining, image processing, pattern recognition, simulation
etc. The benchmark characteristics cover benchmarks with and without extensive main memory utilization and benchmarks which are more computationally
expensive. The system used in this study consists of the following components:
AMD Phenom X4 9650 (Processor), 4GB DDR3 PC1066 (Main Memory). The
power clamp was a Hameg - HZO50 and the oscilloscope was a Digilent Electronics Explorer Board. For the SPEA2 implementation in FALIS the JECO
library [18] was utilized. The graphics card used in these tests was a Nvidia 9500
GT (shader clock speed : 1107 MHz and memory clock speed: 400 MHz). The
probability values for SPEA2 were 40% for mutation and 20% for crossover.
The start population size depends on the number of genes in a chromosome of
a certain benchmark to ensure that the solution space is sufficiently explored.
Due to the specialisation to memory and memory-related instructions, not all
benchmarks of the three benchmarks suites show a performance increase or energy consumption decrease. Only the following benchmarks show significant optimization gain and will be taken into account in the results section:
–
–
–
–
–
–
Kernel
Kernel
Kernel
Kernel
Kernel
Kernel
convolutionRowsKernel, benchmark convolutionSeparable [14]
convolutionColumnsKernel, benchmark convolutionSeparable [14]
srad cuda 1, benchmark SRAD [3]
CUDAkernelQuantizationShort, benchmark dct8x8 [14]
d recursiveGaussian rgba, benchmark recursiveGaussian [14]
cuda compute flux, benchmark cfd [3]
jvargas2006@gmail.com
Relative Energy Consumption Decrease (%)
Feedback-Based Global Instruction Scheduling for GPGPU Applications
25
10
5
0
convolutionRowsKernel
CUDAkernelQuantizationShort
srad cuda 1
convolutionColumnsKernel
d recursiveGaussianrgba
cuda compute flux
Benchmark
Fig. 7. Energy Consumption Analysis of Optimized Benchmarks
4.2
Results
Figures 7 and 8 depict the improvements achieved with FALIS for benchmarks
leading to a performance increase (shown in Figure 8) or an energy consumption
decrease (shown in Figure 7) as listed in the former section. The maximal value
for reducing the energy consumption is 10.23% and the maximal value for runtime decrease is 13.02%. They were measured for kernel convolutionRowsKernel
of the benchmark convolutionSeparable [14]. Another kernel which was accelerated significantly was the kernel srad cuda 1 of benchmark SRAD [3]. Detailed evaluations of explored solutions are depicted in Figure 6(a) respectively
Figure 6(b). Single points inside the figures show the runtime and energy consumption decrease for one individual. The Pareto-optimal points are connected
with a line called Pareto front. A runtime decrease of 11.66% and an energy
consumption decrease of 8.59% was achieved for kernel srad cuda 1 (chromosome length: 64). As one can see from Figure 6(b), there are two clusters. One
cluster comprises solutions which have an impact on the runtime and the energy
consumption. The other exemplary kernel is depicted in Figure 6(a) (chromosome length: 304). The energy consumption decreases up to 10.23% and the
runtime decreases up to 13.02%. Within the optimization of kernel convolutionRowsKernel, the change in register utilization – 7 registers per thread changed to
8 registers per thread – has a positive effect on the runtime and the energy consumption. The performance of kernel CUDAkernelQuantizationShort of benchmark dct8x8 can be increased by 4.84% and energy consumption is decreased
by 3.7%. The performance of kernel CUDAkernelQuantizationShort of benchmark dct8x8 can be increased by 4.84% and energy consumption is decreased
by 3.7%. In addition to that performance of kernel d recursiveGaussian rgba of
benchmark recursiveGaussian is increased by 8.13% and energy consumption is
decreased by 7.28%.
jvargas2006@gmail.com
C. Timm et al.
Relative Runtime Decrease (%)
26
10
5
0
convolutionRowsKernel
CUDAkernelQuantizationShort
srad cuda 1
convolutionColumnsKernel
d recursiveGaussianrgba
cuda compute flux
Benchmark
Fig. 8. Runtime Analysis of Optimized Benchmarks
5
Conclusion
GPGPU application optimizations are usually performed manually in a timeconsuming trial-and-error process without efficient compiler support. Additionally, the lack of energy consumption aware optimizations is unfavorable for
green computing and the utilization of GPGPU-capable chips in mobile systems.
Therefore, this paper presents a novel multi-objective optimization process based
on global instruction scheduling methods, called FALIS, which optimizes the energy consumption and the runtime simultaneously. For solving this optimization
problem, a genetic algorithm was used which feeds profiling data back to the
optimization process. Due to the utilization of a state-of-the-art multi-objective
genetic algorithm it was possible to optimize an GPGPU application towards
two objectives: runtime and energy consumption. FALIS was focus on memory
and memory-related instructions which can have a great impact on GPGPU
application performance in the face of the memory wall problem. With FALIS
reductions of up to 10.23% in energy consumption and 13.02% in runtime could
be achieved for real-world benchmarks.
In the study presented in this paper, memory and memory-related instructions
have been taken as relevant instructions to be scheduled. However, it could
possibly advantageous to schedule also other instructions such as the instructions
used to calculated the addresses for the memory accesses. Therefore, in future
works it will be evaluated if optimized scheduling of more types of instructions
is even more beneficial for the energy consumption and the performance.
Acknowledgement. Part of the work on this paper has been supported by
Deutsche Forschungsgemeinschaft (DFG) within the Collaborative Research Center SFB 876 ”Providing Information by Resource-Constrained Analysis”,
project B2.
jvargas2006@gmail.com
Feedback-Based Global Instruction Scheduling for GPGPU Applications
27
References
1. Banerjia, S., Havanki, W.A., Conte, T.M.: Treegion Scheduling for Highly Parallel
Processors. In: Lengauer, C., Griebl, M., Gorlatch, S. (eds.) Euro-Par 1997. LNCS,
vol. 1300, pp. 1074–1078. Springer, Heidelberg (1997)
2. De Bosschere, K., Luk, W., Martorell, X., Navarro, N., O’Boyle, M., Pnevmatikatos, D., Ramı́rez, A., Sainrat, P., Seznec, A., Stenström, P., Temam, O.:
High-Performance Embedded Architecture and Compilation Roadmap. In: Stenström, P. (ed.) Transactions on High-Performance Embedded Architectures and
Compilers I. LNCS, vol. 4050, pp. 5–29. Springer, Heidelberg (2007)
3. Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J.W., Lee, S.H., Skadron, K.:
Rodinia: A Benchmark Suite for Heterogeneous Computing. In: Proceedings of the
IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54
(2009)
4. Cho, S., Melhem, R.: Corollaries to Amdahl’s Law for Energy. IEEE Computer
Architecture Letters, 25–28 (2008)
5. Dominguez, R., Kaeli, D.R.: Improving the open64 backend for GPUs. Poster at
Google Summer School (2009)
6. Görlich, M.: Untersuchung und Verbesserung der Speicherzugriffsverteilung in
GPGPU-Programmen unter Nutzung von lokalen Schedulingmethoden. Master’s
thesis, Embedded System Group, Faculty of Computer Science, TU Dortmund
(2011)
7. Han, T.D., Abdelrahman, T.S.: Reducing branch Divergence in GPU Programs. In:
Proceedings of the Fourth Workshop on General Purpose Processing on Graphics
Processing Units, pp. 1–8 (2011)
8. Hong, S., Kim, H.: An Analytical Model for a GPU Architecture with Memorylevel and Thread-level Parallelism Awareness. In: Proceedings of the 36th Annual
International Symposium on Computer Architecture (ISCA), pp. 152–163 (2009)
9. Kerns, D.R., Eggers, S.J.: Balanced Scheduling: Instruction Scheduling When
Memory Latency is Uncertain. In: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 278–289
(1993)
10. Kerr, A., Campbell, D., Richards, M.: GPU VSIPL: High-Performance VSIPL Implementation for GPUs. In: Proceedings of the 12th High Performance Embedded
Computing Workshop (HPEC), Lexington, Massachusetts, USA (2008)
11. Kung, S.Y., Kailath, T., Whitehouse, H.J.: VLSI and Modern Signal Processing.
Prentice Hall Professional Technical Reference (1984)
12. Leupers, R.: Instruction Scheduling for Clustered VLIW DSPs. In: Proceedings of
the International Conference on Parallel Architecture and Compilation Techniques
(PACT), pp. 291–300 (2000)
13. Machanick, P.: Approaches to Addressing the Memory Wall. Technical report,
School of IT and Electrical Engineering, University of Queensland (2002)
14. NVIDIA Corporation: CUDA Architecture (2009)
15. NVIDIA Corporation: The CUDA Compiler Driver NVCC (2009)
16. Open64 Project at Rice University: Open64 Compiler: Whirl Intermediate Representation (2007), www.mcs.anl.gov/OpenAD/open64A.pdf
17. Owens, J., Luebke, D., Govindaraju, N., Harris, M., Krüger, J., Lefohn, A., Purcell,
T.: A Survey of General-Purpose Computation on Graphics Hardware. Computer
Graphics Forum, 80–113 (2007)
jvargas2006@gmail.com
28
C. Timm et al.
18. Risco-Martin, J.: Java Evolutionary COmputation library (JECO) (2012),
https://sourceforge.net/projects/jeco
19. Rofouei, M., Stathopoulos, T., Ryffel, S., Kaiser, W., Sarrafzadeh, M.: EnergyAware High Performance Computing with Graphic Processing Units. In: Proceedings of the Workshop on Power Aware Computing and Systems, HotPower (2008)
20. Timm, C., Gelenberg, A., Marwedel, P., Weichert, F.: Energy Considerations
within the Integration of General Purpose GPUs in Embedded Systems. In: Proceedigns of the Annual Internation Conference on Advances in Distributed and
Parallel Computing, ADPC (2010)
21. Timm, C., Weichert, F., Marwedel, P., Müller, H.: Multi-Objective Local Instruction Scheduling for GPGPU Applications. In: Proceedings of the International
Conference on Parallel and Distributed Computing Systems, PDCS (2011)
22. Tseng, C.J., Siewiorek, D.: Automated Synthesis of Data Paths in Digital Systems.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
379–395 (1986)
23. Valluri, M., John, L.: Is Compiling for Performance == Compiling for Power? In:
Proceedings oh the Workshop on Interaction between Compilers and Computer
Architectures, INTERACT (2001)
24. Voorneveld, M.: Characterization of Pareto Dominance. Operations Research Letters, 7–11 (2003)
25. Wang, Z., Hu, X.S.: Energy-Aware Variable Partitioning and Instruction Scheduling for Multibank Memory Architectures. ACM Transactions on Design Automation of Electronic Systems (TODAES), 369–388 (2005)
26. Woo, D.H., Lee, H.H.: Extending Amdahl’s Law for Energy-Efficient Computing
in the Many-Core Era. IEEE Computer, 24–31 (2008)
27. Zitzler, E., Giannakoglou, K., Tsahalis, D., Periaux, J., Papailiou, K., Fogarty,
T., Ler, E.Z., Laumanns, M., Thiele, L.: SPEA2: Improving the Strength Pareto
Evolutionary Algorithm For Multiobjective Optimization. In: Proceedings of the
International Conference on Evolutionary and Deterministic Methods for Design,
Optimization and Control with Applications to Industrial and Societal Problems,
EUROGEN (2001)
jvargas2006@gmail.com
Parallel Algorithm for Landform Attributes
Representation on Multicore and Multi-GPU Systems
Murilo Boratto1 , Pedro Alonso2 , Carla Ramiro2 ,
Marcos Barreto3 , and Leandro Coelho1
1
Núcleo de Arquitetura de Computadores e Sistemas Operacionais (ACSO),
Universidade do Estado da Bahia (UNEB),
Salvador, Bahia, Brazil
{muriloboratto,leandrocoelho}@uneb.br
2
Departamento de Sistemas Informaticos y Computación (DSIC),
Universidad Politécnica de Valencia (UPV),
Valencia, España
{palonso,cramiro}@dsic.upv.es
3
Laboratório de Sistemas Distribuídos (LaSiD),
Universidade Federal da Bahia (UFBA),
Salvador, Bahia, Brazil
marcoseb@dcc.ufba.br
Abstract. Mathematical models are often used to simplify landform representation. Its importance is due to the possibility of describing phenomena by means
of mathematical models from a data sample. High processing power is needed
to represent large areas with a satisfactory level of details. In order to accelerate
the solution of complex problems, it is necessary to combine two basic components in heterogeneous systems formed by a multicore with one or more GPUs.
In this paper, we present a methodology to represent landform attributes on multicore and multi-GPU systems using high performance computing techniques for
efficient solution of two-dimensional polynomial regression model that allow to
address large problem instances.
Keywords: Mathematical Modeling, Landform Representation, Parallel Computing, Performance Estimation, Multicore, Multi-GPU.
1 Introduction
Some recent events have encouraged the development of applications that represent
geophysical resources efficiently. Among these representations, mathematical models
for representing landform attributes stand out, based on two-dimensional polynomial
equations [8]. Landform attributes representation problem using polynomial equations
had already been studied in [1]. However, distributed processing was not used in that
work, which implied the usage of a high degree polynomial, thus limiting the area
representation. It occurred because the greater the represented information, the greater
computational power is needed. Furthermore, a high degree polynomial is required to
represent a large area correctly, which also demands a great computational power.
B. Murgante et al. (Eds.): ICCSA 2012, Part I, LNCS 7333, pp. 29–43, 2012.
c Springer-Verlag Berlin Heidelberg 2012
jvargas2006@gmail.com
30
M. Boratto et al.
Among the reasons for fulfilling landform representation, we focus on measuring
agricultural areas, having as preponderant factors: plantation design, water resource
optimization, logistics and minimization of erosive effects. Consequently, landform representation process becomes a fundamental and needful tool in efficient agriculture operation, especially in the agricultural region located in São Francisco river valley, which
stands out as one of the largest viniculture and fruit export areas in Brazil. In addition,
one of the main problems that make efficiency use difficult in agricultural productivity
factors in this areas occurred due to soil erosion associated with inappropriate land use.
In this context, the work proposed here contributes to the characterization of soil degradation processes.
Today it is usual to have computational systems formed by a multicore with one or
more Graphics Processing Units (GPUs) [13]. These systems are heterogeneous, due to
different types of memory and different speeds of computation between CPU and GPU
cores. In order to accelerate the solution of complex problems it is necessary to use
the aggregated computational power of the two subsystems. Heterogeneity introduces
new challenges to algorithm design and system software. Our main goal is to fully
exploit all the CPU cores and all GPUs devices on these systems to support matrix
computation [14]. Our approach achieves the objective of maximum efficiency by an
appropriate balancing of the workload among all these computational resources.
The purpose of this paper is to present a methodology for landform attributes representation of São Francisco river valley region based on two-dimensional polynomial
regression method on multicore and multi-GPU systems. Section 2 briefly describes the
mathematical model. Section 3 explains the parallel model on multicore and multi-GPU
systems and the parallel implementation. Section 4 presents the experimental results.
Conclusions and future work section closes the paper.
2 Mathematical Model
A mathematical landform model is a computational mathematical representation of a
phenomenon that occurs within a region of the earth surface [7]. This model can represent plenty of geographic information from a site such as: geological, geophysical, humidity, altitude, terrain, etc. One available technique to accomplish this representation is
the Regular Grid Model [11]. This work makes the surface mapping with a global fitting
using the polynomial regression technique. This technique fits a two-dimensional polynomial that best describes the data variation from a sample. The problem is that high
computational power demanded to perform the regression in a large data set made the
process very limited. Polynomial regression is a mathematical modeling that attempts
to describe the relation among observed phenomena.
Figure 1 shows an example of a Regular Grid representation generated from a regular sparse sample that represents information of an area altitude. According to Rawlings [10], modeling can be understood as the development of a mathematical analytical
model that describes the behavior of a random variable of interest. This model is used to
describe the behavior of independent variables whose relationship with the dependent
variable is best represented in a non-linear equation. The relationship among variables
is described by two-dimensional polynomial functions, where the fluctuation of the
jvargas2006@gmail.com
Parallel Algorithm
31
P3
P1
P2
Fig. 1. Model for landform attributes representation: Regular Grid
dependent variable y is related to the fluctuation of the independent variable. Particularly, in the case study developed in this project, the non-linear regression has been used
to describe the relationship between two independent variables (latitude and longitude)
and a dependent variable (height). The mathematical model we use provides the estimation of the coefficients of two-dimensional polynomial functions, of different degrees
in x and y, which represent terrain altitude variation from any area.
When using mathematical regression models, the most widely used estimation method of parameters is the Ordinary Least Squares [5] that consists of the estimation of
a function to represent a set of points minimizing the deviations squared. Given a set of
geographic coordinates (x, y, z), taking the estimated altitude as estimation function of
these points, a polynomial of degree r and s in x and y can be given as Equation 1, with
εij as the error estimated by Equation 2, where 0 ≤ i ≤ m and 0 ≤ j ≤ n,
z = f (xi , yj ) =
r
s
k=0
l=0
akl xki yjl ,
εij = zij − z
ij .
(1)
(2)
The coefficients akl (k = 0, 1, ..., r; l = 0, 1, ..., s) that minimize the function error of
the estimation function f (x, y), can be obtained by solving Equation 3 for c = 0, 1, ..., r
and d = 0, 1, ..., s,
∂ξ
∂acd
where,
ξ=
m n
i=l
2
j=0 εij
=
= 0,
(3)
m n
i=l
j=o (zij
jvargas2006@gmail.com
2
− z
ij ) .
32
M. Boratto et al.
From Equations 4 to 10 we get the development of Equation 3.
ε2ij = (zij −
zij =
2
=2
i=0
r
j=0 [(zij
−
j=0 [(zij
i=0
m n
r
j=0
−
l=0
c d
j=0 [zij xi yj
−
l=0
s
k=0
r
l=0
s
k=0
s
k=0
akl xki yjl )2 ,
(4)
akl xki yjl + εij ,
r
r
s
k=0
m n
i=0
l=0
j=0 (zij
m n
i=0
s
k=0
m n
i=0
s
k=0
m n
ξ=
∂ξ
∂acd
r
l=0
l=0
(5)
akl xki yjl )2 ,
(6)
akl xki yjl )2 )xci yjd ],
(7)
akl xki yjl )2 )xci yjd ] = 0,
akl xk+c
yjl+d =
i
m n
i=0
c d
j=0 zij xi yj ,
r s
− ( k=0 l=0 akl xk+c
yjl+d )] = 0.
i
(8)
(9)
(10)
The particularized form of the polynomial can be exemplified for r = s = n case, from
Equation 11, where:
0 0
0 1
n n
Z
ij (xi , yi ) = a00 x y + a01 x y + · · · + ann x y .
(11)
Developing Equation 10 for the same particular case, we obtain Equations 12 and 13
[9]. The final solution is summarized in the matrix representation form Ax = b, where
A is the matrix formed by xlc terms, vector x is formed by the akl terms and vector b is
formed by bl terms.
m n
β
(12)
xlc = i=0 j=0 xα
i yj ,
bl =
m n
i=0
j=0
Zij xγi yjδ .
(13)
This information is valid for any matrix format that is intended to represent and to any
degree of two-dimensional polynomial, which is to be fitted.
3 Parallel Computational Model
One of the most decisive concepts to successfully program modern GPU and multicore
computers uses GPU and multicore processors is the underlying model of the parallel
computer. A GPU card connected to a sequential computer can be considered as an
isolated parallel computer fitting a SIMD model, i.e. a set of up to 512 (depending on
model) processors running the same instruction simultaneously, each on its own set
of data. On the other hand, CPU can be seen as a set of independent computational
resources (core) that can cooperate in the solution of a given problem.
jvargas2006@gmail.com
Parallel Algorithm
33
Thus, a realistic programming model should consider the host system comprising
CPU cores and graphic cards as a whole thus leading to a heterogeneous parallel computer model. We follow a model similar to the one described in [2], where the machine
is considered as a set of computing resources with different characteristics connected
via independent links to a shared global memory. Such model would be characterized
by the number and type of concurrent computational resources with different access
time to reach each resource and the different types and levels of memory (Figure 2).
Programming such a heterogeneous environment poses challenges at two different levels. At the first one, the programming models for CPUs and GPUs are very different.
The performance of each single host subsystem depends on the availability of exploiting the algorithm’s intrinsic parallelism and how it can be tailored to be fitted in the
GPU or CPU cores. At a second level, the challenge consists of how the whole problem
can be partitioned into pieces (tasks) and how they can be delivered to CPU cores or
GPU cards so that the workload is evenly distributed between these subsystems.
Fig. 2. Heterogeneous parallel computer model
The partition of the problem into tasks and the scheduling of these tasks can be
based on performance models obtained from previous executions or on a more sophisticated strategy. This strategy is based on small and fast benchmarks representative of
the application that allows to predict, at runtime, the amount of workload that should
be dispatched to each resource so that it would minimize the total execution time. Currently, we focus our work on how to leverage the heterogeneous concurrent underlying
hardware to minimize the time-to-solution of our application leaving the study of more
sophisticated strategies of workload distribution to further research.
3.1 The Parallel Algorithm
In particular, the algorithm for the construction of matrix A of order N= (s + 1)2 with a
polynomial degree s can be seen in Algorithm 1. Arrays x and y have been previously
loaded from a file and stored in such a way that allows to simplify two sums in only one
jvargas2006@gmail.com
34
M. Boratto et al.
of length n= m2 . Routine matrixComputation receives as arguments the arrays
x and y, the length of the sum (n) and the order of the polynomial (s), and returns a
pointer to the output matrix A.
Algorithm 1. Routine matrixComputation for the construction of matrix A.
1: int N = (s+1)*(s+1);
2: for( int k = 0; k < N; k++ ) {
3:
for( int l = 0; l < N; l++ ) {
4:
int exp1 = k/s+l/s;
5:
int exp2 = k%s+l%s;
6:
int ind = k+l*N;
7:
for ( int i=0; i<n; i++ ) {
8:
A[ind] += pow( x[i], exp1 ) * pow( y[i], exp2 );
9:
}
10:
}
11: }
The construction of matrices A and b (Equations 12 and 13) is by far the most expensive part of the overall process. However, there exists a great opportunity for parallelism in this computation. It is not hard to see that all the elements of the matrix can
be calculated simultaneously. Furthermore, each term of the sum can be calculated independently. This can be performed to partition the sum into chunks of different sizes.
Our parallel program is based on this approach since the usual value for the order of the
polynomial s ranges from 2 to 20, yielding matrices of order from 9 to 441 (variable
N), whereas the length of the sum ranges from 1, 3 to 25, 4 million terms (variable n),
depending on how fine the grid is desired. We partition the sum into chunks, each one
with a given size so they do not be necessarily equal. The application firstly spawns a
number of nthreads where each one (thread_id) will work on a different sumchunk yielding a matrix A_[thread_id]. We consider here A_[thread_id] as a
pointer to a linear array of length N×N. The result, i.e. matrix A, is the sum of all these
matrices so A=A_[0]+A_[1]+. . . +A_[nthreads-1].
Function matrixComputation can be easily modified to work on these particular matrices that present the computation over chunks of arrays x and y. Everything
discussed for the computation of matrix A can be extrapolated to the computation of
vector b including its computation in the same routine.
Now, we consider our heterogeneous system consisting of two CPU processors with
six cores each and two identical GPUs. The workload consisting of the computation of
matrix A and array b has been separated into two main parts in order to deliver its computation to both the multicore CPU system and two-GPUs system. Algorithm 2 shows
the scheme used to partition the workload into these two pieces by means of the if
branching which starts in line 4. Because it is necessary to have one CPU thread linked
to each GPU, we initialize the runtime with a total of nthreads, i.e., as many CPU
threads as CUDA devices [3] plus a number of CPU threads that will be linked to a CPU
core each. This is carried out through an OpenMP [4] pragma directive (lines 1 and 2).
jvargas2006@gmail.com
Parallel Algorithm
35
The first two CPU threads is binded to two GPUs devices and the rest is binded to CPU
virtual processors. The right number of CPU threads can be fewer than the available
number of CPU in some cases. Someteimes, we got better results with a number of
threads larger than number of cores since we have the Intel Hyper-Threading [6] capability activated. We have employed a static strategy to dispatch data and tasks to the
CPU cores and to the GPUs, i.e., the percentage of workload delivered to each system
is an input to the algorithm provided by the user. Once given the desired percentage
of computation, the size of data is calculated before calling Algorithm 2 so that variable sizeGPU stores the number of terms of the sum (lines 7–9 of Algorithm 1) that
each one of two GPUs will compute, and variable sizeCPU stores the total amount
of terms that all the cores of the CPU system will compute. Each system will perform
computation if the piece of data assigned is larger than zero.
Algorithm 2. Using multiple GPU devices and cores.
1:
omp_set_num_threads(nthreads);
2:
#pragma omp parallel {
3:
int thread_id = omp_get_thread_num();
4:
if( sizeGPU ) {
/* GPU Computing */
5:
int gpu_id = 2*thread_id;
6:
int first = thread_id * sizeGPU;
7:
cudaSetDevice(gpu_id);
8: matrixGPU(A,b,&(x[first]),&(y[first]),&(z[first]),sizeGPU,s);
9:
} else {
10:
if( sizeCPU ) {
/* CPU Computing */
11:
int cpu_thread_id = thread_id-2;
12:
int first = 2*sizeGPU+cpu_thread_id*sizeThr;
13:
int size = thread_id==(nthreads-1)?sizeLth:sizeThr;
14: matrixCPU(A,b,&(x[first]),&(y[first]),&(z[first]),size,s);
15:
}
16:
}
17: }
Matrix A and vector b are the output data of Algorithm 2. They are the sum of the
partial sums computed by each system. We use arrays of matrices A_ and b_ described
earlier to store these partial results independently whether they were computed by a
CPU core or a GPU device. Once the threads are joined (after line 17) a total sum of
these partial results is performed by the main thread to form A and b. Each thread works
on a different piece of arrays x, y, and z. Routines matrixCPU and matrixGPU are
adaptations of Algorithm 1 that receive pointers to the suitable location in arrays x,
y and z (set in variable first in lines 6 and 12, respectively) and the length of the
sum, i.e., sizeGPU for the each GPU or size for each CPU core. These routines also
include the computation of vector b that was omitted in Algorithm 1. The total amount
of work performed by the CPU system (line 13) is divided into equal chunks of size
sizeThr (sizeThr=sizeCPU/(nthreads-2)) except for the last core which is
jvargas2006@gmail.com
36
M. Boratto et al.
sizeLth, i.e., sizeThr plus the remaining terms. The GPU devices in our system
are identified with integers 0 and 2 (gpu_id), which explains line 5.
3.2 The CUDA Kernel
The computation performed by each GPU is implemented in the matrixGPU function,
called in line 8 of Algorithm 2. This function firstly performs the usual operations of
allocating memory in GPU and uploading data from the CPU to the GPU kernel. Thus, it
is supposed that arrays A, x and y have been previously uploaded into the card’s global
memory. For the sake of simplicity we restrict the explanation to the computation of
matrix A since the computation of vector b can be easily deduced.
The construction of matrix A through a CUDA kernel has a great opportunity of
parallelism. In this case, we exploit both, the fact that all the elements of matrix A
can be computed concurrently and that each term of the sum is independent of any
other one. In order to exploit all this concurrency we used a grid of three-dimensional
thread blocks. The thread blocks have dimension BLKSZ_X×BLKSZ_Y×BLKSZ_Z
whose values are macro definitions in the first three lines of Algorithm 3. Each thread
is located in the block through 3 coordinates which are represented by variables X, Y
and Z (lines 9–11). The thread blocks are arranged in a three-dimensional grid. The first
dimension is 1, and the other two are N/BLKSZ_Y and N/BLKSZ_Z , respectively,
being N the dimension of matrix A and idiv an integer function which returns the
length of the last two dimensions. The following code is within matrixGPU function
and shows this arrangement and the call to the kernel:
1: dim3 dimGrid( 1, idiv( N, BLKSZ_Y ), idiv( N, BLKSZ_Z ) );
2: dim3 dimBlock( BLKSZ_X, BLKSZ_Y, BLKSZ_Z );
3: kernel<<< dimGrid, dimBlock >>>( A, x, y, s, n, N );
The aim is that all the threads within a block calculate concurrently the core computation of line 27. The thread with coordinates (X,Y,Z) is assigned to calculate the terms of
the sum X+i, with i = 0 :BLKSZ_X: n. This operation is specified by the loop which
starts in line 19. The exponents exp1 and exp2 depend on the row (k) and column
(l) indexes of the sough-after matrix A. These indexes are calculated in lines 7 and 8,
respectively, based on coordinates Y and Z of the thread. All the threads in the block
use data in arrays x and y so, before calculation in line 27, a piece of these arrays must
be loaded into the shared memory from the global memory. Shared memory is a rapid
access memory space that all the threads within a block can access. Each thread in the X
dimension with Y=0 and Z=0 performs this load into shared memory copying one element of arrays x and y into arrays sh_x and sh_y, respectively (lines 22–25). These
last arrays have been allocated in the shared memory in line 13. Upon completion of the
loop in line 29, all the terms of the sum assigned to that thread have been calculated and
stored in the register variable a. Now, this value is stored in shared memory (line 30).
Therefore, a three-dimensional array sh_A of size BLKSZ_X×BLKSZ_Y×BLKSZ_Z
has been allocated in shared memory in line 14.
We need to imagine the shared data sh_A as a three-dimensional cube where each
position has a partial sum of the total sum. There is one sum for each element r×c
jvargas2006@gmail.com
Parallel Algorithm
37
Algorithm 3. CUDA Kernel.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
35:
36:
37:
38:
39:
40:
#define BLKSZ_X
#define BLKSZ_Y
#define BLKSZ_Z
128
2
2
__global__ void kernel( double *A, double *x, double *y,
int s, int n, int N ) {
int k = blockIdx.y * blockDim.y + threadIdx.y;
int l = blockIdx.z * blockDim.z + threadIdx.z;
int X = threadIdx.x;
int Y = threadIdx.y;
int Z = threadIdx.z;
double a = 0.0;
__shared__ double sh_x[BLKSZ_X], sh_y[BLKSZ_X];
__shared__ double sh_A[BLKSZ_X][BLKSZ_Y][BLKSZ_Z];
if( k<N && l<N ) {
int exp1 = (k/s)+(l/s);
int exp2 = (k%s)+(l%s);
for( int K=0; K<n; K+=BLKSZ_X ) {
int i = X+K;
if( i<n ) {
if( Y == 0 && Z == 0 ) {
sh_x[X] = x[i];
sh_y[X] = y[i];
}
__syncthreads();
a += pow( sh_x[X], exp1 ) * pow( sh_y[X], exp2 );
}
}
sh_A[X][Y][Z] = a;
__syncthreads();
if( X == 0 ) {
a = 0.0;
for( int i=0; i<BLKSZ_X; i++ ) {
a += sh_A[i][Y][Z];
}
A[k+N*l] = a;
}
}
}
of matrix A. In other words, elements sh_A[i][Y][Z], for all i, contain the partial
sums corresponding to a given element r×c, taking into account the parity between the
matrix element and the thread coordinates set in lines 7–11. Thus, it is necessary now
to add all the partial sums in the X dimension for all the sums. This operation (lines 32–
38) is performed only by threads such that X=0. Once the sum has been computed, the
result is saved in, global memory (line 37).
jvargas2006@gmail.com
38
M. Boratto et al.
We use synchronization points within the code (__syncthreads()) to make sure
data in shared memory is saved before read. The use of shared memory is restricted to a
small size that must be statically allocated (at compilation time). The size of our target
GPU is 48KB. The maximum number of threads per block is limited to 1024. However,
the total amount of shared memory is what really determines the size of the threads
block. Anyway, the limitation in the number of threads per block is easily overcome
by the number of blocks that can be run concurrently. Typical values for the threads
block dimensions are the ones defined in lines 1–3. We experimentally checked that
dimensions Y and Z should be equal. Somehow there are proportional to N (size of
matrix A) and dimension X should be related to n (size of arrays x and y) and so much
longer. Values of N range from 9 to 441, while values of n, range between 1, 3 × 106
and 25, 4 × 106 in our experiments. Given this relationship between N and n, it is clear
that the opportunity for concurrency spreads in the X dimension of the block.
Just as a simply note to say that we chose the first dimension as the “largest” one due
to the GPU limits, the last dimension of the threads block to 64, allowing up to 1024 the
number of threads in the other two dimensions. In addition, it is possible to say that the
three-dimensional grid of blocks has really been limited to an effective two-dimensional
grid since the first dimension is set to 1. More blocks in coordinate X means that data
computed by each block in that dimension and stored in the shared memory should be
also shared among the thread blocks. This can only be done through global memory
resulting in a performance penalty.
4 Experimental Results
4.1 Characterization of the Execution Environment
The computer used in our experiments has two Intel Xeon X5680 at 3.33Ghz and 96GB
of GDDR3 main memory. Each one is a hexacore processor with 12MB of cache memory. It contains two GPUs NVIDIA Tesla C2070 with 14 stream multiprocessors (SM)
and 32 stream processors (SP) each (448 cores in total), 16 load/store units, four special
functions, a 32K-word register file, 64K of configurable RAM, and thread control logic.
Each SM has both floating-point and integer execution units. Floating point operations
follow the IEEE 754-2008 floating-point standard. Each core can perform one double
precision fused multiply-add operation in each clock period and one double-memory.
The installed CUDA toolkit is version 4.0. We use library MKL 10.3 to perform BLAS
operations in the CPU subsystem.
4.2 Landform Attributes Representation Analysis
In order to validate the presented methodology and derived equations, it will be applied computing techiques for efficient solution of two dimensional polynomial regression model to represent the landform attributes of an area of São Francisco river
valley region. The data source of the chosen area comes from Digital Terrain Models
(DTM) [12], in the form of a regular matrix, with spacing approximately 900 meters in
jvargas2006@gmail.com
Parallel Algorithm
39
the geographic coordinates. The statistical analyses of the elevations indicate a dispersion from 1 to 2, 863 meters. The DTM with 1, 400 points has only 722 points inside
the region, the other points are outside the area. Using all the points representing the
landform attributes of the area and, by Equation 12, we estimate the polynomial coefficients for representing terrain altitude variation. The time to estimate such a polynomial
in high degree needs great computer power and a long time of processing. However, the
higher the degree of the polynomial is more accuracy in the description of landform
attributes representation we have, thus achieving a more satisfactory level of details.
Using the DTM data source, the solution of the model shows an elevation map generated in 3D vision (Figure 3) and a 2D projection in gray tones (Figure 4). It can also
be observed that São Francisco river valley region has a heavily uneven topography so
a high degree of the adjusted polynomial is needed to attain an accurate representation
of the surface. Figure 3 shows the elevation map generated with the coefficients of
polynomials with degrees 2, 4, 6 and 20, respectively. Therefore, by fitting a high degree
polynomial to the data a better landform attributes representation and a more accurate
extrapolation is obtained.
1200
1200
1000
1000
800
800
z
z
600
600
400
400
5000
200
00
00
3000
1000
2000
x
2000
3000
4000
5000
200
4000
y
4000
3000
1000
1000
2000
x
50000
1200
1200
1000
1000
800
2000
3000
4000
y
1000
50000
800
z
z
600
600
400
400
5000
200
00
4000
1000
2000
x
2000
3000
4000
5000
200
00
3000
y
4000
3000
1000
1000
2000
x
50000
2000
3000
4000
y
1000
50000
Fig. 3. 3D Vision landform attributes representation of São Francisco river valley region for polynomial degrees 2, 4, 6 and 20
4.3 Experiments Using Double Precision Data
We have implemented a parallel algorithm in CUDA, based on landform attributes
representation by using the parallel scheme proposed in Section 3 to build the surface mapping with a global fitting using the polynomial regression technique. The
jvargas2006@gmail.com
40
M. Boratto et al.
Fig. 4. Gray tones landform attributes representation of São Francisco river valley region for
polynomial degrees 2, 4, 6 and 20
benchmarks were compiled with nvcc. In the experiments, we first increased the number of CPU threads from 1 to 24 (Hyper-Threading is set) to obtain the number that
minimizes time. Then we add 1 and 2 GPUs to the number of threads obtained in the
former test. The input sizes of the problem (degree of a polynomial) for the experiments
were 2, 4, ..., 40. The algorithm performance is analyzed in Figure 5.
The execution with one thread is denoted by “Sequential” in the figure while “OMP”
denotes the use of several CPU threads. The OMP version distributes the matrix calculation among the threads and each thread is run exclusively on a CPU core. Versions
denoted by 1GPU and 2GPU represent executions in one single and two devices, respectively. The Parallel Model (“Model”) uses all cores available in the heterogeneous
system. In this model the threads are executed by all the elements of the machine, the
suitable number of CPU cores and the two GPUs. The results of the experiments show
that the parallel CPU algorithm (OMP) reduces the execution time significantly. As can
be seen in Figure 5 the maximum speedup is around 12, matching the number of cores.
The second figure shows Gflops, presenting a difference in performance that can be
observed more clearly. It must be noted that, for small polynomial degrees, the performance of OMP is larger than the performance with 1 GPU (size <= 10). This is due to
the data transfer between CPU and GPU. Similarly, the performance of 1 GPU is larger
than the performance with 2 GPUs (size <= 20). In this case, this is due to the setup
time needed in the selection of the devices, which is high in our target machine (4.5
sec.) and is not necessary if just one GPU is used. The best result has been obtained
with every resource available in the heterogeneous system. The speedup increases with
the problem size reaching the theoretical maximum of 78, a number that has been obtained by comparing the computational power of one GPU with the CPU. The use of
GPU as a standalone tool provides benefits but does not allow to reach the potential
performance that could be obtained by adding more GPUs and/or the CPU subsystem.
jvargas2006@gmail.com
Parallel Algorithm
41
Experimental Execution Time
5000
Sequential
OMP
1GPU
2GPU
Model
Time (Seconds)
4000
3000
2000
1000
0
0
5
10
15
20
25
30
35
40
35
40
Size (Degree of a Polynomial)
Experimental Gflops
50
OMP
1GPU
2GPU
Model
45
40
35
Gflops/s
30
25
20
15
10
5
0
0
5
10
15
20
25
30
Size (Degree of a Polynomial)
Speedup
100
OMP
1GPU
2GPU
Model
80
Speedup
60
40
20
0
0
5
10
15
20
25
30
35
40
Size (Degree of a Polynomial)
Fig. 5. Graphical representation of the execution time (in seconds), Gflops and Speedup rates by
varying the size of the problem (degree of the polynomial)
jvargas2006@gmail.com
42
M. Boratto et al.
Table 1. Comparative performance analysis of execution time (in seconds)
Degree Polynomial
8
12
16
20
24
28
32
36
40
Sequential
84.49
386.17
1, 166.88
2, 842.52
5, 916.06
11, 064.96
24, 397.66
30, 926.82
46, 812.70
OMP
12.32
41.85
114.55
268.32
544.93
1, 011.42
1, 777.25
2, 700.00
4, 252.69
1GPU
12.44
21.36
43.31
90.57
172.71
310.53
521.63
828.50
1, 261.09
2GPU
14.19
19.39
31.48
57.03
101.08
176.88
285.07
450.67
666.77
Model
13.61
18.04
25.53
49.29
88.86
156.72
256.62
404.25
600.90
5 Conclusion and Future Works
The experimental results obtained in this work indicate that our approach to the solution
of the mathematical model for representing landform attributes is efficient and scalable.
The built application exploits the computing power of current GPUs leveraging the
intrinsic parallelism contained in the algorithm. Furthermore, our solution is designed
so that tasks in which the building matrix is partitioned can be either been dispatched
to a GPU or a CPU core. The high computing cost of the application and the way in
which we performed the solution in this paper motivate us to extended this algorithm
further to other hierarchically higher levels such as clusters of nodes like the one we
used in these experiments. To this end, we propose for the future an auto-tuning method
to determine the best tile size that will be computed by each subsystem in order to attain
load balancing among all possible computational resources available.
Acknowledgment. We would like to thank The Generalitat Valenciana for PROMETEO/2009/2013 project.
References
1. Bajaj, C., Ihm, I., Warren, J.: Higher-order interpolation and least-squares approximation
using implicit algebraic surfaces. ACM Transactions on Graphics 12, 327–347 (1993)
2. Ballard, G., Demmel, J., Gearhart, A.: Communication bounds for heterogeneous architectures. Tech. Rep. 239, LAPACK Working Note (February 2011)
3. Barnat, J., Bauch, P., Brim, L., Ceska, M.: Computing strongly connected components in
parallel on CUDA. In: Proceedings of the 25th IEEE International Parallel & Distributed
Processing Symposium (IPDPS 2011), pp. 544–555. IEEE Computer Society (2011)
4. Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: portable shared memory parallel
programming (scientific and engineering computation). The MIT Press (2007)
5. Golub, G.H., Loan, C.F.V.: Matrix Computations, 2nd edn., Baltimore, MD, USA (1989)
6. Marr, D.T., Binns, F., Hill, D.L., Hinton, G., Koufaty, D.A., Miller, J.A., Upton, M.: Hyperthreading technology architecture and microarchitecture. Intel Technology Journal 6(1), 1–12
(2002)
jvargas2006@gmail.com
Parallel Algorithm
43
7. Namikawa, L.M., Renschler, C.S.: Uncertainty in digital elevation data used for geophysical
flow simulation. In: GeoInfo, pp. 91–108 (2004)
8. Nogueira, L., Abrantes, R.P., Leal, B.: A methodology of distributed processing using a mathematical model for landform attributes representation. In: Proceeding of the IADIS International Conference on Applied Computing (April 2008)
9. Nogueira, L., Abrantes, R.P., Leal, B., Goulart, C.: A model of landform attributes representation for application in distributed systems. In: Proceeding of the IADIS International
Conference on Applied Computing (April 2008)
10. Rawlings, J.O., Pantula, S.G., Dickey, D.A.: Applied Regression Analysis: A Research Tool.
Springer Texts in Statistics. Springer (April 1998)
11. Rufino, I., Galvão, C., Rego, J., Albuquerque, J.: Water resources and urban planning: the
case of a coastal area in brazil. Journal of Urban and Environmental Engineering 3, 32–42
(2009)
12. Rutzinger, M., Hofle, B., Vetter, M., Pfeifer, N.: Digital terrain models from airborne laser
scanning for the automatic extraction of natural and anthropogenic linear structures. In: Geomorphological Mapping: a Professional Handbook of Techniques and Applications, pp. 475–
488. Elsevier (2011)
13. Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for GPU computing. In:
Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics
Hardware, pp. 97–106. Eurographics Association, Aire-la-Ville (2007)
14. Song, F., Tomov, S., Dongarra, J.: Efficient support for matrix computations on heterogeneous multicore and multi-GPU architectures. Tech. Rep. 250, LAPACK Working Note (June
2011)
jvargas2006@gmail.com
The Performance Model of an Enhanced Parallel
Algorithm for the SOR Method
Italo Epicoco1,2 and Silvia Mocavero2
1
2
University of Salento, Lecce, Italy
italo.epicoco@unisalento.it
Euro-Mediterranean Center for Climate Change (CMCC), Lecce, Italy
silvia.mocavero@cmcc.it
Abstract. The Successive Over Relaxation (SOR) is a variant of the
iterative Gauss-Seidel method for solving a linear system of equations
Ax = b. The SOR algorithm is used within the NEMO (Nucleus for
European Modelling of the Ocean) ocean model for solving the elliptical equation for the barotropic stream function. The NEMO performance analysis shows that the SOR algorithm introduces a significant
communication overhead. Its parallel implementation is based on the
Red-Black method and foresees a communication step at each iteration.
An enhanced parallel version of the algorithm has been developed by
acting on the size of the overlap region to reduce the frequency of communications. The overlap size must be carefully tuned for reducing the
communication overhead without increasing the computing time. This
work describes an analytical performance model of the SOR algorithm
that can be used for establishing the optimal size of the overlap region.
Keywords: SOR, NEMO, Performance Model.
1
Introduction
The ocean engine of NEMO (Nucleus for European Modelling of the Ocean) [1]
is a primitive equation model adapted to regional and global ocean circulation
problems. It is a flexible tool for studying the ocean and its interactions with the
other components of the earth climate system over a wide range of space and
time scales. Prognostic variables are the three-dimensional velocity field, the sea
surface height, the temperature and the salinity. In the horizontal direction, the
model uses a curvilinear orthogonal grid and in the vertical direction, a full or
partial step z-coordinate, or s-coordinate, or a mixture of the two. The model
time stepping environment is a three level scheme in which the tendency terms
of the equations are evaluated either centered in time, or forward, or backward
depending on the nature of the term. The model is spatially discretized on a
staggered grid (Arakawa C grid) masking the land points. Vertical discretization depends on both how the bottom topography is represented and whether
the free surface is linear or not. Explicit, split-explicit and filtered free surface
formulations are implemented for solving the prognostic equations for the active
B. Murgante et al. (Eds.): ICCSA 2012, Part I, LNCS 7333, pp. 44–56, 2012.
c Springer-Verlag Berlin Heidelberg 2012
jvargas2006@gmail.com
The Performance Model of an Enhanced Parallel Algorithm
45
tracers and the momentum. A number of numerical schemes are available for
the momentum advection, for the computation of the pressure gradients, as well
as for the advection of the tracers (second or higher order advection schemes,
including positive ones). When the filtered sea surface height option is used, a
new force that can be interpreted as a diffusion of the vertically integrated volume flux divergence is added in the momentum equation. The equation is solved
implicitly and it represents an elliptic equation for which two solvers are available: the SOR and the Preconditioned Conjugate Gradient (PCG) schemes. The
SOR has been retained because it is a linear solver very useful when using the
adjoint model of NEMO. The NEMO model with the MFS16 [2] configuration
has been evaluated on the MareNostrum platform at the Barcelona Supercomputing Center. The routine named dyn spg is the most time consuming one; it
computes the surface pressure gradient term using the SOR scheme.
The paper is organized as follows: next section introduces the SOR (Successive
Over Relaxation) method. Section 3 describes our parallel approach, while the
latter sections, 4 and 5, show respectively the analytical performance model of
the parallel algorithm and the iso-efficiency analysis.
2
SOR Overview
The iterative methods for solving the linear equation systems Ax = b iteratively
generates a sequence {pk } of approximate solutions such that the residual vector
(rk = b − Apk ) converges to 0. The Gauss-Seidel algorithm [3] is an example
of iterative method for solving a linear equation system. The method can be
applied only if the matrix A is strictly diagonally dominant. Each equation is
solved by the unknown on the diagonal and the approximated values for the
other unknowns are plugged in. The process is then iterated until convergence.
The Gauss-Seidel method is easily derived by examining separately each of the
n equations in the linear system. Let the i-th equation given by:
n
aij xj = bi
(1)
j=1
(k)
At the iteration k, it can be solved by (2) for the value of xi assuming the
(k−1)
approximation of the previous iteration (xj=i ) for the other unkowns xj=i .
(k)
xi
⎛
⎞
i−1
n
1 ⎝
(k)
(k−1) ⎠
=
aij xj −
aij xj
bi −
aii
j=1
j=i+1
(2)
There are two important characteristics of the Gauss-Seidel method that should
be noted. Firstly, the computation appears to be serial: since each component
at the new iteration depends on all of the previously computed components, the
updates cannot be done simultaneously as in the Jacobi method [4]. Secondly,
the new iterate value x(k) depends upon the order in which the equations are
jvargas2006@gmail.com
46
I. Epicoco and S. Mocavero
examined. If it changes, the values at the new iteration (and not just their order)
change accordingly.
The definition of the Gauss-Seidel method can be expressed using the following
matrix notation:
x(k) = (D − L)−1 (U x(k−1) + b)
(3)
where the matrices D, −L, and −U represent the diagonal, the strictly lower
triangular, and the strictly upper triangular parts of A, respectively. The SOR
[5] is an iterative method for solving a linear system of equations derived by
extrapolating the Gauss-Seidel algorithm. This extrapolation takes the form of a
weighted average between the previous iteration and the Gauss-Seidel component
computed at the current iteration. Given a value for the weight ω the component
at iteration k is given by:
(k)
(k)
xi
= ωxi
(k−1)
+ (1 − ω)xi
(4)
where x denotes a Gauss-Seidel approximation. The idea is to choose a value for
ω within the interval (0, 2) that will accelerate the rate of convergence to the
solution. In general, it is not possible to compute in advance the value of ω that
will maximize the rate of convergence of the SOR. Frequently, some heuristic
estimate is used, such as ω = 2 − O(h) where h is the mesh spacing of the
discretization of the underlying physical domain.
In matrix terms, the SOR algorithm can be written as follows:
x(k) = (D − ωL)−1 [ωU + (1 − ω)D]x(k−1) + ω(D − ωL)−1 b
3
(5)
Parallel Algorithm
While the matrix notation for the SOR algorithm is useful for a theoretical
analysis, a practical implementation requires an explicit formula to be defined
[6]. Let’s consider a general second-order elliptic equation in x and y, finite
differenced on a square. Each row of the matrix A is an equation of the form:
ai,j ui+1,j + bi,j ui−1,j + ci,j ui,j+1 + di,j ui,j−1 + ei,j ui,j = fi,j
(6)
The iterative procedure is defined by solving the following equation for ui,j .
u∗i,j =
fi,j − ai,j ui+1,j − bi,j ui−1,j − ci,j ui,j+1 − di,j ui,j−1
ei,j
(7)
Then, considering the (4), the unew
i,j is a weighted average given by:
∗
old
unew
i,j = ωui,j + (1 − ω)ui,j
(8)
If we consider that the residual at any stage of the iteration is given by:
ξi,j =ai,j ui+1,j + bi,j ui−1,j + ci,j ui,j+1 + di,j ui,j−1 +
ei,j ui,j − fi,j
jvargas2006@gmail.com
(9)
The Performance Model of an Enhanced Parallel Algorithm
47
we can calculate the new value at each iteration given by:
old
unew
i,j = ui,j − ω
ξi,j
ei,j
(10)
This formulation is very easy to program, and the norm of the residual vector
ξi,j can be used as a criterion for terminating the iteration. The need to reduce
the time spent by the SOR algorithm without increasing the number of iterations to reach convergence has been the main goal of several previous works.
Different multi-color ordering techniques, such as the Red-Black [7] method for
two dimensional problems, have been investigated; they allow the parallelization
of operations on the same color. Other techniques, overlapping computation and
communication or allowing an optimal scheduling of available processors, have
been designed and implemented producing parallel versions of SOR [8]. Parallel SOR algorithms, suitable for use on an asynchronous MIMD computer, are
presented since 1984 [9]. In the last years, the BPSOR [10], characterized by a
new mesh domain partition and ordering, allows retaining the same convergence
rate of the sequential SOR method with an easy parallel implementation on an
MIMD parallel computing.
This work analyzes a parallel algorithm for the SOR based on the Red-Black
method that supposes to divide the mesh into odd and even cells, like in a
checkerboard. Equation (10) shows that the odd point values depend only on
the even points, and vice versa. Accordingly, we can carry out one half-sweep
updating the odd points and then another half-sweep updating the even points
with the new odd values. The parallel algorithm uses a 2D domain decomposition based on checkerboard blocks. Let ni and nj be respectively the number of
rows and columns of the global domain, and pi and pj respectively the number of
processes along i and j directions, then each process will compute a subdomain
made of ni /pi x nj /pj elements. If we consider only one overlap line between
neighbors, each parallel process must exchange the computed values at the border at each iteration of the SOR. Two communication steps must be performed
for each iteration (for the odd and for the even points). At each iteration, the
generic process computes the odd points inside its domain, exchanges the odd
points with its neighbors and updates the boundaries values, computes the inner
even points and finally updates even points on the boundaries exchanging with
neighbors (see Fig. 1). At each iteration, the generic parallel process will then
communicate twice for each neighbor. In order to reduce the frequency of communication, the size of the overlap region could be increased [11]. In that case
the neighboring processes would exchange a wider overlap region, but the values
exchanged can be used for further iterations without the need of communication.
Each process, after exchanging the data, computes a total number of lines given
by Ninner + Nol − 1, where Ninner and Nol are respectively the total number
of lines in the inner domain and the number of overlap lines. At each iteration
only one line of the overlap expires so that the process has no need to exchange
for Nol − 1 iterations. The convergence rate of the algorithm does not change,
since the ordering and partition is the same of the original SOR algorithm. The
algorithm is explained by the following pseudo-code fragment.
jvargas2006@gmail.com
48
I. Epicoco and S. Mocavero
Fig. 1. SOR Red-Back computing algorithm
Require: u //result matrix with initial value
a, b, c, d, e //coefficient matrix
f //known term
ol exp ← ol
while ξ(i, j) is not enough small do
if ol exp == 0 then
call data exch //exchange odd points over the overlap
end if
for all even points do
tmp ← (f (i, j) − a(i, j) ∗ u(i, j − 1) − b(i, j) ∗ u(i, j + 1) − c(i, j) ∗ u(i −
1, j) − d(i, j) ∗ u(i + 1, j))/e(i, j)
ξ(i, j) ← tmp − u(i, j)
u(i, j) ← ω ∗ tmp + (1 − ω) ∗ u(i, j)
end for
if ol exp == 0 then
call data exch //exchange even points over the overlap
ol exp ← ol
end if
for all odd points do
tmp ← (f (i, j) − a(i, j) ∗ u(i, j − 1) − b(i, j) ∗ u(i, j + 1) − c(i, j) ∗ u(i −
1, j) − d(i, j) ∗ u(i + 1, j))/e(i, j)
ξ(i, j) ← tmp − u(i, j)
u(i, j) ← ω ∗ tmp + (1 − ω) ∗ u(i, j)
end for
call convergence test (ξ)
ol exp ← ol exp − 1
end while
jvargas2006@gmail.com
The Performance Model of an Enhanced Parallel Algorithm
49
A similar approach has been used by the HYCOM ocean model [12] where a
maximum number of wide halo lines can be added to reduce the halo communication overhead.
4
The Analytical Performance Model
The SOR algorithm has been implemented in a test program made by the main
sor routine that (i) calls the data exch routine for exchanging the data between
the neighbors and (ii) evaluates the convergence. Both the routines are then characterized by two kind of operations: computing and communication. data exch
performs some data buffering operations and the actual send and receive of the
data on the boundaries. The size of the overlap region directly impacts on the
frequency of the data exch invocation. The sor routine computes the result matrix and performs a collective communication during the convergence test. If we
increase the size of the overlap, the computation increases, while the time for
collective communication does not change. The total time is the sum of these
four components, three of them depending on the size of the overlap. Which is
the best value of the overlap to get the best benefit? The optimal value is related
to some architectural aspects (i.e. the processor speed and the network bandwidth and latency) and changes with both the number of parallel processes and
the domain decomposition. A performance model for estimating the behavior
of the SOR algorithm has been defined, such as in [13]. It takes into consideration the four above mentioned aspects. The total time spent by the solver
(Tsor ) is given by: (i) the communication time spent in the sor routine for the
convergence test (Tc sor ); (ii) the computing time spent in the sor routine for
evaluating the result matrix at each iteration (Tu sor ); (iii) the computing time
spent in the data exch routine for managing the data buffer used for the data
transmission (Tu data ) and (iv) the communication time in the data exch for the
data transfer to the neighbors (Tc data ). The number of calls of data exch depends on both the overlap size (l) and the number of iterations (m) needed to
reach the convergence. The performance model is summarized as follows:
2m
Tsor = Tc sor + Tu sor +
+ 1 (Tc data + Tu data )
(11)
l
The four timing components can be modeled as in (12)(13)(14).
The time spent by the collective communication depends only on the number
of parallel processes (pi pj ). The convergence test is performed after the first 100
iterations and has a frequency of 10 iterations through an allreduce MPI collective communication, where the maximum residual value is exchanged among
all of the parallel processes. The amount of data exchanged is constant (it is
independent from the subdomain dimension) and, considering the implementation of the allreduce with the butterfly parallel scheme, we have a number of
communication steps logarithmic to the total number of processes.
m − 100
Tc sor = O(
log pi pj )
(12)
10
jvargas2006@gmail.com
50
I. Epicoco and S. Mocavero
The computing time of the sor is related to the domain dimension: di and dj are
the dimensions of the biggest subdomain along the i and j directions respectively
and they are given by di = ni /pi and dj = nj /pj . For each iteration of the SOR a
complete sweep of the subdomain elements plus the overlap region is performed.
Tu sor = O(m(di + l)(dj + l))
(13)
The communication is implemented with four point-to-point sends/receives hence
the communication time is directly proportional to the number of exchanged elements. Here we consider a parallel process with four neighbors, but not all of
the processes have four neighbors like those in the border of the global domain.
Tc data = O(Li + Lj )
Tu data = O(Li + Lj )
(14)
Li and Lj represent the total number of elements exchanged between neighbors,
Li = (di + 2l)l, Lj = (dj + 2l)l.
Considering all of the previous equations, the parallel time of the whole algorithm can be expressed as follows:
Tsor = O(
ni nj
ni l nj l
+
+
+ l2 + log pi pj )
pi pj
pi
pj
(15)
If we consider a square global domain then ni = nj = n, we can also impose
√
pi = pj = p. The (15) can be simplified:
Tsor = O(
nl
n2
+ √ + l2 + log p)
p
p
(16)
The evaluation of the analytical equation for the performance model has been
defined experimentally on an IBM Power6 cluster. It has 30 IBM p575 nodes,
each