The world’s Largest Sharp Brain Virtual Experts Marketplace Just a click Away
Levels Tought:
Elementary,Middle School,High School,College,University,PHD
| Teaching Since: | May 2017 |
| Last Sign in: | 399 Weeks Ago, 1 Day Ago |
| Questions Answered: | 66690 |
| Tutorials Posted: | 66688 |
MCS,PHD
Argosy University/ Phoniex University/
Nov-2005 - Oct-2011
Professor
Phoniex University
Oct-2001 - Nov-2016
suppose we wish to write a procedure that computes the inner product of two vectors. An abstract version of the function has a CPE of 54 for both integer and floating point data.
void inner4(vec_ptr u, vec_ptr v, data t *dest)
{
int i;
int length = vec_length(u);
data_t *udata = get_vec_start(u);
data_t *vdata = get_vec_start(v);
data_t sum = (data_t) 0;
for (i = 0; i
sum = sum + udata[i] * vdata[i];
}
*dest = sum;
}
Our measurements show that this fucntion requires 3.11 cycles per iteration for integer data. The assembly code for the inner loop is as follows:
.L24:
movl (%esi,%edx,4),%eax Get udate[i]
imull (%ebx,%edx,4),%eax Multiply by vdata[i]
addl %eax,%ecx Add to sum
incl %edx i++
cmpl %edi,%edx Compare i:length
jl .L24 If
Assume that integer multiplcation is performed by the general integer functional unit and that this unit is pipelined. This means that one cycle after a multiplication has started, a new integer operation (multiplication or otherwise) can begin. Assume also that the Integer/Branch function unit can perform simple integer operations.
A) show the translation of these lines of assembly code into a sequence of operations. The movl instruction translates into a single load operation. Register %eax gets updated twice in the loop. Label the different versions %eax.1a and %eax.1b.
B) Explain how the function can go faster than the number of cycles required for integer muiltiplication.
C) Explain what factor limits the performance of this code to at best a CPE of 2.5.
D) For floating-point data, we get a CPE of 3.5. Without needing to examine the assembly code, describe a factor that will limit the performance to at best 3 cycles per iteration.
---------------------------------------------------
Write a version of the inner product procedure described in the previous problem that uses four-way loop unrolling.
Our measurement for this procedure gives a CPE of 2.20 for integer data and 3.50 for floating point.
A) explain why any version of any inner product procedure cannoy achieve a CPE greater than 2.
B) Explain why the performance for floating point did not improve with loop unrolling.
-------------------------------------------------------
Write a version of the inner product procedure described in the first problem that uses four-way loop unrolling and two-way parrallelism.
Our measurements for this procedure give a CPE of 2.25 for floating-point data. Describe two factors that limit the performance to a CPE of at best 2.0
Hel-----------lo -----------Sir-----------/Ma-----------dam-----------Tha-----------nk -----------You----------- fo-----------r u-----------sin-----------g o-----------ur -----------web-----------sit-----------e a-----------nd -----------and----------- ac-----------qui-----------sit-----------ion----------- of----------- my----------- po-----------ste-----------d s-----------olu-----------tio-----------n.P-----------lea-----------se -----------pin-----------g m-----------e o-----------n c-----------hat----------- I -----------am -----------onl-----------ine----------- or----------- in-----------box----------- me----------- a -----------mes-----------sag-----------e I----------- wi-----------ll