GRADING/ANSWER SHEET ## Fall 2010 CS61C Final | Your Name: | | | ··· | | <br> | ,, <u>-</u> | |------------|--------|---------|-------|---------|------|-------------| | Your TA: | Andrew | Michael | Conor | Charles | | | | Login: | cs61c | | | | | | This exam is worth 186 points, or about 20% of your total course grade. The exam contains 10 questions. This booklet contains 14 numbered pages including the cover page, plus the 2 pages for the green card. Put all answers on these pages, please; don't hand in stray pieces of paper. R)Ch M M | Question | Points (Minut | es) | Score | 7 | |-----------------------------|------------------|----------|-------|----------------| | 1. Justified | 30 points (15 m | inutes)Z | 20,5 | 3 | | 2. Virtuosity | 14 (7) | 14 | 9.8 | Ô | | 3. Mo' Cache, Mo' Problems | 24 (12) | 24 | 18,1 | 0 | | 4. My Secret Cache | 34 (17) | 34 | 20.5 | 4 | | 5. Heaven's Gate | 8 (4) | 8 | 7.4 | ] ( | | 6. Off the Map | 20 (10) | 20 | 1207 | $\bigcirc$ | | 7. Pay It Forward | 12 (6) | 12 | 8,3 - | 1 | | 8. Bigger, Stronger, Faster | 10 (5) | 10 | 6.3 | 0 | | 9. Altered States | 10 (5) | (0) | 5.6 | $]$ $\bigcirc$ | | 10. Three's Company | 24 (12) | 24 | 17.0 | 0 | | Total | 186 points (93 m | inutes) | | 段 | | | TOTAL | 177 | 132 | 42 | | | | IVIAX | AUG | WIK | l # Justified. True or False with Justification Circle whether each statement in questions a) to g) is True or False, and give a one sentence justification. We reserve the right to dock points for long-winded answers. Be concise! If you feel like you can give a clear justification in just one word, please do! 3 pts/question; -2 if explanation partially correct -( if explanation "close" a. Designing a processor with dynamic multiple issue places extra burden on the compiler-author to ensure the correctness of emitted code. TRUE /(FALSE) Justification: The instruction set closert change because of nultiple issue b. As a result of hitting the Power Wall, Moore's Law no longer holds. TRUE / FALSE Justification: Moore's Law is about transistor count, not clock retes, which are what is limited by the Power Well. c. Increasing the depth of a pipeline tends to increase performance primarily by decreasing the amount of time it takes on average to execute a particular instruction. (accepted ether TF due to ambiguity about intakes in average TRUE FALSE Justification: Decreases time between instructions - increased throughput but not decreased latercy d. In a single-cycle MIPS CPU during the execution of a jump instruction, the adder block inside the ALU always computes the sum of its inputs. (TRUE) FALSE Justification: Hardware is 5th powered and has inpute, even though its outpid is not used. e. After enough time, a circuit composed of combinational logic without feedback will always show the same output for a given set of inputs. (TRUE)/ FALSE This 15 the definition of combinational logic. Justification: | f. For pipelined execution, if multiple exceptions happen in the same clock cycle for different instructions, the hardware should set the EPC to address of the earliest issued instruction. | |------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| | Justification: (Assuming "earlest issued" Teens "earlest issued among these wexcepting") (Snowing out of order issue.) Except handlers should be invoked as if instructions are exe on at a time. | | g. To get good performance from future microprocessors in the next 5 to 10 years, SIMD via successors to SSE4 instructions could be as important to use as MIMD via more cores. TRUE/FALSE | | Justification:<br>SIMD widths are growing rearly as fast as | | h. A particular memory access results in a page fault but not a protection error. Which of the following statements about this access is <u>always</u> true? Circle the answer. | | Always true / Not always true There was a TLB miss. Always true / Not always true The cache(s) will miss. Always true / Not always true Space for the required page is allocated somewhere on disk. | | i. Which of the following could be enabled by adding more ALUs to a processor design? (Check all that apply.) More structural hazards More issue slots Wider SIMD width | | Larger CPI j. Multiple Choice. (Pick one.) Virtual machines are used in Cloud Computer by Amazon to | | I. Protect users from each other II. Offer more price points III. Simplify introduction of new and faster hardware A. None of the above B. I only C. II only D. III only E. I and II F. II and III G. I and III | | H., II, and III | ### 2. Virtuosity A computer has a 32-bit virtual address with 8 KB pages and a 16 KB cache with 32 Byte blocks that is 2-way set associative -2 pts: OFF 50 one eccor IV. Show how the 32-bit address is divided into the virtual page number and page offset, showing the number of bits in each field: -2 pts: Wrong ordering of Cille Page offset - 13 VPN - 19 31 0 -) N3 points: Field completely wrong, V. Show how the 32-bit address could be divided into the cache tag, cache index, and block offset, showing the number of bits in each field: Only reduct once For coerror between PATTS . 31 For ort lean VI. Given these layouts, must the lookup in the TLB to translate from virtual address to physical address be done sequentially before access the cache? getting Rolling in right 4 prs. Why or why not? 3 pts - Minor Misstatement 2 pts - mostly correct, some in accorda statements lipt - partly correct. The cache index Field is contained within the page offset Field, so can retreive the relevant blocks from the cache while we look up the translation for the UPN in the TLB. We can verify hit/ass on the blocks by wing the PPN as the tag. ### 3. Mo' Cache, Mo' Problems a) A naive hardware developer designs the cache protocol for a 2 CPU system, each CPU with a 2 block direct mapped cache, 1 byte blocks. Memory addresses are 32 bits (byte addressed). The cache policies are Allocate on write, and Write through. Assume that both caches start out with all blocks invalid. Which of the following access patterns, *executed independently from one another*, will always yield correctly updated results? Assume that each Read/Write finishes completely during its time period. ypts each | | CPU | Time 1 | Time 2 | Time 3 | Correct? | |--------|-----|-----------|-----------|----------|----------| | i. 1 2 | | Write 0x0 | | Read 0x0 | Λ. | | | | | Write 0x0 | | 7 IV | | ii. | 1 | Write 0x0 | | Read 0x1 | V | | | 2 | | Write 0x1 | | 7 7 | | iii. | 1 | Write 0x0 | | Read 0x0 | Y | | | 2 | | Write 0x2 | | 7 ' | | iv. | 1 | Read 0x0 | | Read 0x2 | Q | | | 2 | | Write 0x2 | | 7 _ \ | | v. | 1 | Read 0x0 | | Read 0x0 | 1 1 | | | 2 | | Write 0x0 | | 7 () | that memoral survival The cond 14 b) What feature could you add to this cache protocol to make it work in all cases? Be specific; will your revision should describe a particular behavior. Invalidate on Write- -Pirtial creater possible: return Stale dota. (inconsistant with the rust up to date copy in memory) | RI | UR | DS. | | |----|-----------------------------|-----|--| | 1- | $\mathcal{J}_{\mathcal{U}}$ | | | ### 4. My Secret Cache The average memory access time (AMAT) is: $AMAT = Hit Time_{L1} + Miss Rate_{L1} \times Miss Penalty_{L1}$ 4 (a) Write down the AMAT equation for a three level cache (use HT for Hit Time, MR for Miss Rate, and MP for Miss Penalty): HTLI + MRLIX (HTL2 + MRL2 × (HTL3+ MR× MPL3 Memory access (b) Given the following specification: For every 1000 CPU-to-memory references 40 will miss in L1\$: 20 will miss in L2\$; 10 will miss in L3\$; L1\$ hits in 1 clock cycle; L2\$ hits in 10 clock cycles; L3\$ hits in 100 clock cycles; Main memory access is 400 clock cycles; There are 1.25 memory references per instruction; and The ideal CPI is 1. Answer the following questions: (i) What is the local miss rate in the L2\$? (ii) What is the global miss rate in the L2\$? $20/000 = 2\% \text{ or } 4\% \times 50\% = 2\% \text{ or}$ (iii) What is the local miss rate in the L3\$? (iv) What is the global miss rate in the L3\$? the global miss rate in the L3\$? $$(0/(000 = 170 \text{ of } 3000))$$ (v) What is the AMAT with all three levels of cache? $$= 1 + 490(10 + 508 \times (100 + 507 \times 400))$$ $$= 1 + 64 + 29(100 + 200) = 1.4 + 6 = 7.4$$ (vi) What is the AMAT for a two-level cache without L3\$? $$1+490(10+502\times400)$$ = $1+.4+49.\times200=9.4$ | 1 | vii | What is | the AMAT | for a one-level | cache without | 128 and 1289 | |---|------|-----------|------------|-----------------|---------------|-----------------| | 1 | VIII | vv Hat 18 | ine Anna i | 101 a one-level | cache without | LZ\$ ana L\$\$? | (viii) What is the average memory stalls per reference in the system of question (v)? (ix) What is the average memory stalls per reference in the system of question (vi)? (x) What is the average memory stalls per reference in the system of question (vii)? (xi) What is the average memory stalls per instruction in the system of question (v)? (xii) What is the average memory stalls per instruction in the system of question (vi)? (xiii) What is the average memory stalls per instruction in the system of question (vii)? (xiv) What is the performance of the system of question (vii) versus that of question (v)? (Long-division challenged can give just the ratio.) [2.3] 7.4 170 (xv) What is the performance of the system of question (vi) versus that of question (v)? (Long-division challenged can give just the ratio.) 84/84,0 V # RUDERIC ### 5. Heaven's Gate What is the truth table for the following circuit? (It might be simpler to first write the equation) $(\overline{A \cdot B}) \cdot C =$ $A + \overline{B} + \overline{C}$ SPOINTS | A | В | С | Out | |---|---|---|-------| | 0 | 0 | 0 | J. T. | | 0 | 0 | 1 | 1 | | 0 | 1 | 0 | | | 0 | 1 | 1 | 0 | | 1 | 0 | 0 | | | 1 | 0 | 1 | 1 | | 1 | 1 | 0 | | | 1 | 1 | 1 | Ì | | IPOINT | |-------------| | PER | | CORRECT | | ANSWER, | | INLESS | | (NULTETO) | | ( CHA OID | | ( OUANSU | | ITHON ! | | SET SPOINTS | | OUTOF8 | | 0000 | ### 6. Off the Map ``` Given: typedef struct llnode { struct llnode *next; int contents; } llnode; ``` Implement map in C. map should take a pointer to an llnode and a map\_fun, which is just a pointer to a function taking one int argument and returning int. It should return a pointer to a **new** linked list, where the contents of each of the new nodes is equal to the value returned by the map function called on the contents of the corresponding old node. Don't forget to fill in the return value correctly (pointers to function are tricky so we've done the arguments for you). Assume llnode lists are terminated by a NULL pointer. You should **not** modify any part of the list identified by head. You may assume map\_fun has been properly declared by a typedef somewhere else. Ihode Head siff I head ) ( return NULL; I node Anew = malloc (SiZeof (Ilnde)); New > conferts = f (head > contents); return hew; return hew; ; (Note that to call a function pointer, you just use it like any other function. So, in the code template given above, saying int result = f(3); would correctly call the function pointed to by f with the value 3 and put the return value into result.) Return type - 2 Base Case - 4 Malloc - 6 Seffing conferts - 2 recursive call - 4 Peturn value - 2 iterative OK (2 points max if no mallor ### 7. Pay It Forward Consider the excerpt below of a 5-stage pipelined MIPS datapath. - a. Consider the following sequence of instructions - [A] srl \$zero, \$zero, 0 - [B] addu \$t0, \$t1, \$t2 - [C] addu \$t0, \$t0, \$t2 - [D] lw \$s0, 0(\$t3) - [E] subu \$t3, \$s0, \$t0 During which of these instructions' decode stages in the sequence above should ControlRS be 1 to avoid pipeline stalls? Use the labels [A], [B], ... b. Which fields of which instructions from part a does the control logic need to compute the value of ControlRS? Notmetions (to clet, destination, opcode of all instructions (RII-type) is of instructions -3 pts -no opcode/selecting by instruction type Exptif selection by into type M/s mentionly field's used) 10 ### 8. Bigger, Stronger, Faster: Suppose that you are running an algorithm for various problem sizes, and have obtained the data below. Sketch a weak scaling plot of parallel code performance that shows speedup over the serial implementation. Be sure to label the Y-axis. | Problem<br>Size | Gflop/s<br>(serial) | Threads | Gflop/s<br>(parallel) | | |-----------------|---------------------|---------|-----------------------|--| | 100 | 5 | 1 | 5 | | | 200 | 5 | 2 | 10 | | | 400 | 5 | 4 | 19 | | | 600 | 5 | 6 | 25 | | | 800 | 5 | 8 | 35 | | | 1000 | 5 | 10 | 36 | | | 1200 | 5 | 12 | 37 | | | 1400 | 5 | 14 | 37 | | | 1600 | 5 | 16 | 38 | | Rubric - 5 y-axis nonexistant/ wrong wi - 3 plot is above linear line Speedup Parallel Secial) Weak Scaling of Speedup over Serial Threads 16 | • | | | |---------------|------|--| | Student Name: | SID: | | ALTURUD STATES ## Question State Diagram Given the following timing chart, and assuming 0 set-up, hold, and propagation times, as well as a positive edge-triggered synchronous system. | edge-triggered synchronous system, | | | | | | | | | | | | | |------------------------------------|-----|---|---|---|---|---|---|---|---|---|--|---| | Reset | | | | | | | | | | | | | | In | XX | 1 | * | | 0 | | 0 | | 1 | | | | | State | XXX | 0 | | 0 | 0 | 1 | 1 | 1 | 1 | 0 | | | | Out | XX | 1 | | 1 | | Ø | 0 | 0 | 0 | 1 | | , | | Clk | | | | | | | | | | | | | Fill-in the transitions and outputs of the following State Diagram so it has the identical behavior to the timing diagram above: | tudent Name: | SID: | |------------------|------| | tititent ivanic. | | | | | (Oo THREES COMPANY Question Threesy 24 pts Consider the following datapath with an Arithmetic Logic Unit (ALU) and an eight-register register file organized around a single bus. The ALU is to apply add, subtract, etc. operations to its two input operands to generate an output result. The register file has an asynchronous read and a synchronous write. That is, as soon as the Read Enable (RE) is asserted, the register file selects the indicated 32-bit register and presents its value on the Data Out (DO). On the other hand, the Write Enable (WE) is sampled only on the rising edge of the clock, and only writes the indicated register from the Data In (DI) lines on the same edge that WE is asserted. The ALU and Register File share the Bus via a 32-bit wide 2:1 multiplexer. When SelALU is set to 1, the ALU path is connected to the Bus. Otherwise the Register File path is connected to the Bus. The datapath must support three-address instructions of the form $R_z \leftarrow R_x < op > R_v$ . To make use of a single bus architecture, the ALU can be surrounded by one, two, or three 32-bit temporary registers, labeled A, B, and C, as shown below (the temporary registers are shown as dotted lines – the correct solution requires at least one and possibly all three of the registers): | | CID. | |---------------|----------| | Student Name: | <br>SID: | Using the fewest of the A/B/C registers and possible clock cycles, what is the fewest number of each to implement the register transfer for the instructions of the three-address type (circle one for each): Registers Cycles For your answer, on the previous cross out the registers you don't need, and fill-in the outline of the registers that you do. For each cycle you need, write below the control signals that must be asserted to implement the register transfers for the three-address instructions: SelALLEO, RENA, Reg Rd LI, Reg Rd Slex, LDA, Cycle 2: Sel Alle O, Reg Rd LI, Reg Rd Sele Y, LDB, (during cycle 2, Regfiley > Tup B) (during cycle I, Reg Filex -> Tmp A) Cycle 3: All-1 Reglir 41, Reglir Sel-Z, ALU < OP ) Not shown on diagram during Cycle 3, All -> Reg File 2)