Amdahl's law: It states that a small portion of the program which can’t be parallelized will limit the overall speedup available from parallelization. Any large math or engineering problem will typically consists of several parallelizable parts and several non-parallelizable (sequential) parts. This relationship is given by Amdahl's law: S = 1/(1-P) S = Speedup, P = Parallel component in program. Stages of execution in a processor: Fetch, Decode, Execute, Memory and Writeback. What does the following code do? xor eax,eax mov ebx,data ; your input data mov cl,bits ; number of bits loop: ror ebx,1 rcl eax,1 dec cl jnz loop A: Reverses the order of bits A superscalar CPU architecture implements a form of parallelism called Instruction-level parallelism within a single processor. Available performance improvement from superscalar techniques is limited by two key areas: 1. The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism, and 2. The complexity and time cost of the dispatcher and associated dependency checking logic. *Cache block: minimum unit of information that can be present in the cache (several contiguous memory positions) * Cache hit: requested data can be found in cache * Cache miss: requested data cannot be found in cache. Placing a block: Suppose we need to place block 10 Directly mapped (1-way): 10 mod 8 = 2 2-way set associative: 10 mod 4 = set 2 4-way set associative: 10 mod 2 = set 0 fully associative (8-way, in this case): Placement set = address mod (# sets) - Where (# sets) = (cache size)/(# ways). How can a block be found? Block Address comprises of : Block Offset Tag Index Block Offset Determines set (no index in fully associative caches) determines offset in block block unique id “primary key” Each place in cache records block's tag (as well as its data) Of course, place in cache may be unoccupied, so usually place maintains valid bit. So to find block in cache: 1. Use index of block address to determine place (or set of places) 2. For that (or each) place, check valid bit is set and compare tag with that of block address --- this can be done in parallel for all places in a set Bus contention occurs when more than one memory module attempts to access the bus simulataneously. However it can be reduced by using hierarchical bus architecture. Pipelining: Pipelining is a process in which the data is accessed in a stage by stage process. The data is accessed in a sequence that is each stage performs an operation. If there are n number of stages then n number of operations is done. To increase the throughput of the processing network the pipe lining process is done. This method is adopted because the operation or the data is accessed in a sequence with a fast mode. Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with Hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. SMT: The Intel Pentium 4 was the first modern desktop processor to implement simultaneous multithreading, starting from the 2.8GHz model released in 2002, and since introduced into a number of their processors. Intel calls the functionality Hyper-Threading Technology (HTT), and provides a basic two-thread SMT engine. Intel claims up to a 30% speed improvement compared against an otherwise identical, non-SMT Pentium 4. The performance improvement seen is very application dependent, and some programs actually slow down slightly when HTT is turned on due to increased contention for resources such as bandwidth, caches, TLBs, re-order buffer entries, etc. Simultaneous multithreading (SMT) is one of the two main implementations of multithreading, the other form being temporal multithreading. In temporal multithreading, only one thread of instructions can execute in any given pipeline stage at a time. In simultaneous multithreading, instructions from more than one thread can be executing in any given pipeline stage at a time. This is done without great changes to the basic processor architecture: the main additions needed are the ability to fetch instructions from multiple threads in a cycle, and a larger register file to hold data from multiple threads. The number of concurrent threads can be decided by the chip designers, but practical restrictions on chip complexity have limited the number to two for most SMT implementations.