min
/
ucl_project_y3


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304
							% !TeX root = index.tex
\iffalse
This chapter looks specifically at your results.
* You measured some samples. 
What values did you measure? 
Present them in a table or graph? 
How did you test whether they were good measurements? 
Were you looking to improve something? 
Are your new samples better than the old ones?

* You built a device; 
what tests did you run to make sure that it is running correctly?

* You calculated something or developed a new theory about something. 
How do you know how well it predicts? 
What tests did you run? 
What comparisons with the literature did you make?
* You coded or simulated something. 
What tests did you run to be sure it was working correctly? 

Describe what you want the reader to notice in the results. 
Give the facts, then give your analysis of the facts.
Present your graphs, figures, tables, photos, and equations needed to show what you accomplished.
Label everything clearly, using the recommendations given below in “Things to Look For”
\fi

\subsection{FPGA logic component composition}
This subsection describes the testing and results which finds how much FPGA logic components each processor takes and what is composition of each part.

Testing was performed with Quartus synthesis tool by recording flow summary report data. This report includes synthesised design metrics including total logic elements, registers, memory bits and other FPGA resources. In this testing, only parameters that were recorded are logic elements and registers. Number of resources was found by synthesising full processor, then commenting relevant parts of code, re-synthesising and viewing changes in the report. Such method may not be the most accurate, because during HDL synthesis, circuit is optimised as unused connections removed. This means that more of the logic than commented may be not synthesised. 

There are four parts of each processor that will be tested: 
\begin{enumerate}
	\item \textbf{Common} - processor auxiliary logic that is used by both processors. It includes the communication block with UART, RAM and PLL (Phase-Locked Loop, for master clock generation). 
	\item \textbf{ALU} - as described in section \ref{subsec:alu}, both processors have slightly different implementation of ALU.
	\item \textbf{Memory} - the processors memory management, including stack.
	\item \textbf{Other} - reminding processor logic that was not analysed.
\end{enumerate}

\begin{colfigure}
	\centering
	\includegraphics[width=\linewidth]{../tests/fpga_comp.eps}
	\captionof{figure}{Bar graph of FPGA logic components taken by each processor.}
	\label{fig:fpga_comp}
\end{colfigure}

The test results are shown in figures \ref{fig:fpga_comp} and \ref{fig:fpga_reg_comp}. The common logic uses 293 logic elements and 170 registers. OISC uses 1705 logic elements, while RISC uses 3218. Excluding common logic, OISC takes 48.3\% of RISC's logic elements.


\begin{colfigure}
	\centering
	\includegraphics[width=\linewidth]{../tests/fpga_reg_comp.eps}
	\captionof{figure}{Bar graph of FPGA register resources taken by each processor.}
	\label{fig:fpga_reg_comp}
\end{colfigure}

OISC uses 726 logic elements, while RISC uses only 407. Excluding common logic, OISC uses 78.4\% more registers than RISC.

Looking at the composition, OISC ALU takes 30.2\% more logic gates. Figure \ref{fig:fpga_reg_comp} shows a high number of OISC ALU registers. This concludes that higher resource usage in OISC ALU code must be source and destination logic.

Memory logic element composition of OISC is only 34.4\% of RISC's and 7\% lower for register resources, comparing to RISC. This indicates that by removing memory logic for RISC, synthesis tool may removed also other parts of processor, possibly part of control block because it mostly contains combinational logic.

Other logic includes instruction decoding with ROM, register file, program counter. RISC exclusively has control block. Note that OISC uses only three ROM memory blocks whereas RISC uses four as explained in section \ref{subsec:memory}, however this should make a minimal difference as M9K memory blocks are not included in FPGA logic element or register count. Comparing both processors, OISC has only 37\% of other logic components to RISC, however it has 2.28 times more registers. This shows a logic component - register trade-off. OISC source and destination logic requires more registers, whereas RISC uses combinational logic in the control block in order to control the same data in the datapath. 

The much higher number of logic components in RISC can be also explained more complicated register file, ROM memory logic and program counter. All of these components have some additional logic for timing correction or other extra functionality required by these block integration into a datapath.

\subsection{Power analysis}

Power analysis was performed to analyse power consumption of both processors.
This has been accomplished by connecting FPGA board to a laboratory power supply with 4V to an external power input. A shunt resistor of 1.020$\Omega$ was connected in series to calculate current. Supply voltage and voltage across shut resistor were measured using an oscilloscope with a data sampling feature. Three tests have been performed with different processor configurations. Between each test a period of about 5 minutes was given for FPGA to reach steady state. 


\begin{colfigure}
	\centering
	\includegraphics[width=\linewidth]{../tests/power.eps}
	\captionof{figure}{Measured power of processors when implemented on FPGA, running 16bit multiplication function in loop. None indicates auxiliary-only power.}
	\label{fig:power}
\end{colfigure}

Figure \ref{fig:power} represents power results. First configuration is "None" or auxiliary-only power, which includes the whole FPGA board, voltage regulators, and synthesised logic on FPGA required to support a processor (such as PLL, UART, Input/Output control, RAM). RISC and OISC bars in the graph indicate processor implementations on FPGA, each running a multiplication program in a loop. These values also include auxiliary power plus processor power, which means that the processor itself takes relatively small amount comparing to auxiliary power, about 0.5\%. Result shows that OISC require 0.4\%, which including noise is almost insignificant result.

During this test clock frequency of 1MHz was used. Due to equipment unavailability, any further tests were not carried out to investigate power consumption at different frequencies. Due to constant noise, running at higher frequency may result in significant difference between processors.

\subsubsection{Activity Factor}\label{subsec:activity_factor}
An activity factor could be also found using Equation \ref{eq:activity_factor} where $P$ is power, $C_{total}$ is total gate capacitance and $V_{DD}$ is voltage supplied to the transistors.
\begin{align}\label{eq:activity_factor}
\alpha = \frac{P}{C_{total}\cdot f \cdot V_{DD}^2}
\end{align}
As $C_{total}$ and $V_{DD}$ are constants, measuring power at different frequencies allows finding activity factor. This value could be used to compare how much of a processor circuit is active. Further design improvements could be used to optimise power \autocite{8682289,7363689,1207041,6972455}.


\subsection{Benchmark Programs}
A number of programs have been written to test both processors. These involve simple functions that could be commonly used in a 8bit processors:

\begin{description}
	\item[$\bullet$ Printing:] Sends data to UART. It includes waiting until UART is available for transmission. 
	\item[$\bullet$ Printing unsinged integer:] Uses binary-coded decimal algorithm to convert 8 or 16bit binary value to decimal value and print it. 
	\item[$\bullet$ 16bit multiplication:] Uses simple matrix multiplication. 
	\item[$\bullet$ 16bit division:] Uses Long division algorithm to divide two 16bit numbers, result including a reminder.
	\item[$\bullet$ 16bit modulo:] Uses "Russian Peasant Multiplication" algorithm to perform Modulo operation with two 16bit numbers.
	\item[$\bullet$ Prime number calculator:] Uses Sieve of Atkins algorithm \autocite{morain_1989} to calculate primer number, operates on 16bit numbers and utilise 16bit multiplication and modulo functions. 
\end{description}


\subsubsection{Instruction composition}\label{subsec:instr_comp}

This test is performed to investigate instruction composition of each function to see how similar it is between RISC and OISC processors. 
\begin{description}
	\item[$\bullet$ MOVE] - All instructions that move data around internal processor registers.
	\item[$\bullet$ ALU] - Instructions that are used to perform ALU operation.
	\item[$\bullet$ MEMORY] - Instructions that are required to send/retrieve data from system memory, except stack.
	\item[$\bullet$ STACK] - Instructions that push/pop data from memory stack.
	\item[$\bullet$ COM] - Instruction(s) that send/receive data from communication block.
	\item[$\bullet$ BRANCH] - Instructions that are used to make program branching.
	\item[$\bullet$ OTHER] - Any other instructions.
\end{description}

\begin{blockpage}
	\arrayrulecolor{black}
	\begin{tabular}{| c | p{0.65\linewidth} |} \hline 
		\rowcolor[rgb]{0.82,0.82,0.82}
		Name & Instructions \\\hline
		MOVE & \texttt{MOVE, CPY0, CPY1, CPY2, CPY3, CI0, CI1, CI2} \\\hline
		ALU & \texttt{%
			ADD, ADDI,
			SUB, SUBI,
			AND, ANDI,
			OR, ORI,
			XOR, XORI,
			DIV, MUL,
			ADDC, SUBC,
			INC, DEC,
			SLL, SRL, 
			SRA, GETAH
		} \\\hline
		MEMORY & \texttt{LWLO, LWHI, SWLO, SWHI} \\\hline
		STACK  & \texttt{PUSH, POP} \\\hline
		COM & \texttt{COM} \\\hline
		BRANCH & \texttt{BEQ, BGT, BGE, BZ, JUMP, CALL, RET} \\\hline
		\arrayrulecolor[rgb]{0,0,0}\hline
	\end{tabular}
	\captionof{table}{RISC processor instruction groups used in instruction composition test.}
	\label{tab:instr_groups_risc}
\end{blockpage}

\begin{blockpage}
	\arrayrulecolor{black}
	\begin{tabular}{| c | p{0.65\linewidth}|} \hline 
		\rowcolor[rgb]{0.82,0.82,0.82}
		Name & Destination \\\hline
		\arrayrulecolor[rgb]{0.82,0.82,0.82}
		MOVE & \texttt{REG0, REG1} \\\hline
		ALU & \texttt{ALU0, ALU1} \\\hline
		MEMORY & \texttt{MEM0, MEM1, MEM2, MEMLO, MEMHI} \\\hline
		STACK  & \texttt{STACK}\\\hline
		COM & \texttt{COMA, COMD}\\\hline
		BRANCH & \texttt{BR0, BR1, BRZ}\\\hline
		\arrayrulecolor[rgb]{0,0,0}\hline
	\end{tabular}
	\captionof{table}{OISC processor instruction desination groups used in instruction composition test}
	\label{tab:instr_groups_oisc_dst}
\end{blockpage}

\begin{blockpage}
	\arrayrulecolor{black}
	\begin{tabular}{| c | p{0.65\linewidth} |} \hline 
		\rowcolor[rgb]{0.82,0.82,0.82}
		Name & Instructions \\\hline	
		\arrayrulecolor[rgb]{0.82,0.82,0.82}	
		MOVE & \texttt{ALU0, ALU1, REG0, REG1, PC0, PC1, NULL, IMMEDIATE} \\\hline
		ALU & \texttt{ADD, ADDC, SUB, SUBC,
		AND, OR, XOR, SLL,
		SRL, EQ, GT, GE, NE,
		LT, LE, MULLO, MULHI, DIV, MOD,
		ADC, SBC, ROL, ROR} \\\hline
		MEMORY & \texttt{MEM0, MEM1, MEM2, MEMLO, MEMHI} \\\hline
		STACK  & \texttt{STACK} \\\hline
		COM & \texttt{COMA, COMD} \\\hline
		BRANCH & \texttt{BR0, BR1} \\\hline
		\arrayrulecolor[rgb]{0,0,0}\hline
	\end{tabular}
	\captionof{table}{OISC processor instruction source groups used in instruction composition test}
	\label{tab:instr_groups_oisc_src}
\end{blockpage}

Each function was executed on a simulated processor, program counter and instruction were recorded into file at every cycle. File recording was accomplished with SytemVerilog test bench. Start of a recording was triggered when program counter matched \texttt{.start} location and stopped when it matched \texttt{.done} location. Code shown in Listing \ref{code:asmtest} enabled both locations to be static and not depend on test function that was executed.

\begin{blockpage}
	\begin{lstlisting}[frame=single, caption={Assembly frame for executring tests}, emph={setup,start,done}, label={code:asmtest}]
setup:
  JUMP .start
.done:
  JUMP .done
.start:
  ; Setup values
  ; Call function
  JUMP .done
	\end{lstlisting}
\end{blockpage}

\begin{figure*}[t]
	\centering
	\includegraphics[width=\linewidth]{../tests/instr_comp.eps}
	\caption{Graph of instruction composition for every benchmark program.}
	\label{fig:instr_comp}
\end{figure*}


Each recorded file with function composition was then further analysed and each instruction was grouped. Recorded program counter was used to find effective program space. This has been achieved by calculating unique instances of program counter and summing up instruction size for each of them. In RISC, dynamic instruction size has been taken into account. 

From the results in Figure \ref{fig:instr_comp}, few key differences can be seen. Across every test, OISC has significantly more \textit{BRANCH} destination and \textit{MOVE} source groups. \textit{BRANCH} group can be explained by emulated \texttt{CALL}, \texttt{RET} and \texttt{JUMP} instruction explained in section \ref{subsec:oisc_pc}.
High number of \textit{MOVE} source group instructions may be explained by using the immediate values as a separate source, where RISC uses instructions that can integrate immediate as extra word, such as instruction \texttt{ADDI}. In most cases \textit{ALU} group instructions are also higher than for OISC comparing to RISC. This shows a lower OISC ALU efficiency, mostly due to a need to move data into the separate accumulators.

\subsubsection{Performance}
This subsection investigates time and clock cycles to run benchmark programs. The simulation was performed to find a number of cycles required to execute each function. Note that prime number calculator was not simulated due to too complex dynamic nature of program. 

Print 16bit decimal and modulo operation were executed with different input arguments. This allows to see the worst and the best case scenarios as algorithms length depend on inputs. This is not the case for 16bit multiplication as its implementation has no branching, therefore no execution time dependence on the inputs.

Results are shown in Figure \ref{fig:cycles}. In most of the cases, OISC requires around 55-67\% more instructions, with some exceptions.

\begin{colfigure}
	\centering
	\includegraphics[width=\linewidth]{../tests/cycles.eps}
	\captionof{figure}{Simulated results of cycles that taken to perform function.}
	\label{fig:cycles}
\end{colfigure}

Another set of benchmarks have been performed and on both processors once they been implemented on the FPGA. Time taken to perform each set has been recorded. This has been done via UART connection, a single character was sent to indicate the start and the stop of a benchmark. In order to void a slight timing variation due low baud rate of UART or system kernel scheduler unpredictability to process UART input, each benchmark was performed with many iterations. Figure \ref{fig:timing} represents results.

\begin{colfigure}
	\centering
	\includegraphics[width=\linewidth]{../tests/timing.eps}
	\captionof{figure}{Time taken to perform each benchmark on FPGA at 1MHz clock.}
	\label{fig:timing}
\end{colfigure}

Results indicate that on average OISC takes about 71\% longer to execute same benchmark. This is close to results found with simulation. Prime number calculator have taken 3.26 times longer.

Benchmarks include:
\begin{description}
	\item[$\bullet$ Prime Numbers:] Calculate every prime number between 5 to 65536.  
	\item[$\bullet$ Multipy:] 16bit multiplication iterated 65536 times.
	\item[$\bullet$ Modulo 0010h:] 16bit \textit{0010h} modulo that operated on every number between 0 and 65536.
	\item[$\bullet$ Modulo FFFFh:] 16bit \textit{FFFFh} modulo that operated on every number between 0 and 65536.
	\item[$\bullet$ BDC:] Encoded 16bit binary to ASCII decimal number without printing.
\end{description}


\subsubsection{Program space}

Data collected from previous instruction composition results were also used to find effective program size.
Effective program size only includes instruction that been executed depending on argument, meaning that it does not fully represent complete function. A specific input to a function might cause branching and avoiding some function code, which would not be added to effective program size. In this test, the main objective is to look difference in instruction size required to execute the same function, therefore not representing full program size is irrelevant. 
\begin{colfigure}
	\centering
	\includegraphics[width=\linewidth]{../tests/program_size.eps}
	\captionof{figure}{Bar graph showing effective size in bits each benchmark function is taking in program memeory.}
	\label{fig:program_size}
\end{colfigure}

Figure \ref{fig:program_size} represents an effective program size for each test function. On average, OISC instructions take 41.71\% more space which is to be expected.

\subsection{Maximum clock frequency}
In order to find maximum clock frequency, processors were loaded with basic print string function and 16bit multiplication. Then, frequency was constantly increased until resulting output though UART was not correct. 

In order to change clock frequency, three parameters were changed and HDL code resynthesised: 
\begin{description}
	\item[$\bullet$] \textbf{PLL frequency multiplier and divider:}
	PLL takes 50MHz clock and converts it to master clock $f_{mclk}$. Multiplier and divider values are used to adjust $f_{mclk}$.
	
	\item[$\bullet$] \textbf{UART frequency divider:}
	Division value was calculated as $D = \left \lfloor \frac{f_{mclk}}{4 f_{baud}} \right \rfloor$. UART rate was set to 9600 baud. UART module itself has four times oversample. 
\end{description}
Frequency was changed in 5MHz increments. 

The theoretical maximum frequency was found using Quartus Timing Analysis tool. Slow 1200mV 85$^{\circ}$C model was used. 

\begin{center}
	\begin{tabular}{ l | c | c  }
		     & Theoretical & Actual \\ \hline
		RISC & 114.08MHz & 75-70MHz \\ \hline
		OISC & 64.68MHz & 45-40MHz \\
	\end{tabular}
	\captionof{table}{Theoretical and actual maximum frequencies of both processors.}
	\label{tab:max_freq}
\end{center}

Theoretical and actual results show unexpected results shown in Table \ref{tab:max_freq}, RISC operated at about 40\% higher maximum frequency than OISC.

As explained in Subsection \ref{subsec:oisc_cell_issue}, OISC logic blocks takes approximately half the time for data propagation. Keeping that in mind, and assuming that latch propagation and register setup periods are insignificant to critical path of OISC logic block, maximum OISC frequency could be twice as high, reaching 80-90MHz. This also assumes that there is no other part of processor would have limit. Further timing analysis needs to be carried out to confirm this.

\subsection{Future work}

RISC has more sophisticated logic for various processor components. It is expected to see RISC having better results due to its higher optimisation. OISC should be implemented with multiple data \& instruction buses. This could be performed with minimal corrections on hardware, however would require many changes in assembly programs. \nameref{subsec:instr_comp} results show that OISC takes more instructions to store values in accumulators, which could benefit from multi-bus parallelisation. Adding a single additional bus should halve benchmark times, which would produce more comparable to RISC. In addition, multi-bus OISC can perform truly parallel programs assuming it has enough processor resources to perform operations (for example operate different ALU operations at the same time). This potentially would be dominant feature over RISC in time-sensitive programs, GPIO (General Purpose Input/Output) and interrupt handling. 

Additional buses would not greatly increase processor logic element size, especially when using interconnect optimisation techniques \autocite{1207041,6972455}. Matching processor complexity should also allow more fair and direct comparison specifically between two architectures. 

A number of other improvements and future research are proposed:
\begin{enumerate}
	\item Perform more tests on power analysis with different frequencies. Find the activity factor described in Subsection \ref{subsec:activity_factor}.
	\item Further investigate maximum frequency. Try to resolve OISC timing issue and repeat maximum frequency test. This would allow to prove or disprove theorised higher frequency capabilities for OISC. 
	\item Design a higher level language compiler such as BASIC or C. This would allow performing more complicated programs which would more closely relate to microcontroller operations. However, OISC compiler would need extra optimisation layer to efficiently organise instructions.
	\item Compare proposed processor designs with other commercially available 8-bit processors such as Atmel AVR microcontrollers, Motorola 6800 family and Microchip PIC.
\end{enumerate}