[Weekly Review] 2019/12/09-15

Published: December 15, 2019 by SingularityKChen (Last updated: February 10, 2020 )

Categories:
WeeklyReview 81

2019/12/09-15

2019/12/09-15

Last week, I finished the draft version of the matrix sum with Chisel and became half-familiar with it. Also, I tried to write a functional test with Chisel-tester2. Unfortunately, due to the complex and unstable environment, I can not check all the syntax or run the functional test, but I tried to use the environment in Jupyter to run a simple test and two small submodules had been proved right. To use Chisel and Chipyard's components, I studied one sample project named SHA3. If I have more time, I'd back here to restudy that.

Fortunately, while I was programming the matrix sum project, I realized the feeling of distinguishing the difference of Scala types and hardware types. I mean, I knew why we can utilize the features of Scala to boost our efficiency of design our circuits and when we should create a hardware register to store our mediate data, why the hardware type UInt can not be translated to the Scala type Int. Just because Scala is the hardware generator, it's a software programming language, its types exist when we build the project. But as for hardware types, they only exist when the circuits running. So we can not get an unknown value and translate it to a must-know value.

So this week, I'll try to finish the related C files and run the simulation on FPGA boards. Also, I need to learn some Scala programming skills to enhance my Chisel coding ability.

Git Command

some useful command and the explanation of git. the work flow

the workflow

Workspace
Index/Stage
Repository
Remote

Build a git project

Create one new project

git init [project-name]

Config

show current git config: git config --list

add/remove files

These files will be added or removed to Index.

git rm/add [file1] [file2]

git mv [file-original] [file-renamed]

Commit codes from `Index` to `Repository`

These commands will update all the changes from Index to Repository.

git commit -m 'yourmessage'

git commit -v to show all the differences.

Push codes from `Repository` to `Remote`

git push [remote] [branch] to push one of the local branches.

git pull [remote] [branch] to get the Remote codes into your workspace.

Show Information

git status to check all the changed files

git log to show the current branch's changelog

git diff to show the differences between the Workspace and Index

git diff HEAD to show the difference between the Workspace and Repository

git diff --shortstat "@{0 day ago}" to show your daily coding lines number

git branch: show the current branches

git remote -v show the remote information

Other Git Commands

git pull origin master: download the newest project from remote and merge it into the current branch.

git stash: ignore the current changes and then can switch to another branch.

git checkout [branch]: switch to another branch

git checkout -b [new branch]: switch to a new branch

git merge [branch]: merge the branch to master

git branch -d [branch]: The -d option stands for --delete, which would delete the local branch, only if you have already pushed and merged it with your remote branches

git branch -D [branch]: The -D option stands for --delete --force, which deletes the branch regardless of its push and merge status

Matrix Sum Accelerator

This is the lab project in fall 2013 of UCB.

Chisel3 Syntax

We can check the syntax at its official website. Here lists some high efficient and frequently used syntaxes. And also I printed Chisel3 Cheat Sheet which greatly save my time.

Vec
Seq
Queen
DecoupledIO
PriorityEncoder
SyncReadMem
Flipped

(Linux) C Syntax

`asm volatile`

asm volatile is the inline assembly that used in language Linux C. asm is the keyword of gcc and volatile means gcc don't need to optimize the assemble codes. There is a brief introduction of this. And I found this term while I was trying to test the custom extensive instructions with RoCC interface. This syntax can run our instruction without change the compiler. Plus, it can help our C program to run several commands in order with the help of asm volatile("" ::: "memory").

This syntax also has a feature named Atomic. That means this command only has to states in runtime: either success or failure.

create a C matrix with 2-D Array

Define and initialize the matrix:

double matrix[10][15];
//initialize the static matrix
for (i=0;i<10;i++)
{
    for(j=0;j<15;j++)
    {
        matrix[i][j]=0;
    }
}


// define the struct, need a pointer. m means row and n means column
typedef struct
{
    double **mat;
    int m, n;
}matrix;

//apply for memory space, use malloc()
void initial(matrix &T,int m,int n)
{
    int i = 0;
    T.mat = (double **)malloc(m*sizeof(double *));
    for (i = 0; i < m; m++)
    {
        T.mat[i] = (double *)malloc(n*sizeof(double));
    }
    T.m = m;
    T.n = n;
}

//initialize the matrix with the zero element.
void initValue(matrix &T, int m, int n)
{
    int i, j;
    initial(T,m,n);
    for (i = 0; i < m; i++)
    {
        for (j = 0; j < n; j++)
        {
            T.mat[i][j] = 0;
        }
    }
}

//free the memory space
void destroy(matrix &T)
{
    int i;
    for (i = 0; i < T.m; i++)
    {
        free(T.mat[i]);
    }
    free(T.mat);
}
//finished

Pointer and address access

This blog describes and interprets the usage of * and & to get the address of one data or get the corresponding value of the address.

RISC-V SPEC

Chipyard

`SHA3`

I mimicked SHA3, which also acts as one official sample to use RoCC. To write my accelerator, I had to study the SHA3 project.

If one wants to harness RoCC interface to add extensive instructions, he must design his Chisel hardware project firstly and then write some C tests to prove this project works. Then modify some C files to let the compiler utilize your new instructions.

`SHA3`-Chisel

`sha3.scala`

This is the top module of SHA3 project.

RoCCinSHA3

`ctrl.scala`

This is the most complex submodule of SHA3 project. It contains how to decode the instructions, as well as three finite-state machines. One (rocc_s) shows the states of the communication between the control module and the CPU(or receives instructions from CPU and uses ready-valid signals to show some states of the accelerator); One (mem_s) shows the states of the communication between the control module and the memory(or receives data from memory and send data back to memory and also ready-valid signals); The last one (state) shows the states of the communication between the control module and its submodules.

dpath.scala

This is the data path of the whole module.

dmem.scala

This is the module responsible for transform data between memory and the accelerator. So we can use this module with just a slight modifying.

dcache_req

dcache_resp

`SHA3`-test

`Hwacha`

Hwacha is another more complex sample of custom extensive instructions. Plus, it also used as a component to realize the RISC-V Standard Vector instructions. But it's quite complex and time tiny, so I didn't learn it thoroughly last week.

`Hwacha`-Chisel

`RoCC` and extension instruction

`RoCC` component

RoCC interface includes two main IO, one is for processor core and the other is for cache or memory. Due to the feature of Scala, we can use io.cmd and io.mem to show this. And both IOs own several kinds of signals, such as io.cmd.valid io.cmd.ready and io.cmd.bits, also the io.mem.req and io.mem.resp.

RoCC interface

We can see it in more details from the photo below. NOTE: this graph is a little old as it was published in 2015, and there are several slight changes.

RoCC 2015

extension instruction with `RoCC`

The table below shows the RoCC instruction format.

RoCC instruction format

`RoCC`-Chisel

RoCC IO

Instruction test with C

Tests for example Rocket Custom Coprocessors

`HellaCache`

Chisel-Tester2

main-page

Hammer

main-page

Computer Architecture

Memory Fence and Memory Barrier

Page Table Walk

PTW Introduction

Translation Lookaside Buffer

TLB Introduction

With the help of TLB, the CPU may access the physical address with just one accessment. And it contains both some copies of the page table and the physical address.

TLB passthrough

When I looked through the TLB.scala I found a valuable named passthrough

Cache Hierarchy

I found the following information when I tried to understand the meaning of memory tag. I met this term while I was trying to mimic the control module of SHA3. Because it extends from HellaCache, which require a tag. Although finally, I found I can give out tag casually, I learned a lot from the structure of cache including TLB.

Cache hierarchy of the K8 core in the AMD Athlon 64 CPU

The picture above illustrates the abstract architecture of the cache hierarchy of one AMD CPU, from which we can see TLB and Cache. Both these two components contain two levels and save instructions and data separately.

If we see memory hierarchy excluding TLB, then it looks like the following picture:

Cortex-A53 memory hierarchy

And because of the dramatically booming speeds of high-level caches, they can improve the performance of the CPU. We can regard the cache is a method to improve the R/W speed of main memory, the CPU register is a method to improve that of the cache.

cache speed

So how can the cache controller know whether the data we need is included in current cache? So tag comes.

the usage of tag

We can see from the graph above, there is a tag array and its corresponding data array. Firstly, the controller will use the index to find a cache line and then compares the tag array, then bingo.

Sinaean Dean introduced the two usages of cache tag:

For caches parallelly connected, using tags as a selection signal in a MUX to choose the right data.
Know whether cache hits or misses.

If the cache miss, then it will ask a lower lever cache for the data or instructions.

cooperation between different levels of caches

Memory Read (Synchronous and Asynchronous)

This blog describes the synchronous memory – SSRAM, which is controlled by the clock and can not read or write at the same clock cycle. But if we use asynchronous memory, then we can get the read data within one clock and if we use a combinational/asynchronous-read, sequential/synchronous-write memory, then the read-after-write hazards are not an issue.

Some Trouble

`psutil`

ERROR INFO: No module named psutil.

But I have installed this module. Then via some blogs, I find that might be the version of python that excursive doesn't install the module. So I changed the default version of python from 2.7 to 3.5 and reinstall it with pip3.

Import classes into `IntelliJ IDEA`

I use some modules of chipyard, which means I have to import that project into my project.

In the beginning, I just add the files which contain the Scala files directly, but it didn't work at all. Because the packages and classes are defined at a more high level.

By accidentally, I included the highest file (generator in this case) and it succeeds.

Import reliance into `sbt` project

lazy val matrixsum = conditionalDependsOn(project in file("/home/singularity/chipyard/generators/matrixsum"))
				.dependsOn(rocketchip, chisel_testers, sifive_blocks, sifive_cache, utilities, midasTargetUtils)
				.settings(commonSettings)

But that's works on the top module, the dependson must be a subfile of the current build.sbt directory.

Can not find the lvy local lib

~~Haven't solved yet~~

Solved at Dec 28th by republishing it with the value of version := 1.2-SNAPSHOT in build.sbt.

1 targets failed
adder.compileClasspath 
Resolution failed for 1 modules:
--------------------------------------------
				edu.berkeley.cs:treadle_2.12:1.2-SNAPSHOT 
				not found: /home/singularity/.ivy2/local/edu.berkeley.cs/treadle_2.12/1.2-SNAPSHOT/ivys/ivy.xml
				not found: https://repo1.maven.org/maven2/edu/berkeley/cs/treadle_2.12/1.2-SNAPSHOT/treadle_2.12-1.2-SNAPSHOT.pom
				not found: https://oss.sonatype.org/content/repositories/releases/edu/berkeley/cs/treadle_2.12/1.2-SNAPSHOT/treadle_2.12-1.2-SNAPSHOT.pom