C is the de-facto programming language to do serious system serious programming. Why? Most kernels have their API accessible through C. The Linux kernel (Love #ref-Love) and the XNU kernel (Inc. #ref-xnukernel) of which MacOS is based on are written in C and have C API - Application Programming Interface. The Windows Kernel uses C++, but doing system programming on that is much harder on windows that UNIX for novice system programmers. C doesn’t have abstractions like classes and Resource Acquisition Is Initialization (RAII) to clean up memory. C also gives you much more of an opportunity to shoot yourself in the foot, but it lets you do things at a much more fine-grained level.
C was developed by Dennis Ritchie and Ken Thompson at Bell Labs back in 1973 (Ritchie #ref-Ritchie:1993:DCL:155360.155580). Back then, we had gems of programming languages like Fortran, ALGOL, and LISP. The goal of C was two-fold. Firstly, it was made to target the most popular computers at the time, such as the PDP-7. Secondly, it tried to remove some of the lower-level constructs (managing registers, and programming assembly for jumps), and create a language that had the power to express programs procedurally (as opposed to mathematically like LISP) with readable code. All this while still having the ability to interface with the operating system. It sounded like a tough feat. At first, it was only used internally at Bell Labs along with the UNIX operating system.
The first “real” standardization was with Brian Kernighan and Dennis Ritchie’s book (Kernighan and Ritchie #ref-kernighan1988c). It is still widely regarded today as the only portable set of C instructions. The K&R book is known as the de-facto standard for learning C. There were different standards of C from ANSI to ISO, though ISO largely won out as a language specification. We will be mainly focusing on is the POSIX C library which extends ISO. Now to get the elephant out of the room, the Linux kernel is fails to be POSIX compliant. Mostly, this is so because the Linux developers didn’t want to pay the fee for compliance. It is also because they did not want to be fully compliant with a multitude of different standards because that meant increased development costs to maintain compliance.
We will aim to use C99, as it is the standard that most computers recognize, but sometimes use some of the newer C11 features. We will also talk about some off-hand features like getline because they are so widely used with the GNU C library. We’ll begin by providing a fairly comprehensive overview of the language with language facilities. Feel free to gloss over if you have already worked with a C based language.
Speed. There is little separating a program and the system.
Simplicity. C and its standard library comprise a simple set of portable functions.
Manual Memory Management. C gives a program the ability to manage its memory. However, this can be a downside if a program has memory errors.
Ubiquity. Through foreign function interfaces (FFI) and language bindings of various types, most other languages can call C functions and vice versa. The standard library is also everywhere. C has stood the test of time as a popular language, and it doesn’t look like it is going anywhere.
The canonical way to start learning C is by starting with the hello world program. The original example that Kernighan and Ritchie proposed way back when hasn’t changed.
学习 C 的正宗方法是从 “Hello, World!” 程序开始。K&R 在很久以前提出的原始示例并没有改变。
#include <stdio.h>
int main(void) {
printf("Hello World\n");
return 0;
}
The #include
directive takes the file stdio.h
(which stands for standard input and output) located somewhere in your operating system, copies the text, and substitutes it where the #include
was. #include
指令会将 stdio.h
文件 (代表标准输入和输出) 从你的操作系统中某处复制到代码中,并在需要时将其替换为代码中的位置。
The int main(void)
is a function declaration. The first word int
tells the compiler the return type of the function. The part before the parenthesis (main
) is the function name. In C, no two functions can have the same name in a single compiled program, although shared libraries may be able. Then, the parameter list comes after. When we provide the parameter list for regular functions (void
) that means that the compiler should produce an error if the function is called with a non-zero number of arguments. For regular functions having a declaration like void func() means that the function can be called like func(1, 2, 3)
, because there is no delimiter. main is a special function. There are many ways of declaring main but the standard ones are int main(void)
, int main()
, and int main(int argc, char *argv[])
.
int main(void) 是一个函数声明。第一个单词 int 告诉编译器函数的返回类型。括号内的 (main) 是函数名。在 C 中,单个编译程序中不允许两个函数具有相同的名称,尽管共享库可能允许。接下来是参数列表。当我们为普通函数提供参数列表时 (void),这意味着编译器应该在函数调用时产生错误,如果函数被调用时传递非零的参数。对于普通函数,具有 void func() 声明意味着函数可以像 func(1, 2, 3) 一样调用,因为函数声明没有明确的分隔符。main 是一个特殊函数。有许多方式声明 main,但标准方式是 int main(void)、int main() 和 int main(int argc, char *argv[])。
printf("Hello World");
is what a function call. printf
is defined as a part of stdio.h
. The function has been compiled and lives somewhere else on our machine - the location of the C standard library. Just remember to include the header and call the function with the appropriate parameters (a string literal "Hello World"
). If the newline isn’t included, the buffer will not be flushed (i.e. the write will not complete immediately). printf(“Hello World”); 是函数调用。printf 是 stdio.h 的一部分定义。函数已经编译并在我们的机器上其他位置运行——C 标准库的位置。只需记住包含头文件并使用适当的参数 (一个字符串常量“Hello World”) 调用函数。如果未包含换行符,缓冲区不会被 flushed(即写入不会立即完成)。
return 0
. main
has to return an integer. By convention, return 0
means success and anything else means failure. Here are some exit codes / statuses with special meaning: http://tldp.org/LDP/abs/html/exitcodes.html. In general, assume 0 means success. return 0。main 必须返回一个整数。通常,返回 0 表示成功,其他值表示失败。这里有一些特殊的退出码/状态:http://tldp.org/LDP/abs/html/exitcodes.html。通常,假设 0 表示成功。
$ gcc main.c -o main
$ ./main
Hello World
$
gcc
is short for the GNU Compiler Collection which has a host of compilers ready for use. The compiler infers from the extension that you are trying to compile a .c file. gcc 是 GNU 编译器集合的缩写,该集合中有许多可用的编译器。编译器从文件后缀推断出您正在尝试编译 .c 文件。
./main tells your shell to execute the program in the current directory called main. The program then prints out “hello world”. ./main 告诉您的 shell 在当前目录中运行名为 main 的程序。该程序随后打印出“hello world”。
If systems programming was as easy as writing hello world though, our jobs would be much easier. 尽管系统编程并不像写“hello world”一样简单,但如果您能像写“hello world”一样容易地编写系统程序,那么您的工作就会更轻松。
What is the preprocessor? Preprocessing is a copy and paste operation that the compiler performs before actually compiling the program. The following is an example of substitution. 什么是预处理器?预处理是在编译程序之前执行的复制和粘贴操作。下面是一个例子,展示了预处理的替换操作。
// Before preprocessing
#define MAX_LENGTH 10
char buffer[MAX_LENGTH]
// After preprocessing
char buffer[10]
There are side effects to the preprocessor though. One problem is that the preprocessor needs to be able to tokenize properly, meaning trying to redefine the internals of the C language with a preprocessor may be impossible. Another problem is that they can’t be nested infinitely - there is a bounded depth where they need to stop. Macros are also simple text substitutions, without semantics. For example, look at what can happen if a macro tries to perform an inline modification. 然而,预处理器也有一些副作用。一个问题是,预处理器需要能够正确地分词,这意味着尝试使用预处理器重新定义 C 语言的内部结构可能是不可能实现的。另一个问题是,它们不能无限嵌套——它们需要停止的有限深度。宏也是一种简单的文本替换,没有语义。例如,看看如果宏尝试进行内联修改会发生什么。
#define min(a,b) a < b ? a : b
int main() {
int x = 4;
if(min(x++, 5)) printf("%d is six", x);
return 0;
}
Macros are simple text substitution so the above example expands to 宏是简单的文本替换,所以上面的例子扩展为:
x++ < 5 ? x++ : 5
In this case, it is opaque what gets printed out, but it will be 6. Can you try to figure out why? Also, consider the edge case when operator precedence comes into play. 在这种情况下,输出的内容是不透明的,但一定会输出 6。你能尝试找出为什么吗?此外,应考虑运算符优先级引起的边界情况。
int x = 99;
int r = 10 + min(99, 100); // r is 100!
// This is what it is expanded to
int r = 10 + 99 < 100 ? 99 : 100
// Which means
int r = (10 + 99) < 100 ? 99 : 100
There are also logical problems with the flexibility of certain parameters. One common source of confusion is with static arrays and the sizeof operator. 某些参数的灵活性也可能导致逻辑问题。一个常见的混淆来源是对静态数组和 sizeof 运算符的操作。
#define ARRAY_LENGTH(A) (sizeof((A)) / sizeof((A)[0]))
int static_array[10]; // ARRAY_LENGTH(static_array) = 10
int* dynamic_array = malloc(10); // ARRAY_LENGTH(dynamic_array) = 2 or 1 consistently
What is wrong with the macro? Well, it works if a static array is passed in because sizeof a static array returns the number of bytes that array takes up and dividing it by the sizeof(an_element) would give the number of entries. But if passed a pointer to a piece of memory, taking the sizeof the pointer and dividing it by the size of the first entry won’t always give us the size of the array. 宏有什么问题?如果传入一个静态数组,宏就能正常工作,因为 sizeof 静态数组返回数组占用的字节数,将其除以 sizeof(an_element) 即可得到数组元素个数。但如果传入一个指向内存的指针,计算指针的大小并除以第一个元素的的大小并不一定能得出数组的大小。
C has an assortment of keywords. Here are some constructs that you should know briefly as of C99.
switch(1) {
case 1: /* Goes to this switch */
puts("1");
break; /* Jumps to the end of the block */
case 2: /* Ignores this program */
puts("2");
break;
} /* Continues here */
In the context of a loop, using it breaks out of the inner-most loop. The loop can be either a for, while, or do-while construct
while(1) {
while(2) {
break; /* Breaks out of while(2) */
} /* Jumps here */
break; /* Breaks out of while(1) */
} /* Continues here */
const int i = 0; // Same as "int const i = 0"
char *str = ...; // Mutable pointer to a mutable string
const char *const_str = ...; // Mutable pointer to a constant string
char const *const_str2 = ...; // Same as above
const char *const const_ptr_str = ...;
// Constant pointer to a constant string
But, it is important to know that this is a compiler imposed restriction only. There are ways of getting around this, and the program will run fine with defined behavior. In systems programming, the only type of memory that you can’t write to is system write-protected memory.
const int i = 0; // Same as "int const i = 0"
(*((int *)&i)) = 1; // i == 1 now
const char *ptr = "hi";
*ptr = ’\0’; // Will cause a Segmentation Violation
int i = 10;
while(i--) {
if(1) continue; /* This gets triggered */
*((int *)NULL) = 0;
} /* Then reaches the end of the while loop */
int i = 1;
do {
printf("%d\n", i--);
} while (i > 10) /* Only executed once */
enum <type> varname
. The added benefit to this is that the compiler can type check these expressions to make sure that you are only comparing alike types. enum
是用来声明枚举的。枚举是一种类型,可以拥有许多有限的值。如果你有一个枚举,但没有指定任何数字,C 编译器将为该枚举生成一个唯一的数字 (在当前枚举上下文中),并使用这个数字来进行比较。声明枚举实例的语法是enum <类型> 变量名
。这样做的好处是编译器可以对这些表达式进行类型检查,以确保你只能比较相似的类型。enum day{ monday, tuesday, wednesday,
thursday, friday, saturday, sunday};
void process_day(enum day foo) {
switch(foo) {
case monday:
printf("Go home!\n"); break;
// ...
}
}
It is completely possible to assign enum values to either be different or the same. It is not advisable to rely on the compiler for consistent numbering, if you assign numbers. If you are going to use this abstraction, try not to break it. 完全有可能为枚举值分配不同的或相同的值。如果你为枚举分配数字,不建议依赖编译器进行一致的编号。如果你打算使用这种抽象,尽量不要破坏它。
enum day{
monday = 0,
tuesday = 0,
wednesday = 0,
thursday = 1,
friday = 10,
saturday = 10,
sunday = 0};
void process_day(enum day foo) {
switch(foo) {
case monday:
printf("Go home!\n"); break;
// ...
}
}
// file1.c
extern int panic;
void foo() {
if (panic) {
printf("NONONONONO");
} else {
printf("This is fine");
}
}
//file2.c
int panic = 1;
for (initialization; check; update) {
//...
}
// Typically
int i;
for (i = 0; i < 10; i++) {
//...
}
As of the C89 standard, one cannot declare variables inside the for
loop initialization block. This is because there was a disagreement in the standard for how the scoping rules of a variable defined in the loop would work. It has since been resolved with more recent standards, so people can use the for loop that they know and love today
for(int i = 0; i < 10; ++i) {
The order of evaluation for a for
loop is as follows:
goto
is a keyword that allows you to do conditional jumps. Do not use goto
in your programs. The reason being is that it makes your code infinitely more hard to understand when strung together with multiple chains, which is called spaghetti code. It is acceptable to use in some contexts though, for example, error checking code in the Linux kernel. The keyword is usually used in kernel contexts when adding another stack frame for cleanup isn’t a good idea. The canonical example of kernel cleanup is as below.void setup(void) {
Doe *deer;
Ray *drop;
Mi *myself;
if (!setupdoe(deer)) {
goto finish;
}
if (!setupray(drop)) {
goto cleanupdoe;
}
if (!setupmi(myself)) {
goto cleanupray;
}
perform_action(deer, drop, myself);
cleanupray:
cleanup(drop);
cleanupdoe:
cleanup(deer);
finish:
return;
}
// (1)
if (connect(...))
return -1;
// (2)
if (connect(...)) {
exit(-1);
} else {
printf("Connected!");
}
// (3)
if (connect(...)) {
exit(-1);
} else if (bind(..)) {
exit(-2);
}
// (1)
if (connect(...)) {
exit(-1);
} else if (bind(..)) {
exit(-2);
} else {
printf("Successfully bound!");
}
inline
a function for you.inline int max(int a, int b) {
return a < b ? a : b;
}
int main() {
printf("Max %d", max(a, b));
// printf("Max %d", a < b ? a : b);
}
memcpy(void * restrict dest, const void* restrict src, size_t
bytes);
void add_array(int *a, int * restrict c) {
*a += *c;
}
int *a = malloc(3*sizeof(*a));
*a = 1; *a = 2; *a = 3;
add_array(a + 1, a) // Well defined
add_array(a, a) // Undefined
void
then it simply exits the functions. Otherwise, another parameter follows as the return value. return 是一种控制流运算符,用于退出当前函数。如果函数是 void,则只需退出函数即可。否则,作为返回值的另一个参数紧随其后。void process() {
if (connect(...)) {
return -1;
} else if (bind(...)) {
return -2;
}
return 0;
}
unsigned
修饰符才能使它们成为无符号类型。但是,在某些情况下,你可能希望编译器默认为有符号类型,例如以下情况:int count_bits_and_sign(signed representation) {
//...
}
char a = 0;
printf("%zu", sizeof(a++));
char a = 0;
printf("%zu", 1);
Which then the compiler is allowed to operate on further. The compiler must have a complete definition of the type at compile-time - not link time - or else you may get an odd error. Consider the following: 编译器必须在编译时获得类型的完整定义,而不是在链接时,否则可能会出现奇怪的错误。考虑以下代码:
// file.c
struct person;
printf("%zu", sizeof(person));
// file2.c
struct person {
// Declarations
}
This code will not compile because sizeof is not able to compile file.c
without knowing the full declaration of the person
struct. That is typically why programmers either put the full declaration in a header file or we abstract the creation and the interaction away so that users cannot access the internals of our struct.Additionally, if the compiler knows the full length of an array object, it will use that in the expression instead of having it decay into a pointer. 这段代码将无法编译,因为 sizeof 无法在不知道 person 结构的完全声明的情况下编译 file.c。这就是为什么程序员通常在头文件中包含完整声明,或者通过抽象创建和交互来避免用户访问结构的内部。此外,如果编译器知道数组对象的完整长度,它会在表达式中使用它,而不是将其退化为指针。例如:
char str1[] = "will be 11";
char* str2 = "will be 8";
sizeof(str1) //11 because it is an array 11,因为这是一个数组
sizeof(str2) //8 because it is a pointer 8,因为这是一个指针
Be careful, using sizeof for the length of a string! 使用 sizeof 对字符串长度进行计算时要小心!
// visible to this file only
static int i = 0;
static int _perform_calculation(void) {
// ...
}
char *print_time(void) {
static char buffer[200]; // Shared every time a function is called
// ...
}
struct hostname {
const char *port;
const char *name;
const char *resource;
}; // You need the semicolon at the end
// Assign each individually
struct hostname facebook;
facebook.port = "80";
facebook.name = "www.google.com";
facebook.resource = "/"
// You can use static initialization in later versions of c
struct hostname google = {"80", "www.google.com", "/"};
switch(/* char or int */) {
case INT1: puts("1");
case INT2: puts("2");
case INT3: puts("3");
}
if we give a value of 2 then:
switch(2) {
case 1: puts("1"); /* Doesn’t run this */
case 2: puts("2"); /* Runs this */
case 3: puts("3"); /* Also runs this */
}
One of the more famous examples of this is Duff’s device which allows for loop unrolling. You don’t need to understand this code for the purposes of this class, but it is fun to look at [2].
send(to, from, count)
register short *to, *from;
register count;
{
register n=(count+7)/8;
switch(count%8){
case 0: do{ *to = *from++;
case 7: *to = *from++;
case 6: *to = *from++;
case 5: *to = *from++;
case 4: *to = *from++;
case 3: *to = *from++;
case 2: *to = *from++;
case 1: *to = *from++;
}while(--n>0);
}
}
This piece of code highlights that switch statements are goto statements, and you can put any code on the other end of a switch case. Most of the time it doesn’t make sense, some of the time it just makes too much sense.
typedef float real;
real gravity = 10;
// Also typedef gives us an abstraction over the underlying type used.
// In the future, we only need to change this typedef if we
// wanted our physics library to use doubles instead of floats.
// typedef 也为我们提供了对使用的基础类型的抽象。
// 将来,我们只需要改变这个 typedef,如果我们想让我们的物理库使用双精度浮点数而不是浮点数。
typedef struct link link_t;
//With structs, include the keyword ’struct’ as part of the original types
// 与结构体一起使用,将关键字 "struct" 作为原始类型的一部分包含在内。
In this class, we regularly typedef functions. A typedef for a function can be this for example: 在本课程中,我们经常使用 typedef
定义函数。一个函数 typedef
的例子如下:
typedef int (*comparator)(void*,void*);
int greater_than(void* a, void* b){
return a > b;
}
comparator gt = greater_than;
This declares a function type comparator that accepts two void*
params and returns an integer.
这声明了一个名为 comparator 的函数类型,它接受两个 void*
参数,并返回一个整数。
union
是由多个变量占用的一块内存。它用于在保持一致性的同时,具有灵活性,可以在不维护跟踪位函数的情况下,切换不同类型的变量。考虑一个示例,其中我们有不同的像素值。union pixel {
struct values {
char red;
char blue;
char green;
char alpha;
} values;
uint32_t encoded;
}; // Ending semicolon needed
union pixel a;
// When modifying or reading
a.values.red;
a.values.blue = 0x0;
// When writing to a file
fprintf(picture, "%d", a.encoded);
unsigned is a type modifier that forces unsigned behavior in the variables they modify. Unsigned can only be used with primitive int types (like int and long). There is a lot of behavior associated with unsigned arithmetic. For the most part, unless your code involves bit shifting, it isn’t essential to know the difference in behavior with regards to unsigned and signed arithmetic. unsigned 是一种类型修饰符,它强制修改的变量表现出无符号行为。无符号只能与基本整数类型 (如 int
和 long
) 一起使用。与无符号算术有关的行为有很多。大多数情况下,除非您的代码涉及位旋转,了解无符号算术和行为之间的区别并不是必要的。
void is a double meaning keyword. When used in terms of function or parameter definition, it means that the function explicitly returns no value or accepts no parameter, respectively. The following declares a function that accepts no parameters and returns nothing. void 是一个具有双重意义的关键字。当它用于函数或参数定义时,它意味着函数明确地返回无值或不接受参数。以下声明了一个不接受参数且返回值为空的函数:
void foo(void);
The other use of void
is when you are defining an lvalue
. A void *
pointer is just a memory address. It
is specified as an incomplete type meaning that you cannot dereference it but it can be promoted to any
time to any other type. Pointer arithmetic with this pointer is undefined behavior. 另一个使用 void 的地方是在定义 lvalue 时。void *
指针只是内存地址。它被指定为不完整类型,这意味着你不能对它进行解引用,但它可以转换为任何其他类型。使用这个指针进行指针算术是未定义行为。
int *array = void_ptr; // No cast needed // 不需要类型转换
int flag = 1;
pass_flag(&flag);
while(flag) {
// Do things unrelated to flag
}
The compiler may, since the internals of the while loop have nothing to do with the flag, optimize it to the following even though a function may alter the data. 尽管循环内部与 flag 毫无关联,但由于循环内部的细节与 flag 无关,编译器可能会将其优化为以下形式:
while(1) {
// Do things unrelated to flag
}
If you use the volatile keyword, the compiler is forced to keep the variable in and perform that check. This
is useful for cases where you are doing multi-process or multi-threaded programs so that we can affect the
running of one sequence of execution with another. 如果你使用 volatile
关键字,编译器将被强制保留变量并对其进行检查。这对于编写多进程或多线程程序非常有用,因为我们可以使用另一个执行序列来影响一个执行序列的执行。
There are many data types in C. As you may realize, all of them are either integers or floating point numbers and other types are variations of these.
char
Represents exactly one byte of data. The number of bits in a byte might vary. unsigned char
and signed char means the exact same thing. This must be aligned on a boundary (meaning you cannot use bits in between two addresses). The rest of the types will assume 8 bits in a byte.short
(short int
) must be at least two bytes. This is aligned on a two byte boundary, meaning that the address must be divisible by two.int
must be at least two bytes. Again aligned to a two byte boundary [5, P. 34]. On most machines this will
be 4 bytes.long
(long int
) must be at least four bytes, which are aligned to a four byte boundary. On some machines
this can be 8 bytes.long long
must be at least eight bytes, aligned to an eight byte boundary.float
represents an IEEE-754 single precision floating point number tightly specified by IEEE [1]. This will
be four bytes aligned to a four byte boundary on most machines.double
represents an IEEE-754 double precision floating point number specified by the same standard,
which is aligned to the nearest eight byte boundary.If you want a fixed width integer type, for more portable code, you may use the types defined in stdint.h, which are of the form [u]intwidth_t, where u (which is optional) represents the signedness, and width is any of 8, 16, 32, and 64