前言

着重分析linux系统装载程序相关，涉及部分操作系统其它部分

android下与linux内核的版本对应：

Android Version	API Level	Linux Kernel in AOSP
1.5 Cupcake	3	2.6.27
1.6 Donut	4	2.6.29
2.0/1 Eclair	5-7	2.6.29
2.2.x Froyo	8	2.6.32
2.3.x Gingerbread	9, 10	2.6.35
3.x.x Honeycomb	11-13	2.6.36
4.0.x Ice Cream San	14, 15	3.0.1
4.1.x Jelly Bean	16	3.0.31
4.2.x Jelly Bean	17	3.4.0
4.3 Jelly Bean	18	3.4.39
4.4 Kit Kat	19, 20	3.10
5.x Lollipop	21, 22	3.16.1
6.0 Marshmallow	23	3.18.10
7.0 Nougat	24	4.4.1
7.1 Nougat	25	4.4.1
8.0 Oreo	26	4.10
8.1 Oreo	27	4.10
9.0 Pie	28	4.4, 4.9 and 4.14

来自：https://en.wikipedia.org/wiki/Android_version_history

分析的系统虽然是8.0的…手头的android源码不含内核…只能分析网上的4.4版本了：https://www.androidos.net.cn/kernel/4.4/xref

前置知识

Syscall位于用户与内核空间之间，使得：

提高内核安全性，将内核操作完全独立，内核决定开放什么接口，谁能调用。
便于移植

因为各操作系统、系统版本、各架构的系统调用都不一样，因此为了形成编码级别的移植增加了一层运行库。支持编码移植的基础正是各个语言的运行库，比如c语言的read，在各系统下的运行库实现都不同，但代码调用时只需要引入它的声明并使用，连接时找到该平台下的库就可以连接到该实现。这一过程仅与运行库文件与编译器连接器有关，因此可以交叉。
思路就是把各个操作系统的系统调用交集在各个平台下都实现成库文件，上层编码时只需要使用语言提供的库就可以生成不同平台的可执行代码去运行。
其实触发中断到运行库间还有一层，是外壳函数，是各个操作系统为了方便系统调用(汇编)提供的c函数实现(相同系统使用外壳函数的源码可以移植)，外壳函数屏蔽了架构差异。
运行库(屏蔽系统与架构)>外壳函数(屏蔽架构)>系统调用(架构汇编与系统规定实现)
自己透过c运行库与外壳函数手动实现系统调用可以十分有效的混淆ida的分析。

内核空间是高地址1G空间，余下的就是应用空间。应用可以通过系统调用陷入内核空间。64位下用户与内核各有64TB左右的空间。中间有空洞
用户空间对应进程，所以每当进程切换，用户空间就会跟着变化；而内核空间是由内核负责映射，它并不会跟着进程变化，是固定的。内核空间地址有自己对应的页表，用户进程各自有不同的页表。
因为用户与内核间要数据交互所以这样分每个虚拟内存交互可以使其交互方便点。

图:x86段页式的寻址方式
相对X86，ARM的寻址方式更像是实模式，至于多任务下的数据保护，由OS来完成。
关于ARM、X86下地址映射运行模式的区别、cpu提供的功能与操作系统实现的部分的划分还比较模糊….先暂时以与x86相同来考虑

内核的虚拟内存是固定的，每个进程虚拟空间的内核空间部分是物理相同的。该部分虚拟内存可以访问全部的物理内存，32位下用高端内存机制，64位下虚拟地址足够大了就不用了。

让我们忽略Linux对段式内存映射的支持。在保护模式下，我们知道无论CPU运行于用户态还是核心态，CPU执行程序所访问的地址都是虚拟地址，只是页表不同。
那为何用户模式下进程不能自己切换到内核态呢？正是因为中断的存在。使得切换到内核态必须进行限制级的申请，只有中断存在的系统调用号(系统选择开放的功能)才能获得运行。这样一来内核态的切换与执行操作绑定，没有要系统黑盒执行的操作就不能切换内核态。

接下来详细分析从运行库到中断表，到内核执行的全过程

map

execv

android下的运行库是google自己实现的bionic，为c语言运行库，实现直接调用的系统调用。(突然好奇图形渲染怎么实现的..等以后闲下来再分析)
c库的system和popen都是最后调用的exec外壳函数，exec系列只有execve是真正意义上的系统调用，其它都是在此基础上经过包装的库函数。

函数声明：
int execve(const char filename, char const argv[ ], char *const envp[ ]);
如果执行成功则函数不会返回，执行失败则直接返回-1，失败原因存于errno 中。功能为在当前进程执行新程序。
man：http://man7.org/linux/man-pages/man2/execve.2.html

 execve() executes the program pointed to by filename.  This causes
       the program that is currently being run by the calling process to be
       replaced with a new program, with newly initialized stack, heap, and
       (initialized and uninitialized) data segments.
当前程序被替换，新的堆栈数据段。注意不会创建进程，内存地址空间还是已有的，只是更新成新的数据了
如果可执行文件是动态链接的ELF可执行文件， 则PT_INTERP段中命名的解释器用于加载所需的共享对象。
注意execve执行elf时如果有PT_INTERP会先装载动态连接器，由动态连接器加载重定位所有文件。控制权是先给连接器的entry，再由连接器给elf的entry
具体见下分析

壳函数

声明：
bionic\libc\include\unistd.h

1	int execve(const char* __file, char* const* __argv, char* const* __envp);

一般的壳函数会在\bionic\libc\bionic下有其c语言实现，之后通过c的形式调用其符号。符号的实现在各平台的.S文件中，这个比较特殊直接用的.S

实现,外壳函数在不同的架构下实现不同：
bionic\libc\arch-arm64\syscalls\execve.S

#include <private/bionic_asm.h>
ENTRY(execve)
    mov     x8, __NR_execve
    svc     #0
    cmn     x0, #(MAX_ERRNO + 1)
    cneg    x0, x0, hi
    b.hi    __set_errno_internal
    ret
END(execve)

bionic\libc\arch-x86_64\syscalls\execve.S

#include <private/bionic_asm.h>
ENTRY(execve)
    movl    $__NR_execve, %eax
    syscall
    cmpq    $-MAX_ERRNO, %rax
    jb      1f
    negl    %eax
    movl    %eax, %edi
    call    __set_errno_internal
1:
    ret
END(execve)

一些宏定义：

bionic\libc\kernel\uapi\asm-generic\unistd.h
#define __NR_execve 221
bionic\libc\kernel\uapi\asm-x86\asm\unistd_32.h
#define __NR_execve 11
bionic\libc\kernel\uapi\asm-x86\asm\unistd_64.h
#define __NR_execve 59
bionic\libc\private\bionic_asm.h
#define ENTRY_NO_DWARF(f) \
    .text; \
    .globl f; \
    .balign __bionic_asm_align; \
    .type f, __bionic_asm_function_type; \
    f: \
    __bionic_asm_custom_entry(f); \
	
#define END_NO_DWARF(f) \
    .size f, .-f; \
    __bionic_asm_custom_end(f) \
	
#define ENTRY(f) \
    ENTRY_NO_DWARF(f) \
    .cfi_startproc \
	
#define END(f) \
    .cfi_endproc; \
    END_NO_DWARF(f) \

各平台的时间基本都是放入系统调用号，触发中断，检测是否成功。注意这些都是微机原理课上看到过的.S汇编格式，大意为手动分段(.xxxx)，定义好符号名(f:)

中断过程

这里省略中断向量表的过程，基本概述就是查询中断向量表，跳到中断处理程序，其中系统调用按编号处理。
省略内核态切换、堆栈切换过程，需要时再分析

系统调用

sys-execve

kernel/arch/arm/kernel/calls.S

/* 10 */	CALL(sys_unlink)
		CALL(sys_execve)
		CALL(sys_chdir)
		CALL(OBSOLETE(sys_time))	/* used by libc4 */
		CALL(sys_mknod)

此处为系统调用表，调用符号sys_execve

符号声明：
/kernel/include/linux/syscalls.h

1
2
3

asmlinkage long sys_execve(const char __user *filename,
		const char __user *const __user *argv,
		const char __user *const __user *envp);

asmlinkage是gcc标签，代表函数读取的参数来自于栈中，而非寄存器。

/kernel/include/uapi/asm-generic/unistd.h

/* arch/example/kernel/sys_example.c */
#define __NR_clone 220
__SYSCALL(__NR_clone, sys_clone)
#define __NR_execve 221
__SC_COMP(__NR_execve, sys_execve, compat_sys_execve)

发现221号的NR正对应sys_execve。
其表明的定义位置可能并没有其实现，这说明其实现可能是系统统一的。比如按着它的路径找到：
/arch/arm/kernel/sys_arm.c

1	#include <linux/fs.h>

添加系统调用只需添加call、与实现即可
还有就是拷一份内核代码到本地全局搜索：

1	SYSCALL_DEFINEn(xxxxx

系统调用的定义是用宏写的
fs/exec.c

SYSCALL_DEFINE3(execve,
		const char __user *, filename,
		const char __user *const __user *, argv,
		const char __user *const __user *, envp)
{
	return do_execve(getname(filename), argv, envp);
}

这个宏定义有些复杂，这里就不展开了

do-execve

fs/exec.c

int do_execve(struct filename *filename,
	const char __user *const __user *__argv,
	const char __user *const __user *__envp)
{
	struct user_arg_ptr argv = { .ptr.native = __argv };
	struct user_arg_ptr envp = { .ptr.native = __envp };
	return do_execveat_common(AT_FDCWD, filename, argv, envp, 0);
}

指向程序参数argv和环境变量envp两个数组的指针以及数组中所有的指针都位于虚拟地址空间的用户空间部分。因此内核在当问用户空间内存时, 需要多加小心, 而user注释则允许自动化工具来检测时候所有相关事宜都处理得当

do-execveat-common

fs/exec.c

/*
 * sys_execve() executes a new program.
 */
 //参数user_arg_ptr是对用户空间指针的一种封装
static int do_execveat_common(int fd, struct filename *filename,
			      struct user_arg_ptr argv,
			      struct user_arg_ptr envp,
			      int flags)
{
	char *pathbuf = NULL;
	struct linux_binprm *bprm;//保存可执行文件的信息
	struct file *file;
	struct files_struct *displaced;
	int retval;//返回值
	//判断文件名是否合法
	if (IS_ERR(filename))
		return PTR_ERR(filename);
	/*
	 * We move the actual failure in case of RLIMIT_NPROC excess from
	 * set*uid() to execve() because too many poorly written programs
	 * don't check setuid() return code.  Here we additionally recheck
	 * whether NPROC limit is still exceeded.
	 */
	 //判断当前用户进程数是否超过限制
	if ((current->flags & PF_NPROC_EXCEEDED) &&
	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
		retval = -EAGAIN;
		goto out_ret;
	}
	/* We're below the limit (still or again), so we don't want to make
	 * further execve() calls fail. */
	current->flags &= ~PF_NPROC_EXCEEDED;
	//用于备份当前进程的文件表至displaced中，即当前进程的files_struct结构，当出错或者返回时用于恢复当前进程的文件表。 
	retval = unshare_files(&displaced);
	if (retval)
		goto out_ret;
	retval = -ENOMEM;
	//创建linux_binprm并分配内存空间
	bprm = kzalloc(sizeof(*bprm), GFP_KERNEL);
	if (!bprm)
		goto out_files;
	retval = prepare_bprm_creds(bprm);
	if (retval)
		goto out_free;
	check_unsafe_exec(bprm);
	current->in_execve = 1;
	//通过do_filp_open函数打开文件，返回file结构。
	file = do_open_execat(fd, filename, flags);
	retval = PTR_ERR(file);
	if (IS_ERR(file))
		goto out_unmark;
	//找到最小负载cpu，用来执行
	sched_exec();
	bprm->file = file;
	if (fd == AT_FDCWD || filename->name[0] == '/') {
		bprm->filename = filename->name;
	} else {
		if (filename->name[0] == '\0')
			pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d", fd);
		else
			pathbuf = kasprintf(GFP_TEMPORARY, "/dev/fd/%d/%s",
					    fd, filename->name);
		if (!pathbuf) {
			retval = -ENOMEM;
			goto out_unmark;
		}
		/*
		 * Record that a name derived from an O_CLOEXEC fd will be
		 * inaccessible after exec. Relies on having exclusive access to
		 * current->files (due to unshare_files above).
		 */
		if (close_on_exec(fd, rcu_dereference_raw(current->files->fdt)))
			bprm->interp_flags |= BINPRM_FLAGS_PATH_INACCESSIBLE;
		bprm->filename = pathbuf;
	}
	bprm->interp = bprm->filename;
	//bprm_mm_init函数分配新进程的内存空间mm_struct
	retval = bprm_mm_init(bprm);
	if (retval)
		goto out_unmark;
	//参数与环境变量个数
	bprm->argc = count(argv, MAX_ARG_STRINGS);
	if ((retval = bprm->argc) < 0)
		goto out;
	bprm->envc = count(envp, MAX_ARG_STRINGS);
	if ((retval = bprm->envc) < 0)
		goto out;
	//prepare_binprm用于设置进程的授权，并将可执行文件的128字节内容读取到bprm的buf缓存中
	retval = prepare_binprm(bprm);
	if (retval < 0)
		goto out;
	retval = copy_strings_kernel(1, &bprm->filename, bprm);
	if (retval < 0)
		goto out;
	//拷贝字串，用户到内核
	bprm->exec = bprm->p;
	retval = copy_strings(bprm->envc, envp, bprm);
	if (retval < 0)
		goto out;
	retval = copy_strings(bprm->argc, argv, bprm);
	if (retval < 0)
		goto out;
	//执行新的进程
	retval = exec_binprm(bprm);
	if (retval < 0)
		goto out;
	/* execve succeeded */
	current->fs->in_exec = 0;
	current->in_execve = 0;
	acct_update_integrals(current);
	task_numa_free(current);
	free_bprm(bprm);
	kfree(pathbuf);
	putname(filename);
	if (displaced)
		put_files_struct(displaced);
	return retval;
out:
	if (bprm->mm) {
		acct_arg_size(bprm, 0);
		mmput(bprm->mm);
	}
out_unmark:
	current->fs->in_exec = 0;
	current->in_execve = 0;
out_free:
	free_bprm(bprm);
	kfree(pathbuf);
out_files:
	if (displaced)
		reset_files_struct(displaced);
out_ret:
	putname(filename);
	return retval;
}

注释即可，主要就是填充bprm结构

binprm

include/linux/binfmts.h

/*
 * This structure is used to hold the arguments that are used when loading binaries.
 */
struct linux_binprm {
	char buf[BINPRM_BUF_SIZE];// 保存可执行文件的头128字节
#ifdef CONFIG_MMU
	struct vm_area_struct *vma;
	unsigned long vma_pages;
#else
# define MAX_ARG_PAGES	32
	struct page *page[MAX_ARG_PAGES];
#endif
	struct mm_struct *mm;
	unsigned long p; /* current top of mem 当前内存最高地址*/
	unsigned int
		cred_prepared:1,/* true if creds already prepared (multiple
				 * preps happen for interpreters) */
		cap_effective:1;/* true if has elevated effective capabilities,
				 * false if not; except for init which inherits
				 * its parent's caps anyway */
#ifdef __alpha__
	unsigned int taso:1;
#endif
	unsigned int recursion_depth; /* only for search_binary_handler() */
	struct file * file; //要执行的文件
	struct cred *cred;	/* new credentials */
	int unsafe;		/* how unsafe this exec is (mask of LSM_UNSAFE_*) */
	unsigned int per_clear;	/* bits to clear in current->personality */
	int argc, envc;//参数与环变数目
	const char * filename;	/* Name of binary as seen by procps 要执行文件的名字*/
	const char * interp;	/* Name of the binary really executed. Most
				   of the time same as filename, but could be
				   different for binfmt_{misc,script} */
	unsigned interp_flags;
	unsigned interp_data;
	unsigned long loader, exec;
};

该结构体统一保存各可执行文件的信息

prepare-binprm

int prepare_binprm(struct linux_binprm *bprm)
{
    int retval;
    bprm_fill_uid(bprm);
    retval = security_bprm_set_creds(bprm);
    if (retval)
        return retval;
    bprm->cred_prepared = 1;
    memset(bprm->buf, 0, BINPRM_BUF_SIZE);
    return kernel_read(bprm->file, 0, bprm->buf, BINPRM_BUF_SIZE);
}

设置即将运行的uid与gid。读到buf的不是文件的全部内容BINPRM-BUF-SIZE只是128个字节

exec-binprm

static int exec_binprm(struct linux_binprm *bprm)
{
	pid_t old_pid, old_vpid;
	int ret;
	/* Need to fetch pid before load_binary changes it */
	old_pid = current->pid;
	rcu_read_lock();
	old_vpid = task_pid_nr_ns(current, task_active_pid_ns(current->parent));
	rcu_read_unlock();
	ret = search_binary_handler(bprm);
	if (ret >= 0) {
		audit_bprm(bprm);
		trace_sched_process_exec(current, old_pid, bprm);
		ptrace_event(PTRACE_EVENT_EXEC, old_vpid);
		proc_exec_connector(current);
	}
	return ret;
}

调用search-binary-handler()函数对linux_binprm的formats链表进行扫描，并尝试每个load-binary函数，如果成功加载了文件的执行格式，对formats的扫描终止。

search-binary-handler

#define printable(c) (((c)=='\t') || ((c)=='\n') || (0x20<=(c) && (c)<=0x7e))
/*
 * cycle the list of binary formats handler, until one recognizes the image
 */
int search_binary_handler(struct linux_binprm *bprm)
{
	bool need_retry = IS_ENABLED(CONFIG_MODULES);
	struct linux_binfmt *fmt;//linux_binfmt描述可以执行的程序
	//linux内核对所支持的每种可执行的程序类型都有个struct linux_binfmt的数据结构
	int retval;
	/* This allows 4 levels of binfmt rewrites before failing hard. */
	if (bprm->recursion_depth > 5)
		return -ELOOP;
	retval = security_bprm_check(bprm);
	if (retval)
		return retval;
	retval = -ENOENT;
 retry:
	read_lock(&binfmt_lock);
	list_for_each_entry(fmt, &formats, lh) {
		if (!try_module_get(fmt->module))
			continue;
		read_unlock(&binfmt_lock);
		bprm->recursion_depth++;
		retval = fmt->load_binary(bprm);
		read_lock(&binfmt_lock);
		put_binfmt(fmt);
		bprm->recursion_depth--;
		if (retval < 0 && !bprm->mm) {
			/* we got to flush_old_exec() and failed after it */
			read_unlock(&binfmt_lock);
			force_sigsegv(SIGSEGV, current);
			return retval;
		}
		if (retval != -ENOEXEC || !bprm->file) {
			read_unlock(&binfmt_lock);
			return retval;
		}
	}
	read_unlock(&binfmt_lock);
	if (need_retry) {
		if (printable(bprm->buf[0]) && printable(bprm->buf[1]) &&
		    printable(bprm->buf[2]) && printable(bprm->buf[3]))
			return retval;
		if (request_module("binfmt-%04x", *(ushort *)(bprm->buf + 2)) < 0)
			return retval;
		need_retry = false;
		goto retry;
	}
	return retval;
}

search-binary-handler函数遍历format格式，对于linux下的elf格式的可执行文件而言，会找到elf-format，调用其load-binary函数，最终其实调用的是load-elf-binary函数

注意这些人非常喜欢用宏定义：

/**
 * list_for_each_entry	-	iterate over list of given type
 * @pos:	the type * to use as a loop cursor.
 * @head:	the head for your list.
 * @member:	the name of the list_struct within the struct.
 */
#define list_for_each_entry(pos, head, member)				\
	for (pos = list_entry((head)->next, typeof(*pos), member);	\
	     &pos->member != (head); 	\
	     pos = list_entry(pos->member.next, typeof(*pos), member))

linux-binfmt

include/linux/binfmts.h

struct linux_binfmt {
	struct list_head lh;
	struct module *module;
	int (*load_binary)(struct linux_binprm *);//通过读存放在可执行文件中的信息为当前进程建立一个新的执行环境
	int (*load_shlib)(struct file *);
	int (*core_dump)(struct coredump_params *cprm);
	unsigned long min_coredump;	/* minimal dump size */
};
//elf的格式定义，构建时想用名字赋值的话前要加引用符号.
static struct linux_binfmt elf_format = {
    .module      = THIS_MODULE,
    .load_binary = load_elf_binary,
    .load_shlib      = load_elf_library,
    .core_dump       = elf_core_dump,
    .min_coredump    = ELF_EXEC_PAGESIZE,
    .hasvdso     = 1
};

这个结构体每个代表一种可执行文件格式，linux中存于formats链表中
其存贮的加载函数指针指向的函数不同

load-elf-binary

这部分是重点，elf文件的加载

load_elf_binary
fs/binfmt_elf.c

static int load_elf_binary(struct linux_binprm *bprm)
{
	struct file *interpreter = NULL; /* to shut gcc up */
 	unsigned long load_addr = 0, load_bias = 0;
	int load_addr_set = 0;
	char * elf_interpreter = NULL;
	unsigned long error;
	struct elf_phdr *elf_ppnt, *elf_phdata, *interp_elf_phdata = NULL;
	unsigned long elf_bss, elf_brk;
	int retval, i;
	unsigned long elf_entry;
	unsigned long interp_load_addr = 0;
	unsigned long start_code, end_code, start_data, end_data;
	unsigned long reloc_func_desc __maybe_unused = 0;
	int executable_stack = EXSTACK_DEFAULT;
	struct pt_regs *regs = current_pt_regs();
	struct {
		struct elfhdr elf_ex;
		struct elfhdr interp_elf_ex;
	} *loc;//存文件结构的
	struct arch_elf_state arch_state = INIT_ARCH_ELF_STATE;
	loc = kmalloc(sizeof(*loc), GFP_KERNEL);
	if (!loc) {
		retval = -ENOMEM;
		goto out_ret;
	}
	
	/* Get the exec-header */
	//获取缓存的elf头
	loc->elf_ex = *((struct elfhdr *)bprm->buf);
	retval = -ENOEXEC;
	/* First of all, some simple consistency checks */
	//检测魔数-前4个字节
	if (memcmp(loc->elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
		goto out;
	//既不是可执行文件又不是动态链接文件时
	if (loc->elf_ex.e_type != ET_EXEC && loc->elf_ex.e_type != ET_DYN)
		goto out;
	//不是指定的平台的
	if (!elf_check_arch(&loc->elf_ex))
		goto out;
	//
	if (!bprm->file->f_op->mmap)
		goto out;
	//读取段头，用的file结构与缓存的头读取，空间现分配
	elf_phdata = load_elf_phdrs(&loc->elf_ex, bprm->file);
	if (!elf_phdata)
		goto out;
	elf_ppnt = elf_phdata;
	elf_bss = 0;
	elf_brk = 0;
	start_code = ~0UL;
	end_code = 0;
	start_data = 0;
	end_data = 0;
	for (i = 0; i < loc->elf_ex.e_phnum; i++) {
		//查找到解释器信息，解释器段p地址直接指向字串
		if (elf_ppnt->p_type == PT_INTERP) {
			/* This is the program interpreter used for
			 * shared libraries - for now assume that this
			 * is an a.out format binary
			 */
			retval = -ENOEXEC;
			if (elf_ppnt->p_filesz > PATH_MAX || 
			    elf_ppnt->p_filesz < 2)
				goto out_free_ph;
			retval = -ENOMEM;
			//分配字串
			elf_interpreter = kmalloc(elf_ppnt->p_filesz,
						  GFP_KERNEL);
			if (!elf_interpreter)
				goto out_free_ph;
			//从文件中读取...可见之前确实没全读进来，毕竟还在内核态中
			retval = kernel_read(bprm->file, elf_ppnt->p_offset,
					     elf_interpreter,
					     elf_ppnt->p_filesz);
			if (retval != elf_ppnt->p_filesz) {
				if (retval >= 0)
					retval = -EIO;
				goto out_free_interp;
			}
			/* make sure path is NULL terminated */
			retval = -ENOEXEC;
			//结尾是0
			if (elf_interpreter[elf_ppnt->p_filesz - 1] != '\0')
				goto out_free_interp;
			//打开解析器，返回file结构
			interpreter = open_exec(elf_interpreter);
			retval = PTR_ERR(interpreter);
			if (IS_ERR(interpreter))
				goto out_free_interp;
			/*
			 * If the binary is not readable then enforce
			 * mm->dumpable = 0 regardless of the interpreter's
			 * permissions.
			 */
			would_dump(bprm, interpreter);
			/* Get the exec headers */
			//读解析器的头
			retval = kernel_read(interpreter, 0,
					     (void *)&loc->interp_elf_ex,
					     sizeof(loc->interp_elf_ex));
			if (retval != sizeof(loc->interp_elf_ex)) {
				if (retval >= 0)
					retval = -EIO;
				goto out_free_dentry;
			}
			break;
		}
		elf_ppnt++;
	}
	elf_ppnt = elf_phdata;
	//遍历段
	for (i = 0; i < loc->elf_ex.e_phnum; i++, elf_ppnt++)
		switch (elf_ppnt->p_type) {
			//是GNU_STACK时
		case PT_GNU_STACK:
			//设置堆栈可执行
			if (elf_ppnt->p_flags & PF_X)
				executable_stack = EXSTACK_ENABLE_X;
			else
				executable_stack = EXSTACK_DISABLE_X;
			break;
		//case可以...这么用，要求空格
		case PT_LOPROC ... PT_HIPROC:
			retval = arch_elf_pt_proc(&loc->elf_ex, elf_ppnt,
						  bprm->file, false,
						  &arch_state);
			if (retval)
				goto out_free_dentry;
			break;
		}
	/* Some simple consistency checks for the interpreter */
	//有连接器或解释器时
	if (elf_interpreter) {
		retval = -ELIBBAD;
		/* Not an ELF interpreter */
		//是不是elf文件
		if (memcmp(loc->interp_elf_ex.e_ident, ELFMAG, SELFMAG) != 0)
			goto out_free_dentry;
		/* Verify the interpreter has a valid arch */
		//检测架构平台
		if (!elf_check_arch(&loc->interp_elf_ex))
			goto out_free_dentry;
		//获取段头表
		/* Load the interpreter program headers */
		interp_elf_phdata = load_elf_phdrs(&loc->interp_elf_ex,
						   interpreter);
		if (!interp_elf_phdata)
			goto out_free_dentry;
		/* Pass PT_LOPROC..PT_HIPROC headers to arch code */
		elf_ppnt = interp_elf_phdata;
		for (i = 0; i < loc->interp_elf_ex.e_phnum; i++, elf_ppnt++)
			switch (elf_ppnt->p_type) {
			case PT_LOPROC ... PT_HIPROC:
				retval = arch_elf_pt_proc(&loc->interp_elf_ex,
							  elf_ppnt, interpreter,
							  true, &arch_state);
				if (retval)
					goto out_free_dentry;
				break;
			}
	}
	/*
	 * Allow arch code to reject the ELF at this point, whilst it's
	 * still possible to return an error to the code that invoked
	 * the exec syscall.
	 */
	 //!!可以保证其为0或1
	retval = arch_check_elf(&loc->elf_ex, !!interpreter, &arch_state);
	if (retval)
		goto out_free_dentry;
	/* Flush all traces of the currently running executable */
	//flush_old_exec主要用来进行新进程地址空间的替换，并删除同线程组中的其他线程。
	retval = flush_old_exec(bprm);
	if (retval)
		goto out_free_dentry;
	/* Do this immediately, since STACK_TOP as used in setup_arg_pages
	   may depend on the personality.  */
	SET_PERSONALITY2(loc->elf_ex, &arch_state);
	if (elf_read_implies_exec(loc->elf_ex, executable_stack))
		current->personality |= READ_IMPLIES_EXEC;
	if (!(current->personality & ADDR_NO_RANDOMIZE) && randomize_va_space)
		current->flags |= PF_RANDOMIZE;
	//setup_new_exec函数对刚刚替换的地址空间进行简单的初始化
	setup_new_exec(bprm);
	/* Do this so that we can load the interpreter, if need be.  We will
	   change some of these later */
	retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
				 executable_stack);
	if (retval < 0)
		goto out_free_dentry;
	
	current->mm->start_stack = bprm->p;
	/* Now we do a little grungy work by mmapping the ELF image into
	   the correct location in memory. */
	//查找elf文件中类型为PT_LOAD的Segment，将其装载进内存
	for(i = 0, elf_ppnt = elf_phdata;
	    i < loc->elf_ex.e_phnum; i++, elf_ppnt++) {
		int elf_prot = 0, elf_flags;
		unsigned long k, vaddr;
		unsigned long total_size = 0;
		//遍历查找需要LOAD的头
		if (elf_ppnt->p_type != PT_LOAD)
			continue;
		//如果此时brk>bss就分配一段内存
		if (unlikely (elf_brk > elf_bss)) {
			unsigned long nbyte;
	            
			/* There was a PT_LOAD segment with p_memsz > p_filesz
			   before this one. Map anonymous pages, if needed,
			   and clear the area.  */
			retval = set_brk(elf_bss + load_bias,
					 elf_brk + load_bias);
			if (retval)
				goto out_free_dentry;
			nbyte = ELF_PAGEOFFSET(elf_bss);
			if (nbyte) {
				nbyte = ELF_MIN_ALIGN - nbyte;
				if (nbyte > elf_brk - elf_bss)
					nbyte = elf_brk - elf_bss;
				if (clear_user((void __user *)elf_bss +
							load_bias, nbyte)) {
					/*
					 * This bss-zeroing can fail if the ELF
					 * file specifies odd protections. So
					 * we don't check the return value
					 */
				}
			}
		}
		//该段的权限保护
		if (elf_ppnt->p_flags & PF_R)
			elf_prot |= PROT_READ;
		if (elf_ppnt->p_flags & PF_W)
			elf_prot |= PROT_WRITE;
		if (elf_ppnt->p_flags & PF_X)
			elf_prot |= PROT_EXEC;
		elf_flags = MAP_PRIVATE | MAP_DENYWRITE | MAP_EXECUTABLE;
		//该段的VA
		vaddr = elf_ppnt->p_vaddr;
		//如果是可执行文件或者不是第一次映射load_addr_set，基址已经定下
		if (loc->elf_ex.e_type == ET_EXEC || load_addr_set) {
			elf_flags |= MAP_FIXED;
		//如果是动态连接文件
		} else if (loc->elf_ex.e_type == ET_DYN) {
			/* Try and get dynamic programs out of the way of the
			 * default mmap base, as well as whatever program they
			 * might try to exec.  This is because the brk will
			 * follow the loader, and is not movable.  */
			 //load_bias是偏移
			load_bias = ELF_ET_DYN_BASE - vaddr;
			if (current->flags & PF_RANDOMIZE)
				load_bias += arch_mmap_rnd();//地址随机因子
			load_bias = ELF_PAGESTART(load_bias);
			//total_size为计算的全部需要映射的内存大小
			total_size = total_mapping_size(elf_phdata,
							loc->elf_ex.e_phnum);
			if (!total_size) {
				retval = -EINVAL;
				goto out_free_dentry;
			}
		}
		//段映射到虚拟内存中load_bias+p_vaddr
		//参数：文件句柄、虚拟地址、段头、权限、标志、内存镜像总大小-只有第一次不为0
		error = elf_map(bprm->file, load_bias + vaddr, elf_ppnt,
				elf_prot, elf_flags, total_size);
		if (BAD_ADDR(error)) {
			retval = IS_ERR((void *)error) ?
				PTR_ERR((void*)error) : -EINVAL;
			goto out_free_dentry;
		}
		//第一次映射
		if (!load_addr_set) {
			load_addr_set = 1;
			//记录装载的虚拟地址基址，动态文件其基址是偏移过的，装载基址为文件头的开始位置
			//
			load_addr = (elf_ppnt->p_vaddr - elf_ppnt->p_offset);
			if (loc->elf_ex.e_type == ET_DYN) {
				load_bias += error -
				             ELF_PAGESTART(load_bias + vaddr);
				load_addr += load_bias;
				reloc_func_desc = load_bias;
			}
		}
		//这个段的虚拟地址
		k = elf_ppnt->p_vaddr;
		//start_code为最小的虚拟地址
		if (k < start_code)
			start_code = k;
		//start_data为最大的虚拟地址
		if (start_data < k)
			start_data = k;
		/*
		 * Check to see if the section's size will overflow the
		 * allowed task size. Note that p_filesz must always be
		 * <= p_memsz so it is only necessary to check p_memsz.
		 */
		 //检查大小是否超出任务大小（用户空间）
		if (BAD_ADDR(k) || elf_ppnt->p_filesz > elf_ppnt->p_memsz ||
		    elf_ppnt->p_memsz > TASK_SIZE ||
		    TASK_SIZE - elf_ppnt->p_memsz < k) {
			/* set_brk can never work. Avoid overflows. */
			retval = -EINVAL;
			goto out_free_dentry;
		}
		//该段有实际内容的结束地址
		k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz;
		//elf_bss是最大的结束地址
		if (k > elf_bss)
			elf_bss = k;
		//end_code的可执行的最大结束地址
		if ((elf_ppnt->p_flags & PF_X) && end_code < k)
			end_code = k;
		//end_data是最大的结束地址
		if (end_data < k)
			end_data = k;
		//该段内存中的结束地址
		k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;
		//elf_brk是内存中的结束地址
		if (k > elf_brk)
			elf_brk = k;
	}
	
	
	
	
	
	//退出循环后，各虚拟地址加上偏移
	loc->elf_ex.e_entry += load_bias;
	elf_bss += load_bias;
	elf_brk += load_bias;
	start_code += load_bias;
	end_code += load_bias;
	start_data += load_bias;
	end_data += load_bias;
	/* Calling set_brk effectively mmaps the pages that we need
	 * for the bss and break sections.  We must do this before
	 * mapping in the interpreter, to make sure it doesn't wind
	 * up getting placed where the bss needs to go.
	 */
	 //在elf_bss到elf_brk之间分配内存空间
	retval = set_brk(elf_bss, elf_brk);
	if (retval)
		goto out_free_dentry;
	if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {
		retval = -EFAULT; /* Nobody gets to see this, but.. */
		goto out_free_dentry;
	}
	//解释器与连接器映射
	if (elf_interpreter) {
		unsigned long interp_map_addr = 0;
		//将解释器/连接器，映射到内存，返回连接器的入口地址
		elf_entry = load_elf_interp(&loc->interp_elf_ex,
					    interpreter,
					    &interp_map_addr,
					    load_bias, interp_elf_phdata);
		if (!IS_ERR((void *)elf_entry)) {
			/*
			 * load_elf_interp() returns relocation
			 * adjustment
			 */
			interp_load_addr = elf_entry;
			elf_entry += loc->interp_elf_ex.e_entry;
		}
		if (BAD_ADDR(elf_entry)) {
			retval = IS_ERR((void *)elf_entry) ?
					(int)elf_entry : -EINVAL;
			goto out_free_dentry;
		}
		reloc_func_desc = interp_load_addr;
		allow_write_access(interpreter);
		fput(interpreter);
		kfree(elf_interpreter);
	} else {
		//没有俩器的话就是elf文件本身的入口点，偏移过的
		elf_entry = loc->elf_ex.e_entry;
		if (BAD_ADDR(elf_entry)) {
			retval = -EINVAL;
			goto out_free_dentry;
		}
	}
	kfree(interp_elf_phdata);
	kfree(elf_phdata);
	//set_binfmt将elf_format记录到mm_struct的binfmt变量中
	set_binfmt(&elf_format);
#ifdef ARCH_HAS_SETUP_ADDITIONAL_PAGES
	retval = arch_setup_additional_pages(bprm, !!elf_interpreter);
	if (retval < 0)
		goto out;
#endif /* ARCH_HAS_SETUP_ADDITIONAL_PAGES */
	//install_exec_creds会设置新进程的cred结构
	install_exec_creds(bprm);
	//create_elf_tables函数在将启动解释器或者程序前，向用户空间的堆栈添加一些额外信息，例如应用程序Segment头的起始地址，入口地址等等
	//环境变量，参数置于栈底也是在这里
	retval = create_elf_tables(bprm, &loc->elf_ex,
			  load_addr, interp_load_addr);
	if (retval < 0)
		goto out;
	/* N.B. passed_fileno might not be initialized? */
	//向新进程mm_struct结构中设置前面计算的代码段、数据段、bss段和堆的位置
	current->mm->end_code = end_code;
	current->mm->start_code = start_code;
	current->mm->start_data = start_data;
	current->mm->end_data = end_data;
	current->mm->start_stack = bprm->p;
	if ((current->flags & PF_RANDOMIZE) && (randomize_va_space > 1)) {
		current->mm->brk = current->mm->start_brk =
			arch_randomize_brk(current->mm);
#ifdef compat_brk_randomized
		current->brk_randomized = 1;
#endif
	}
	if (current->personality & MMAP_PAGE_ZERO) {
		/* Why this, you ask???  Well SVr4 maps page 0 as read-only,
		   and some applications "depend" upon this behavior.
		   Since we do not have the power to recompile these, we
		   emulate the SVr4 behavior. Sigh. */
		error = vm_mmap(NULL, 0, PAGE_SIZE, PROT_READ | PROT_EXEC,
				MAP_FIXED | MAP_PRIVATE, 0);
	}
#ifdef ELF_PLAT_INIT
	/*
	 * The ABI may specify that certain registers be set up in special
	 * ways (on i386 %edx is the address of a DT_FINI function, for
	 * example.  In addition, it may also specify (eg, PowerPC64 ELF)
	 * that the e_entry field is the address of the function descriptor
	 * for the startup routine, rather than the address of the startup
	 * routine itself.  This macro performs whatever initialization to
	 * the regs structure is required as well as any relocations to the
	 * function descriptor entries when executing dynamically links apps.
	 */
	ELF_PLAT_INIT(regs, reloc_func_desc);
#endif
	//调用start_thread函数将执行权交给解释器或者应用程序
	//寄存器、入口、栈地址
	start_thread(regs, elf_entry, bprm->p);
	retval = 0;
out:
	kfree(loc);
out_ret:
	return retval;
	/* error cleanup */
out_free_dentry:
	kfree(interp_elf_phdata);
	allow_write_access(interpreter);
	if (interpreter)
		fput(interpreter);
out_free_interp:
	kfree(elf_interpreter);
out_free_ph:
	kfree(elf_phdata);
	goto out;
}

通过几部分代码可见内核设计者的意图，内存等资源不到用的时候不分配，即lazy加载。这样可以有效回收，并防止资源浪费
都做了如下事：

检查elf文件，读取头
读取段头表
查找解释/连接器，打开该文件并读取头
读取解释器的段
flush-old-exec进行新进程地址空间的替换，setup-new-exec函数对刚刚替换的地址空间进行简单的初始化，堆栈执行性
遍历加载段头，load-addr为基址，load_bias为偏移，基址与偏移只定一次
将解释器的数据映射到虚拟内存中
添加环变、参数等内容到栈底，设置current的mm-struct
最后调用start-thread函数将执行权交给解释器或者应用程序

start-code:最小的段va
end-code:最大的可执行段va+filesz
start-data:最大的段va
end-data:最大的段va+filesize
elf-bss:最大的段va+filesize
elf-brk:最大的段va+memsz
因此内存分配为：代码区，初始化数据区(有文件直接映射)，未初始化数据区bss，brk表示镜像所占内存末尾，堆，栈

load-el-phdrs

读取段头的函数：

static struct elf_phdr *load_elf_phdrs(struct elfhdr *elf_ex,
                       struct file *elf_file)
{
    struct elf_phdr *elf_phdata = NULL;
    int size;
    size = sizeof(struct elf_phdr) * elf_ex->e_phnum;
    elf_phdata = kmalloc(size, GFP_KERNEL);
    kernel_read(elf_file, elf_ex->e_phoff,
                 (char *)elf_phdata, size);
    return elf_phdata;
}

flush-old-exec

fs/exec.c

int flush_old_exec(struct linux_binprm * bprm)
{
	int retval;
	/*
	 * Make sure we have a private signal table and that
	 * we are unassociated from the previous thread group.
	 */
	 //删除同线程组中的其他线程
	retval = de_thread(current);
	if (retval)
		goto out;
	/*
	 * Must be called _before_ exec_mmap() as bprm->mm is
	 * not visibile until then. This also enables the update
	 * to be lockless.
	 */
	set_mm_exe_file(bprm->mm, bprm->file);
	/*
	 * Release all of the old mmap stuff
	 */
	acct_arg_size(bprm, 0);
	//将新进程的地址空间设置为bprm中创建并设置好的地址空间
	retval = exec_mmap(bprm->mm);
	if (retval)
		goto out;
	bprm->mm = NULL;		/* We're using it now */
	set_fs(USER_DS);
	current->flags &= ~(PF_RANDOMIZE | PF_FORKNOEXEC | PF_KTHREAD |
					PF_NOFREEZE | PF_NO_SETAFFINITY);
	
	//flush_thread函数主要用来初始化thread_struct中的TLS元数据信息
	flush_thread();
	current->personality &= ~bprm->per_clear;
	return 0;
out:
	return retval;
}
EXPORT_SYMBOL(flush_old_exec);

setup-new-exec

fs/exec.c

void setup_new_exec(struct linux_binprm * bprm)
{
	arch_pick_mmap_layout(current->mm);
	/* This is the point of no return */
	current->sas_ss_sp = current->sas_ss_size = 0;
	if (uid_eq(current_euid(), current_uid()) && gid_eq(current_egid(), current_gid()))
		set_dumpable(current->mm, SUID_DUMP_USER);
	else
		set_dumpable(current->mm, suid_dumpable);
	perf_event_exec();
	__set_task_comm(current, kbasename(bprm->filename), true);
	/* Set the new mm task size. We have to do that late because it may
	 * depend on TIF_32BIT which is only updated in flush_thread() on
	 * some architectures like powerpc
	 */
	current->mm->task_size = TASK_SIZE;
	/* install the new credentials */
	if (!uid_eq(bprm->cred->uid, current_euid()) ||
	    !gid_eq(bprm->cred->gid, current_egid())) {
		current->pdeath_signal = 0;
	} else {
		would_dump(bprm, bprm->file);
		if (bprm->interp_flags & BINPRM_FLAGS_ENFORCE_NONDUMP)
			set_dumpable(current->mm, suid_dumpable);
	}
	/* An exec changes our domain. We are no longer part of the thread
	   group */
	current->self_exec_id++;
	flush_signal_handlers(current, 0);
	do_close_on_exec(current->files);
}
EXPORT_SYMBOL(setup_new_exec);

elf-map

fs/binfmt_elf.c

static unsigned long elf_map(struct file *filep, unsigned long addr,
		struct elf_phdr *eppnt, int prot, int type,
		unsigned long total_size)
{
	unsigned long map_addr;
	unsigned long size = eppnt->p_filesz + ELF_PAGEOFFSET(eppnt->p_vaddr);
	unsigned long off = eppnt->p_offset - ELF_PAGEOFFSET(eppnt->p_vaddr);
	addr = ELF_PAGESTART(addr);
	size = ELF_PAGEALIGN(size);
	/* mmap() will return -EINVAL if given a zero size, but a
	 * segment with zero filesize is perfectly valid */
	if (!size)
		return addr;
	/*
	* total_size is the size of the ELF (interpreter) image.
	* The _first_ mmap needs to know the full size, otherwise
	* randomization might put this image into an overlapping
	* position with the ELF binary image. (since size < total_size)
	* So we first map the 'big' image - and unmap the remainder at
	* the end. (which unmap is needed for ELF images with holes.)
	*/
	//total_size是elf的镜像大小，只有第一次有。为了防止随机化导致重叠。因此先映射大镜像，再最后映射其它
	if (total_size) {
		total_size = ELF_PAGEALIGN(total_size);
		map_addr = vm_mmap(filep, addr, total_size, prot, type, off);
		if (!BAD_ADDR(map_addr))
			vm_munmap(map_addr+size, total_size-size);
	} else
		map_addr = vm_mmap(filep, addr, size, prot, type, off);
	return(map_addr);
}

传入的参数filep是文件指针，addr是即将映射的内存中的虚拟地址，size是文件映像的大小，off是映像在文件中的偏移。elf-map函数主要通过vm-mmap为文件申请虚拟空间并进行相应的映射，然后返回虚拟空间的起始地址map-addr

load-elf-interp

/fs/binfmt_elf.c

// 文件头 、file句柄、返回的加载map的地址、偏移、段头表
static unsigned long load_elf_interp(struct elfhdr *interp_elf_ex,
		struct file *interpreter, unsigned long *interp_map_addr,
		unsigned long no_base, struct elf_phdr *interp_elf_phdata)
{
	struct elf_phdr *eppnt;
	unsigned long load_addr = 0;
	int load_addr_set = 0;
	unsigned long last_bss = 0, elf_bss = 0;
	unsigned long error = ~0UL;
	unsigned long total_size;
	int i;
	/* First of all, some simple consistency checks */
	//检查文件类型、平台
	if (interp_elf_ex->e_type != ET_EXEC &&
	    interp_elf_ex->e_type != ET_DYN)
		goto out;
	if (!elf_check_arch(interp_elf_ex))
		goto out;
	if (!interpreter->f_op->mmap)
		goto out;
	//计算内存中总大小
	total_size = total_mapping_size(interp_elf_phdata,
					interp_elf_ex->e_phnum);
	if (!total_size) {
		error = -EINVAL;
		goto out;
	}
	eppnt = interp_elf_phdata;
	//遍历加载段
	for (i = 0; i < interp_elf_ex->e_phnum; i++, eppnt++) {
		if (eppnt->p_type == PT_LOAD) {
			int elf_type = MAP_PRIVATE | MAP_DENYWRITE;
			int elf_prot = 0;
			unsigned long vaddr = 0;
			unsigned long k, map_addr;
			if (eppnt->p_flags & PF_R)
		    		elf_prot = PROT_READ;
			if (eppnt->p_flags & PF_W)
				elf_prot |= PROT_WRITE;
			if (eppnt->p_flags & PF_X)
				elf_prot |= PROT_EXEC;
			vaddr = eppnt->p_vaddr;
			if (interp_elf_ex->e_type == ET_EXEC || load_addr_set)
				elf_type |= MAP_FIXED;
			else if (no_base && interp_elf_ex->e_type == ET_DYN)
				load_addr = -vaddr;
			//映射段load_addr为基址，一开始是0，vaddr是段头指定的装载va
			map_addr = elf_map(interpreter, load_addr + vaddr,
					eppnt, elf_prot, elf_type, total_size);
			total_size = 0;
			if (!*interp_map_addr)
				*interp_map_addr = map_addr;
			error = map_addr;
			if (BAD_ADDR(map_addr))
				goto out;
			if (!load_addr_set &&
			    interp_elf_ex->e_type == ET_DYN) {
				load_addr = map_addr - ELF_PAGESTART(vaddr);
				load_addr_set = 1;
			}
			/*
			 * Check to see if the section's size will overflow the
			 * allowed task size. Note that p_filesz must always be
			 * <= p_memsize so it's only necessary to check p_memsz.
			 */
			 //检查是否超出任务大小
			k = load_addr + eppnt->p_vaddr;
			if (BAD_ADDR(k) ||
			    eppnt->p_filesz > eppnt->p_memsz ||
			    eppnt->p_memsz > TASK_SIZE ||
			    TASK_SIZE - eppnt->p_memsz < k) {
				error = -ENOMEM;
				goto out;
			}
			/*
			 * Find the end of the file mapping for this phdr, and
			 * keep track of the largest address we see for this.
			 */
			 //找到文件映射的最大地址
			k = load_addr + eppnt->p_vaddr + eppnt->p_filesz;
			if (k > elf_bss)
				elf_bss = k;
			/*
			 * Do the same thing for the memory mapping - between
			 * elf_bss and last_bss is the bss section.
			 */
			 //内存中的最大地址
			k = load_addr + eppnt->p_memsz + eppnt->p_vaddr;
			if (k > last_bss)
				last_bss = k;
		}
	}
	//填充bss
	if (last_bss > elf_bss) {
		/*
		 * Now fill out the bss section.  First pad the last page up
		 * to the page boundary, and then perform a mmap to make sure
		 * that there are zero-mapped pages up to and including the
		 * last bss page.
		 */
		if (padzero(elf_bss)) {
			error = -EFAULT;
			goto out;
		}
		/* What we have mapped so far */
		elf_bss = ELF_PAGESTART(elf_bss + ELF_MIN_ALIGN - 1);
		/* Map the last of the bss segment */
		error = vm_brk(elf_bss, last_bss - elf_bss);
		if (BAD_ADDR(error))
			goto out;
	}
	error = load_addr;
out:
	return error;
}

与主体装载不同的是没有随机地址了，依旧是分段映射

start-thread

arm下的没找到….先看熟悉的x86好了
start-thread
arch/x86/kernel/process_64.c

static void
start_thread_common(struct pt_regs *regs, unsigned long new_ip,
		    unsigned long new_sp,
		    unsigned int _cs, unsigned int _ss, unsigned int _ds)
{
	loadsegment(fs, 0);
	loadsegment(es, _ds);
	loadsegment(ds, _ds);
	load_gs_index(0);
	regs->ip		= new_ip;
	regs->sp		= new_sp;
	regs->cs		= _cs;
	regs->ss		= _ss;
	regs->flags		= X86_EFLAGS_IF;
	force_iret();
}
void
start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
{
	start_thread_common(regs, new_ip, new_sp,
			    __USER_CS, __USER_DS, 0);
}

设置各寄存器，force-iret强制返回，跳转到new-ip指向的地址处开始执行，要设置的就是ip与sp。注意这里的regs实际上是保存系统调用返回地址的位置，通过修改返回地址使系统调用返回时直接执行新程序

参考

https://vvl.me/2017/03/20/how-does-the-Linux-kernel-run-a-program/