1. SMP 硬件体系结构:
对于 SMP 最简单可以理解为系统存在多个完全相同的 CPU ,所有 CPU 共享总线,拥有自己的寄存器。对于内存和外部设备访问,由于共享总线,所以是共享的。 Linux 操作系统多个 CPU 共享在系统空间上映射相同,是完全对等的。
由于系统中存在多个 CPU ,这是就引入一个问题,当外部设备产生中断的时候,具体有哪一个 CPU 进行处理?
为此, intel 公司提出了 IO APCI 和 LOCAL APCI 的体系结构。
IO APIC 连接各个外部设备,并可以设置分发类型,根据设定的分发类型,中断信号发送的对应 CPU 的 LOCAL APIC 上。
LOCAL APIC 负责本地 CPU 的中断处理, LOCAL APIC 不仅可以接受 IO APIC 的中断,也需要处理本地 CPU 产生的异常。同时 LOCAL APIC 还提供了一个定时器。
如何确定那个 CPU 是引导 CPU ?
根据 intel 公司中的资料,系统上电后,会根据 MP Initialization Protocol 随机选择一个 CPU 作为 BSP ,只有 BSP 会运行 BIOS 程序,其他 AP 都进入等待状态, BSP 发送 IPI 中断触发后才可以运行。具体的 MP Initialization Protocol 细节,可以参考 Intel? 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1 第 8 章。
引导 CPU 如何控制其他 CPU 开始运行?
BSP 可以通过 IPI 消息控制 AP 从指定的起始地址运行。 CPU 中集成的 LOCAL APIC 提供了这个功能。可以通过写 LOCAL APIC 中提供的相关寄存器,发送 IPI 消息到指定的 CPU 上。
如何获取系统硬件 CPU 信息的?
在系统初始化后,硬件会在内存的规定位置提供关于 CPU ,总线 , IO APIC 等的信息,即 SMP MP table 。在 linux 初始化的过程,会读取该位置,获取系统相关的硬件信息。
2. linux SMP 启动过程流程简介
setup_arch()
setup_memory();
reserve_bootmem(PAGE_SIZE, PAGE_SIZE);
find_smp_config(); // 查找 smp mp table 的位置
smp_alloc_memory();
trampoline_base = (void *) alloc_bootmem_low_pages(PAGE_SIZE); // 分配 trampoline ,用于启动 AP 的引导代码。
get_smp_config(); // 根据 smp mp table ,获取具体的硬件信息
trap_init()
init_apic_mappings();
mem_init()
zap_low_mappings(); 如果没有定义 SMP 的话,清楚用户空间的地址映射。
rest_init();
kernel_thread(init, NULL, CLONE_FS | CLONE_SIGHAND);
init();
set_cpus_allowed(current, CPU_MASK_ALL);
smp_prepare_cpus(max_cpus);
smp_boot_cpus(max_cpus);
connect_bsp_APIC();
setup_local_APIC(); // 初始化 BSP 的 LOCAL APCI 。
map_cpu_to_logical_apicid();
针对每个 CPU 调用 do_boot_cpu(apicid, cpu)
smp_init(); // 每个 CPU 开始进行调度
trampoline.S AP 引导代码,为 16 进制代码,启用保护模式
head.s 为 AP 创建分页管理
initialize_secondary 根据之前 fork 创建设置的信息,跳转到 start_secondary 处
start_secondary 判断 BSP 是否启动,如果启动 AP 进行任务调度。
3. 代码学习总结
find_smp_config(); ,查找 MP table 在内存中的位置。具体协议可以参考 MP 协议的第 4 章。
这个表的作用在于描述系统 CPU ,总线, IO APIC 等的硬件信息。
相关的两个全局变量: smp_found_config 是否找到 SMP MP table , mpf_found SMP MP table 的线性地址。
smp_alloc_memory() 为启动 AP 的启动程序分配内存空间。相关全局变量 trampoline_base ,分配的启动地址的线性地址。
get_smp_config() 根据 MP table 中提供的内容,获取硬件的信息。
init_apic_mappings(); 获取 IO APIC 和 LOCAL APIC 的映射地址 。
zap_low_mappings(); 如果没有定义 SMP 的话,清楚用户空间的地址映射。将 swapper_pg_dir 中表项清零。
setup_local_APIC(); 初始化 BSP 的 LOCAL APCI 。
do_boot_cpu(apicid, cpu)
idle = alloc_idle_task(cpu);
task = copy_process(CLONE_VM, 0, idle_regs(®s), 0, NULL, NULL, 0);
init_idle(task, cpu);
将 init 进程使用 copy_process 复制,并且调用 init_idle 函数,设置可以运行的 CPU 。
idle->thread.eip = (unsigned long) start_secondary;
修改 task_struct 中的 thread.eip ,使得 AP 初始化完成后,就运行 start_secondary 函数。
start_eip = setup_trampoline();
调用 setup_trampoline() 函数,复制 trampoline_data 到 trampoline_end 之间的代码到 trampoline_base 处, trampoline_base 就是之前在 setup_arch 处申请的内存。 start_eip 返回值是 trampoline_base 对应的物理地址。
smpboot_setup_warm_reset_vector(start_eip); 设置内存 40:67h 处为 start_eip 为启动地址。
wakeup_secondary_cpu(apicid, start_eip); 在这个函数中通过操作 APIC_ICR 寄存器, BSP 向目标 AP 发送 IPI 消息,触发目标 AP 从 start_eip 地址处,从实模式开始运行。
trampoline.S
ENTRY(trampoline_data)
r_base = .
wbinvd # Needed for NUMA-Q should be harmless for others
mov %cs, %ax # Code and data in the same place
mov %ax, %ds
cli # We should be safe anyway
movl $0xA5A5A5A5, trampoline_data - r_base
这个是设置标识,以便 BSP 知道 AP 运行到这里了。
lidtl boot_idt - r_base # load idt with 0, 0
lgdtl boot_gdt - r_base # load gdt with whatever is appropriate
加载 ldt 和 gdt
xor %ax, %ax
inc %ax # protected mode (PE) bit
lmsw %ax # into protected mode
# flush prefetch and jump to startup_32_smp in arch/i386/kernel/head.S
ljmpl $__BOOT_CS, $(startup_32_smp-__PAGE_OFFSET)
启动保护模式,跳转到 startup_32_smp 处
# These need to be in the same 64K segment as the above;
# hence we don't use the boot_gdt_descr defined in head.S
boot_gdt:
.word __BOOT_DS + 7 # gdt limit
.long boot_gdt_table-__PAGE_OFFSET # gdt base
boot_idt:
.word 0 # idt limit = 0
.long 0 # idt base = 0L
.globl trampoline_end
trampoline_end:
在这段代码中,设置标识,以便 BSP 知道该 AP 已经运行到这段代码,加载 GDT 和 LDT 表基址。
然后启动保护模式,跳转到 startup_32_smp 处。
Head.s 部分代码:
ENTRY(startup_32_smp)
cld
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
xorl %ebx,%ebx
incl %ebx
如果是 AP 的话,将 bx 设置为 1
movl $swapper_pg_dir-__PAGE_OFFSET,%eax
movl %eax,%cr3
movl %cr0,%eax
orl $0x80000000,%eax
movl %eax,%cr0
ljmp $__BOOT_CS,$1f
启用分页,
lss stack_start,%esp
使 esp 执行 fork 创建的进程内核堆栈部分,以便后续跳转到 start_secondary
#ifdef CONFIG_SMP
movb ready, %cl
movb $1, ready
cmpb $0,%cl
je 1f # the first CPU calls start_kernel
# all other CPUs call initialize_secondary
call initialize_secondary
jmp L6
1:
#endif
call start_kernel
如果是 AP 启动的话,就调用 initialize_secondary 函数。
void __devinit initialize_secondary(void)
{
asm volatile(
"movl %0,%%esp\n\t"
"jmp *%1"
:
:"r" (current->thread.esp),"r" (current->thread.eip));
}
设置堆栈为 fork 创建时的堆栈, ip 为 fork 时的 ip ,这样就跳转的了 start_secondary 。
start_secondary 函数中处理如下:
while (!cpu_isset(smp_processor_id(), smp_commenced_mask))
rep_nop();
进行 smp_commenced_mask 判断,是否启动 AP 运行。 smp_commenced_mask 在 smp_init() 中设置。
cpu_idle();
如果启动了,调用 cpu_idle 进行任务调度。
SMP(对称多处理器)启动流程--转载
-------------------------------------------
本文系本站原创,欢迎转载!
转载请注明出处:http://sjj0412.cublog.cn
-------------------------------------------
startup_32:
cld
cli
movl $(KERNEL_DS),%eax
mov %ax,%ds
mov %ax,%es
mov %ax,%fs
mov %ax,%gs
#ifdef __SMP__
orw %bx,%bx # What state are we in BX=1 for SMP
# 0 for boot
jz 2f # Initial boot
//根据bx值指示是主cpu(bx=0)还是次cpu(bx=1)
//然后会有不同的执行路径
/这里是其他次cpu执行路径
mov %ax,%ss
xorl %eax,%eax # Back to 0
mov %cx,%ax # SP low 16 bits
movl %eax,%esp
pushl 0 # Clear NT
popfl
ljmp $(KERNEL_CS), $0x100000 # Into C and sanity
2://这里是主cpu的执行路径
#endif
lss SYMBOL_NAME(stack_start),%esp
xorl %eax,%eax
1: incl %eax # check that A20 really IS enabled
movl %eax,0x000000 # loop forever if it isn't
cmpl %eax,0x100000
je 1b
pushl $0
popfl
xorl %eax,%eax
movl $ SYMBOL_NAME(_edata),%edi
movl $ SYMBOL_NAME(_end),%ecx
subl %edi,%ecx
cld
rep
stosb
subl $16,%esp # place for structure on the stack
pushl %esp # address of structure as first arg
call SYMBOL_NAME(decompress_kernel)
orl %eax,%eax
jnz 3f
xorl %ebx,%ebx
ljmp $(KERNEL_CS), $0x100000
ljmp $(KERNEL_CS), $0x100000
这个其实就是跳到start_kernel函数。
asmlinkage void start_kernel(void)
{
char * command_line;
#ifdef __SMP__
static int first_cpu=1;
//这个不是函数局部变量,是函数静态变量,主cpu执行这个函数时复位为1,其他cpu为0,因为主cpu总是第一个执行这个函数的。
if(!first_cpu)
start_secondary();
//对于
first_cpu=0;
#endif
setup_arch(&command_line, &memory_start, &memory_end);
memory_start = paging_init(memory_start,memory_end);
trap_init();
init_IRQ();
sched_init();
time_init();
parse_options(command_line);
#ifdef CONFIG_MODULES
init_modules();
#endif
#ifdef CONFIG_PROFILE
if (!prof_shift)
#ifdef CONFIG_PROFILE_SHIFT
prof_shift = CONFIG_PROFILE_SHIFT;
#else
prof_shift = 2;
#endif
#endif
if (prof_shift) {
prof_buffer = (unsigned int *) memory_start;
prof_len = (unsigned long) &_etext - (unsigned long) &_stext;
prof_len >>= prof_shift;
memory_start += prof_len * sizeof(unsigned int);
}
memory_start = console_init(memory_start,memory_end);
#ifdef CONFIG_PCI
memory_start = pci_init(memory_start,memory_end);
#endif
memory_start = kmalloc_init(memory_start,memory_end);
sti();
calibrate_delay();
memory_start = inode_init(memory_start,memory_end);
memory_start = file_table_init(memory_start,memory_end);
memory_start = name_cache_init(memory_start,memory_end);
#ifdef CONFIG_BLK_DEV_INITRD
if (initrd_start && initrd_start < memory_start) {
printk(KERN_CRIT "initrd overwritten (0x%08lx < 0x%08lx) - "
"disabling it.\n",initrd_start,memory_start);
initrd_start = 0;
}
#endif
mem_init(memory_start,memory_end);
buffer_init();
sock_init();
#if defined(CONFIG_SYSVIPC) || defined(CONFIG_KERNELD)
ipc_init();
#endif
dquot_init();
arch_syms_export();
sti();
check_bugs();
printk(linux_banner);
#ifdef __SMP__
smp_init();
#endif
sysctl_init();
kernel_thread(init, NULL, 0);
cpu_idle(NULL);
}
asmlinkage void start_secondary(void)
{
trap_init();
init_IRQ();
//初始化自己的irq
smp_callin();
//这个等待主cpu给大家发送开始信号
cpu_idle(NULL);
//这个是ide进程。
}
void smp_callin(void)
{
extern void calibrate_delay(void);
int cpuid=GET_APIC_ID(apic_read(APIC_ID));
unsigned long l;
SMP_PRINTK(("CALLIN %d\n",smp_processor_id()));
l=apic_read(APIC_SPIV);
l|=(1<<8);
apic_write(APIC_SPIV,l);
sti();
calibrate_delay();
smp_store_cpu_info(cpuid);
set_bit(cpuid, (unsigned long *)&cpu_callin_map[0]);
load_ldt(0);
local_flush_tlb();
while(!smp_commenced);
//这个可以看成是自旋锁,等待主cpu发smp_commenced信号即开始信号。
if (cpu_number_map[cpuid] == -1)
while(1);
local_flush_tlb();
SMP_PRINTK(("Commenced..\n"));
load_TR(cpu_number_map[cpuid]);
}
int cpu_idle(void *unused)
{
for(;;)
idle();
}
主cpu给各次cpu发开始信号是在init函数中调用smp_begin函数:
static void smp_begin(){
smp_threads_ready=1;
smp_commence();
//这个会通过IPI给各个次cpu发送相关中断来通信
}
每个cpu有一个current指针。
刚开始的时候由主cpu赋值为init_task;
在主cpu调用 sched_init赋值。
void sched_init(void)
{
int cpu=smp_processor_id();//这个为0,因为是主cpu才调用。
#ifndef __SMP__
current_set[cpu]=&init_task;
#else
init_task.processor=cpu;
//这个是将init_task标志为主cpu在运行。
for(cpu = 0; cpu < NR_CPUS; cpu++)
current_set[cpu] = &init_task;
#endif
init_bh(TIMER_BH, timer_bh);
init_bh(TQUEUE_BH, tqueue_bh);
init_bh(IMMEDIATE_BH, immediate_bh);
}
同时这些还会在 smp_init丰富。
static void smp_init(void)
{
int i, j;
smp_boot_cpus();
for (i=1; i<smp_num_cpus; i++)
{
struct task_struct *n, *p;
j = cpu_logical_map[i];
kernel_thread(cpu_idle, NULL, CLONE_PID);
//这个其实就是创建线程然后这个线程体现在task[i]上了,因为创建的时候的task_struct就是从task[i]取的。
current_set[j]=task[i];
current_set[j]->processor=j;
cli();
n = task[i]->next_run;
p = task[i]->prev_run;
nr_running--;
n->prev_run = p;
p->next_run = n;
task[i]->next_run = task[i]->prev_run = task[i];
sti();
}
}
上面执行完后就给每个cpu加了一个idle任务。
然后kernel_thread(init, NULL, 0)创建的init任务。
//每个cpu在时间中断时都可能调用这个共同的函数。
asmlinkage void schedule(void)
{
int c;
struct task_struct * p;
struct task_struct * prev, * next;
unsigned long timeout = 0;
int this_cpu=smp_processor_id();
//获取cpu_id;
if (intr_count)
goto scheduling_in_interrupt;
if (bh_active & bh_mask) {
intr_count = 1;
do_bottom_half();
intr_count = 0;
}
run_task_queue(&tq_scheduler);
need_resched = 0;
prev = current;
cli();
if (!prev->counter && prev->policy == SCHED_RR) {
prev->counter = prev->priority;
move_last_runqueue(prev);
}
switch (prev->state) {
case TASK_INTERRUPTIBLE:
if (prev->signal & ~prev->blocked)
goto makerunnable;
timeout = prev->timeout;
if (timeout && (timeout <= jiffies)) {
prev->timeout = 0;
timeout = 0;
makerunnable:
prev->state = TASK_RUNNING;
break;
}
default:
del_from_runqueue(prev);
case TASK_RUNNING:
}
p = init_task.next_run;
//获取进程双向链表的一个节点。
sti();
#ifdef __SMP__
prev->processor = NO_PROC_ID;
#define idle_task (task[cpu_number_map[this_cpu]])
#else
#define idle_task (&init_task)
#endif
c = -1000;
next = idle_task;
while (p != &init_task) {
//p初始值为init_task.next_run
//当回到init_task时说明已经查找为所有的了。
int weight = goodness(p, prev, this_cpu);
if (weight > c)
c = weight, next = p;
p = p->next_run;
}
//这个是查找所有的task,找出最合适的task来调度。
if (!c) {
for_each_task(p)
p->counter = (p->counter >> 1) + p->priority;
}
#ifdef __SMP__
next->processor = this_cpu;
//将这个将要被执行的processor标识为这个cpu
next->last_processor = this_cpu;
#endif
#ifdef __SMP_PROF__
if (0==next->pid)
set_bit(this_cpu,&smp_idle_map);
else
clear_bit(this_cpu,&smp_idle_map);
#endif
if (prev != next) {
struct timer_list timer;
kstat.context_swtch++;
if (timeout) {
init_timer(&timer);
timer.expires = timeout;
timer.data = (unsigned long) prev;
timer.function = process_timeout;
add_timer(&timer);
}
get_mmu_context(next);
switch_to(prev,next);
if (timeout)
del_timer(&timer);
}
return;
scheduling_in_interrupt:
printk("Aiee: scheduling in interrupt %p\n",
__builtin_return_address(0));
}
上面需要注意的是current变量,在单核中肯定就是一个变量,在多核中肯定是各个cpu有自己的current:
其定义如下:
#define current (0+current_set[smp_processor_id()]
在smp中current是current_set数组中的一个元素,是指具体一个cpu的当前进程。
从上面可以看出一个cpu是从全局task找一个task来运行,每个cpu有一个idle_task,这个task的编号是固定的。
所有的task可以通过init_task来找到,因为创建新进程(内核线程)的时候,会将新建的挂到链表上。
而init_task是静态挂在这上面的。
附上task_struct:
struct task_struct {
volatile long state;
long counter;
long priority;
unsigned long signal;
unsigned long blocked;
unsigned long flags;
int errno;
long debugreg[8];
struct exec_domain *exec_domain;
struct linux_binfmt *binfmt;
struct task_struct *next_task, *prev_task;
struct task_struct *next_run, *prev_run;
unsigned long saved_kernel_stack;
unsigned long kernel_stack_page;
int exit_code, exit_signal;
unsigned long personality;
int dumpable:1;
int did_exec:1;
int pid;
int pgrp;
int tty_old_pgrp;
int session;
int leader;
int groups[NGROUPS];
struct task_struct *p_opptr, *p_pptr, *p_cptr, *p_ysptr, *p_osptr;
struct wait_queue *wait_chldexit;
unsigned short uid,euid,suid,fsuid;
unsigned short gid,egid,sgid,fsgid;
unsigned long timeout, policy, rt_priority;
unsigned long it_real_value, it_prof_value, it_virt_value;
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
long utime, stime, cutime, cstime, start_time;
unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
int swappable:1;
unsigned long swap_address;
unsigned long old_maj_flt;
unsigned long dec_flt;
unsigned long swap_cnt;
struct rlimit rlim[RLIM_NLIMITS];
unsigned short used_math;
char comm[16];
int link_count;
struct tty_struct *tty;
struct sem_undo *semundo;
struct sem_queue *semsleeping;
struct desc_struct *ldt;
struct thread_struct tss;
struct fs_struct *fs;
struct files_struct *files;
struct mm_struct *mm;
struct signal_struct *sig;
#ifdef __SMP__
int processor;
int last_processor;
int lock_depth;
#endif
};
故这个p = init_task.next_run;
p可以获取到所有在就绪状态的task;
#linux #smp #多核 #多cpu #调度 #启动 #it3年前SMP(对称多处理器)的启动流程
There are a few SMP related macros, like CONFIG_SMP, CONFIG_X86_LOCAL_APIC, CONFIG_X86_IO_APIC, CONFIG_MULTIQUAD and CONFIG_VISWS. I will ignore code that requires CONFIG_MULTIQUAD or CONFIG_VISWS, which most people don't care (if not using
IBM high-end multiprocessor server or SGI Visual Workstation).
BSP executes start_kernel() -> smp_init() -> smp_boot_
cpus() -> do_boot_cpu() -> wakeup_secondary_via_INIT() to trigger APs. Check
MultiProcessor Specification and IA-32 Manual Vol.3 (Ch.7. Multile-Processor Management, and Ch.8. Advanced Programmable Interrupt Controller) for technical details.
8.1. Before smp_init()Before calling smp_init(), start_kernel() did something to setup SMP environment:
start_kernel()|-- setup_arch()| |-- parse_cmdline_early(); // SMP looks for "noht" and "acpismp=force"| | `-- | | if (!memcmp(from, "noht", 4)) {| | disable_x86_ht = 1;| | set_bit(X86_FEATURE_HT, disabled_x86_caps);| | }| | | | else if (!memcmp(from, "acpismp=force", 13))| | enable_acpi_smp_table = 1;| |-- setup_memory(); // reserve memory for MP configuration table| | |-- reserve_bootmem(PAGE_SIZE, PAGE_SIZE);| | `-- find_smp_config();| | `-- find_intel_smp();| | `-- smp_scan_config();| | |-- set flag smp_found_config| | |-- set MP floating pointer mpf_found| | `-- reserve_bootmem(mpf_found, PAGE_SIZE);| |-- if (disable_x86_ht) { // if HyperThreading feature disabled| | clear_bit(X86_FEATURE_HT, &boot_cpu_data.x86_capability[0]);| | set_bit(X86_FEATURE_HT, disabled_x86_caps);| | enable_acpi_smp_table = 0;| | }| |-- if (test_bit(X86_FEATURE_HT, &boot_cpu_data.x86_capability[0]))| | enable_acpi_smp_table = 1;| |-- smp_alloc_memory();| | `-- | | trampoline_base = (void *) alloc_bootmem_low_pages(PAGE_SIZE);| `-- get_smp_config(); | |-- config_acpi_tables();| | |-- memset(&acpi_boot_ops, 0, sizeof(acpi_boot_ops));| | |-- acpi_boot_ops[ACPI_APIC] = acpi_parse_madt;| | `-- | | if (enable_acpi_smp_table && !acpi_tables_init())| | have_acpi_tables = 1;| |-- set pic_mode| | | |-- save local APIC address in mp_lapic_addr| `-- scan for MP configuration table entries, like| MP_PROCESSOR, MP_BUS, MP_IOAPIC, MP_INTSRC and MP_LINTSRC.|-- trap_init();| `-- init_apic_mappings(); // setup PTE for APIC| |-- | | if (!smp_found_config && detect_init_APIC()) {| | apic_phys = (unsigned long) alloc_bootmem_pages(PAGE_SIZE);| | apic_phys = __pa(apic_phys);| | } else| | apic_phys = mp_lapic_addr;| |-- | | set_fixmap_nocache(FIX_APIC_BASE, apic_phys);| |-- | | if (boot_cpu_physical_apicid == -1U)| | boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));| `-- // map IOAPIC address to uncacheable linear address| set_fixmap_nocache(idx, ioapic_phys);| // Now we can use linear address to access APIC space.|-- init_IRQ();| |-- init_ISA_irqs();| | |-- | | | init_bsp_APIC();| | `-- init_8259A(auto_eoi=0);| `-- setup SMP/APIC interrupt handlers, esp. IPI.`-- mem_init(); `--IPI (InterProcessor Interrupt), CPU-to-CPU interrupt through local APIC, is the mechanism used by BSP to trigger APs.
Be aware that "one local APIC per CPU is required" in an MP-compliant system. Processors do not share APIC local units address space (physical address 0xFEE00000 - 0xFEEFFFFF), but will share APIC I/O units (0xFEC00000 - 0xFECFFFFF). Both address spaces are uncacheable.
8.2. smp_init()BSP calls start_kernel() -> smp_init() -> smp_boot_cpus() to setup data structures for each CPU and activate the rest APs.
///////////////////////////////////////////////////////////////////////////////static void __init smp_init(void){ smp_boot_cpus(); wait_init_idle = cpu_online_map; clear_bit(current->processor, &wait_init_idle); smp_threads_ready=1; smp_commence() { Dprintk("Setting commenced=1, go go go\n"); wmb(); atomic_set(&smp_commenced,1); } printk("Waiting on wait_init_idle (map = 0x%lx)\n", wait_init_idle); while (wait_init_idle) { cpu_relax(); // i.e. "rep;nop" barrier(); } printk("All processors have done init_idle\n");}///////////////////////////////////////////////////////////////////////////////void __init smp_boot_cpus(void){ // ... something not very interesting :-) prof_counter[0..NR_CPUS-1] = 0; prof_old_multiplier[0..NR_CPUS-1] = 0; prof_multiplier[0..NR_CPUS-1] = 0; init_cpu_to_apicid() { physical_apicid_2_cpu[0..MAX_APICID-1] = -1; logical_apicid_2_cpu[0..MAX_APICID-1] = -1; cpu_2_physical_apicid[0..NR_CPUS-1] = 0; cpu_2_logical_apicid[0..NR_CPUS-1] = 0; } smp_store_cpu_info(0); printk("CPU%d: ", 0); print_cpu_info(&cpu_data[0]); set_bit(0, &cpu_online_map); boot_cpu_logical_apicid = logical_smp_processor_id() { GET_APIC_LOGICAL_ID(*(unsigned long *)(APIC_BASE+APIC_LDR)); } map_cpu_to_boot_apicid(0, boot_cpu_apicid) { physical_apicid_2_cpu[boot_cpu_apicid] = 0; cpu_2_physical_apicid[0] = boot_cpu_apicid; } global_irq_holder = 0; current->processor = 0; init_idle(); // will clear corresponding bit in wait_init_idle smp_tune_scheduling(); // ... some conditions checked connect_bsp_APIC(); // enable APIC mode if used to be PIC mode setup_local_APIC(); if (GET_APIC_ID(apic_read(APIC_ID)) != boot_cpu_physical_apicid) BUG(); Dprintk("CPU present map: %lx\n", phys_cpu_present_map); for (bit = 0; bit < NR_CPUS; bit++) { apicid = cpu_present_to_apicid(bit); if (apicid == boot_cpu_apicid) continue; if (!(phys_cpu_present_map & (1 << bit))) continue; if ((max_cpus >= 0) && (max_cpus <= cpucount+1)) continue; do_boot_cpu(apicid); if ((boot_apicid_to_cpu(apicid) == -1) && (phys_cpu_present_map & (1 << bit))) printk("CPU #%d not responding - cannot use it.\n", apicid); } // ... SMP BogoMIPS // ... B stepping processor warning // ... HyperThreading handling setup_APIC_clocks(); if (cpu_has_tsc && cpucount) synchronize_tsc_bp();smp_done: zap_low_mappings();}///////////////////////////////////////////////////////////////////////////////static void __init do_boot_cpu (int apicid){ cpu = ++cpucount; // 1. prepare "idle process" task struct for next AP if (fork_by_hand() < 0) panic("failed fork for CPU %d", cpu); idle = init_task.prev_task; if (!idle) panic("No idle process for CPU %d", cpu); idle->processor = cpu; idle->cpus_runnable = 1 << cpu; // only on this AP! map_cpu_to_boot_apicid(cpu, apicid) { physical_apicid_2_cpu[apicid] = cpu; cpu_2_physical_apicid[cpu] = apicid; } idle->thread.eip = (unsigned long) start_secondary; del_from_runqueue(idle); unhash_process(idle); init_tasks[cpu] = idle; // 2. prepare stack and code (CS:IP) for next AP start_eip = setup_trampoline() { memcpy(trampoline_base, trampoline_data, trampoline_end - trampoline_data); return virt_to_phys(trampoline_base); } printk("Booting processor %d/%d eip %lx\n", cpu, apicid, start_eip); stack_start.esp = (void *) (1024 + PAGE_SIZE + (char *)idle); atomic_set(&init_deasserted, 0); Dprintk("Setting warm reset code and vector.\n"); CMOS_WRITE(0xa, 0xf); local_flush_tlb(); Dprintk("1.\n"); *((volatile unsigned short *) TRAMPOLINE_HIGH) = start_eip >> 4; Dprintk("2.\n"); *((volatile unsigned short *) TRAMPOLINE_LOW) = start_eip & 0xf; Dprintk("3.\n"); // we have setup 0:467 to start_eip (trampoline_base) // 3. kick AP to run (AP gets CS:IP from 0:467) // Starting actual IPI sequence... boot_error = wakeup_secondary_via_INIT(apicid, start_eip); if (!boot_error) { // looks OK set_bit(cpu, &cpu_callout_map); // bit cpu in cpu_callin_map is set by AP in smp_callin() if (test_bit(cpu, &cpu_callin_map)) { print_cpu_info(&cpu_data[cpu]); } else { boot_error= 1; // marker 0xA5 set by AP in trampoline_data() if (*((volatile unsigned char *)phys_to_virt(8192)) == 0xA5) printk("Stuck ??\n"); else printk("Not responding.\n"); } } if (boot_error) { unmap_cpu_to_boot_apicid(cpu, apicid); clear_bit(cpu, &cpu_callout_map); clear_bit(cpu, &cpu_initialized); clear_bit(cpu, &cpu_online_map); cpucount--; } *((volatile unsigned long *)phys_to_virt(8192)) = 0;}Don't confuse start_secondary() with trampoline_data(). The former is AP "idle" process task struct EIP value, and the latter is the real-mode code that AP runs after BSP kicks it (using wakeup_secondary_via_INIT()).8.3. linux/arch/i386/kernel/trampoline.SThis file contains the 16-bit real-mode AP startup code. BSP reserved memory space trampoline_base in start_kernel() -> setup_arch() -> smp_alloc_memory(). Before BSP triggers AP, it copies the trampoline code, between trampoline_data and trampoline_end, to trampoline_base (in do_boot_cpu() -> setup_trampoline()). BSP sets up 0:467 to point to trampoline_base, so that AP will run from here.
///////////////////////////////////////////////////////////////////////////////trampoline_data(){r_base: wbinvd; // Needed for NUMA-Q should be harmless for other DS = CS; BX = 1; // Flag an SMP trampoline cli; // write marker for master knows we're running trampoline_base = 0xA5A5A5A5; lidt idt_48; lgdt gdt_48; AX = 1; lmsw AX; // protected mode! goto flush_instr;flush_instr: goto CS:100000; // see linux/arch/i386/kernel/head.S:startup_32()}idt_48: .word 0 # idt limit = 0 .word 0, 0 # idt base = 0Lgdt_48: .word 0x0800 # gdt limit = 2048, 256 GDT entries .long gdt_table-__PAGE_OFFSET # gdt base = gdt (first SMP CPU).globl SYMBOL_NAME(trampoline_end)SYMBOL_NAME_LABEL(trampoline_end)Note that BX=1 when AP jumps to linux/arch/i386/kernel/head.S:startup_32(), which is different from that of BSP (BX=0). See
Section 6.8.4. initialize_secondary()Unlike BSP, at the end of linux/arch/i386/kernel/head.S:startup_32() in
Section 6.4, AP will call initialize_secondary() instead of start_kernel().
void __init initialize_secondary(void){ asm volatile( "movl %0,%%esp\n\t" "jmp *%1" : :"r" (current->thread.esp),"r" (current->thread.eip));}As BSP called do_boot_cpu() to set thread.eip to start_secondary(), control of AP is passed to this function. AP uses a new stack frame, which was set up by BSP in do_boot_cpu() -> fork_by_hand() -> do_fork().8.5. start_secondary()All APs wait for signal smp_commenced from BSP, triggered in
Section 8.2smp_init() -> smp_commence(). After getting this signal, they will run "idle" processes.
///////////////////////////////////////////////////////////////////////////////int __init start_secondary(void *unused){ cpu_init(); smp_callin(); while (!atomic_read(&smp_commenced)) rep_nop(); local_flush_tlb(); return cpu_idle(); // never return, see Section 7.3}cpu_idle() -> init_idle() will clear corresponding bit in wait_init_idle, and finally make BSP finish smp_init() and continue with the following function in start_kernel() (i.e. rest_init()).8.6. Reference
MultiProcessor SpecificationIA-32 Intel Architecture Software Developer's ManualLinux Kernel 2.4 Internals: Ch.1.7. SMP Bootup on x86Linux SMP HOWTOACPI specAn Implementation Of Multiprocessor Linux: linux/Documentation/smp.tex