Task Basics

Task: A task is an execution unit where CPU can dispatch, execute and suspend. This task can be called in different operating system as a Process, Thread, Fork and other names. All program execution should happen within a task.

A task is a combination of two parts:

  1. Task Execution Space: This consists of a Code segment, Stack segment and Data segments. If operating system uses processor privilege level protection, the task execution space provides a separate stack for each privilege level (SS0, SS1 and SS2).
  2. Task State Segment (TSS): This provides a space for storing the task state information. This is mainly used in multitasking system.
  3. TSS Descriptor: Like other segments, TSS is defined by a segment descriptor. TSS descriptor should be placed only in GDT (Global Descriptor Table). This cannot be placed either in IDT (Interrupt Descriptor Table) or LDT (Local Descriptor Table). Refer following Figure.

Busy Flag (B) in descriptor indicates whether a task is busy or not. This should be maintained (Setting/Clearing) by operating system while running multiple tasks. If this flag is 0, then task is not busy or else it is busy.

When flag G is 0 in a TSS descriptor limit field must have value equal to or greater then 67H, even if there is 1 byte less than 67h during task switch will generate Invalid-TSS exception. Size can be more if Operating system add’s extra field or if it’s included I/O permission bitmap in TSS. Processor will not check the max limit by default, checks only minimum limit of 67h. During I/O operation it’s going to check the maximum limit field.

Each CPU should have 1 task minimum. All these descriptor is stored only in GDT and GDT size is 2^16=65536 bytes. Each GDT size is 8 bytes. So, GDT can store maximum 65536/8=8192 descriptors. So, we can create maximum tasks using hardware method only 8192 tasks. This is actually a limitation. Although we cannot create even this many tasks and maintain due to various reasons like non-portable, might saves state of current task more then what we require, difficult to debug, etc…

If paging is enabled (most enables) for a task, base address of Page directory used by the task is loaded into CR3 register (Control Register 3). In WinNT, after switching to new thread and if the current executing and new thread is part of different process, then it updates CR3 register where new thread is going to access its process address space. In this way, it can give protection for not accessing other process address space. Updating CR3 register is costly since it flushes CPU internal cache.

Creating more number of Hardware Task Switching is left to choice of Operating System and it should create at least one Hardware Task. Most of the modern operating system will use stack based switching to switch between different tasks by re-using the same hardware TSS by updating manually required fields. This software based task switching will give greater control to operating system to deal with many situations.

Before running task using LTR (Load Task Register)/STR (Store Task Register) instruction we need to load current task in this register. This has visible field for software and Non-visible field which are accessible only from processor.

Following field will be present in currently executing task (Check following Figure):

  1. Segment registers CS, DS, SS, ES, FS and GS
  2. State of the general purpose registers- State of EFLAGS register
  3. State of EIP (Instruction Pointer)
  4. State of control register CR3
  5. State of Task register
  6. State of LDTR register
  7. I/O Map base address and I/O Map
  8. Stack pointers of all Privileges SS0, SS1, SS2.
  9. Link to previously executed task

Task switch happens either using JMP or CALL instructions and return will be initiated using IRET instruction. There are methods for using INT instructions also.

Using multi-task method as explained above, we can create many tasks to execute concurrently even though we cannot execute concurrently on single physical CPU. Using the stack based task switch it will make operating system looks like executing more than 1 task at a time using Preemptive multitasking method. This has done by assigning time slice for each thread, after completing executing for that much time Operating system will suspend the currently executing task and dispatches new task for execution.

When we have multiple tasks and if these tasks are sharing the same global data structure, then we need to synchronize data or else tasks might corrupt data by updating from more than 1 task.

Take example of the following structure:

type struct _DUMBSTRUCT
{
int iIndex;
char cValue;
}DUMBSTRUCT;

DUMBSTRUCT strData ; // Global variable

TASK1()
{
strData.iIndex = 10 ;
strData.cValue = ‘A’ ;
}

TASK2()
{
strData.iIndex = 20 ;
strData.cValue = ‘B’ ;
}

Here there are 2 Tasks where OS can switch between them. During executing TASK1 and while updating value of first field of structure time slice might get over and operating system will suspend this task and schedules TASK2.

Current value of struct is :

strData.iIndex = 10 ;
strData.cValue = 0 ;

Even in this task TASK2 it is updating same variable and it will update both fields. Now the current value ‘strData’ is as follow:

strData.iIndex = 20 ; ( Overwritten value which is written by TASK1 )
strData.cValue = ‘B’;

After suspending TASK2, variable ‘strData’ has wrong values. To avoid this, we need to have some use Synchronization methods where we can prevent data corruption

Modified code:

TASK1()
{
WaitForLock();
strData.iIndex = 10 ;
strData.cValue = ‘A’ ;
ReleaseLock();
}

TASK2()
{
WaitForLock();
strData.iIndex = 20 ;
strData.cValue = ‘B’ ;
ReleaseLock();
}

In this way, till TASK1 completes the updating of global variable, second TASK2 will not get lock and will not update the variable. It will wait till it gets lock to touch the global structure.

Now, do we need to Synchronize if we use global variable (of Basic data type) like following:

int Index = 0 ;

TASK1()
{
Index ++ ;
}

TASK2()
{
Index ++ ;
}

Let’s say compiler will generate code like following:

Index ++ ;

TO

[Generate code of compiler in 3 steps, this is not actual instructions]

STEP-1: move Index value to register
STEP-2: add 1 to register value
STEP-3: move register value back to Index

During executing TASK1 and after executing STEP-1, Operating system will suspend TASK1 for executing other task.

Current fetched value of Index is 0

When it starts executing TASK2, it will perform above 3 steps like:

STEP-1: Register <- Index (Value is 0)
STEP-2: Register = Register + 1 (Value of Register is 1)
STEP-3: Index <- Register (stores value of 1 into Index)

Current value of Index is 1.

After this Operating system will switch back to TASK1 and continue executing from STEP-2(earlier it was suspended after expecting STEP-1). Current value of Register is 0 and continues further:

STEP-2: Register = Register + 1 (Value of Register is 1)
STEP-3: Index <- Register (stores value of 1 into Index)

Here, you can clearly observe total value of variable ‘Index’ should be 2. But, due to accessing wrongly from multiple tasks value got corrupted. In most modern compilers they will not generate 2 steps for incrementing variable; they will do by one instruction itself. However, in multiple physical CPU there is still a chance of accessing the same variable from multiple tasks at a time. Here, if there are 2 physical processor we can execute 2 tasks parallel. To synchronize in this situation hardware itself provides LOCK instruction where it guarantees only 1 can modify variable.

In WinNT, Win32 function for this purpose like:


InterlockedCompareExchange(),InterlockedCompareExchangePointer(), InterlockedDecrement(), InterlockedExchange(), InterlockedExchangeAdd(), InterlockedExchangePointer() and InterlockedIncrement().

WinNT Process

Process is a data structure which has several information to control the running program. This will have following fields:

  1. Info about physical memory used by this process.
  2. Info about code, data and stack segment used by this process.
  3. Info about each thread created by this process.
  4. Info about memory allocation is done by different threads.
  5. Info about file handles from different threads.
  6. Info about various objects created or opened by this process.
  7. There are lot of other info is stored in this data structure to keep track of what is happening in each process.

WinNT Threads

Thread is a data structure which controls execution unit. This data structure is used to store all Task info during suspending the Task. This has fields like:

  1. Info about Context registers (like TSS)
  2. 2 Stack pointers for both UserMode and KernelMode
  3. Private storage info for using by subsystems, run time libraries and DLL’s
  4. It has security context info related to thread. There are lot of other info is stored to control the execution unit.

Task scheduling: In a preemptive multi-tasking system, it will generally have multiple queues along with priority assigned. Each time users create any Task, he will also mention priority for the Task. Based on this Operating system will select the task and dispatch for executing. During running low priority task, if there is any high priority task arrives, it will suspend current task and run’s high priority task. Operating system also makes sure that all the tasks in all level of priority will get a fair chance to use CPU time.

This preemptive multi-tasking is implemented using timer, like timer ISR(Interrupt Service Routine) will get control for each millisecond and here operating system will get a chance to check time slice of the current running thread, if need it will switch to new thread and continues executing the new thread.

Note: Hyper threading will have 2 logical CPU’s within same Physical CPU. This will run 2 tasks at a time. Synchronization should be applied like we do it for multi-processor.

Stack Internals

Stack is a temporary abstract data type and data structure on the principle of Last In First Out (LIFO). Stack is a container of nodes and has two basic operations PUSH and POP. PUSH adds a node to top of stack and POP removes node from top of the stack.

System Stack

System stack provides a mechanism to allocate memory dynamically for various data associated with a function (or procedure). System stack will store following types of data:

  1. Local variables: All the variables which are created inside a function except static variables are temporary and these will be destroyed after completing the execution of a procedure or after going outside the scope.
  2. Return address: Before CPU transfers execution control to new procedure, it will automatically store the next instruction pointer into a stack.
  3. Parameters: Before calling any procedure, all the parameters either stored in a System stack or it will store in register and this is completely depends on the calling convention of a function and availability of a CPU register.
  4. Registers: Typically, after entering a function, it stores some of CPU registers in system stack and make use of those CPU registers in the current function till it completes the procedure execution.

Win32 calling convention

All arguments are 32 bits when they are passed, return values will be return in EAX and it will be 32 bit. If they are 8-byte structure, then it is returned in EDX: EAX, Larger structure is returned as a pointer in EAX register for hidden structure.

All parameters are pushed to function and stack clean up happens based on the calling convention. Compiler will generate prolog and epilog code to save and restore ESI, EDI, EBX, EBP registers if they are used inside a function. Compiler can create new calling conventions based on the need.

Calling convention available in Visual C/C++ compiler

  1. cdecl: Parameter push happens from right to left and caller will cleanup parameters from stack. This is mainly usefully if a function has variable arguments, in this case only caller will know to clean up the stack. Although most C function uses this convention. This option is set by default in Visual C/C++ compiler.

    In this example parameters are pushed from right to left and after completing cdeclFunc() caller is cleaning up the stack. Marked with Square Box shows cleanup code. ADD ESP, 10h means 16 bytes cleanup from stack.

  2. stdcall: Parameter push happens from right to left and callee will cleanup parameters from stack. Win32 API uses primarily this convention.

    Caller is not cleaning up the stack in above code.

    Marked with square box shows cleanup code. RET 10h means 16 bytes cleaning up from stack.

  3. fastcall: First 2 DWORD parameters will be stored in registers (ECX & EDX) and rest will be pushed to stack from right to left. If there is any parameters pushed into stack, then the callee will clean up the stack.

    First 2 DWORD parameters compiler is using ECX and EDX registers. Rest of the parameters is pushed from right to left. Caller is not cleaning up stack in above code.

    ECX and EDX registers will have 2 parameters data and rest will be popped from stack where ever it requires. In our example we pushed 2 DWORD parameters in stack. RET 8 means it is removing 8 bytes data from stack.

  4. thiscall: This convention is mainly used in C++ member functions and parameters are pushed from right to left. this pointer is stored in ECX register instead of pushing into a stack. Callee will clean up the stack.

this pointer is passed in ECX register.

Callee is cleaning up the stack. RET 10h means 16 bytes cleanup from stack.

Executing a High level function will result in following operations internally:

  1. Push parameters into the stack
  2. Call the function. This will result in storing return address in stack
  3. Save and update EBP register
  4. Allocate local variables in stack
  5. Save CPU registers whatever used inside a function (PUSH registers)
  6. Execute code of a function which has some purpose
  7. Release local variable storage from stack
  8. Restore saved registers (POP all pushed registers)
  9. Restore the old base pointer (EBP)
  10. Pop the return address and transfer control back to caller
  11. Cleanup the pushed parameters (Based on calling convention)

Let’s say you call a function which is of type cdecl TestFunc (10, 20, 30, 40);


Parameters are pushed from right to left: 40(28h), 30(1Eh), 20(14H), 10(0Ah). After executing instruction CALL TestFunc stack will look like following:

In memory window shows the stack dump. Current stack pointer after entering function TestFunc() is 0x0012F21C (Check ESP register in Register window).

In a stack dump you will find following data:

  • 0x0012F21C have value of 0x004151d0. This is return address where function should go back after executing function. This address is visible in previous code during calling.
  • 0x0012F21C+4 have value of 0x0000000a(10). This is first variable which is passing to TestFunc().
  • 0x0012F21C+8 have value of 0x000000014(20). This is second variable which is passing to TestFunc().
  • 0x0012F21C+12 have value of 0x00000001e(30). This is third variable which is passing to TestFunc().
  • 0x0012F21C+16 have value of 0x000000028(40). This is fourth variable which is passing to TestFunc().

After entering a function, it stores EBP register in stack for use inside this function. This register is used mainly to access the stack and its parameters. If it’s using any registers during the execution of this function it will PUSH all those registers. In our example, it is pushing register EBX, ESI, EDI.

Once it retrieves parameters it will execute the function to serve it purpose, releases all the local variables memory from stack if there is any, restores the registers which was PUSH’d earlier like EDI, ESI, and EBX, restore the EBP register which was PUSH’d earlier and return from function. RET instruction will POP the return address 0x004151d0 and transfers control to the caller.



Recursion Vs Iteration

A piece of code which executes repeatedly based on the different inputs. In general, before solving any large problem first, we need to divide the problem and then you need to solve either using iterative way or recursion way.

Recursion sample:

unsigned long FactorialR(int Num)
{
if (Num == 0)
return(1);
return(Num*FactorialR(Num-1));
}

This function will take ‘Num’ as parameter and executes the same function till it satisfies exit condition i.e ‘Num == 0’.

If we pass 5 as parameter to this function it will execute as following:

  • Num[5]*Factorial(Num-1)[?] -> 5*? it doesn’t return immediately
  • Num[4]*Factorial(Num-1)[?] -> 4*? it doesn’t return immediately
  • Num[3]*Factorial(Num-1)[?] -> 3*? it doesn’t return immediately
  • Num[2]*Factorial(Num-1)[?] -> 2*? it doesn’t return immediately
  • Num[1]*Factorial(Num-1)[?] -> 1*? it doesn’t return immediately

After Num == 0, then it will start returning result of processed result of the function in reverse order like following:

  • Num[1]*Factorial(Num-1)[ 1] -> 1* 1 (1 is returned value of the previous function call)
  • Num[2]*Factorial(Num-1)[ 1] -> 2* 1 (1 is returned value of the previous function call)
  • Num[3]*Factorial(Num-1)[ 2] -> 3* 2 (2 is returned value of the previous function call)
  • Num[4]*Factorial(Num-1)[ 6] -> 4* 6 (6 is returned value of the previous function call)
  • Num[5]*Factorial(Num-1)[24] -> 5*24 (24 is returned value of the previous function call)

Finally, function will return 120 result to the caller.

Iteration sample:

unsigned long FactorialI(int Num)
{
int Result = 1;
for ( ; Num > 0; Num– )
Result = Result * Num ;
return(Result);
}

This function will take ‘Num’ as parameter and executes the same code inside the loop till it satisfies exit condition i.e ‘Num > 0’.

If we pass 5 as parameter to this function it will execute as following:

  • Result * Num -> 1*5 = 5 (5 is stored in ‘Result’ variable)
  • Result * Num -> 5*4 = 20 (20 is stored in ‘Result’ variable)
  • Result * Num -> 20*3 = 60 (60 is stored in ‘Result’ variable)
  • Result * Num -> 60*2 = 120 (120 is stored in ‘Result’ variable)
  • Result * Num -> 120*1 = 120 (120 is stored in ‘Result’ variable)

Finally, function will return 120 result to the caller.

Code generation by compiler for a function

Any function you write, compiler will generate default prolog and epilog for it. So, it will reserve stack for local variable once it enters the function and while leaving the function, it will restore the old value of registers. Layout will look like following by default:

[Prolog]

[Function Body]

[Epilog]

For example:

[Prolog]

push ebp ; Save ebp

mov ebp, esp ; Set stack frame pointer

sub esp, localbytes ; Allocate space for locals

push ; Save registers. will represent list of registers to be saved

Function Body]

[Epilog]

pop ; Restore registers. will represent list of registers to be restored old values

mov esp, ebp ; Restore stack pointer

pop ebp ; Restore ebp

ret ; Return from function

This Epilog and Prolog code generation will vary from compiler to compiler. Microsoft compiler will give option to prevent default code generation of Prolog and Epilog for a given function using a keyword called ‘naked’. This will help you to write your own custom Prolog and Epilog code.

Calling function

When you call any function, first it will push parameter from right to left and then it will push return address before jumping to function. After completing the execution of function, it will jump back to return address which is stored earlier in stack. Normally it will use instruction ‘ret’ or ‘ret ‘where it will transfer control to the address which is found in stack.

For example: TestFunction(10, 30). If you call this function, compiler will generate code like following:

04000000 push 1Eh // 30 == 1Eh

04000002 push 0Ah // 10 == 0Ah

04000004 call TestFunction

04000009

After executing “call TestFunction”. Stack will look like following:

05000005 04  <—–Top of stack

05000004 00

05000003 00  // 04000009h (Stored reverse and this is implementation specific)

05000002 09

05000001 0Ah // 10

05000000 1Eh // 30 <—–Bottom of stack

After completion of function execution, it will transfer control back to caller function. Depends on the calling convention either caller or the calle will clear the stack which is used for parameter pushing.

Iterative Vs Recursion method

Take any kind of problem; you can use either Iterative or Recursion method to solve it. If you can solve using iteration method, same problem can be solved with recursion way and vice versa. But, during solving problem you are going to decide based on following factors:

  1. Does this system have enough stacks?
  2. Which method takes less code i.e. Iterative or Recursive?
  3. Is it possible to solve this problem using Iterative method easily?
  4. Is it possible to solve this problem using Recursion method easily?
  5. Which method is faster i.e. Iterative or Recursive?

If you use Recursion way, compiler will allocate space for variable after entering function and it de-allocate space in stack while leaving the function automatically. This is achieved using Epilog and Prolog code generated by compiler. Recursion function will take some value as parameters and it will process and returns the result (Check above FactorialR() function and also Prolog, Epilog). All the process result will be stored in stack variables till it returns the final result. In Recursion method Prolog and Epilog plays important roles in getting final output. All of the following 3 will contribute to get final result:

[Prolog] /* This will execute once*/

[Function Body] /* This will execute once*/

[Epilog] /* This will execute once*/

If you use Iterative way, compiler will not generate anything automatic code like prolog or epilog for a loop. You are going to store the result during iterative process (a piece of code which is executing inside loop) in stack variables. Once loop exits it will have some expect result. This iterative method will be in”[Function Body]” (check above compiler code generation for a function). In Iterative method Prolog andEpilog will not play any role in getting complete result, it might just store processing result during loop execution. Only function body will be contributed to get final result:

[Function Body] /* This will execute more than once*/

Recursion to Iteration

During converting in this way, we need to create own stack, this stack will store all (it depends) the variable which is created using Prolog code in recursion method. After we use those stack variable then we need to remove from stack which is like Epilog, de-allocating stack variables.

Iteration to Recursion

During converting in this way, we need to analyze, what does we need to store in stack (Prolog), what does processing code should contain (Function Body), this will generally have the code which is executed inside loop and how does stack cleanup happens (Epilog).

For both conversion example check above sample FactorialI() & FactorialR(). Remember, while solving problem as mentioned above you need to consider factors to identify which method is best to solve this problem or else you might solve the problem but in ugly way.

Above sample FactorialI() & FactorialR() are simple to solve. But, to solve complex problems some are easy to think in recursion way some are easy to think using Iterative way. To convert between these 2 methods, you need to visualize how compiler generates code for both Recursion and Iterative method which will make easier for conversion.

For example: You know to solve the problem in recursion way, but in some system where there is less stack you need to maintain your own stack to do the implementation. If you can visualize how compiler does generate code for a function and also if you trace manually, you can convert to manual stack implementation easily.


Memory byte Reading-Writing

Segments: Dividing processor (CPU) addressable memory space into smaller protected address space called segments.

Segment Selector: A Segment selector is a 16 bit identifier for a segment. It does not point directly to the segment, instead points to segment descriptor that defines the segment. This segment descriptor has all the details about actual segment along with other properties.

Three types of address we refer often:

  1. Logical address: A logical address is a combination of segment selector and offset. Size of this address is 16(Bits):32(Bits).
  2. Linear address: Linear address is the result of the segment value which either got from GDT (Global Descriptor Table) or LDT (Local Descriptor Table) + offset field in logical address. Logical offset size is 32 bits. Linear address size will be 32 bits.
  3. Physical address: Linear address will become physical address directly if there is no paging enabled. If paging is enabled, then the result of translation using page tables by the CPU will become Physical address. Physical address size will be 32 bits, which supports 4 GB physical RAM. If PAE (Physical Addressing Extension) is enabled, then physical address size will be 36 bits, which supports 64 GB Physical RAM.

WinNT uses flat memory model and it doesn’t use segmentation. In x86 there is no option to disable segmentation so, NT should be using at least one segment.

When you allocate a memory like following:

char *pBuffer = malloc( 100 );

In user mode, we can allocate memory up to 2 GB in Workstation version of WinNT and up to 3GB in Server version of WinNT. So, using malloc() function you will be getting within 2GB or 3GB of virtual memory.

‘pBuffer’ will have logical address which is combined of Segment Selector + Offset.

If application tried to Read/Write to memory, then CPU will do following operations:

  1. Using Segment Selector it will identify which Descriptor table to select GDT or LDT. This Segment selector will be a index value for the table. Each table entry size will be 64 Bits of segment descriptor. This will have other fields like privilege of the segment, segment limit, segment base, segment type, etc… Check following figure for other fields of segment descriptor:
  2. After getting Segment Base address either from LDT or GDT, it will add Offset (32bits) of logical address to get a linear address.
  3. If paging is disabled, then Logical address will become physical address and CPU will read/write to memory. If paging is enabled, then logical address will have combination of: Page Directory(10bits)+Page Table(10bits)+Offset(12bits) = Total 32 bits.
  4. First 10bits value will be used as index to Page Directory to get Page Directory Entry value. Each Page Directory is an array of 32 bit Page Directory Entries (PDE) contains in a 4K byte page and will have maximum 1024 page directory entries. Base address of Page Directory will be stored in CR3 register; this is changed often when there is a task switch happens by operating system.
  5. Page Directory Entry (PDE) will have base address to Page Table. Page Table will have 1024 entries of size 32 bit each entry. This can contain maximum 1024 Page Table Entries (PTE). Next 10bits of logical address is used as an index to Page Table.
  6. Page Table Entry will be a base address in physical address space and Offset of 12bits in logical address will be address to Page Table Entry value to get actual physical address. Further CPU can read or write to memory byte.


All the above steps represented in following figure:

Maximum 2 level of translation will happen to the physical address:

  1. Logical Address->Linear Address->Physical address
  2. Logical Address -> Linear Address translation will happen, since we cannot disable segmentation.
  3. Linear Address -> Physical address translation will happen only if Operating system enables paging. If there is no paging enabled we cannot get the benefit of virtual memory. All modern operating system uses paging. If you have 256MB physical RAM and if you want to allocate more than that then Operating system uses Hard Disk as a backup to allocate memory and brings to RAM based on the demand by application.