January 2007 - Posts
This is the last of the series, since Security wasn't covered in full.
A storage (disk) has a Master Boot Record (MBR), A Partition Table and 1 or more partitions. Each partition can have a different file system, but no one's assuring you your OS can use it.
Most of a partition is the Blocks, which are like the pages for the data, only for the disk. Blocks that are too large will cause internal fragmentation and blocks that are too small will cause slower sequential access to files (since disk operations are I/O and are slow).
The Unix File System (UFS) - i-nodes
- Everything is a file is the guiding principle in Unix-based systems.
- A part of the partition is the i-nodes, which are the data structure used to describe files and their content's location/s.
An i-node is made up of four parts: - File Attributes, like owner id, last access time, etc..
- n Direct Blocks (where usually, n = 10), which are pointers to the blocks containing the first (n * BlockSize) bytes of the file's data.
- Indirect Block Pointer, which is a pointer to a block that contains (BlockSize / PointerSize) pointers, each to a block of data.
- Double Indirect Block Pointer, which is a pointer to a block that contains (BlockSize / PointerSize) indirect block pointers.
- Triple Indirect Block Pointer, which is a pointer to a block that contains (BlockSize / PointerSize) double indirect block pointers.
Overall, this means that a file's maximum size is: BlockSize * (n + BlockSize / PointerSize + BlockSize2 / PointerSize + BlockSize3 / PointerSize). If we have 1KB blocks and 4 byte pointers, we can't create a file larger than ~16GB.
This structure means that smaller files have the advantage of having to use less disk-reads than the larger files.
Example: Block size is 1K and pointer size is 4 bytes. Byte 355,000 is in the double indirect's 80th block (355,000 = (10 + (210 / 4) + 80) * 210 + 696).
- Hard Links are a way to create more than one file that will reference the same i-node. This i-node will keep a Reference Counter, since deletion of a hard link means freeing of the i-node and blocks only if there are no references left to that file other than the one we have delete now. Hard links to directories are allowed (and will therefore turn the directory tree into a graph), but only for administrators, since it could create a cycle in the graph, when it must remain acyclical (DAG).
- Symbolic Links (or Soft Links) are a file that's contents if the path to another file. This file is simply a shortcut, but is treated as if it was the original file when executed, read, etc.
- Directories are files (as we've stated before), whose contents are simply records of the <i-node number, file name> pair. The root directory is the first i-node.
- lockf is a system call that allows processes to avoid working with the same locations in files, by locking ranges of bytes in a file. Code can decide to not play nice and simply not call lockf and there's nothing you can do about it, so all code accessing that file needs to know it's going to use lockf or it's simply useless.
File Allocation Table (FAT)
- A part of the partition is the File Allocation Table itself, which has an entry for each block and a pointer to the file's next block. That means that it's a fixed length linked list, where the only thing that changes is the pointers (nodes' order).
- Directories are implemented as files with records of the <file name+extension (8.3), file attributes, first data block number> tuples. Since file attributes are kept inside a fixed length record, hard links can't be added to the file system, since there's no room for the reference counting. The root directory is the first file in the table.
- Long File Names - in later implementations of FAT (starting with Windows NT 3.5), long filenames were allowed by using the filename part of the record as a pointer to point to an offset in a heap in the directory's blocks that contained the file names.
- Since storage (disk) is non-volatile, FAT has an Optional Duplicate FAT, which is a backup of the FAT, but as the name indicates, is optional.
- The maximum size of FAT-12 is 16MB, that of FAT-16 is 2GB and of FAT-32 is 2TB. This is due to limitations of pointer size (the n in FAT-n) and block size (at most 4KB in FAT-12 and 32KB in the others).
New Technology File System (NTFS)
- A file in the partition is the Master File Table (MFT), which contains modular records. Since the MFT is a file, the only limit to the number of files is how much space you have on the disk.
- Except for the record's header, each module in the record is called an Attribute and is made up of a <header, data> pair and the following are mandatory:
- Standard Information - these are the same old file attributes we've seen in UFS (i-nodes) and FAT.
- File Name - Since this is an independent attribute and due to non-inlining, there's no limit to the length of the filename (but there's one set anyway).
- Data - The data portion of this attribute is divided into ordered Runs which indicate the location of fragments of the file in <starting block, count> pairs. The first run is <0, total number of runs>.
- An attribute's data may either be Inline (that is, inside the MFT record itself) or not (which means there's a pointer to a block containing the attribute's data). An attribute's header will always be in the MFT record. Inlining is a consideration of the file system itself and needs not be controlled by the OS.
- If a record is too long, we can add an attribute of Extension Records and create more records to keep it.
- Directories are implemented as files with an index attribute, the data portion of which hold records of <MFT record#, file name length, file name, some flags, ...> tuples. Yes, it means there's a duplication of the file name. The root directory is record 5 in the MFT.
- One of the records in the MFT points to itself for consistency. Another points to a backup duplicate (like in FAT, only not optional).
- Transparent File Compression is an ability of NTFS which allows some of the file's runs to be compacted if their compression rate is high enough (if it wasn't high enough - it wouldn't be worth the effort to compress/decompress it). Each run that is compressed is followed by a run like this: <0, number of blocks we saved by compression>.
The Buffer Cache
- Both Unix and Windows have a cache that's held by their kernels which allows for faster disk access. Blocks both read and written to disk are cached and high-level algorithms instruct which blocks to pre-cache or delay-write.
- Unix implements this cache as a hashtable of size n, linking to a triply linked list (doubly linked list with another pointer in each node to simulate a linked list for each hashtable entry that ends in the free list's header). The doubly linked list is sorted from the most recently used block (list head) to the least recently used block (list tail).
When a block is looked for, it is looked for in the list pointed at by the relevant hashtable entry. If we reached the free list, the block isn't cached. Otherwise, the block is moved to the front of the list and is loaded from the cache.
When a new block needs to be added, the doubly linked list's tail is removed and the new block is moved to the head of the linked list (as well as connected to the hashtable entry's list).
Log-Structured File Systems (LFS)
- These file systems are based on the assumption that there is a big buffer cache, which means that most disk operations will be writes.
- All write operations are kept in the cache and once in a while, they are sequentially committed to the disk in one fell swoop (to a 'log').
- Re-organization can now be a separate daemon (service).
Reliability
- The file system must be consistent. UFS (i-nodes) has a program named fsck (File System Check) that checks two things:
- It checks the disk's blocks by using the number of times they appear on the free list and the number of times they're used in i-nodes in the following manner:
- If there is no reference to the block in either count, it is added to the free list.
- If it appears more than once on the free list, delete the excess nodes so that only one is kept.
- If it appears on the free list and is also referenced by an i-node, it's removed from the free list.
- If n i-nodes references the block (where n > 1), it is cloned n times and each i-node is corrected to reference a different one.
- It checks the reference counts for hard links. The number of found files always trumps the number currently in the reference count and replaces it. If either number is 0, the file is erased!
- Dumps are used to keep a backup of a disk. Full dumps are long and wasteful and so we use Incremental Dumps which only copy the changed files over to the dump whenever it is called.
Incremental dumps use bit-maps to decide which files and directories are to be dumped in the following manner: - Create a bit-map in the length of the number of files in the system.
- Turn the bit of all changed files and all directories on. The rest should stay off.
- Turn off the bits of directories not in the path of changed files.
- Write all directories to the dump.
- Write all files to the dump.
An option to optimize the file system's speed is to use the physical makeup of the disk and the knowledge about the operations that take it a lot of time. For instance, placing the i-nodes/FAT/MFT in the beginning of the disk, organizing it in groups on clusters, etc.
Physical Memory (RAM) is small, fast and volatile (pull the plug and nothing's left). Remember that. :)
A process needs memory to run (storing code, stacks, information, data, etc.), so first we need to allocate some for it. Here are some allocation methods:
- Growing Segments - Give the process a predetermined amount of space for code (that will never change, of course) and a preliminary amount of space for data. Leave a big block of unused space after that for data to be added later, if needed. This still means that a process can get a limited amount of space and that there's a finite amount of processes that can live in the system.
- Tracking Allocated Memory - Give the process the space for code and an initial space for data. Then give it more as it needs and track where the used/free spaces are. You can do that with Bitmaps (the structure, not the image type) or linked lists. Now we need an Allocation Strategy, that is - where do we allocate from the space we have left?
- First Fit - No need to search a lot, but highly ineffective.
- Next Fit - The first one we can find from the last place we allocated. Not a lot of improvement there.
- Best Fit - It needs 500 bytes? We'll find the closest we can get to it (without going under 500) and give it that. The problem we get here is external fragmentation (later).
- Worst Fit - Now you're just guessing.
- Quick Fit - Use Best Fit, but do it more quickly by keeping not one linked list, but several for size ranges (want 500 bytes? no prob - we'll go look for it in the 250-750 bytes queue).
- The Buddy System - (Knuth 1973) We need X bytes, so we divide the first piece of memory into twos until we get a piece small enough so that xfloor(lgx) <= x <= xceil(lgx) and then allocate that part. For example, if we have 1024KB and we need 200KB, we divide 1024 in half and then the first half in half as well (we now have chunks sized 256, 256 and 512) and then allocate the first 256KB chunk. Once a chunk is freed (or re-merged), it will be re-merged with it's buddy (that is, the other chunk in the original division). For example, freeing our 256KB chunk would cause it to re-merge with it's 256KB buddy and the 'new' 512KB chunk will then re-merge with it's 512KB buddy to re-form the original 1024KB chunk.
This algorithm's memory utilization isn't that good and there's a lot of internal fragmentation (think what would happen if we needed a chunk of size 2x + 1).
Fragmentation is something we don't like, because it means we have free memory we can't use. There are two types of fragmentation:
- Internal Fragmentation is what happens when a chunk of memory is allocated by the allocation strategy, but only a small percentage of it is ever used. This means that the memory is logically free to be used by someone else, but can't be, since it's allocated.
- External Fragmentation happens when we have a lot of little holes of free memory between regions of allocated memory. We now can't allocate them, since they're not big enough for any of the requests for memory we get. A way of solving this problem is Memory Compaction, where regions of memory are moved to seal the holes and all of the holes are merged.
We've mentioned that memory is small, but we need to host a lot of processes, so we'll just Swap some of it to a backing store (disk). We could swap an entire process if it's blocked, too large, has too many holes in its memory, etc.. But we can't move daemons (services)... Or shared libraries (dlls), so what do we decide?
You know what? Let's put everything on the disk and simply use the physical memory to store it. The contents of the disk are called the Virtual Memory. If you have 4GB of virtual memory and 1GB of physical memory, you have 4GB of memory, since the physical memory is only a sort of buffer for quick and easy access. Remember that. :)
A process will never know where its memory is, since the abstraction supplied by the OS means that it only sees its Logical Address Space, no matter where it is at the moment. It should never need to stoop to that level. We'll discuss how this is implemented later.
There are two ways of managing the backing store's (disk) Swap File (that is - where the virtual memory is): It can either be pre-allocated and used directly (most systems) or allocated when needed and a translation table sits in memory. There is no real reason to go with the second way.
Needing to work fast and easily means we should probably divide our memory into same sized blocks. We'll call those Pages and use their numbers as the first part in a logical address. The second part would be the offset in that page. The Memory Management Unit (MMU) is a (fast) part of the processor that takes care of translating the logical address's page part into the page's location in the physical memory, where it's now called a Frame (it uses a Page Table, that is a table where the index is the logical page number, and the values are the frame's location in the physical memory and some control bits). The important control bits are (in order of checking by the MMU):
- Valid Bit - 0 if the page wasn't even allocated yet for the process (this is because the table includes the maximum amount of entries for a process, even though they may have not yet been allocated). When the MMU sees a 0 value in this bit, it throws a Segmentation Fault (Unix) or BSOD (Windows).
- In Memory Bit (aka Present/Absent Bit) - 1 if the page is actually in the physical memory. 0 means that none of the rest of the entry is valid and when the MMU sees it, it throws a Page Fault, telling the OS that the page isn't in the physical memory and should be first brought from the virtual memory before being asked for.
- Dirty Bit - 1 if the page has been changed in the physical memory and must be committed to disk when paged out.
- Referenced Bit - Changed to 1 whenever the page is read from or written to. Used in the algorithms for selection of pages to be paged out.
The page table in the MMU can be really big, so we can simply split it into a Multilevel Page Table (mostly 2) and only have to keep the topmost level in the physical memory all the time. Associative memory (aka Translation Lookaside Buffer or TLB) helps by keeping recently used logical addresses so that we have less of a hit on performance (think about the amount of page faults bringing the xth level page to physical memory, where x > 1, since the first is always there).
We could also use, for big enough memories, an Inverted Page Table, which means that a table is kept only for the available space in the physical memory (instead of one for the maximum size of a logical address space). The problem is that we now don't have a good lookup method to find logical address pages (it would take O(n)) and we can solve that by creating a hashtable where the key is the <process id, logical page number> pair (costing us only O(1)).
You can read more about these techniques on Wikipedia.
Page Replacement Algorithms come into play when we need to copy a page into physical memory, but can't find an open spot. The following algorithms return a frame number that will be thrown:
- Optimal - If you think it's weird that we start with the optimal algorithm - don't - because it simply does not exist. It's only an index to measure an algorithm against. The optimal algorithm knows what pages are going to be requested next (the future Reference String) and outputs the number of the frame that contains the page that is the farthest away.
- FIFO - We had the best case, now we have the worst case. This algorithm simply outputs the frame number of the page that was paged in the earliest.
- Least Recently Used (LRU) - We can take the reference string of the past and see which page was least recently used. This is like FIFO, only that it doesn't take a frame by the time its page entered the physical memory, but rather by the last reference to that page. This algorithm is good, but implementing it takes too much time and space.
- FIFO Second Chance - This is an elaboration on FIFO and LRU. This algorithm finds the page that was paged in the earliest: if its reference bit is off, its frame number is given, otherwise it is turned off, the frame number goes back to the head of the queue (as if it was paged out and back in) and the algorithm runs again. This is 'fair' towards pages that are referenced often, since the original FIFO would have thrown them away as well.
- Clock - The clock's hand points to the ith frame. If the frame's reference bit is set, it is turned off, the clock's hand moves to frame (i + 1) mod n (where n is the total amount of frames) and the algorithm is run again. Otherwise, the algorithm outputs i. This is the base for WS-Clock, so that's the only reason it's important to us.
- Not Recently Used (NRU) - This algorithm divides pages into four states:
- Both the reference bit and the dirty bit are off.
- The reference bit if off, but the dirty bit is on.
- The reference bit if on, but the dirty bit is off.
- Both the reference bit and the dirty bit are on.
The frame of the page with the lowest number class is returned. Note that state 2 is created by the fact that every few seconds, the algorithm intentionally clears all of the reference flags for more precise results. - Not Frequently Used (NFU) - This algorithm keeps a counter for every frame. At every clock tick it shifts them all by one bit to the right, adds their reference bit in the most significant bit and clears it. The lowest counter value is the counter for the frame that will be returned, since it was the one referenced the farthest backwards and also the least. This allows for simulated aging of references to pages.
Remember: before removing a page, we need to copy it back to the disk if the dirty bit is 1 and also set its in memory bit to 0 in the MMU's page table.
Belady's Anomaly states that increasing the number of frames in memory does not necessarily decrease the number of page faults that will occur. This does not, however, apply to Stack Algorithms (such as Optimal and LRU). So we need to find a middle ground, where not too many (more page faults) or too few (hurts multiprogramming) frames are allocated. This is done dynamically in the OS, which sets a maximum and minimum number of frames usable by a process in the physical memory. These bars are also dynamic themselves, adjusting according to the number of page faults, etc.
When a process has too few frames, it starts Thrashing - the high level scheduler is too busy paging in and out (lots of I/O = bad). This could either be caused by the above paragraph's minimum bar being too low or the fact that there are just too many processes alive and working in the system right now.
To solve thrashing, we use the Working Set Model (WSM), which is based on the principle of locality and uses a window to look at only the frames used by the process in the past n instructions and tries to preserve these pages the most, since it is more likely the process will need them again soon. If the window is too long, it will include some frames not in the last locality that we usually don't need and will limit our choices of frame freeing; too short and it won't include the whole current locality and thrashing will ensue.
Since implementing pure WSM is costly, we use the Working Set Clock Page Replacement Algorithm (aka WS-Clock). Each process has a virtual time that measures the amount of ticks since it was created. For each function, its reference virtual time is kept. When the clock hand reaches a process with its reference bit on, it will turn it off and update its reference virtual time to the owning process's virtual time (similar to FIFO-SC). If the reference bit is off, it will check if it has been referenced in the last τ (Greek letter tau) ticks by comparing τ and the difference between the owning process's virtual time and the page's reference virtual time. If it wasn't, this frame is selected for freeing. Note that if τ is too small, it will cause thrashing and if it's too large, the level of multiprogramming will decline because WS-Clock will keep getting called instead of running a process. τ is dynamically changed by trial and error.
Unix has a page daemon that uses a two-handed clock algorithm - both act as the hands of a WS-Clock with a fixed angle between them. The smaller the angle, the busier a page needs to be to survive.
Windows 2000 keeps a dynamic minimum and maximum of frames per process. Requests for more frames than the maximum are not given. Once in a while, the balance-set manager runs and clears frames until it reaches the minimum frames for the process.
Both systems have a threshold that once it is passed (meaning too few free frames are available), the algorithms kick in.
Segments are logical groups of pages (you might know them as Data Segment, Stack Segment, Code Segment, etc.), which are used to distinguish between different parts of a process's memory. Segments are also the smallest unit of memory shareable between processes (since it is the smallest unit of logical significance).
Preface: I have absolutely no pretence to teach anyone about synchronization with this part, since this is over a year's worth of material I had previously learned at grade 13 (Technician's Degree), so if you want to learn from this, you'll have to read a lot on Wikipedia while reading this.
A Critical Section is a section of code that must never allow more than one process inside it at all times, for fear of Race Conditions. To protect a critical section from this, a protection mechanism must abide by three rules:
- Mutual Exclusion - No two processes can co-exist in a critical section at once. In order to prove that a mechanism abides by this rule, you must assume that two processes are currently in the critical section and by backtracking their steps, find that this is an impossible state.
- Progress - A process that is not currently in a critical section must never be allowed to block another from entering. In order to prove that a mechanism abides by this rule, you must show that the condition xi (xi = true means the process i can enter the critical section) can not be changed indefinitely to false by a process j.
- Bounded Wait (aka No Starvation) - A process's waiting time to enter the critical section is bounded by a finite time, that is that a process will never be locked out of a critical section indefinitely. In order to prove that a mechanism abides by this rule, you must assume a process is barred from entering the critical section and prove that it will eventually enter it.
Proven solutions for this problem using simple constructs include Peterson's Algorithm, Lamport's Bakery Algorithm (which is simply Peterson's for n processes) and Dekker's Algorithm.
A big problem that an algorithm must consider is that the scheduler can pre-emptively stop one process and allow execution to another. This causes a problem, since many instructions in 3rd generation languages, like C, are not atomic and can stop in the middle of execution. A TSL Instruction (Test-and-Set Lock) is one solution to this, but it's pretty complicated, so I won't be getting into it here.
Semaphores are small atomic structures. A semaphore has a value and two operations - Up (increase value by 1) and Down (wait until the value is greater than zero, then decrease it by 1). Semaphores operations are, by definition, atomic. A Mutex is simply a semaphore with a value of either 1 or 0 (continue or wait when downed). Unless you know what you're doing, always use down and up on semaphores in a symmetrical (LIFO) manner (that is - the first downed semaphore will be the last upped one), otherwise you risk deadlocks. (Note: a semaphore might have negative values, but this is irrelevant for us).
Semaphores can achieve their atomicity in one of two ways: TSL Instructions (which are fair, but create a Busy-Waiting mechanism which takes resources) and Disabling Interrupts (which means that nothing will disturb the semaphore's internal work; still unacceptable outside internal system implementation, but since this is an internal implementation, we forgive it).
Semaphores are usable in such problems as the Bounded Buffer and Producer-Consumer, but also in many of the next problems, where they are simply not the main player.
Message Passing can simulate semaphores, since the operation receive blocks until a message is received (that is, someone used send) in the same way that down blocks until someone has used up. The advantage in using this is that you can easily change the critical section's protection to work in a distributed environment.
Monitors are a higher level construct, used in .NET (C#'s lock) and Java (synchronized), that allow operation declarations inside of them. These operations are all mutually exclusive (they are all the same critical section).
The two functions monitors offer to the operation declarations are wait(condition) and signal(condition) which act as a mutex's down and up operations, both of which are only accessible from inside the monitor. Both of those internal operations can act on any of the conditions defined by the monitor (see examples below). Since signaling on a condition means that a waiting operation might wake up and start running, signaling is only allowed at the end of an operation or it will endanger the mutual exclusion. (Note: In the Java implementation of the producer-consumer problem with monitors, you must use notifyall and not notify, since notify will wake up one process, but it's not necessarily the process we want to wake (there is only one condition); also, using notifyall, we have a competition over which process will be allowed to run, which would prevent starvation).
Monitors are usable in such problems as the Bounded Buffer and Producer-Consumer, but in many others as well.
Barriers are a synchronization mechanism that means that all processes must wait when reaching it. When the last one reaches it, they are all simultaneously released. No idea what they're good for.
More problems I'm not going to talk about but are worth mentioning are the Sleeping Barber, Readers-Writers and One-Way Tunnel (only one car can be in the tunnel).
Deadlocks
A simple deadlock occurs when a process A holds a resource R1 and waits for R2, while a process B holds R2 and waits for R1. Systems that have the following characteristics are prone to deadlocks:
- Mutual Exclusion of resources - Each resource can only be used by one process.
- Hold and Wait - A process can hold a resource while waiting for another.
- No Pre-emption or resources - A resource can only be released by its holding process.
- Circular Wait - Two or more processes can wait for resources being held by other waiting processes.
If you model your processes (usually as circles) and resources (usually as squares) as vertices of a directed graph and add requests for and holding of resources as edges, a deadlock might happen if the graph has a circle in it. If the resources involved all have only one instance of them available for holding, a deadlock will happen, but if they have more than one instance, it's not a sure thing.
How do we handle deadlocks then? There are several schools of thought:
- Prevent them from happening by removing one of the aforementioned conditions for deadlock-prone systems:
- Do not allow mutual exclusion of the devices themselves or use spooling (buffering), where only the spooler accesses the resource and many processes can access the spooler. Not very good for many devices and the spools may grow too big.
- Do not allow holding and waiting by either forcing a process to declare all of its required resources up-front and then arranging them in a way that won't cause a deadlock (not really realistic in most cases) or by forcing a process to release every resource it holds before requesting a new one (now that's just crazy talk... :P).
- Allow resource pre-emption, though it might cause incorrect execution (imagine your printer printing half a page from one application, half from another and then continuing with the two halves it has left).
- Prevent circular waiting by:
- Allow processes to only hold one resource at a time (not a good idea).
- Number resources and only allow allocating in ascending order or resources with higher numbers than those currently held (will work, but why do that?).
All in all, no silver bullet. - Avoid deadlocks by granting resources only if it's safe to do so. You can use algorithms like the Banker's Algorithm to make sure that resources are allocated for a process only if it can finish execution.
Again, we need to know which resources each process wants in advance, which is not realistic most of the time. Avoidance is simply not practical. - Detect deadlocks by running a circle detection algorithm on the graph model every k minutes and also whenever CPU utilization drops hard and there are at least two processes in hold and wait mode. Recover from them by creating synthetic pre-emption on the resource (rarely a good thing), rolling back (a checkpoint mechanism must be implemented first - slows everything down and takes space; implemented in DBMS) or simply killing one of the processes involved. Detection and Recovery is very hard and not really cost-effective.
- Ignore them (bury your head in the sand) and let the user handle the problem (reboot). It's the simplest solution and most cost-effective, since deadlocks are really rare and the previous three methods of tackling them are either unrealistic or very CPU and I/O-intensive.
A classic deadlock is shown in the Dining Philosophers' Problem.
Processes need to be managed, so that they will know when to run and when not to run, so here enter the Scheduling Algorithms. A scheduling algorithm needs to be fair and efficient, but how do you measure that?
- Throughput - This measures the number of processes completed per time unit, which is an important thing for batch systems, where you want as many jobs to finish as fast as possible.
- Turnaround Time - The average time it takes a process to complete from the moment it is introduced to the system.
- Waiting Time - The average amount of time the process is ready to run, but isn't run (notice that this does not include blocked processes). This indicates how much CPU time we 'waste' (when of course, we do not waste it - it simply goes towards running other processes).
- Response Time - The time between submitting a command and the beginning of output. This is a measurement mostly used in interactive systems, where response to the user is very important.
- CPU Utilization - How much of the time was the CPU really utilized? A low ranking (that is - not close to 100%) in a batch system means something is wrong. On the other hand, in interactive systems, a computer might be idle and therefore would have no need for the CPU.
Some of these measurements conflict with one another, so they don't all need to be at their best - it all depends on what kind of operating system we're using and what we're using it for.
Let's review some scheduling algorithms:
- First Come, First Served (FCFS) - Better known as FIFO, this is a non-preemptive scheduling algorithm. Whenever a process enters the Ready state, it is assigned a timestamp. When a process exits the Running state (either ends or is blocked), the process with the earliest (smallest) timestamp is allowed to run. This is a very basic approach and is highly ineffective for interactive systems (waiting time is very high).
- Shortest Job First (SJF) - This enhancement of FCFS adds the estimated amount of future running time by the process to the mix and lets the processes that are shortest to run first. This algorithm can be implemented as either pre-emptive or not. The main point of the algorithm is that we're willing to hurt fairness if it gets us better performance.
This produces optimal turnaround time, but can't be implemented without statistics, since processes aren't in the habit of declaring the amount of CPU time they'd need in advance.
In order to guess the amount of CPU time a process needs, we can use a simple equation: G(n+1) = a*T(n) + (1 - a)*G(n), where G is the guess for a CPU-burst instance n, T is the actual time spent in CPU in instance n and a is their balancer, which can easily be altered based on past experience. - Round Robin (RR) - Time in the CPU is divided into quanta and the processes that want to run are given one quantum of time each in a circular pattern (process 1 runs 1 quantum, process 2 runs 1 quantum, ... process n runs 1 quantum, process 1 runs 1 quantum, ...). Note that if we take a quantum large enough, we get FCFS.
Since context switching takes up time, this might not be a great algorithm when the difference between the time to switch between contexts and the time in one quantum is high enough (of course in favor of the quanta). Changing the size of each quantum does not necessarily mean it will have a good impact on the turnaround time. - Shortest Remaining Time First (SRTF) - This is a pre-emptive upgrade of SJF, in which a process stays in Running until such a time as it either finishes running (blocked or exits) or a process with less time to run enters Ready, in which case the running process is pre-emptively returned to Ready and the shorter process enters Running.
- Guaranteed Scheduling (GS) - Also known as Fair Share Scheduling, this algorithm is fair towards every single process in the system (as opposed to RR, which is fair only towards those in the Ready and Running states).
To do this, it holds two counters for each process: - Time Ran - The amount of time quanta the process has stayed in the Running state.
- Time Deserved - This is a sum of the relative amount of time the process was entitled to run: If there are n processes in the system, for each quantum, each of their Time Deserved counters will be increased by 1/n.
If a new process enters, its Time Deserved is 0. A Blocked and/or Suspended process still deserves time (which is the difference from RR).
When the ratio of Time Ran/Time Deserved is greater than 1, this means that a process has taken up more than the amount of time it deserved and now another process must be placed in the Running state.
Imagine two processes (A and B) in the system, which both want to run. A will run first (why not), so that at the end of the first quantum, it would have deserved 1/2 and run 1, while B would deserve 1/2 and run 0. Since A has run more than it deserved (TR/TD ratio = 2 > 1), B will now run. At the end of the second quantum, both A and B have a TR/TD ratio of 1 and A can run again. - Highest Response Ratio Next (HRRN) - In order to avoid starvation, we can use a ratio of the amount of time a process has spent waiting and the amount of time it needs to run. Since this is a non-preemptive algorithm, whenever a process exits Running, each process i has its ratio P(i) computed, where P(i) = (time waiting + expected run time) / expected run time and the process with the highest ratio wins and goes to Running.
- Feedback Scheduling (FS) - Each process is assigned the highest priority group. Each group has its own scheduling algorithm (Round Robin), when the process that would run at any given moment is the process decided upon by the scheduler of the highest priority group that has runnable processes. Once a process has run one time-quantum, it is demoted to the next priority level. This means that there is also fairness towards process that have yet to run, since they will enter the system with the highest priority.
The algorithm penalizes jobs that have been running longer (means that we keep the overall idea of SJF) without the need to know or guess the time a process might run in advance. - Dynamic Priority Groups (CTSS) - FS presents us with the problem that longer running processes will finish very late and their turnaround time will be horrid. Since we can assume that a process waiting for more time in the lowest priority will likely wait again, we could decide that the lower the priority of the process, it will be given more time quanta. A good example is giving each process in a priority group n (where 0 is the highest priority group) the amount of 2n time-quanta per run. This would mean that instead of taking 100 context switches to finish a 100-time-quanta job, it would only take 7, since the process would run in the following time frames (in time-quanta): 1, 2, 4, 8, 16, 32, 37 (out of 64 given to it). This means that longer running processes are less penalized and that waiting time is smaller.
But how is scheduling really done in Unix? Each process is given a priority and entered into a priority queue. The queue with the smallest number runs first, just like in FS. Kernel-mode processes (daemons and such) get negative priorities (high) while User-mode processes get only positive priorities.
The Low-Level Scheduler maintains these queues (while the High-Level Scheduler makes sure that Suspended processes get their time share too by moving them to and from the disk) and recalculates the processes' priorities for the point in time n+1 by using Pn+1 = base priority + nice + Cn where Cn = cpu usagen + Cn-1/2. The fact that the more the process uses the cpu, the lower its priority, means that no process will ever feel starvation.
In Windows, there are 32 priority level, where the higher 16 are for system processes. Each process will consume its allocated time and then call the scheduler to call the next process, instead of the scheduler having to stop the process itself.
One problem we could have with scheduling is called Priority Inversion, in which a high-priority process waits for a lower-priority process to act, yet the lower-priority process doesn't get to run yet. This means that a higher-priority process will not run and this is contrary to the principle of priorities. Windows solves this by giving the blocking process priority 16 (lowest system priority) for two time quanta at a time until the blocking process stops blocking the higher-priority process.
As some of you might know, I'm currently studying for my B.Sc. in Computer Science. One of the courses I've taken this semester is Operating Systems, which talks about the principles OS's are built on. Since this is a theoretical class and the material is actually kinda cool (and the fact that I have a test this Thursday), I thought I'd share (that is - brain-dump) and in the process study for the upcoming test. It would be cool to get questions, too, so let me know if you want me to expand on something I haven't expanded on. Of course, it goes without saying that corrections will be more than welcome.
Quick note: The main OS's we learn about are the Unix-based systems, but most notes are not specific only to those systems.
Processes
- The initial goal of Operating Systems was, since they were used on shared computers back in the day, the ability to run multiple processes 'at once'. This means that when a process left to do some I/O, another process got some time on the CPU, since time on the CPU was expensive. This concept is called Multiprogramming. Today, you can't really imagine an OS without multiple processes.
- A process can be in three main logical states: Ready (process ready for execution), Running (process is currently running) and Blocked (process has called an I/O operation and can not continue until that operation completes). There are four ways a process can move between these states:
- Ready to Running - The scheduler has decided that the process should begin running.
- Running to Blocked - During the execution of the process, it waits for an I/O operation to complete (no matter how long that operation is, since I/O operations are several orders of magnitude longer than simple execution on the CPU).
- Blocked to Ready - The I/O operation has completed and the process wants to continue running.
- Running to Ready - This vector only exists in systems where the principle of Pre-emption exists. This means that the scheduler has decided that the process has had time to run and it will now let another process run. We'll get to preemption later.
- These three logical states can be expanded to include the states New, when the process has been created but is not ready to run (admitting a new process means moving it to Ready), and Exit, when the process no longer needs to run (releasing a process means that the process has moved from Running to Exit).
- Since physical memory is limited, some systems allow a process to be suspended (with two new logical states, Ready-Suspended and Blocked-Suspended), which essentially means the process will be moved to disk. This is usually done when the process is not expected to run for a while (think about a long I/O operation - it would be a waste of memory to keep the process that called it in memory for so long, so it is suspended).
- Processes are managed by the surprisingly named Process Manager, which is in charge of the creation, termination, scheduling, dispatching and switching of processes and also inter-process synchronization and communication (pipes, etc.). It uses the control structure named Process Control Block (or PCB) which contains the process's id, the owner's id, the parent process's id (if we're talking Unix-based systems) and also has two subcategories:
- Process Control Information - This is information for the manager, such as priority, scheduling information, signals (next paragraph), memory granted, open handles, etc.
- Process State Information - This is information used for running the process, such as the contents of its registers and the location of its stacks.
- Important system calls: fork (duplicates current process and continues running from the same point in the code in both processes), wait* (waits for children to exit), execvp (changes the code running in the process to the code of a different executable), getpid (gets current process id), getppid (gets parent process's id).
- If a parent process doesn't wait for its child process to end (using the wait/waitpid/etc. system calls) before exiting, the child process might turn into a Zombie, which means that the process table will still contain them, but they will be an 'empty shell'. This might mean, eventually, that the process table will 'explode' and no new processes could be created.
Zombie processes relinquish their child processes to the OS which is a way to prevent any more zombies (the OS waits for those child processes).
Signals (Unix and derivatives only)
- Signals are notifications to a program about events that occurred (more or less like interrupts), such as Ctrl+C (SIGINT), Segmentation fault (e.g. access to unallocated memory - SIGSEGV), A child process termination (SIGCHLD), etc. All signals are called asynchronously into your process, but suspend current execution until they are handled.
- Some signals have default implementation (SIGINT exits, SIGSEGV does a core-dump) and some are simply ignored (SIGCHLD).
- To override the defaults, you can use the signal system call and either set a function of your own to handle the signal or ignore it. For instance, to ignore Ctrl+C, call signal(SIGINT, SIG_IGN);.
- If you register a new handler for a signal, you need to re-register it in the end of the function, since some old systems revert the signal to the default once the function returns.
- Calling signals is done with the system call kill(pid, signo).
- If you want a quick way to prevent zombies in your process, ignore the SIGCHILD signal. This means that the parent process doesn't get notified about anything that happens to its child and the child does not wait for answers from the parent. This means, however, that if you wait for children, you will keep waiting until none are running and get an error that no children are running.
- Forking a new process means that all of the signals to which we've set a handler are reverted to the default behavior in the new process. Ignored signals keep getting ignored. To revert a signal manually, call signal with SIG_DFL).
Threads
- A thread is the basic unit of CPU utilization, which means that it's not a process that's doing the running - it's its threads. Therefore, every process has at least one thread (the main thread).
- This means that now we have to divide the metadata between a process and a thread:
- A Process has its address space, global variables (that is, non-thread specific), handles (like open files), child processes (when applicable), signals and signal handlers and some accounting information, but who cares about that :)
- A Thread, on the other hand, is an executing unit, and as such has its stack, registers, variables and current execution state. This is a part of the Thread Control Block.
- Some benefits of threads over processes: More lightweight (takes less time to create and destroy them), context switching is faster, they share memory and files without calling the kernel's system calls.
- There are two main types of threads:
- Kernel Threads (Threads in Windows) are threads supplied by the kernel, which means that the kernel controls scheduling (heavier, but this is the only way a process could run on two different processors, for instance).
- Application Threads or User-Level Thread (Fibers in Windows) are threads that the application manages and lack lots of advantages, like the fact that once one of them blocks, they are all blocked. On the other hand, the kernel doesn't manage them, so we're free to do with them as we like. Application Threads usually use Coroutines to manage their time.
- The combined approach is a concept called Lightweight Processes (LWPs), where Kernel Threads, exposed as LWPs, are allocated to processes, which in turn allocate each LWP they have to one or more Application Threads.
Next topic is Scheduling, but that's for tomorrow. My Emacs-Pinky is bothering me again. :)
Since I've started working with CodePlex, Team System failed to even remember my username whenever I started Visual Studio. After digging a bit, I found the answer.
- Start the Stored User Names and Passwords control panel: rundll32 keymgr.dll,KRShowKeyMgr
- Add a new server (for CodePlex, it's tfs01.codeplex.com), then type your username and password.
Now the username will appear in the list.
I'm still looking for a way to save the password too.
After a few hours of working with Office 2007's Outlook, PowerPoint and Word (I have yet to find a reason to open Excel), I have come to a very disturbing realization about the new Ribbon interface.
Let's start from the beginning: The new interface shows a host of buttons for common actions, according to categories and subcategories. These buttons are big and spiffy and are supposed to show you common actions, according to studies conducted by Microsoft. As a GUI aficionado, I was excited about this leap in usability and was all too eager to try it out. (On a side note - I was also sure the GUI would be adaptive, but I guess that was simply wishful thinking)
Before I start giving some reasons for and against the new interface, I want to point out the most glaring error in judgement the new UI presents. The Ribbon is not a replacement to menus, it's a replacement to the toolbar. You only get buttons that are bigger, nicer-looking and, glaringly, less. However, the Ribbon has replaced the menus too. What you're left with is less functionality.
Admittedly, you can access the functionality that has 'disappeared', but for the inexperienced user, it's just not there. This will cause the 90/10 principle (90% of the users use 10% of the code) to turn into around 99.9/10, since it's so unintuitive to try and get to the other 90% of the application that most users will simply not bother.
A few more points I wanted to make:
- Con: Categories? The Ribbon's categories offer zero improvement over menus. The Edit menu has now simply been renamed to the Home category, the File menu has left the application entirely and moved to the context menu (double clicking the context menu still closes the application -- phew), etc..
- Con: Where Is It?! I still can't find out how to change the color of the borders for an entire table in Powerpoint without resorting to manually using the Draw Table tool. It took me an age and a half (about 10 seconds) to find the Properties dialog (it's under the Context Menu->Prepare(?!)->Properties). It's things like these that make you miss menus most.
- Con: No Turning Back? There's no way to return to the menu-driven GUI, that has been around since Windows 1.0. This means that you have two choices: Either get with the program or revert.
- Pro/Con: Outlook Left Behind? It seems Outlook's main window has been left untouched by the Ribbon and seems to be as it ever was - functionality-wise. I'm happy about this, since I find the classic GUI familiar.
- Pro: Fast! What I love most about the new UI is that it takes me less time to do the most common things, because things are categorized into smaller groups of buttons.
This is not the way to introduce breaking changes to an interface - this type of change is best introduced gradually and without forcing the user. It would have been best to use the Ribbon as a default GUI, but to allow a way to return to the 'classic' mode for users that don't like the new paradigm. Eventually, these users will start warming up to the new GUI, allowing the next version of Office to retire the menus altogether.
It seems that for me, since Office XP was introduced, there has been little to no reason to upgrade.
It is said that a framework is not a framework until a developer uses it.
It seems my CodeDom Patterns library now has its first users, Justin Dunlap and Marc Clifton, as featured in their article Object Mapping Part II - Schema Code Generator (part 1) over at The Code Project. Go check it out.
I just popped in from my unannounced hiatus to share a link to ProgrammingBooks.org which is a list of the best development books as ranked by programmers. Know a good book? Share it.
More Posts