The File is a collection of records. Using the primary key, we can access the records. The type and frequency of access can be determined by the type of file organization which was used for a given set of records.
File organization is a logical relationship among various records. This method defines how file records are mapped onto disk blocks.
File organization is used to describe the way in which the records are stored in terms of blocks, and the blocks are placed on the storage medium.
Objective of File Organization
- It contains an optimal selection of records, i.e., records can be selected as fast as possible.
- To perform insert, delete or update transaction on the records should be quick and easy.
- The duplicate records cannot be induced as a result of insert, update or delete.
- For the minimal cost of storage, records should be stored efficiently.
Types of file organization
File organization contains various methods. These particular methods have pros and cons on the basis of access or selection. In the file organization, the programmer decides the best-suited file organization method according to his requirement.
Types of file organization are as follows :
- Sequential File Organization
- Heap File Organization
- Hash File Organization
- B+ Tree File Organization
- Indexed Sequential Access Method (ISAM)
- Cluster File Organization
Sequential File Organization – This method is the easiest method for file organization. In this method, files are stored sequentially. This method can be implemented in two ways :
- Pile File MethodIt is a quite simple method. In this method, we store the record in a sequence, i.e., one after another. Here, the record will be inserted in the order in which they are inserted into tables.In case of updating or deleting of any record, the record will be searched in the memory blocks. When it is found, then it will be marked for deleting, and the new record is inserted.Starting of the File→R1R3———-R8R5←End of the FileSuppose we want to insert a new record R2 in the sequence given above, then it will be placed at the end of the file.Starting of the File→R1R3———-R8R5←R2Starting of the File→R1R3———-R8R5R2←End of the File
- Sorted File MethodIn this method, the new record is always inserted at the file’s end, and then it will sort the sequence in ascending or descending order. Sorting of records is based on any primary key or any other key.In the case of modification of any record, it will update the record and then sort the file, and lastly, the updated record is placed in the right place.Starting of the File→R1R3———-R8R9←End of the FileSuppose a new record R2 has to be inserted in the sequence given above, then it will be inserted at the end of the file, and then it will sort the sequence.Starting of the File→R1R3———-R8R9←R2Starting of the File→R1R2R3———-R8R9←End of the File
Advantages of Sequential File Organization
- It contains a fast and efficient method for the huge amount of data.
- In this method, files can be easily stored in cheaper storage mechanism like magnetic tapes.
- It is simple in design. It requires no much effort to store the data.
- This method is used when most of the records have to be accessed like grade calculation of a student, generating the salary slip, etc.
- This method is used for report generation or statistical calculations.
Disadvantages of Sequential File Organization
- It will waste time as we cannot jump on a particular record that is required but we have to move sequentially which takes our time.
- Sorted file method takes more time and space for sorting the records.
Heap File Organization – Heap File Organization works with data blocks. In this method records are inserted at the end of the file, into the data blocks. No Sorting or Ordering is required in this method. If a data block is full, the new record is stored in some other block, Here the other data block need not be the very next data block, but it can be any block in the memory. It is the responsibility of DBMS to store and manage the new records.
Suppose we have five records R1, R4, R5, R3 and R8 in a heap and suppose we want to insert a new record R2 in a heap. If the last data block that is data block 3 is full then it will be inserted in any of the data block selected by the DBMS, let’s say data block 1.
If we want to search, update or delete the data in heap file organization, then we need to traverse the data from staring of the file till we get the requested record.
If the database is very large then searching, updating or deleting of record will be time-consuming because there is no sorting or ordering of records. In the heap file organization, we need to check all the data until we get the requested record.
Advantages of Heap File Organization
- Fetching and retrieving records is faster than sequential record but only in case of small databases.
- When there is a huge number of data needs to be loaded into the database at a time, then this method of file Organization is best suited.
Disadvantages of Heap File Organization
- This method is inefficient for the large database because it takes time to search or modify the record.
Hash File Organization – Hash File Organization uses the computation of hash function on some fields of the records. The hash function’s output determines the location of disk block where the records are to be placed.
For example, let us consider the following table Student;
Student_id | Name | Age | Phone |
---|---|---|---|
1001 | Amit Das | 40 | 9748516231 |
1005 | Sohini Das | 32 | 9073919231 |
1008 | Debabrata Panchadhyay | 29 | 8420392064 |
A hash function is a function which maps the large set of values into smaller set of files/locations/values. Let us organize the above table using the phone attribute value as input for the hash function.
Consider that the Hash Function for the Student table is → h(s_id mod 5)
In the above hash function, s_id is the Student_id attribute’s value of each record. 5 is the number of buckets/pages where we want to store our table. [5 buckets means bucket0, bucket1, . . . , bucket4].
For our example,
For 1st record, h(1001 mod 5) = 1 that is, the first record has to be stored in 1st bucket.
For 2nd record, h(1005 mod 5) = 0 that is, the second record has to be stored in 0th bucket.
For 3rd record, h(1008 mod 5) = 3 that is, the third record has to be stored in 3rd bucket.
Important points for consideration
- If bucket(s) is/are full, then overflow buckets can be used to store more records.
- Hash function has to be chosen with extra care to avoid uneven distribution. That is, a bad hash function may assign more records to few buckets and less to others.
- The attribute(s) that is frequently used for data manipulation can be chosen as the input for the hash function.
- Same hash function that was used to store the records has to be used for deletion, modification or selection of records.
Insert Record – When a new record has to be inserted, then the address is generated using the hash key and record is directly inserted.
Let us assume that the following query is executed.
INSERT INTO Student VALUES(1007, ‘Payel Chatterjee’, 28, 9073456582);
We use the same hash function to insert the record.
h(1007 mod 5) = 2.
Searching Record – When a record needs to be searched, The same hash function is used to retrieve the bucket address for the record.
Let us assume that the following query is executed.
SELECT * FROM Student WHERE Student_id = 1008;
For searching the record, we have to use the same hash function that we used for storing the records. Hence, h(1008 mod 5) = 3. And the result points to the 3rd bucket. It actually gives us the quick access to the required record.
Advantages of Hash File Organization
- Quick access to records in terms of selection. [If queried on the attribute that was used for hashing]
- Easy to insert, delete, or update a record.
Disadvantages of Hash File Organization
- Records are randomly stored in scattered locations. May waste a lot of space in case of small files.
- For queries that involve ranges, hash file organization is not efficient. [Example – SELECT * FROM Employee WHERE Salary BETWEEN 10000 AND 25000;]
- If querying attribute is not the hashed attribute, you may need to scan the entire table for retrieval.
- Frequent update to the hashed column results in movement of data between buckets which actually affects the system performance.
B+ Tree File Organization – B+ tree file organization is the advanced method of an indexed sequential access method. It uses a tree-like structure to store records in File. It uses the same concept of key-index where the primary key is used to sort the records. For each primary key, the value of the index is generated and mapped with the record.
The B+ tree is similar to a binary search tree (BST), but it can have more than two children. In this method, all the records are stored only at the leaf node. Intermediate nodes act as a pointer to the leaf nodes. They do not contain any records.
Advantages of Hash File Organization
- In this method, searching becomes very easy as all the records are stored only in the leaf nodes and sorted the sequential linked list.
- Traversing through the tree structure is easier and faster.
- The size of the B+ tree has no restrictions, so the number of records can increase or decrease and the B+ tree structure can also grow or shrink.
Disadvantages of Hash File Organization
- This method is inefficient for the static tables
Indexed Sequential Access Method (ISAM) – ISAM method is an advanced sequential file organization. In this method, records are stored in the file using the primary key. An index value is generated for each primary key and mapped with the record. This index contains the address of the record in the file.
If any record has to be retrieved based on its index value, then the address of the data block is fetched and the record is retrieved from the memory.
Advantages of Indexed Sequential Access Method
- In this method, each record has the address of its data block, searching a record in a huge database is quick and easy.
- This method supports range retrieval and partial retrieval of records. Since the index is based on the primary key values, we can retrieve the data for the given range of value. In the same way, the partial value can also be easily searched, i.e., the student name starting with ‘JA’ can be easily searched.
Disadvantages of Indexed Sequential Access Method
- This method requires extra space in the disk to store the index value.
- When the new records are inserted, then these files have to be reconstructed to maintain the sequence.
- When the record is deleted, then the space used by it needs to be released. Otherwise, the performance of the database will slow down.
Cluster File Organization – When the two or more records are stored in the same file, it is known as clusters. These files will have two or more tables in the same data block, and key attributes which are used to map these tables together are stored only once. This method reduces the cost of searching for various records in different files.
The cluster file organization is used when there is a frequent need for joining the tables with the same condition. These joins will give only a few records from both tables. In the given example, we are retrieving the record for only particular departments. This method can’t be used to retrieve the record for the entire department.
Table Name : employee
emp_id | emp_name | emp_age | dept_id |
---|---|---|---|
1001 | Amit Das | 40 | 102 |
1005 | Sohini Das | 32 | 105 |
1008 | Debabrata Panchadhyay | 29 | 102 |
1010 | Manirash Das | 28 | 103 |
1012 | Payel Chatterjee | 28 | 105 |
Table Name : department
dept_id | dept_name |
---|---|
102 | JAVA |
103 | HTML & CSS |
105 | PHP |
Cluster Key : dept_id
dept_id | dept_name | emp_id | emp_name | emp_age |
---|---|---|---|---|
102 | JAVA | 1001 | Amit Das | 40 |
1008 | Debabrata Panchadhyay | 29 | ||
103 | HTML & CSS | 1010 | Manirash Das | 28 |
105 | PHP | 1005 | Sohini Das | 32 |
1012 | Payel Chatterjee | 28 |
In this method, we can directly insert, update or delete any record. Data is sorted based on the key with which searching is done. Cluster key is a type of key with which joining of the table is performed.
Types of Cluster File Organization
- Indexed Clusters – In indexed cluster, records are grouped based on the cluster key and stored together. The above EMPLOYEE and DEPARTMENT relationship is an example of an indexed cluster. Here, all the records are grouped based on the cluster key – dept_id and all the records are grouped.
- Hash Clusters – It is similar to the indexed cluster. In hash cluster, instead of storing the records based on the cluster key, we generate the value of the hash key for the cluster key and store the records with the same hash key value.
Advantages of Cluster File Organization
- The cluster file organization is used when there is a frequent request for joining the tables with same joining condition.
- It provides the efficient result when there is a 1:M mapping between the tables.
- The size of the B+ tree has no restrictions, so the number of records can increase or decrease and the B+ tree structure can also grow or shrink.
Disadvantages of Cluster File Organization
- This method has the low performance for the very large database.
- If there is any change in joining condition, then this method cannot use. If we change the condition of joining then traversing the file takes a lot of time.
- This method is not suitable for a table with a 1:1 condition.