CPP09

Last project of 42 school's C++ Piscine

How to do the Ford-Johnson merge insertion sort

I have suffered a lot with this project, but I like to think my understanding became quite solid by the end. I decided to make this short tutorial, so people who are new to it can have some guideline on how to implement the merge insertion sort correctly.

DISCLAIMER: if you are from 42 & currently working on this project - do not take my code. The implementation itself is quite shitty anyway (I am mostly writing C in C++ and my coding style can be all over the place). But I am pretty sure with this tutorial you will be able to implement the algorithm on your own. You will not find parts telling you how to write your code or what containers/functions to use, but I do try to cover the theory and logic behind the exercise as much as I can (based on my current knowledge).

I recommend this tutorial also to students who are already done with the project, but either feel like they have not done the correct implementation and are not sure where it went wrong, or have done it correctly but there were some parts of the process they did not find very clear (for example, the purpose of the Jacobstahl sequence is not explained well in most resources, but it is very important for the sorting to work well, and is also super interesting). If you belong to this group, you can use my table of contents to choose the parts you are interested in.

[UPDATE] (14.12.2024) Ismayil has contributed a really cool tester, which you can find now in the resources folder. I have put the instructions for use under a new section in this document.

Small table of contents

1. What you will need
2. Why does the number of comparisons matter?
3. The main idea: merge insertion in 3 steps
4. What is the purpose of the Jacobstahl sequence?
5. But how to keep track of my pairs?
6. Testing if your implementation is correct
7. Pictures
- 7.1. Mirroring
- 7.2. Insertion methods
8. Tester

What you will need

Firstly, if you have some colored/numbered cards around (like Uno or Ligretto), they will be super useful for visualizing the process of sorting (separating and insertion). During/after reading this tutorial, but before starting to code, I recommend playing around with them to better visualize the process (especially because recursion can be hard to imagine). They are super helpful for debugging too. In case you do not have access to any cards though, I have documented some of the concepts I'm talking about - you can find this in the Pictures section.

You will definitely need the book - by now you should be able to find the link to this in the subject. Some people have told me there are periods when the link for the book is not accessible, therefore I have uploaded an excerpt of the parts that we need to this repo (you can find it in the resources folder). I hope I will get no letters to take it down in the near future. If you can access the whole book and you have some time, it is in general a very interesting read.

It is very important to use this book to understand the logic of the sort and not other sources or articles from the internet. They might seem similar, but they are not exactly right and this can (usually will) result in your algorithm doing much more comparisons than necessary in the end.

Why does the number of comparisons matter?

This is one question that gets asked very frequently during evaluations. When sorting a container of int, merge insertion might not be the fastest way to do it. This method is optimized to sort with the least amount of comparisons, and you will see that this oftentimes results in a lot of calculations to avoid even a single extra comparison between two ints - which can slow things down considerably. So why would any sort be optimized this way?

The point of optimization is always to save resources. Let's say we are not sorting a container of int, but some arbitrary data type T, from which any two elements are very expensive to compare. For example, such T can be a one-gigabyte string (strcmp iterating through 1,073,741,824 characters, anyone?), or a huge object/struct loaded with all kinds of data. If comparisons are expensive and eating up resources (like time), you want to make as few of them as possible. It indeed does not make much sense to choose merge insertion sort when T is a simple int, but the point of the exercise is not creating a practical use case, but understanding the algorithm.

Another thing that follows from this is that when we say we try to reduce the amount of comparisons in our code, this only means comparisons between any two T. Comparing the indexes of two elements is fine. Comparing the position of two iterators is fine. Comparing anything but T is fine, but we want to resort to comparing as few T with each other as possible.

The main idea: merge insertion in 3 steps

Take the book and search for "merge insertion" (you will find the one you need on page 184). On page 185, you will find the main description of the algorithm in 3 steps.

Merge insertion in 3 steps

</div> <br>

What does this mean?

Let's say we have a container of 9 numbers:<br> 2, 3, 8, 11, 3, 1, 99, 8, 0

Step 1 is simple: take all of the numbers you have to sort, assign them into pairs (2 neighbors are one pair), and for each pair, decide which one is the bigger number (if they are equal, you can choose the one you like).

This means we now have the pairs:<br> (2,3) (8,11) (3,1) (99, 8) (0)

Step 2 says: from all of these pairs, you need to sort the bigger elements only - therefore you need to separate the big and small members of the pairs first. How you approach this depends on your individual implementation. Some people assign them to std::pairs. You can also have another container and put all the "small" elements there, which is what I will imitate now because it looks much more readable.

So we have the big elements:<br> 3, 11, 3, 99<br> And the small ones:<br> 2, 8, 1, 8, 0

Now comes the tricky part. The book says: the way you will have to sort the larger elements is by merge insertion. Therefore, whatever implementation you make, will likely be recursive (check my note below). Since merge insertion itself is step 1-3, we will take the larger elements now, and apply the merge insertion sort to them.

Let's grab only the big numbers and go back to step 1. Make pairs. Separate. Then go back to step 1 again (and again, and again, if needed).

When will the recursion stop? When you only have 1 pair left (in the above example this will be (11, 99) ), you separate that and you end up with only 1 element in your "bigger" container. That 1 element is inherently sorted. Since the "larger numbers", at least for this depth of the recursion, are sorted, you can now proceed to step 3.

NOTE: the recursion here is not so complex for it not to be possible to be transformed iterative. This was always on my mind while writing my own solution, which is, to be fully honest, while technically recursive, very close to iterative in spirit. This means recursivity is not a strict requirement for an algorithm to work, but I do feel like a recursive implementation is closer to the spirit of the original description.

When you arrive at step 3 (on any applicable depth of the recursion), you have 2 containers: one contains the bigger members of the pairs, and is sorted, the other contains the smaller members of the pairs, and is unsorted. You have to, then, insert the latter into the former, with binary insertion. Look at the last line on the picture, and check the indexes of the elements to insert (labeled b).<br> You can see a pattern like 3->2->5->4->11->10->(9->8->7->6->21->20...)

This means: you choose the element with the 3rd index from the sorted container, select its pair from the unsorted one, and insert it into the former. Then do this with the element with the 2nd index, the 5th, and so on. (The first - in code 0th - index is not listed here because, as you will see later, its pair can be inserted with 0 comparisons, so can technically be counted as an element of the sorted container). You can see that you always jump up to a "boundary" number (5, 11, 21, etc), then count backwards until you reach an index you have already covered.

These "boundary" numbers follow a pattern, which is called Jacobstahl sequence. Looking at fig. 13 in the book, you can see that any element of this sequence can be calculated with the formula

(2<sup>k+1</sup> + ( -1 )<sup>k </sup>) / 3

</div>

where k is the index of the element. So the first element will be (2<sup>1+1</sup> + ( -1 )<sup>1 </sup>) / 3 = 3 / 3 = 1. The fifth element is (2<sup>5+1</sup> + ( -1 )<sup>5 </sup>) / 3 = 63 / 3 = 21 etc.

If you already know a Jacobstahl number J with the index of k (for example, you know that index 3 is 5), the next Jacobstahl number J<sub>k+1</sub> can also be calculated with a simpler formula:

Is this even Jacobstahl or are we all scammed?

Some of you who have done some research might have noticed that on the [Wikipedia page](https://en.wikipedia.org/wiki/Ja

CPP09

Install / Use

README

CPP09

How to do the Ford-Johnson merge insertion sort

Small table of contents

What you will need

Why does the number of comparisons matter?

The main idea: merge insertion in 3 steps