XZ Software Optimization with Autovectorization

skhan4059
Apr 22, 2022
2 min read

This is part 2 of the xz software optimization blog, previous blog here:

https://skhan4059.wixsite.com/spoblogging/post/testing-auto-vectorization-for-upcoming-software

In my last blog I selected a software that I thought would be viable for optimization with the upcoming autovectorization option in gcc for armv9. I selected xz for the task since it handles a lot of data and could potentially benefit from the autovectorizer. I started off by downloading it using "wget https://tukaani.org/xz/xz-5.2.5.tar.gz" then extracting the file using "tar -zxf xz-5.2.5.tar.gz". After inspecting some of the README and INSTALL files I used the configure script to configure all the Makefiles and then use the "make -j 24" command to build the software, using the -j 24 option makes it so that the build is split into 24 jobs making it build faster. I wanted to find the executable itself to make sure everything in the build is functional, to find the executable I used the "find . -type f -executable" to search for all executable files from the main directory.

We find a src/xz/xz which seems to be the main program so I dug around there and ran the command which ran without any errors, usually I would run a test here but I could not find any test in any README file but I did find that there was an executable in the ./test directory called test_scripts.sh. This seemed like the file I was looking for so I ran it and it ran successfully and generated some files in my directory.

Now that I have tested everything and it seems stable, I will begin to compile with the auto-vectorizer and see if it breaks anything. First we look at the Makefile itself to see which flags are set.

Looks like the default options, so all we need to do to add the sve2 to the build is using the configure script. We use './configure CFLAGS="-g O3 march=armv8-a+sve2"' to configure all the make files with these CFLAGSS. After running the configure script, since everything looks alright we move on to seeing if everything worked. We run the aforementioned src/xz/xz script and we see that it doesn't work. This would normally be an issue but remembering that we are using CFLAGS for a different architecture it makes sense. We use the qemu-aarch64 emulator to use the software that is built for aarch64 architecture. Again it doesn't work, this is due to the fact that src/xz/xz is not the actual software but just a wrapper script. In order to run the software as we want, I edit the script to find where the xz program is actually being called and change the command to use the qemu-aarch64 emulator.

We then run the test scripts again and everything works if not a bit slower than before. Now we will check how many whilelo instructions have been implemented in the software to see if the optimization worked.

Most of the instructions has been implemented in the lzma library which makes sense since that where most of the heavy lifting would be done. There were a total of 47 whilelo instructions implemented and when we checked how many loops had been vectorized, there were 24 loops that where vectorized and 997 missed vectorizations.

In the next blog I will be checking what was changed to see how optimized the software is, and run some tests.

Shayaan Khan

SPO600 Learning Blog

XZ Software Optimization with Autovectorization

Recent Posts

Comments

Never Miss a Post. Subscribe Now!