I developed an algorithm for a fairly hard problem in mathematics which is likely to need several months to finish. As I have limited resources only, I started this on my Ubuntu 12.04 (x86) laptop. Now I want to install some updates and actually restart the laptop (the “please reboot” message is just annoying).
Is there a way to save an entire process including its allocated memory for continuation beyond a reboot?
Here is some information about the process you might need. Please feel free to ask for further information if needed.
- I called the process in a terminal with the command “
./binary > ./somefile &” or “time ./binary > ./somefile &”, I cannot really remember. - It’s printing some debug information to std::cerr (not very often).
- It’s currently using roughly 600.0 kiB and even though this will increase, it’s unlikely to increase rapidly.
- the process runs with normal priority
- the kernel is 3.2.0-26-generic-pae, the cpu is an AMD, the operating system is Ubuntu 12.04 x86.
- it runs since 9 days and 14 hours (so too long to cancel it 😉 )
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
The best/simplest solution is to change your program to save the state to a file an reuse that file to restore the process.
Based upon the wikipedia page about application snapshots there are multiple alternatives:
- There is also cryopid but it seems to be unmaintained.
- Linux checkpoint/restart seems to be a good choice but your kernel needs to have
CONFIG_CHECKPOINT_RESTOREenabled. - criu is probably the most up to-date project and probably your best shot but depends also on some specific Kernel options which your distribution probably hasn’t set.
This is already too late but another more hands-on approach is to start your process in a dedicated VM and just suspend and restore the whole Virtual machine. Depending on your hypervisor you can also move the machine between different hosts.
For the future think about where you run your long-running processes, how to parallize them and how to handle problems, e.g. full disks, process gets killed etc.
Method 2
A fairly “cheap” way to do this would be to do the processing in a VM (e.g., with VirtualBox). Before you shut down suspend the VM and save the state. After booting restore the VM & state.
This does have the disadvantage of requiring killing and restarting the job. But if it’s actually going to be running for several months then a nine days difference becomes trivial (5% increase over 6 months).
Edit: I just realized that Ulrich already mentioned this in unnumbered item 4 on his list.
I would still encourage you to consider this as an option, especially since none of the alternatives seem like a robust solution. Each has a reason why it may not work.
I suppose the best thing to do would be to try one of those and if it doesn’t work restart the job in a VM.
Method 3
Take a peek at the tool CryoPID.
From the home page:
“CryoPID allows you to capture the state of a running process in Linux and save it to a file. This file can then be used to resume the process later on, either after a reboot or even on another machine.”
Method 4
If you end up needing to restart your program, I would encourage you to spend some time adding some features to your code that might save you time in the future.
If the process is going to be run for a long time, being able to save the entire process state when you restart the machine is perhaps not hugely helpful if your process crashes while it is running.
I would encourage you to have your program output to a file “checkpoint” data. This data should be sufficient that your program will be able to resume from the state it was at when the checkpoint file was saved. You need not save the entire process, just a snapshot of the relevant variables being used in your calculation, sufficient for your calculation to resume where it left off. Your code would also need to include some way of reading in the data from this file to obtain it’s starting state.
You could set up your code so when you send it a signal, it saves one of these checkpoint files, so you can save the “state” of your calculation at any point.
Additionally, being able to see how the data changes as the calculation progresses might be interesting in itself!
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0