I want to decode URL encoding, is there any built-in tool for doing this or could anyone provide me with a sed code that will do this?
I did search a bit through unix.stackexchange.com and on the internet but I couldn’t find any command line tool for decoding url encoding.
What I want to do is simply in place edit a txt file so that:
%21becomes!%23becomes#%24becomes$%26becomes&%27becomes'%28becomes(%29becomes)
And so on.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
Found these Python one liners that do what you want:
Python2
$ alias urldecode='python -c "import sys, urllib as ul;
print ul.unquote_plus(sys.argv[1])"'
$ alias urlencode='python -c "import sys, urllib as ul;
print ul.quote_plus(sys.argv[1])"'
Python3
$ alias urldecode='python3 -c "import sys, urllib.parse as ul;
print(ul.unquote_plus(sys.argv[1]))"'
$ alias urlencode='python3 -c "import sys, urllib.parse as ul;
print (ul.quote_plus(sys.argv[1]))"'
Example
$ urldecode 'q+werty%3D%2F%3B' q werty=/; $ urlencode 'q werty=/;' q+werty%3D%2F%3B
References
Method 2
sed
Try the following command line:
$ sed '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="bbc8fb">[email protected]</a><a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c7ec87">[email protected]</a> @g;<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cfbc8f">[email protected]</a>%@\<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="255d6542">[email protected]</a>' file | xargs -0 printf "%b"
or the following alternative using echo -e:
$ sed -e's/%([0-9A-F][0-9A-F])/\\x1/g' file | xargs echo -e
Note: The above syntax may not convert + to spaces, and can eat all the newlines.
You may define it as alias and add it to your shell rc files:
$ alias urldecode='sed "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d2a192">[email protected]</a><a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c2e982">[email protected]</a> @g;<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="bccffc">[email protected]</a>%@\\<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="453d0522">[email protected]</a>" | xargs -0 printf "%b"'
Then every time when you need it, simply go with:
$ echo "http%3A%2F%2Fwww" | urldecode http://www
Bash
When scripting, you can use the following syntax:
input="http%3A%2F%2Fwww"
decoded=$(printf '%b' "${input//%/\x}")
However above syntax won’t handle pluses (+) correctly, so you’ve to replace them with spaces via sed or as suggested by @isaac, use the following syntax:
decoded=$(input=${input//+/ }; printf "${input//%/\x}")
You can also use the following urlencode() and urldecode() functions:
urlencode() {
# urlencode <string>
local length="${#1}"
for (( i = 0; i < length; i++ )); do
local c="${1:i:1}"
case $c in
[a-zA-Z0-9.~_-]) printf "$c" ;;
*) printf '%%%02X' "'$c" ;;
esac
done
}
urldecode() {
# urldecode <string>
local url_encoded="${1//+/ }"
printf '%b' "${url_encoded//%/\x}"
}
Note that above
urldecode()assumes the data contains no backslash.
Here is similar Joel’s version found at: https://github.com/sixarm/urldecode.sh
bash + xxd
Bash function with xxd tool:
urlencode() {
local length="${#1}"
for (( i = 0; i < length; i++ )); do
local c="${1:i:1}"
case $c in
[a-zA-Z0-9.~_-]) printf "$c" ;;
*) printf "$c" | xxd -p -c1 | while read x;do printf "%%%s" "$x";done
esac
done
}
Found in cdown’s gist file, also at stackoverflow.
PHP
Using PHP you can try the following command:
$ echo oil+and+gas | php -r 'echo urldecode(fgets(STDIN));' // Or: php://stdin oil and gas
or just:
php -r 'echo urldecode("oil+and+gas");'
Use -R for multiple line input.
Perl
In Perl you can use URI::Escape.
decoded_url=$(perl -MURI::Escape -e 'print uri_unescape($ARGV[0])' "$encoded_url")
Or to process a file:
perl -i -MURI::Escape -e 'print uri_unescape($ARGV[0])' file
awk
Try anon solution:
awk -niord '{printf RT?$0chr("0x"substr(RT,2)):$0}' RS=%..
Note: Parameter -n is specific to GNU awk.
Try Stéphane Chazelas urlencode solution:
awk -v RS='&#[0-9]+;' -v ORS= '1;RT{printf("%%%02X", substr(RT,3))}'
See: Using awk printf to urldecode text.
decoding file names
If you need to remove url encoding from the file names, use deurlname tool from renameutils (e.g. deurlname *.*).
See also:
- Can wget decode uri file names when downloading in batch?
- How to remove URI encoding from file names?
Related:
- How to decode URL-encoded string in shell? at SO
- How can I encode and decode percent-encoded strings on the command line? at Ask Ubuntu
Method 3
There is a built-in function for that in the Python standard library. In Python 2, it’s urllib.unquote.
decoded_url=$(python2 -c 'import sys, urllib; print urllib.unquote(sys.argv[1])' "$encoded_url")
Or to process a file:
python2 -c 'import sys, urllib; print urllib.unquote(sys.stdin.read())' <file >file.new && mv -f file.new file
In Python 3, it’s urllib.parse.unquote.
decoded_url=$(python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.argv[1]))' "$encoded_url")
Or to process a file:
python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.stdin.read()))' <file >file.new && mv -f file.new file
In Perl you can use URI::Escape.
decoded_url=$(perl -MURI::Escape -e 'print uri_unescape($ARGV[0])' "$encoded_url")
Or to process a file:
perl -pli -MURI::Escape -e '$_ = uri_unescape($_)' file
If you want to stick to POSIX portable tools, it’s awkward, because the only serious candidate is awk, which doesn’t parse hexadecimal numbers. See Using awk printf to urldecode text for examples with common awk implementations, including BusyBox.
Method 4
Perl one liner:
$ perl -pe 's/%(ww)/chr hex $1/ge'
Example:
$ echo '%21%22' | perl -pe 's/%(ww)/chr hex $1/ge' !"
or if you want to ignore non-hex sequences like %zz (which the above mangles)
$ perl -pe 's/%([[:xdigit:]]{2})/chr hex $1/ge'
Method 5
If you want to use a simple-minded sed command, then use the following:
sed -e 's/%21/!/g' -e 's/%23/#/g' -e 's/%24/$/g' -e 's/%26/&/g' -e "s/%27/'/g" -e 's/%28/(/g' -e 's/%29/)/g'
But it is more convenient to create a script like (say sedscript):
s/%21/!/g
s/%23/#/g
s/%24/$/g
s/%26/&/g
s/%27/'/g
s/%28/(/g
s/%29/)/g
Then run sed -f sedscript < old > new, which will output as you desired.
For an ease, the command urlencode is also available directly in gridsite-clients package can be installed from (by sudo apt-get install gridsite-clients in Ubuntu/Debian system).
NAME
urlencode – convert strings to or from URL-encoded form
SYNOPSIS
urlencode [-m|-d] string [string ...]DESCRIPTION
urlencodeencodes strings according to RFC 1738.
That is, charactersA–Za–z0–9._and-are
passed through unmodified, but all other characters are represented as %HH,
where HH is their two-digit upper-case hexadecimal ASCII representation.
For example, the URLhttp://www.gridpp.ac.uk/becomeshttp%3A%2F%2Fwww.gridpp.ac.uk%2F
urlencodeconverts each character in all the strings
given on the command line. If multiple strings are given,
they are concatenated with separating spaces before conversion.OPTIONS
-mInstead of full conversion, do GridSite “mild URL encoding”
in which A-Z a-z 0-9 . = – _ @ and / are passed through unmodified.
This results in slightly more human-readable strings
but the application must be prepared to create or simulate
the directories implied by any slashes.
-dDo URL-decoding rather than encoding, according to RFC 1738.
%HH and %hh strings are converted and other characters are passed through
unmodified, with the exception that+is converted to space.
Example of decoding URL:
$ urlencode -d "http%3a%2f%2funix.stackexchange.com%2f" http://unix.stackexchange.com/ $ urlencode -d "Example: %21, %22, . . . , %29 etc" Example: !, ", . . . , ) etc
Method 6
I can’t comment on best answer in this thread, so here is mine.
Personally, I use these aliases for URL encoding and decoding:
alias urlencode='python -c "import urllib, sys; print urllib.quote( sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])"' alias urldecode='python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])"'
Both commands allow you to convert data, passed as a command line argument or read it from standard input, because both one-liners check whether there are command line arguments (even empty ones) and process them or just read standard input otherwise.
update 2017-05-23 (slash encoding)
In response to the @Bevor’s comment.
If you also need to encode the slash, just add an empty second argument to the quote function, then the slash will also be encoded.
So, finally urlencode alias in bash looks like this:
alias urlencode='python -c "import urllib, sys; print urllib.quote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1], "")"'
Example
$ urlencode "Проба пера/Pen test" %D0%9F%D1%80%D0%BE%D0%B1%D0%B0%20%D0%BF%D0%B5%D1%80%D0%B0%2FPen%20test $ echo "Проба пера/Pen test" | urlencode %D0%9F%D1%80%D0%BE%D0%B1%D0%B0%20%D0%BF%D0%B5%D1%80%D0%B0%2FPen%20test $ urldecode %D0%9F%D1%80%D0%BE%D0%B1%D0%B0%20%D0%BF%D0%B5%D1%80%D0%B0%2FPen%20test Проба пера/Pen test $ echo "%D0%9F%D1%80%D0%BE%D0%B1%D0%B0%20%D0%BF%D0%B5%D1%80%D0%B0%2FPen%20test" | urldecode Проба пера/Pen test $ urlencode "Проба пера/Pen test" | urldecode Проба пера/Pen test $ echo "Проба пера/Pen test" | urlencode | urldecode Проба пера/Pen test
Method 7
GNU Awk
#!/usr/bin/awk -fn
@include "ord"
BEGIN {
RS = "%.."
}
{
printf "%s", $0
if (RT != "") {
printf "%s", chr("0x" substr(RT, 2))
}
}
Method 8
And another Perl approach:
#!/usr/bin/env perl
use URI::Encode;
my $uri = URI::Encode->new( { encode_reserved => 0 } );
while (<>) {
print $uri->decode($_)
}
You will need to install the URI::Encode module. On my Debian, I could simply run
sudo apt-get install liburi-encode-perl
Then, I ran the script above on a test file containing:
http://foo%21asd%23asd%24%26asd%27asd%28asd%29
The result was (I had saved the script as foo.pl):
$ ./foo.pl
http://foo!asd#asd$&asd'asd(asd)
Method 9
Another solution using ruby (accepted python answer wasn’t working for me)
alias urldecode='ruby -e "require "cgi"; puts CGI.unescape(ARGV[0])"'
alias urlencode='ruby -e "require "cgi"; puts CGI.escape(ARGV[0])"'
Example
$ urldecode 'q+werty%3D%2F%3B'
q werty=/;
$ urlencode 'q werty=/;'
q+werty%3D%2F%3B
Method 10
An answer in (mostly Posix) shell:
$ input='%21%22' $ printf "`printf "%sn" "$input" | sed -e 's/+/ /g' -e 's/%(..)/\\x1/g'`" !"
Explanation:
-e 's/+/ /gtransforms each+in space (as described in url-encode norm)-e 's/%(..)/\\x1/g'transform each%XXin\xXX. Notice one ofwill be removed by quoting rules.- The inner printf is just there to pass input to sed. We may replace it by any other mechanism
- The outer printf interpret
\xXXsequences and display result.
Edit:
Since % should always been interpreted in URLs, it is possible to simplify this answer. In add, I think it is cleaner to use xargs instead of backquotes (thanks to @josch).
$ input='%21%22+%25' $ printf "%sn" "$input" | sed -e 's/+/ /g; s/%/\x/g' | xargs -0 printf !" %
Unfortunately, (as @josch noticed) none of these solutions are Posix compliant since x escape sequence is not defined in Posix.
Method 11
Here is a BASH function to do exactly that:
function urldecode() {
echo -ne $(echo -n "$1" | sed -E "s/%/\\x/g")
}
Method 12
Shell-only:
$ x='a%20%25%e3%81%82';printf "${x//%/\x}"
a %あ
Add -- or %b to prevent arguments that start with a dash from being treated as options.
In zsh ${x//%/a} adds a to the end but ${x//%/a} replaces % with a.
Method 13
Here are the relevant bits from another script (that I just shamelessly stole from my youtube.com download script from another answer) I’ve written before. It uses sed and the shell to build up a working urldecode.
set ! " # $ % & ' ( ) * + , / : ; = ? @ [ ]
for c do set "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="193d59">[email protected]</a>" "'$c" "$c"; shift; done
curl -s "$url" | sed 's/\u0026/&/g;'"$(
printf 's/%%%X/\%s/g;' "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d0f490">[email protected]</a>"
)"
I wont swear it’s comprehensive – and in fact I doubt it – but it handled youtube surely enough.
Method 14
The simple solution for short strings (shell is slowwww):
$ str='q+werty%3D%2F%3B'
$ a=${str//+/ };printf "$(echo "${a//%/\x}")n"
q werty=/;
Method 15
From my laymen research of the topic, it appears that the implementations of the percent-encoding are susceptible to ambiguity in edge cases, such as character encoding potentially being different than expected, characters not escaped, query part being encoded differently, potential presence of binary and non-ASCII characters, etc. So, some analysis of and assumptions about the input data are necessary.
The closest to a dedicated tool are respective functions in programming languages, such as Python’s functions from urllib module, which makes some sane assumptions about the URL data, as evidenced by the comments in cpython’s code. That’s why I find the current top answer being good.
As a matter of exercise, I implemented a similar alias with GNU Guile, since it is in path by default on a GNU Guix system with Python not necessarily being present in path. I cannot comment on reliability in comparison to Python, Perl, or other solutions. The documentation suggests that one should preferably split the URL on ?, &, and =, and process the query separately from the path, as well as split the path into segments with a dedicated function, and still be ready for errors. However, I am satisfied with the results on full URL strings copied from a browser.
alias urldecode='guile -c "(use-modules (web uri))
(display (uri-decode (cadr (command-line))))
(newline)"'
(web uri) module provides uri-decode function for decoding URIs. command-line passes the arguments. cadr picks the second item in the list (which is the URL being the first argument after the executable name itself, i.e. guile).
$ urldecode "http://ephsheir.uhsp.edu.ua/bitstream/handle/8989898989/2850/%d0%9c%d0%b0%d0%ba%d0%b5%d1%82%20%d0%9d%d0%b0%d1%80%d0%be%d0%b4%d0%bd%d0%b8%20%d0%bd%d0%b0%d0%b7%d0%b2%d0%b8.pdf?sequence=2&isAllowed=y"
http://ephsheir.uhsp.edu.ua/bitstream/handle/8989898989/2850/Макет Народни назви.pdf?sequence=2&isAllowed=y
A one-liner when not having an alias:
$ guile -c "(use-modules (web uri)) (display (uri-decode (cadr (command-line)))) (newline)" "http://ephsheir.uhsp.edu.ua/bitstream/handle/8989898989/2850/%d0%9c%d0%b0%d0%ba%d0%b5%d1%82%20%d0%9d%d0%b0%d1%80%d0%be%d0%b4%d0%bd%d0%b8%20%d0%bd%d0%b0%d0%b7%d0%b2%d0%b8.pdf?sequence=2&isAllowed=y"
http://ephsheir.uhsp.edu.ua/bitstream/handle/8989898989/2850/Макет Народни назви.pdf?sequence=2&isAllowed=y
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0