Decoding URL encoding (percent encoding)

I want to decode URL encoding, is there any built-in tool for doing this or could anyone provide me with a sed code that will do this?

I did search a bit through unix.stackexchange.com and on the internet but I couldn’t find any command line tool for decoding url encoding.

What I want to do is simply in place edit a txt file so that:

  • %21 becomes !
  • %23 becomes #
  • %24 becomes $
  • %26 becomes &
  • %27 becomes '
  • %28 becomes (
  • %29 becomes )

And so on.

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Found these Python one liners that do what you want:

Python2

$ alias urldecode='python -c "import sys, urllib as ul; 
    print ul.unquote_plus(sys.argv[1])"'

$ alias urlencode='python -c "import sys, urllib as ul; 
    print ul.quote_plus(sys.argv[1])"'

Python3

$ alias urldecode='python3 -c "import sys, urllib.parse as ul; 
    print(ul.unquote_plus(sys.argv[1]))"'

$ alias urlencode='python3 -c "import sys, urllib.parse as ul; 
    print (ul.quote_plus(sys.argv[1]))"'

Example

$ urldecode 'q+werty%3D%2F%3B'
q werty=/;

$ urlencode 'q werty=/;'
q+werty%3D%2F%3B

References

Method 2

sed

Try the following command line:

$ sed '<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="bbc8fb">[email protected]</a><a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c7ec87">[email protected]</a> @g;<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="cfbc8f">[email protected]</a>%@\<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="255d6542">[email protected]</a>' file | xargs -0 printf "%b"

or the following alternative using echo -e:

$ sed -e's/%([0-9A-F][0-9A-F])/\\x1/g' file | xargs echo -e

Note: The above syntax may not convert + to spaces, and can eat all the newlines.


You may define it as alias and add it to your shell rc files:

$ alias urldecode='sed "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d2a192">[email protected]</a><a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="c2e982">[email protected]</a> @g;<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="bccffc">[email protected]</a>%@\\<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="453d0522">[email protected]</a>" | xargs -0 printf "%b"'

Then every time when you need it, simply go with:

$ echo "http%3A%2F%2Fwww" | urldecode
http://www

Bash

When scripting, you can use the following syntax:

input="http%3A%2F%2Fwww"
decoded=$(printf '%b' "${input//%/\x}")

However above syntax won’t handle pluses (+) correctly, so you’ve to replace them with spaces via sed or as suggested by @isaac, use the following syntax:

decoded=$(input=${input//+/ }; printf "${input//%/\x}")

You can also use the following urlencode() and urldecode() functions:

urlencode() {
    # urlencode <string>
    local length="${#1}"
    for (( i = 0; i < length; i++ )); do
        local c="${1:i:1}"
        case $c in
            [a-zA-Z0-9.~_-]) printf "$c" ;;
            *) printf '%%%02X' "'$c" ;;
        esac
    done
}
 
urldecode() {
    # urldecode <string>
 
    local url_encoded="${1//+/ }"
    printf '%b' "${url_encoded//%/\x}"
}

Note that above urldecode() assumes the data contains no backslash.

Here is similar Joel’s version found at: https://github.com/sixarm/urldecode.sh


bash + xxd

Bash function with xxd tool:

urlencode() {
  local length="${#1}"
  for (( i = 0; i < length; i++ )); do
    local c="${1:i:1}"
    case $c in
      [a-zA-Z0-9.~_-]) printf "$c" ;;
    *) printf "$c" | xxd -p -c1 | while read x;do printf "%%%s" "$x";done
  esac
done
}

Found in cdown’s gist file, also at stackoverflow.


PHP

Using PHP you can try the following command:

$ echo oil+and+gas | php -r 'echo urldecode(fgets(STDIN));' // Or: php://stdin
oil and gas

or just:

php -r 'echo urldecode("oil+and+gas");'

Use -R for multiple line input.


Perl

In Perl you can use URI::Escape.

decoded_url=$(perl -MURI::Escape -e 'print uri_unescape($ARGV[0])' "$encoded_url")

Or to process a file:

perl -i -MURI::Escape -e 'print uri_unescape($ARGV[0])' file

awk

Try anon solution:

awk -niord '{printf RT?$0chr("0x"substr(RT,2)):$0}' RS=%..

Note: Parameter -n is specific to GNU awk.

Try Stéphane Chazelas urlencode solution:

awk -v RS='&#[0-9]+;' -v ORS= '1;RT{printf("%%%02X", substr(RT,3))}'

See: Using awk printf to urldecode text.

decoding file names

If you need to remove url encoding from the file names, use deurlname tool from renameutils (e.g. deurlname *.*).

See also:


Related:

Method 3

There is a built-in function for that in the Python standard library. In Python 2, it’s urllib.unquote.

decoded_url=$(python2 -c 'import sys, urllib; print urllib.unquote(sys.argv[1])' "$encoded_url")

Or to process a file:

python2 -c 'import sys, urllib; print urllib.unquote(sys.stdin.read())' <file >file.new &&
mv -f file.new file

In Python 3, it’s urllib.parse.unquote.

decoded_url=$(python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.argv[1]))' "$encoded_url")

Or to process a file:

python3 -c 'import sys, urllib.parse; print(urllib.parse.unquote(sys.stdin.read()))' <file >file.new &&
mv -f file.new file

In Perl you can use URI::Escape.

decoded_url=$(perl -MURI::Escape -e 'print uri_unescape($ARGV[0])' "$encoded_url")

Or to process a file:

perl -pli -MURI::Escape -e '$_ = uri_unescape($_)' file

If you want to stick to POSIX portable tools, it’s awkward, because the only serious candidate is awk, which doesn’t parse hexadecimal numbers. See Using awk printf to urldecode text for examples with common awk implementations, including BusyBox.

Method 4

Perl one liner:

$ perl -pe 's/%(ww)/chr hex $1/ge'

Example:

$ echo '%21%22' |  perl -pe 's/%(ww)/chr hex $1/ge'
!"

or if you want to ignore non-hex sequences like %zz (which the above mangles)

$ perl -pe 's/%([[:xdigit:]]{2})/chr hex $1/ge'

Method 5

If you want to use a simple-minded sed command, then use the following:

sed -e 's/%21/!/g' -e 's/%23/#/g' -e 's/%24/$/g' -e 's/%26/&/g' -e "s/%27/'/g" -e 's/%28/(/g' -e 's/%29/)/g'

But it is more convenient to create a script like (say sedscript):

s/%21/!/g
s/%23/#/g
s/%24/$/g
s/%26/&/g
s/%27/'/g
s/%28/(/g
s/%29/)/g

Then run sed -f sedscript < old > new, which will output as you desired.


For an ease, the command urlencode is also available directly in gridsite-clients package can be installed from (by sudo apt-get install gridsite-clients in Ubuntu/Debian system).

NAME

    urlencode – convert strings to or from URL-encoded form

SYNOPSIS

    urlencode [-m|-d] string [string ...]

DESCRIPTION

    urlencode encodes strings according to RFC 1738.
    That is, characters AZ az 09 . _ and - are
    passed through unmodified, but all other characters are represented as %HH,
    where HH is their two-digit upper-case hexadecimal ASCII representation.
    For example, the URL http://www.gridpp.ac.uk/ becomes http%3A%2F%2Fwww.gridpp.ac.uk%2F

    urlencode converts each character in all the strings
    given on the command line.  If multiple strings are given,
    they are concatenated with separating spaces before conversion.

OPTIONS

    -m
      Instead of full conversion, do GridSite “mild URL encoding”
      in which A-Z a-z 0-9 . = – _ @ and / are passed through unmodified. 
      This results in slightly more human-readable strings
      but the application must be prepared to create or simulate
      the directories implied by any slashes.

    -d

      Do URL-decoding rather than encoding, according to RFC 1738. 
      %HH and %hh strings are converted and other characters are passed through
      unmodified, with the exception that + is converted to space.

Example of decoding URL:

$ urlencode -d "http%3a%2f%2funix.stackexchange.com%2f"
http://unix.stackexchange.com/

$ urlencode -d "Example: %21, %22, . . . , %29 etc"
Example: !, ", . . . , ) etc

Method 6

I can’t comment on best answer in this thread, so here is mine.

Personally, I use these aliases for URL encoding and decoding:

alias urlencode='python -c "import urllib, sys; print urllib.quote(  sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])"'

alias urldecode='python -c "import urllib, sys; print urllib.unquote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1])"'

Both commands allow you to convert data, passed as a command line argument or read it from standard input, because both one-liners check whether there are command line arguments (even empty ones) and process them or just read standard input otherwise.


update 2017-05-23 (slash encoding)

In response to the @Bevor’s comment.

If you also need to encode the slash, just add an empty second argument to the quote function, then the slash will also be encoded.

So, finally urlencode alias in bash looks like this:

alias urlencode='python -c "import urllib, sys; print urllib.quote(sys.argv[1] if len(sys.argv) > 1 else sys.stdin.read()[0:-1], "")"'

Example

$ urlencode "Проба пера/Pen test"
%D0%9F%D1%80%D0%BE%D0%B1%D0%B0%20%D0%BF%D0%B5%D1%80%D0%B0%2FPen%20test

$ echo "Проба пера/Pen test" | urlencode
%D0%9F%D1%80%D0%BE%D0%B1%D0%B0%20%D0%BF%D0%B5%D1%80%D0%B0%2FPen%20test

$ urldecode %D0%9F%D1%80%D0%BE%D0%B1%D0%B0%20%D0%BF%D0%B5%D1%80%D0%B0%2FPen%20test
Проба пера/Pen test

$ echo "%D0%9F%D1%80%D0%BE%D0%B1%D0%B0%20%D0%BF%D0%B5%D1%80%D0%B0%2FPen%20test" | urldecode
Проба пера/Pen test

$ urlencode "Проба пера/Pen test" | urldecode
Проба пера/Pen test

$ echo "Проба пера/Pen test" | urlencode | urldecode
Проба пера/Pen test

Method 7

GNU Awk

#!/usr/bin/awk -fn
@include "ord"
BEGIN {
   RS = "%.."
}
{
   printf "%s", $0
   if (RT != "") {
      printf "%s", chr("0x" substr(RT, 2)) 
   }
}

Method 8

And another Perl approach:

#!/usr/bin/env perl
use URI::Encode;
my $uri     = URI::Encode->new( { encode_reserved => 0 } );
while (<>) {

    print $uri->decode($_)
}

You will need to install the URI::Encode module. On my Debian, I could simply run

sudo apt-get install liburi-encode-perl

Then, I ran the script above on a test file containing:

http://foo%21asd%23asd%24%26asd%27asd%28asd%29

The result was (I had saved the script as foo.pl):

$ ./foo.pl
http://foo!asd#asd$&asd'asd(asd)

Method 9

Another solution using ruby (accepted python answer wasn’t working for me)

alias urldecode='ruby -e "require "cgi"; puts CGI.unescape(ARGV[0])"'
alias urlencode='ruby -e "require "cgi"; puts CGI.escape(ARGV[0])"'

Example

$ urldecode 'q+werty%3D%2F%3B'
q werty=/;

$ urlencode 'q werty=/;'
q+werty%3D%2F%3B

Method 10

An answer in (mostly Posix) shell:

$ input='%21%22'
$ printf "`printf "%sn" "$input" | sed -e 's/+/ /g' -e 's/%(..)/\\x1/g'`"
!"

Explanation:

  • -e 's/+/ /g transforms each + in space (as described in url-encode norm)
  • -e 's/%(..)/\\x1/g' transform each %XX in \xXX. Notice one of will be removed by quoting rules.
  • The inner printf is just there to pass input to sed. We may replace it by any other mechanism
  • The outer printf interpret \xXX sequences and display result.

Edit:

Since % should always been interpreted in URLs, it is possible to simplify this answer. In add, I think it is cleaner to use xargs instead of backquotes (thanks to @josch).

$ input='%21%22+%25'
$ printf "%sn" "$input" | sed -e 's/+/ /g; s/%/\x/g' | xargs -0 printf
!" %

Unfortunately, (as @josch noticed) none of these solutions are Posix compliant since x escape sequence is not defined in Posix.

Method 11

Here is a BASH function to do exactly that:

function urldecode() {
        echo -ne $(echo -n "$1" | sed -E "s/%/\\x/g")
}

Method 12

Shell-only:

$ x='a%20%25%e3%81%82';printf "${x//%/\x}"
a %あ

Add -- or %b to prevent arguments that start with a dash from being treated as options.

In zsh ${x//%/a} adds a to the end but ${x//%/a} replaces % with a.

Method 13

Here are the relevant bits from another script (that I just shamelessly stole from my youtube.com download script from another answer) I’ve written before. It uses sed and the shell to build up a working urldecode.

set ! " # $ % & ' ( ) *  + , / : ; = ? @ [ ]
for c do set "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="193d59">[email protected]</a>" "'$c" "$c"; shift; done
curl -s "$url" | sed 's/\u0026/&/g;'"$(
    printf 's/%%%X/\%s/g;' "<a href="https://getridbug.com/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d0f490">[email protected]</a>"
)"

I wont swear it’s comprehensive – and in fact I doubt it – but it handled youtube surely enough.

Method 14

The simple solution for short strings (shell is slowwww):

$ str='q+werty%3D%2F%3B'

$ a=${str//+/ };printf "$(echo "${a//%/\x}")n"

q werty=/;

Method 15

From my laymen research of the topic, it appears that the implementations of the percent-encoding are susceptible to ambiguity in edge cases, such as character encoding potentially being different than expected, characters not escaped, query part being encoded differently, potential presence of binary and non-ASCII characters, etc. So, some analysis of and assumptions about the input data are necessary.

The closest to a dedicated tool are respective functions in programming languages, such as Python’s functions from urllib module, which makes some sane assumptions about the URL data, as evidenced by the comments in cpython’s code. That’s why I find the current top answer being good.

As a matter of exercise, I implemented a similar alias with GNU Guile, since it is in path by default on a GNU Guix system with Python not necessarily being present in path. I cannot comment on reliability in comparison to Python, Perl, or other solutions. The documentation suggests that one should preferably split the URL on ?, &, and =, and process the query separately from the path, as well as split the path into segments with a dedicated function, and still be ready for errors. However, I am satisfied with the results on full URL strings copied from a browser.

alias urldecode='guile -c "(use-modules (web uri))
                           (display (uri-decode (cadr (command-line))))
                           (newline)"'

(web uri) module provides uri-decode function for decoding URIs. command-line passes the arguments. cadr picks the second item in the list (which is the URL being the first argument after the executable name itself, i.e. guile).

$ urldecode "http://ephsheir.uhsp.edu.ua/bitstream/handle/8989898989/2850/%d0%9c%d0%b0%d0%ba%d0%b5%d1%82%20%d0%9d%d0%b0%d1%80%d0%be%d0%b4%d0%bd%d0%b8%20%d0%bd%d0%b0%d0%b7%d0%b2%d0%b8.pdf?sequence=2&isAllowed=y"
http://ephsheir.uhsp.edu.ua/bitstream/handle/8989898989/2850/Макет Народни назви.pdf?sequence=2&isAllowed=y

A one-liner when not having an alias:

$ guile -c "(use-modules (web uri)) (display (uri-decode (cadr (command-line)))) (newline)" "http://ephsheir.uhsp.edu.ua/bitstream/handle/8989898989/2850/%d0%9c%d0%b0%d0%ba%d0%b5%d1%82%20%d0%9d%d0%b0%d1%80%d0%be%d0%b4%d0%bd%d0%b8%20%d0%bd%d0%b0%d0%b7%d0%b2%d0%b8.pdf?sequence=2&isAllowed=y"
http://ephsheir.uhsp.edu.ua/bitstream/handle/8989898989/2850/Макет Народни назви.pdf?sequence=2&isAllowed=y


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x