How to detect if a file is PDF or TIFF?

Please bear with me as I’ve been thrown into the middle of this project without knowing all the background. If you’ve got WTF questions, trust me, I have them too.

Here is the scenario: I’ve got a bunch of files residing on an IIS server. They have no file extension on them. Just naked files with names like “asda-2342-sd3rs-asd24-ut57” and so on. Nothing intuitive.

The problem is I need to serve up files on an ASP.NET (2.0) page and display the tiff files as tiff and the PDF files as PDF. Unfortunately I don’t know which is which and I need to be able to display them appropriately in their respective formats.

For example, lets say that there are 2 files I need to display, one is tiff and one is PDF. The page should show up with a tiff image, and perhaps a link that would open up the PDF in a new tab/window.

The problem:

As these files are all extension-less I had to force IIS to just serve everything up as TIFF. But if I do this, the PDF files won’t display. I could change IIS to force the MIME type to be PDF for unknown file extensions but I’d have the reverse problem.

http://support.microsoft.com/kb/326965

Is this problem easier than I think or is it as nasty as I am expecting?

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

OK, enough people are getting this wrong that I’m going to post some code I have to identify TIFFs:

private const int kTiffTagLength = 12;
private const int kHeaderSize = 2;
private const int kMinimumTiffSize = 8;
private const byte kIntelMark = 0x49;
private const byte kMotorolaMark = 0x4d;
private const ushort kTiffMagicNumber = 42;


private bool IsTiff(Stream stm)
{
    stm.Seek(0);
    if (stm.Length < kMinimumTiffSize)
        return false;
    byte[] header = new byte[kHeaderSize];

    stm.Read(header, 0, header.Length);

    if (header[0] != header[1] || (header[0] != kIntelMark && header[0] != kMotorolaMark))
        return false;
    bool isIntel = header[0] == kIntelMark;

    ushort magicNumber = ReadShort(stm, isIntel);
    if (magicNumber != kTiffMagicNumber)
        return false;
    return true;
}

private ushort ReadShort(Stream stm, bool isIntel)
{
    byte[] b = new byte[2];
    _stm.Read(b, 0, b.Length);
    return ToShort(_isIntel, b[0], b[1]);
}

private static ushort ToShort(bool isIntel, byte b0, byte b1)
{
    if (isIntel)
    {
        return (ushort)(((int)b1 << 8) | (int)b0);
    }
    else
    {
        return (ushort)(((int)b0 << 8) | (int)b1);
    }
}

I hacked apart some much more general code to get this.

For PDF, I have code that looks like this:

public bool IsPdf(Stream stm)
{
    stm.Seek(0, SeekOrigin.Begin);
    PdfToken token;
    while ((token = GetToken(stm)) != null) 
    {
        if (token.TokenType == MLPdfTokenType.Comment) 
        {
            if (token.Text.StartsWith("%PDF-1.")) 
                return true;
        }
        if (stm.Position > 1024)
            break;
    }
    return false;
}

Now, GetToken() is a call into a scanner that tokenizes a Stream into PDF tokens. This is non-trivial, so I’m not going to paste it here. I’m using the tokenizer instead of looking at substring to avoid a problem like this:
% the following is a PostScript file, NOT a PDF file
% you'll note that in our previous version, it started with %PDF-1.3,
% incorrectly marking it as a PDF
%
clippath stroke showpage

this code is marked as NOT a PDF by the above code snippet, whereas a more simplistic chunk of code will incorrectly mark it as a PDF.

I should also point out that the current ISO spec is devoid of the implementation notes that were in the previous Adobe-owned specification. Most importantly from the PDF Reference, version 1.6:

Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.

Method 2

TIFF can be detected by peeking at first bytes http://local.wasp.uwa.edu.au/~pbourke/dataformats/tiff/

The first 8 bytes forms the header.
The first two bytes of which is either
“II” for little endian byte ordering
or “MM” for big endian byte ordering.

About PDF: http://www.adobe.com/devnet/livecycle/articles/lc_pdf_overview_format.pdf

The header contains just one line that
identifies the version of PDF.
Example: %PDF-1.6

Method 3

Reading the specification for each file format will tell you how to identify files of that format.

TIFF files – Check bytes 1 and 2 for 0x4D4D or 0x4949 and bytes 2-3 for the value ’42’.

Page 13 of the spec reads:

A TIFF file begins with an 8-byte
image file header, containing the
following information: Bytes 0-1: The
byte order used within the file. Legal
values are: “II” (4949.H) “MM”
(4D4D.H) In the “II” format, byte
order is always from the least
significant byte to the most
significant byte, for both 16-bit and
32-bit integers This is called
little-endian byte order. In the “MM”
format, byte order is always from most
significant to least significant, for
both 16-bit and 32-bit integers. This
is called big-endian byte order. Bytes
2-3 An arbitrary but carefully chosen
number (42) that further identifies
the file as a TIFF file. The byte
order depends on the value of Bytes
0-1.

PDF files start with the PDF version followed by several binary bytes. (I think you now have to purchase the ISO spec for the current version.)

Section 7.5.2

The first line of a PDF file shall be
a header consisting of the 5
characters %PDF– followed by a version
number of the form 1.N, where N is a
digit between 0 and 7. A conforming
reader shall accept files with any of
the following headers: %PDF–1.0,
%PDF–1.1, %PDF–1.2, %PDF–1.3, %PDF–1.4,
%PDF–1.5, %PDF–1.6, %PDF–1.7 Beginning
with PDF 1.4, the Version entry in the
document’s catalog dictionary (located
via the Root entry in the file’s
trailer, as described in 7.5.5, “File
Trailer”), if present, shall be used
instead of the version specified in
the Header.

If a PDF file contains binary data, as
most do (see 7.2, “Lexical
Conventions”), the header line shall
be immediately followed by a comment
line containing at least four binary
characters—that is, characters whose
codes are 128 or greater. This ensures
proper behaviour of file transfer
applications that inspect data near
the beginning of a file to determine
whether to treat the file’s contents
as text or as binary.

Of course you could do a “deeper” check on each file by checking more file specific items.

Method 4

A very useful list of File Signatures aka “magic numbers” by Gary Kessler is available http://www.garykessler.net/library/file_sigs.html

Method 5

Internally, the file header information should help. if you do a low-level file open, such as StreamReader() or FOPEN(), look at the first two characters in the file… Almost every file type has its own signature.

PDF always starts with "%P" (but more specifically would have like %PDF)
TIFF appears to start with "II"
Bitmap files with "BM"
Executable files with "MZ"

I’ve had to deal with this in the past too… also to help prevent unwanted files from being uploaded to a given site and immediately aborting it once checked.

EDIT — Posted sample code to read and test file header types

String fn = "Example.pdf";

StreamReader sr = new StreamReader( fn );
char[] buf = new char[5];
sr.Read( buf, 0, 4);
sr.Close();
String Hdr = buf[0].ToString()
    + buf[1].ToString()
    + buf[2].ToString()
    + buf[3].ToString()
    + buf[4].ToString();

String WhatType;
if (Hdr.StartsWith("%PDF"))
   WhatType = "PDF";
else if (Hdr.StartsWith("MZ"))
   WhatType = "EXE or DLL";
else if (Hdr.StartsWith("BM"))
   WhatType = "BMP";
else if (Hdr.StartsWith("?_"))
   WhatType = "HLP (help file)";
else if (Hdr.StartsWith("1"))
   WhatType = "Icon (.ico)";
else if (Hdr.StartsWith("2"))
   WhatType = "Cursor (.cur)";
else
   WhatType = "Unknown";

Method 6

If you go here, you will see that the TIFF usually starts with “magic numbers” 0x49 0x49 0x2A 0x00 (some other definitions are also given), which is the first 4 bytes of the file.

So just use these first 4 bytes to determine whether file is TIFF or not.

EDIT, it is probably better to do it the other way, and detect PDF first. The magic numbers for PDF are more standardized: As Plinth kindly pointed out they start with “%PDF” somewhere in the first 1024 bytes (0x25 0x50 0x44 0x46). source

Method 7

You are going to have to write an ashx to get the file requested.

then, your handler should read the first few bytes (or so) to determine what the file type really is– PDF and TIFF’s have “magic numers” in the beginning of the file that you can use to determin this, then set your Response Headers accordingly.

Method 8

you can use Myrmec to identify the file type, this library use the file byte head. this library avaliable on nuget “Myrmec”,and this is the repo, myrmec also support mime type,you can try it. the code will like this :

// create a sniffer instance.
Sniffer sniffer = new Sniffer();

// populate with mata data.
sniffer.Populate(FileTypes.CommonFileTypes);

// get file head byte, may be 20 bytes enough.
byte[] fileHead = ReadFileHead();

// start match.
List<string> results = sniffer.Match(fileHead);

and get mime type :
List<string> result = sniffer.Match(head);

string mimeType = MimeTypes.GetMimeType(result.First());

but that support tiff only “49 49 2A 00” and “4D 4D 00 2A” two signature, if you have more you can add your self, may be you can see the readme file of myrmec for help. myrmec github repo


All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x