Tuesday, June 7, 2011

Mechanize reads special Microsoft Excel characters as question marks

So you are trying to read data from a Microsoft Excel spreadsheet (or any Office product) and it keeps reading special characters as question marks. Well, I've been there.

I had this problem the other day. The data was English and Spanish, so it contained a variety of special characters. I was using Mechanize, a ruby on rails plugin that automates your web scraping, to download the spreadsheet and import the data, but kept seeing names like Andr? and Ant?nio. Here is how I solved it:

Short Answer:

#a bunch of code to get the excel response...
corruptbody = agent.page.body.strip
body = Iconv.conv('UTF-8', 'cp1252', corruptbody)

Quick and painless. Ruby's Iconv class came through for me. The hard part was figuring out what charset Microsoft was using for the excel spreadsheet.

cp-1252 is Microsoft's charset for western languages. You can find a list of all encodings here.

Long Answer:

First I tried using ruby's String instance methods to convert the content to UTF-8.
corruptbody = agent.page.body.strip.force_encoding('UTF-8')

This didn't work, as I was using ruby 1.8.7, which doesn't include force_encoding. So I tried finding out what encoding it was in to begin with.
puts corruptbody.encoding
==>UTF-8
This confused me because I thought I needed to convert it to unicode UTF-8 to get rid of the question marks. I found a post here where someone had the same problem. I tried what they suggested, but nothing changed.

After some digging around I found out that ruby, by default, encodes everything to utf-8. So the problem was with Mechanize, and not ruby. Mechanize wasn't able to read whatever encoding the excel data was in.

Naturally, I figured the excel data was using some Spanish charset that I needed to change (even though most latin charsets include values for both english and spanish). I read around some more and found out that ISO-8859-1 is commonly used for Latin alphabets. However, when I tried to convert it to UTF-8, nothing changed.

Finally I found the answer on a list of character encodings, which you can see here. This led me to the Iconv solution.

corruptbody = agent.page.body.strip
body = Iconv.conv('UTF-8', 'cp1252', corruptbody)

Resources:
List of all character encodings: http://en.wikipedia.org/wiki/Character_encoding
String encoding in ruby 1.9: http://blog.grayproductions.net/articles/ruby_19s_string
Ruby Iconverter: http://www.ruby-doc.org/stdlib/libdoc/iconv/rdoc/classes/Iconv.html
Mechanize documentation: http://mechanize.rubyforge.org/mechanize/
Railscast on learning to use mechanize: http://railscasts.com/episodes/191-mechanize

Ruby: Class vs Instance methods

Today I was trying to call a method in the rails console, but it kept giving me the "NoMethodError: Undefined" error. I was confused because I could see the method right there in the model, but it wasn't seeing it in the console for some reason.

Eventually I discovered that the method was defined in the model as a class method (self.method) and I was trying to call it using an instance of the class. Here's what I learned:

A class method is one that is called by the class itself, not by an instance.

MyClass.method
==>true


Suppose you had an instance of MyClass:

mc = MyClass.new
mc.method

==>#NoMethodError: Undefined


An instance of a class cannot call a class method. An instance can call an instance method.

Examples:

Ways to define an instance method:

1)
class MyClass
def method
puts doStuff
end
end

mc = MyClass.new
mc.method
==>doStuff

2)
class MyClass
attr_accessor: :method
end

mc = MyClass.new
mc.method = "doStuff"
puts mc.method
==>doStuff
If the attr_accessor: throws you off, learn about it here.


3)
class MyClass; end

mc = MyClass.new
def mc.method
puts "doStuff"
end

mc.method
==>doStuff

Ways to define a class method:

1)
class MyClass
def self.method
puts "doStuff"
end
end

MyClass.method
==>doStuff

Notice that you are calling the method using the class (MyClass.method), not an instance.

2)
class MyClass
class << self
#anything in this block is a class method
def method
puts "doStuff"
end
end
end

MyClass.method
==>doStuff

mc = MyClass.new #this is now an instance of MyClass; cannot call method
mc.method
==>NoMethodError: Undefined

3)
class MyClass; end
def MyClass.method
puts "doStuff"
end

MyClass.method
==>doStuff

In this last example, notice that you are defining the method for MyClass, making it a class method.

For more examples/explanation:
http://railstips.org/blog/archives/2009/05/11/class-and-instance-methods-in-ruby/