Ruby on Rails: January 2011

Sunday, January 30, 2011

convert string to Time rails

How can I convert a string such as '2007-01-31 12:22:26' to a Time
object?

irb(main):001:0> require 'time'
=> true
irb(main):002:0> Time.parse('2007-01-31 12:22:26')
=> Wed Jan 31 12:22:26 EST 2007
Chronic can do that and a whole lot more (in case you need to)!

require 'chronic'

Chronic.parse('2007-01-31 12:22:26')
=> Wed Jan 31 12:22:26 -0800 2007

Chronic.parse('today at 12:22.26')
=> Wed Jan 31 12:22:26 -0800 2007

When working with UTF-8-encoded text from an untrusted source like a web form, it’s a good idea to fix any invalid byte sequences at the first stage, to avoid breaking later processing steps that depend on valid input.

For a long while, the Ruby idiom that I’ve been using and recommending to others is this:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string)

IGNORE is supposed to tell the processor to silently discard bytes that it can’t convert. The output thus contains only valid byte sequences from the input—exactly what we want.

Today, quite by accident, I discovered a problem with it. Iconv in all its forms (library and command-line, on Linux and on Mac OS X) will ignore invalid byte sequences unless they occur right at the end of the string; compare this:

ic.iconv("foo303bar") # => "foobar"

and this:

ic.iconv("foo303") # Iconv::InvalidCharacter: "303"

What’s more, it’s only a certain range of bytes that break the conversion:

(128..255).inject([]){ |acc, b|
  begin
    ic.iconv("foo%c" % b)
    acc
  rescue
    acc << b
  end
}

The ‘dangerous’ bytes are those in the range 194-253. To put it another way, that’s all bytes of the binary pattern /^1{2,6}0/—the leading bytes from a UTF-8 byte sequence. (Incidentally, it’s interesting to see that, at least on OS X, it recognises the never-used and since-withdrawn five- and six-byte sequences from the original UTF-8 specification).

All of this is useful in explaining why it happens, but not how to fix it. The fix, however, is simple:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid_string = ic.iconv(untrusted_string + ' ')[0..-2]

Add a valid byte before converting, and remove it afterwards, and voilà—there’s never an invalid sequence at the end of the buffer. (It’s possible to improve the efficiency of this implementation if you don’t care about preserving the original string: use << instead of + to add the space.)

As to why //IGNORE doesn’t ignore this situation, I don’t know. As far as I can tell, the POSIX specification doesn’t specifically address the //IGNORE flag, so it’s hard to say what it should be doing.

http://po-ru.com/diary/fixing-invalid-utf-8-in-ruby-revisited/

Ruby: convert string to date

Date.strptime("{ 2009, 4, 15 }", "{ %Y, %m, %d }")

http://stackoverflow.com/questions/2720907/ruby-convert-string-to-date

Convert to/from DateTime and Time in Ruby

require 'time'
require 'date'

t = Time.now
d = DateTime.now

dd = DateTime.parse(t.to_s)
tt = Time.parse(d.to_s)

http://stackoverflow.com/questions/279769/convert-to-from-datetime-and-time-in-ruby

Monday, January 24, 2011

generate a random string in rails

1:
o = [('a'..'z'),('A'..'Z')].map{|i| i.to_a}.flatten;
string = (0..50).map{ o[rand(o.length)] }.join;

2:
Why not use SecureRandom, provided by ActiveSupport?
require 'active_support/secure_random'
random_string = ActiveSupport::SecureRandom.hex(16)
# outputs: 5b5cd0da3121fc53b4bc84d0c8af2e81
SecureRandom also has methods for:
base64
hex
random_bytes
random_number
see: http://api.rubyonrails.org/classes/ActiveSupport/SecureRandom.html

3:
This solution generates a string of easily readable characters for activation codes; I didn't want people confusing 8 with B, 1 with I, 0 with O, etc.

# Generates a random string from a set of easily readable characters
def generate_activation_code(size = 6)
charset = %w{ 2 3 4 6 7 9 A C D E F G H J K L M N P Q R T V W X Y Z}
(0...size).map{ charset.to_a[rand(charset.size)] }.join
end

4:
Can't remember where I found this, but seemed the best to me and least process intense:

def random_string(length=10)
chars = 'abcdefghjkmnpqrstuvwxyzABCDEFGHJKLMNPQRSTUVWXYZ23456789'
password = ''
length.times { password << chars[rand(chars.size)] }
password
end

5:
`pwgen 8 1`.chomp

6:
I use this for generating random url friendly strings.

rand(32**length).to_s(32)
It generates random strings of lowercase a-z and 0-9. It's not very customizable but it's short and clean.

7:
Another method I like to use

rand(2**256).to_s(36)[0..7]
Add ljust if you are really paranoid about the correct string length:

rand(2**256).to_s(36).ljust(8,'a')[0..7]

8:
To make your first into one statement:

(0...8).collect { |n| value << (65 + rand(25)).chr }.join()

9:
With this method you can pass in an abitrary length. It's set as a default as 6.

def generate_random_string(length=6)
string = ""
chars = ("A".."Z").to_a
length.times do
string << chars[rand(chars.length-1)]
end
string
end

10:
ALPHABET = ('a'..'z').to_a
10.times.map{ ALPHABET.sample }.join
10.times.inject(''){|s| s << ALPHABET.sample }

11:

RFC-822 date-time format

Here are examples of valid RFC822 date-times:

Wed, 02 Oct 2002 08:00:00 EST

Wed, 02 Oct 2002 13:00:00 GMT

Wed, 02 Oct 2002 15:00:00 +0200

Tuesday, January 18, 2011

Reading data from Flash file

Adobe Flash can make data difficult to extract. This tutorial will teach you how to find and examine raw data files that are sent to your web browser, without worrying how the data is visually displayed.

For example, the data displayed on thisRecovery.gov Flash map is drawn from this text file, which is downloaded to your browser upon accessing the web page.

Inspecting your web browser traffic is a basic technique that you should do when first examining a database-backed website.

Background

In September 2008, drug company Cephalon pleaded guilty to a misdemeanor charge and settled a civil lawsuit involvingallegations of fraudulent marketing of its drugs. It is required to post its payments to doctors on its website.

Cephalon's report is not downloadable and the site disables the mouse’s right-click function, which typically brings up a pop-up menu with the option to save the webpage or inspect its source code. The report is inside a Flash application and disables copying text with Ctrl-C.

We asked the company why it chose this format. Company spokeswoman Sheryl Williams wrote in an e-mail: "We can appreciate the lack of ease in aggregating data or searching based on other parameters, but this posting was not required to do these things. We believe the [Office of the Inspector General]’s requirement was intended for the use of patients, who can easily look up their [health care provider] in our system."

Software to Get

Firefox
The Firebug plugin, to monitor your browser’s web traffic
Ruby, the scripting language
Nokogiri, an XML parsing library for Ruby

Instead of using Firebug, you can also use Safari's built-inActivity window, or Chrome's Developer Tools, for the inspection part. To parse the result, we use Ruby and Nokogiri, which is an essential library for any kind of web scraping with Ruby.

A Series of Tubes...and Files

While the site makes the data difficult to download, it’s not impossible. In fact, it’s fairly easy with some understanding of web browser interaction. The content of a web page doesn’t consist of a single file. For instance, images are downloaded separately from the webpage’s HTML.

Flash applications are also discrete files, and sometimes they act as shells for data that come in separate text files, all of which is downloaded by the browser when visiting Cephalon’s page. So, while Cephalon designed a Flash application to format and display its payments list, we can just view the list as raw text.

Viewing Cephalon's page. The Firebug panel is circled. Click to enlarge.

Firebug can tell you what files your browser is receiving. In Firefox, open up Firebug by clicking on the bug icon on the status bar, then click on the Net panel. This panel shows every file that was received by your web browser when it accessed Cephalon'spage.

Close-up of the Firebug panel. The Net tab is circled in yellow, the relevant .swf file is circled in green.

We know we’re looking for the Flash file, so let's look for that first. Flash applets use the suffix swf. The only one listed isspend_data.swf. In Firebug, right-click on the listing, copy the url, and paste it into a new browser window:

http://www.cephalon.com/our-responsibility/fees-for-services-2009/spend_data.swf

You'll get a larger-screen view of the list, though that doesn’t really help our data analysis. As you may have noticed in the Firebug Net panel, spend_data.swf is less than 45 kilobytes, which doesn't seem large enough to contain the entire list of doctors and payments. So where is the actual data stored?

Sniffing Out the Data

Here’s how find it: First, clear your cache in Firefox by going toTools->Clear Recent History and selecting Cache. With Firebug still open, refresh the browser window that has spend_data.swf open.

Relevant XML file is circled here.

Firebug's window tells us that besides receivingspend_data.swf, our browser downloaded two xml files. One of these is more than 100 kilobytes, which is about what we would expect for an XML-formatted list of a few hundred doctors.

Now right-click on the file in Firebug and select Open in New Tab, and then View Page Source by right-clicking in the new tab. You should see a text file full of entries like the following:


  100001
  $ 100,001 - $ 110,000
  14057447
  Rizzieri, David A
  Rizzieri

  David
  MD
  Durham, NC
  102400
  Honoraria

That's what we were looking for: a well-structured list of the doctors and what they got paid. Now it's a simple matter of using an xml parser, like Ruby's Nokogiri, to iterate through each "row" node and pick up the essential values.

Parsing with Nokogiri

The following is a brief example of Nokogiri's most basic methods. It assumes you have Ruby and Nokogiri installed, and a little familiarity of basic programming.

The two Nokogiri methods we're most interested in are:

css – this lets us select tags inside XML and HTML documents. In this example, we want the value and row tags.
text – with each element returned by css, text will give us the actual characters enclosed by the element's tags.

Each row represents a record, and each value represents a datafield, like name and location. So, we simply want to read eachrow and select the values we're interested in.

require 'rubygems'
require 'nokogiri'
# you should set the following filename variable to whatever the name of the xml file is, either online, 
# or if you've downloaded it on to your hard drive
filename = 'cephalon-data.xml' 
file = Nokogiri::XML(open(filename))
# use 'css' to select each 'row' and iterate through each one
file.css('row').each do |row|
        
        # select each value in the row with 'css'
        values = row.css('value')
        
        # The 4th, 8th, and 9th values contain the doctor's name, city, and payment amount, respectively
        # (remember that Ruby arrays start their count at zero)
        
        # put the three value elements' text in an array, join them with a tab-character, and print the line
        # to the screen
        puts [values[3].text, values[7].text, values[8].text].join("\t")
end

Here's a compact variation of the above code that writes the result into a file:

require 'rubygems'
require 'nokogiri'
filename = 'cephalon-data.xml' File.open('cephalon-output.txt', 'w'){ |output_file| 
        Nokogiri::XML(open(filename)).css('row').map{|row| row.css('value').map{|v| v.text}}.each do |values|
                output_file.puts( [values[3], values[7], values[8]].join("\t") )
        end
}

So, what first appeared to be the most difficult report to parse ends up being the easiest. Whether you’re dealing with a Flash application or a HTML database-backed website, your first step should be to see what text files your browser receives when accessing the page.