<code> Monkey </code>: nlp

Showing posts with label nlp. Show all posts

Monday, September 15, 2008

Powerset use GPU to do linguistic processing :)

"I just recently saw a talk on using CUDA to do linguistic processing for Powerset’s engine. One of the major pain points was pushing data to the card."

Extracted from http://kirindave.tumblr.com/post/50255490/tim-sweeny-on-the-twilight-of-the-gpu

Note that KirinDave is a Ruby guru at Powerset.

Wednesday, September 10, 2008

Amazon Search does not have spelling suggestion function

Quite surprise !!
Spelling suggestion is like a standard feature for search engines (Yahoo, Google, MSN, Ask.com, PowerSet, AltaVista, Vivisimo, Clusty, Lycos, SearchMe.com, and xalo.vn (Vietnamese) )

I found that Cuil and Baidu (Chinese) also don't have English spelling suggestion.

Sunday, August 24, 2008

Interview with Google

I had an interview with Google R&D in Beijing, China on 08/08/2008. Nice day ha :)

Four week ago, I received an email from a professor in Vietnam said that Google need people to do Vietnamese Natural Language Processing in China. They need people who is good at programming, has experience in NLP and Vietnamese is a must. I applied at got an telephone interview invitation. They created a share document on Google Docs so I and the interviewer can share notes. I had a week to prepare.

I prepared by reading:

Tuyển tập các câu hỏi phỏng vấn by Ngo Quang Hung
Get the job at Google by Steve Yegge (a Googler)
Introduction to Algorithms
Elements of Programming

To improve important skills:

Algorithms
Problem Solving
Programming

At the interview day, the guy who interview me called late 30 minutes and was sorry because he had been busy. The telephone connection was not good. The guy could hear me clearly but I had difficult to hear him. I asked him to call again but still the same. We started the interview.

They didn't ask about programming skill and experience.

They didn't read my blog too :))

I was surprised while knowing that none people from China spent more than zero seconds on my blog. It means that my interview (in China) never read my blog although I put links to my blog on cover letter, emails and resume. I believe that any web company will ask "Do you have a blog?" to most of their interviewees. The assumption is that: If you love your work, you should write about it; and blog is the most effective way to let people know that. Don't know why Google China don't care about enthusiasm. The strongest force that make people great. All of web startups (include Google) was started with a small team with a lot of enthusiasm.

Come back to my interview. I was asked two questions. One about C programming language, one about algorithms.

Q1: Tell the different between following C codes?

char *a = "A";

char a = "A";

char *a = 'A';

char a = 'A';

Q2: Given a group of people. There is only one or none star in the group. Everybody know the star. The star don't know other people. Given a function know(a, b) that return true if a know b, return false if a don't know b. Design an algorithm to answer that is there a star in the group and who is the star.

I answered Q1 smoothly. It's not a tricky question. Anyone who know about C pointer and C string can answer Q1.

For Q2, I showed the interviewer a naive solution with two for loop that take O(n^2) time.

The interviewer asked me to improve the algorithm. He was emphasis that could I design an algo that take O(n) time only? I thought for a while and still don't have an answer. I remembered that in "Get a job at Google" post, Steve Yegge said that we should think graph first. I draw a graph and saw a solution there. I told the interview I got a solution when describe people and their relationship as a graph. The interviewer reminded me that we don't have a graph already, and it also take O(n^2) time to build that graph :)) Steve Yegge's advice did not work in my case. Ha ha. I ask for more time to think carefully. Finally I found a solution, it's an easy solution that I should came up with earlier. I was nervous. When we treat people as a set and try to eliminate non relevant elements from the set. We can solve Q2 in O(n).

The interviewer approved my answer and he asked me if I have any question for Google. I smiled because I had practiced with my friend before, Dinh Hai, he acted as the interviewer and asked me as much questions as possible. He told me that I should ask Google some fun questions, he suggested this question: "is it true that at Google, the food never be placed away from programmers more than 8 meters?". I asked the same question to my interviewer and he laughed. The trick worked, thanks Dinh Hai :)) I also told him that I have motivation to do Vietnamese Language Processing to help Vietnamese people understand English better and help Google search Vietnamese documents more accurately (It's a win-win situation) and asked about Google plan for Vietnamese. He said that he is not a member of language processing group and could not answer my question. It was my last question and we finished.

Two weeks later, I received a "thank you" letter from Google said that currently Google did not have any position relevant to my skill :) I guest Google need a person who really do Natural Language Processing and good at C/C++ or Java. My job is web development using Ruby and JavaScript. It is not a good match. Before the interview, I realized my weakness and betted on my enthusiasm and motivation in Vietnamese language processing. It did not work :)

What I learn from my interview?

First, big company care about computer science. They want people who master algorithms and have good problem solving skill. The better computer science you are, the bigger chance you have.

Second, interview over telephone is not a good idea. Bad telephone line can reduce the quality of the interview. I had to asked my interviewer to repeat the questions, and needed to confirm with him questions' content.

How about your interview experience?

Saturday, August 2, 2008

Simplest Hunspell Ruby binding

Hunspell is one of the most powerful open source spelling checker. Recently I need to use spelling checking our startup project. Hunspell is the first choice.

There are several Hunspell bindings for Ruby:

http://rubyforge.org/projects/hunspell/
http://rubyforge.org/projects/ruby-hunspell/

They are gems with long and traditional bingding code.

Initially, I use Hunspell gem but after digging in to source code, I found a memory lick problem: the author forgot to free suggestion list after call Hunspell_suggest(pHunspell, &slst, str) function.

I decided to write my own binding with less code using RubyInline. Here is the result:
A simplest Hunspell Ruby Binding within 60 line of codes :)

Check it out at:
http://github.com/tiendung/rhunspell/

Update: Now the binding become a Rubygem, you can install it easily with following command:
"sudo gem install rhunspell"

Tuesday, May 20, 2008

Giới thiệu về xử lý ngôn ngữ tự nhiên

(lược dịch từ Getting Started on Natural Language Processing with Python)

Những khái niệm cơ bản

Token: trước khi xử lý, dữ liệu text đầu vào cần được phân đoạn thành các đơn vị ngôn ngữ như là từ, dấu chấm câu, số và chữ số. Các đơn vị ngôn ngữ đó được gọi chung là tokens.

Sentence: Một chuỗi có thứ tự các tokens

Tokenization: quá trình phân chia một sent thành các tokens. Với segmented language như tiếng Anh, có thể tokenize dễ dàng với dấu cách (whitespace). Tuy nhiên, với những ngôn ngữ như tiếng Việt, tiếng Trung hay tiếng Ả Rập, tokenization khó hơn nhiều vì không có biên giới rõ rành giữa các từ như tiếng Anh. Hơn thế nữa mọi ký tự trong ngôn ngữ tượng hình như tiếng Trung, tiếng Ả Rập đều có thể tồn tại như một từ đơn ký tự hoặc cũng có thể kết hợp với nhau để tạo thành từ đa ký tự.

Corpus: a body of text, thường chứa rất nhiều sent.

Part-of-speech (POS) tag: một từ có thể được phân loại vào một hay nhiều tập từ vựng (lexical) hoặc lớp từ loại như là danh từ, động từ, tính từ, và giới từ .. Một POS tag là một biểu tượng tượng trưng cho một lớp từ vựng - NN(Noun), VB(Verb), JJ(Adjective), AT(Article). Một trong những tập tags lâu đời và phổ biến nhất là Brow Corpus tag set.

Parse Tree: với mỗi một sent có một tree tương ứng biểu diễn cấu trúc cú pháp của sent đó. Cấu trúc cú pháp được định nghĩa bởi một ngôn ngữ hình thức (formal language).

Những bài toán phổ thông

POS Tagging: cho một câu và một tập POS tags, hãy gán nhãn cho từng từ trong câu. Ví dụ, cho câu “The ball is red”, POS tagger sẽ cho output là “The/AT ball/NN is/VB red/JJ”. POS taggers hiện thời có thể đạt tới độ chính xác 96% (Stanford POS Tagger đạt tới độ chính các trên 97% trong thử nghiệm ở đây). Gán nhãn từ rất hữu dụng cho những bài toán NLP phức tạp hơn như là parsing và machine translation.

Computational Morphology (hình thái học): Ngôn ngữ tự nhiên bao gồm rất nhiều word được xây dựng từ morphemes hay stems (thân từ) - đơn vị ngôn ngữ nhỏ nhất mà còn mang ý nghĩa. Computational Morphology sử dụng máy tính để phát hiện và phân tích cấu trúc bên trong của các words.

Parsing: trong bài toán phân tích câu, một parser sẽ tạo ra một parse tree với đầu vào là một sent. Một vài parsers giả sử có sự tồn tại của một tập luật cú pháp (grammar rules) để dựa trên đó mà phân tích. Các parsers gần đây đã đủ thông minh để suy diễn (deduce) parse trees trực tiếp từ các dữ liệu sử dụng các mô hình thống kê phức tạp. Hầu hết parses đều hoạt động có giám sát (operate in supervised setting) và yêu cầu sent phải được gán nhãn từ loại trước khi phân tích ngữ pháp (parse).

Machine Translation (MT): mục đích của MT là dùng máy tính để dịch một đoạn text từ một ngôn ngữ tự nhiên này sang một ngôn ngữ tự nhiên khác mà không cần sự tham gia của con người. Đây là một trong những bài toán khó nhất của NLP và đã được giải theo nhiều hướng khác nhau trong nhiều năm liền. Hầu hết các cách tiếp cận đều sử dụng POS tagging và parsing như là các bước cơ bản.

Viết tắt:
doc => document
sent => sentence
pos => part-of-speech (từ loại như là danh từ, động từ, tính từ .. )