cld2 – Google’s Compact Language Detector 2 – standalone command line on Cent OS

It appears that cld2 has no mention of how one would go about using it (or at-least that is the way it looks to me). The language detection ability is one of the better ones, and I decided to make use of it.

I came across a blog mentioning how to install cld2 on ubuntu but it just fell short of using it directly through a command line. It mentions how to build a Python binding.

Luckily, I also came across another blog where a Slackware script mentions building a command line tool which is perfectly what I was looking for, except that I had CentOS, not Slackware.

So with a little bit of digging around the various compile scripts on cld2’s SVN trunk, I got a faint sense of combining the ideas from these two blogs, and give it a try. I succeeded! Here’s what I did

  1. Get g++, it is required to build cld2 on your CentOS machine
    $ /usr/bin/sudo /usr/bin/yum install gcc-c++
    ...
    $ which g++
    /usr/bin/g++
    
  2. Get the cld2 source through SVN on your local CentOS machine. In my case I used /tmp folder
    $ pwd
    /tmp
    $ svn checkout http://cld2.googlecode.com/svn/trunk/ cld2
  3. Next, make a copy of one of the already existing compile scripts to make a few changes, specifically compile_libs.sh. The step is mentioned already in how to install cld2 on ubuntu. I use 32-bit, hence I use the same step remove the -m64 flag.
    $ pwd
    /tmp/cld2/internal
    $ cat compile_libs.sh | sed 's/\ \-m64\ //g' 1> compile_libs_32bit.sh
    
  4. To make a standalone cld2 executable, again I followed the steps from Slackware script example. I made following changes to my copied compile script. Here’s a diff of what changes I made from compile_libs.sh to my custom compile_libs_32bit.sh script
    https://gist.github.com/visitsb/8affec514ef5829c6bd0/revisions
  5. That’s it! Now compile_libs_32bit.sh is ready to build a standalone cld2 executable on your machine. It is just a matter of executing your custom compile_libs_32bit.sh script now
    $ chmod u+x compile_libs_32bit.sh
    $ ./compile_libs_32bit.sh
    
  6. It takes a few mins to build, and voila, you have a standalone cld2 executable built, and installed on your machine.
    $ which cld2
    /usr/local/bin/cld2
    $ echo "Hello World こんにちは γει? σου" | cld2
    ExtLanguage Japanese(35% 3904p), GREEK(33% 1024p), ENGLISH(27% 1194p), 45/43 bytes of non-tag letters, Summary: Japanese*
      SummaryLanguage Japanese(un-reliable) at 8391021 of 43 562us (0 MB/sec), (null)
    
  7. For the record, here is what get’s installed
    $ which cld2
    /usr/local/bin/cld2
    $ ls -l /usr/include/cld2/*
    /usr/include/cld2/internal:
    total 52
    -rw-r--r--. 1 root root 28159 Jun 20 17:49 generated_language.h
    -rw-r--r--. 1 root root  5839 Jun 20 17:49 generated_ulscript.h
    -rw-r--r--. 1 root root   945 Jun 20 17:49 integral_types.h
    -rw-r--r--. 1 root root  8326 Jun 20 17:49 lang_script.h
    
    /usr/include/cld2/public:
    total 24
    -rw-r--r--. 1 root root 14850 Jun 20 17:49 compact_lang_det.h
    -rw-r--r--. 1 root root  7056 Jun 20 17:49 encodings.h
    $ 
    $ ls -l /usr/lib/libcld2*
    -rwxr-xr-x. 1 root root 6457627 Jun 20 17:49 /usr/lib/libcld2_full.so
    -rwxr-xr-x. 1 root root 1742462 Jun 20 17:49 /usr/lib/libcld2.so
    $ 
    

Hope this helps someone, and kudos to cld2 for being awesome!

Comments

One response to “cld2 – Google’s Compact Language Detector 2 – standalone command line on Cent OS”

  1. Cent OS x64 requires ‘/usr/lib64/’ instead of ‘/usr/lib/’

Leave a Reply

Your email address will not be published. Required fields are marked *

4 × 5 =