Merge tag 'v2.9.5' of https://github.com/github/linguist

Linguist 2.9.5

Merge tag 'v2.9.5' of https://github.com/github/linguist
76c076f8 · Dmitriy Zaporozhets · c3d6fc5a · a00967dd · 76c076f8 · 76c076f8
Commit 76c076f8 authored 11 years ago by Dmitriy Zaporozhets
--- a/Gemfile
+++ b/Gemfile
-source :rubygems
+source 'https://rubygems.org'
 gemspec
--- a/LICENSE
+++ b/LICENSE
-Copyright (c) 2011 GitHub, Inc.
+Copyright (c) 2011-2013 GitHub, Inc.
  
 Permission is hereby granted, free of charge, to any person
 obtaining a copy of this software and associated documentation

--- a/README.md
+++ b/README.md
@@ -8,44 +8,38 @@ We use this library at GitHub to detect blob languages, highlight code, ignore b
  
 Linguist defines the list of all languages known to GitHub in a [yaml file](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml). In order for a file to be highlighted, a language and lexer must be defined there.
  
-Most languages are detected by their file extension. This is the fastest and most common situation. For script files, which are usually extensionless, we do "deep content inspection"™ and check the shebang of the file. Checking the file's contents may also be used for disambiguating languages. C, C++ and Obj-C all use `.h` files. Looking for common keywords, we are usually able to guess the correct language.
+Most languages are detected by their file extension. This is the fastest and most common situation.
+
+For disambiguating between files with common extensions, we use a [Bayesian classifier](https://github.com/github/linguist/blob/master/lib/linguist/classifier.rb). For an example, this helps us tell the difference between `.h` files which could be either C, C++, or Obj-C.
  
 In the actual GitHub app we deal with `Grit::Blob` objects. For testing, there is a simple `FileBlob` API.
  
-    Linguist::FileBlob.new("lib/linguist.rb").language.name #=> "Ruby"
+```ruby
+
+Linguist::FileBlob.new("lib/linguist.rb").language.name #=> "Ruby"
  
-    Linguist::FileBlob.new("bin/linguist").language.name #=> "Ruby"
+Linguist::FileBlob.new("bin/linguist").language.name #=> "Ruby"
+```
  
 See [lib/linguist/language.rb](https://github.com/github/linguist/blob/master/lib/linguist/language.rb) and [lib/linguist/languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml).
  
 ### Syntax Highlighting
  
-The actual syntax highlighting is handled by our Pygments wrapper, [Albino](https://github.com/github/albino). Linguist provides a [Lexer abstraction](https://github.com/github/linguist/blob/master/lib/linguist/lexer.rb) that determines which highlighter should be used on a file.
-
-We typically run on a prerelease version of Pygments to get early access to new lexers. The [lexers.yml](https://github.com/github/linguist/blob/master/lib/linguist/lexers.yml) file is a dump of the lexers we have available on our server. If there is a new lexer in pygments-main not on the list, [open an issue](https://github.com/github/linguist/issues) and we'll try to upgrade it soon.
-
-### MIME type detection
-
-Most of the MIME types handling is done by the Ruby [mime-types gem](https://github.com/halostatue/mime-types/blob/master/lib/mime/types.rb.data). But we have our own list of additions and overrides. To add or modify this list, see [lib/linguist/mimes.yml](https://github.com/github/linguist/blob/master/lib/linguist/mimes.yml).
-
-MIME types are used to set the Content-Type of raw binary blobs which are served from a special `raw.github.com` domain. However, all text blobs are served as `text/plain` regardless of their type to ensure they open in the browser rather than downloading.
-
-The MIME type also determines whether a blob is binary or plain text. So if you're seeing a blob that says "View Raw" and it is actually plain text, the mime type and encoding probably needs to be explicitly stated.
+The actual syntax highlighting is handled by our Pygments wrapper, [pygments.rb](https://github.com/tmm1/pygments.rb). It also provides a [Lexer abstraction](https://github.com/tmm1/pygments.rb/blob/master/lib/pygments/lexer.rb) that determines which highlighter should be used on a file.
  
-    Linguist::FileBlob.new("linguist.zip").binary? #=> true
-
-See [lib/linguist/mimes.yml](https://github.com/github/linguist/blob/master/lib/linguist/mimes.yml).
+We typically run on a pre-release version of Pygments, [pygments.rb](https://github.com/tmm1/pygments.rb), to get early access to new lexers. The [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml) file is a dump of the lexers we have available on our server.
  
 ### Stats
  
-The [Language Graph](https://github.com/github/linguist/graphs/languages) is built by aggregating the languages of all repo's blobs. The top language in the graph determines the project's primary language. Collectively, these stats make up the [Top Languages](https://github.com/languages) page.
+The Language Graph you see on every repository is built by aggregating the languages of all repo's blobs. The top language in the graph determines the project's primary language. Collectively, these stats make up the [Top Languages](https://github.com/languages) page.
  
 The repository stats API can be used on a directory:
  
-    project = Linguist::Repository.from_directory(".")
-    project.language.name  #=> "Ruby"
-    project.languages      #=> { "Ruby" => 0.98,
-                                 "Shell" => 0.02 }
+```ruby
+project = Linguist::Repository.from_directory(".")
+project.language.name  #=> "Ruby"
+project.languages      #=> { "Ruby" => 0.98, "Shell" => 0.02 }
+```
  
 These stats are also printed out by the binary. Try running `linguist` on itself:
  
@@ -56,21 +50,27 @@ These stats are also printed out by the binary. Try running `linguist` on itself
  
 Checking other code into your git repo is a common practice. But this often inflates your project's language stats and may even cause your project to be labeled as another language. We are able to identify some of these files and directories and exclude them.
  
-    Linguist::FileBlob.new("vendor/plugins/foo.rb").vendored? # => true
+```ruby
+Linguist::FileBlob.new("vendor/plugins/foo.rb").vendored? # => true
+```
  
 See [Linguist::BlobHelper#vendored?](https://github.com/github/linguist/blob/master/lib/linguist/blob_helper.rb) and [lib/linguist/vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml).
  
 #### Generated file detection
  
-Not all plain text files are true source files. Generated files like minified js and compiled CoffeeScript can be detected and excluded from language stats. As an extra bonus, these files are suppressed in Diffs.
+Not all plain text files are true source files. Generated files like minified js and compiled CoffeeScript can be detected and excluded from language stats. As an extra bonus, these files are suppressed in diffs.
  
-    Linguist::FileBlob.new("underscore.min.js").generated? # => true
+```ruby
+Linguist::FileBlob.new("underscore.min.js").generated? # => true
+```
  
-See [Linguist::BlobHelper#generated?](https://github.com/github/linguist/blob/master/lib/linguist/blob_helper.rb).
+See [Linguist::Generated#generated?](https://github.com/github/linguist/blob/master/lib/linguist/generated.rb).
  
 ## Installation
  
-To get it, clone the repo and run [Bundler](http://gembundler.com/) to install its dependencies.
+github.com is usually running the latest version of the `github-linguist` gem that is released on [RubyGems.org](http://rubygems.org/gems/github-linguist).
+
+But for development you are going to want to checkout out the source. To get it, clone the repo and run [Bundler](http://gembundler.com/) to install its dependencies.
  
    git clone https://github.com/github/linguist.git
    cd linguist/
@@ -80,17 +80,16 @@ To run the tests:
  
    bundle exec rake test
  
-*Since this code is specific to GitHub, is not published as a official rubygem.*
+## Contributing
  
-If you are seeing errors like `StandardError: could not find any magic files!`, it means the CharlockHolmes gem didn’t install correctly. See the [installing section](https://github.com/brianmario/charlock_holmes/blob/master/README.md) of the CharlockHolmes README for more information.
+The majority of patches won't need to touch any Ruby code at all. The [master language list](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml) is just a configuration file.
  
-## Contributing
+We try to only add languages once they have some usage on GitHub, so please note in-the-wild usage examples in your pull request.
+
+Almost all bug fixes or new language additions should come with some additional code samples. Just drop them under [`samples/`](https://github.com/github/linguist/tree/master/samples) in the correct subdirectory and our test suite will automatically test them. In most cases you shouldn't need to add any new assertions.
+
+### Testing
+
+Sometimes getting the tests running can be too much work, especially if you don't have much Ruby experience. It's okay, be lazy and let our build bot [Travis](http://travis-ci.org/#!/github/linguist) run the tests for you. Just open a pull request and the bot will start cranking away.
  
-1. Fork it.
-2. Create a branch (`git checkout -b detect-foo-language`)
-3. Make your changes
-4. Run the tests (`bundle install` then `bundle exec rake`)
-5. Commit your changes (`git commit -am "Added detection for the new Foo language"`)
-6. Push to the branch (`git push origin detect-foo-language`)
-7. Create a [Pull Request](http://help.github.com/pull-requests/) from your branch.
-8. Promote it. Get others to drop in and +1 it.
+Here's our current build status, which is hopefully green: [![Build Status](https://secure.travis-ci.org/github/linguist.png?branch=master)](http://travis-ci.org/github/linguist)
--- a/Rakefile
+++ b/Rakefile
+require 'rake/clean'
 require 'rake/testtask'
  
 task :default => :test
  
-Rake::TestTask.new do |t|
-  t.warning = true
+Rake::TestTask.new
+
+task :samples do
+  require 'linguist/samples'
+  require 'yajl'
+  data = Linguist::Samples.data
+  json = Yajl::Encoder.encode(data, :pretty => true)
+  File.open('lib/linguist/samples.json', 'w') { |io| io.write json }
+end
+
+namespace :classifier do
+  LIMIT = 1_000
+
+  desc "Run classifier against #{LIMIT} public gists"
+  task :test do
+    require 'linguist/classifier'
+    require 'linguist/samples'
+
+    total, correct, incorrect = 0, 0, 0
+    $stdout.sync = true
+
+    each_public_gist do |gist_url, file_url, file_language|
+      next if file_language.nil? || file_language == 'Text'
+      begin
+        data = open(file_url).read
+        guessed_language, score = Linguist::Classifier.classify(Linguist::Samples::DATA, data).first
+
+        total += 1
+        guessed_language == file_language ? correct += 1 : incorrect += 1
+
+        print "\r\e[0K%d:%d  %g%%" % [correct, incorrect, (correct.to_f/total.to_f)*100]
+        $stdout.flush
+      rescue URI::InvalidURIError
+      else
+        break if total >= LIMIT
+      end
+    end
+    puts ""
+  end
+
+  def each_public_gist
+    require 'open-uri'
+    require 'json'
+
+    url = "https://api.github.com/gists/public"
+
+    loop do
+      resp = open(url)
+      url = resp.meta['link'][/<([^>]+)>; rel="next"/, 1]
+      gists = JSON.parse(resp.read)
+
+      for gist in gists
+        for filename, attrs in gist['files']
+          yield gist['url'], attrs['raw_url'], attrs['language']
+        end
+      end
+    end
+  end
 end
--- a/bin/linguist
+++ b/bin/linguist
 #!/usr/bin/env ruby
  
+# linguist — detect language type for a file, or, given a directory, determine language breakdown
+#
+# usage: linguist <path>
+
 require 'linguist/file_blob'
 require 'linguist/repository'
  
@@ -23,12 +27,11 @@ elsif File.file?(path)
  
  puts "#{blob.name}: #{blob.loc} lines (#{blob.sloc} sloc)"
  puts "  type:      #{type}"
-  puts "  extension: #{blob.pathname.extname}"
  puts "  mime type: #{blob.mime_type}"
  puts "  language:  #{blob.language}"
  
  if blob.large?
-    puts "  blob is to large to be shown"
+    puts "  blob is too large to be shown"
  end
  
  if blob.generated?

--- a/linguist.gemspec
+++ b/linguist.gemspec
 Gem::Specification.new do |s|
-  s.name    = 'linguist'
-  s.version = '1.0.0'
+  s.name    = 'github-linguist'
+  s.version = '2.9.5'
  s.summary = "GitHub Language detection"
  
-  s.authors = "GitHub"
+  s.authors  = "GitHub"
+  s.homepage = "https://github.com/github/linguist"
  
  s.files = Dir['lib/**/*']
  s.executables << 'linguist'
  
  s.add_dependency 'charlock_holmes', '~> 0.6.6'
-  s.add_dependency 'escape_utils',    '~> 0.2.3'
-  s.add_dependency 'mime-types',      '~> 1.18'
-  s.add_dependency 'pygments.rb',     '~> 0.2.11'
+  s.add_dependency 'escape_utils',    '~> 0.3.1'
+  s.add_dependency 'mime-types',      '~> 1.19'
+  s.add_dependency 'pygments.rb',     '~> 0.5.2'
+  s.add_development_dependency 'mocha'
+  s.add_development_dependency 'json'
  s.add_development_dependency 'rake'
+  s.add_development_dependency 'yajl-ruby'
 end
--- a/lib/linguist.rb
+++ b/lib/linguist.rb
 require 'linguist/blob_helper'
+require 'linguist/generated'
 require 'linguist/language'
-require 'linguist/mime'
-require 'linguist/pathname'
 require 'linguist/repository'
+require 'linguist/samples'
--- a/lib/linguist/blob_helper.rb
+++ b/lib/linguist/blob_helper.rb
+require 'linguist/generated'
 require 'linguist/language'
-require 'linguist/mime'
-require 'linguist/pathname'
  
 require 'charlock_holmes'
 require 'escape_utils'
+require 'mime/types'
 require 'pygments'
 require 'yaml'
  
 module Linguist
+  # DEPRECATED Avoid mixing into Blob classes. Prefer functional interfaces
+  # like `Language.detect` over `Blob#language`. Functions are much easier to
+  # cache and compose.
+  #
+  # Avoid adding additional bloat to this module.
+  #
  # BlobHelper is a mixin for Blobish classes that respond to "name",
  # "data" and "size" such as Grit::Blob.
  module BlobHelper
-    # Internal: Get a Pathname wrapper for Blob#name
-    #
-    # Returns a Pathname.
-    def pathname
-      Pathname.new(name || "")
-    end
-
    # Public: Get the extname of the path
    #
    # Examples
@@ -27,7 +26,23 @@ module Linguist
    #
    # Returns a String
    def extname
-      pathname.extname
+      File.extname(name.to_s)
+    end
+
+    # Internal: Lookup mime type for extension.
+    #
+    # Returns a MIME::Type
+    def _mime_type
+      if defined? @_mime_type
+        @_mime_type
+      else
+        guesses = ::MIME::Types.type_for(extname.to_s)
+
+        # Prefer text mime types over binary
+        @_mime_type = guesses.detect { |type| type.ascii? } ||
+          # Otherwise use the first guess
+          guesses.first
+      end
    end
  
    # Public: Get the actual blob mime type
@@ -39,7 +54,23 @@ module Linguist
    #
    # Returns a mime type String.
    def mime_type
-      @mime_type ||= pathname.mime_type
+      _mime_type ? _mime_type.to_s : 'text/plain'
+    end
+
+    # Internal: Is the blob binary according to its mime type
+    #
+    # Return true or false
+    def binary_mime_type?
+      _mime_type ? _mime_type.binary? : false
+    end
+
+    # Internal: Is the blob binary according to its mime type,
+    # overriding it if we have better data from the languages.yml
+    # database.
+    #
+    # Return true or false
+    def likely_binary?
+      binary_mime_type? && !Language.find_by_filename(name)
    end
  
    # Public: Get the Content-Type header value
@@ -71,7 +102,7 @@ module Linguist
      elsif name.nil?
        "attachment"
      else
-        "attachment; filename=#{EscapeUtils.escape_url(pathname.basename)}"
+        "attachment; filename=#{EscapeUtils.escape_url(File.basename(name))}"
      end
    end
  
@@ -90,15 +121,6 @@ module Linguist
      @detect_encoding ||= CharlockHolmes::EncodingDetector.new.detect(data) if data
    end
  
-    # Public: Is the blob binary according to its mime type
-    #
-    # Return true or false
-    def binary_mime_type?
-      if mime_type = Mime.lookup_mime_type_for(pathname.extname)
-        mime_type.binary?
-      end
-    end
-
    # Public: Is the blob binary?
    #
    # Return true or false
@@ -132,23 +154,28 @@ module Linguist
    #
    # Return true or false
    def image?
-      ['.png', '.jpg', '.jpeg', '.gif'].include?(extname)
+      ['.png', '.jpg', '.jpeg', '.gif'].include?(extname.downcase)
+    end
+
+    # Public: Is the blob a supported 3D model format?
+    #
+    # Return true or false
+    def solid?
+      extname.downcase == '.stl'
    end
  
-    # Public: Is the blob a possible drupal php file?
+    # Public: Is this blob a CSV file?
    #
    # Return true or false
-    def drupal_extname?
-      ['.module', '.install', '.test', '.inc'].include?(extname)
+    def csv?
+      text? && extname.downcase == '.csv'
    end
  
-    # Public: Is the blob likely to have a shebang?
+    # Public: Is the blob a PDF?
    #
    # Return true or false
-    def shebang_extname?
-      extname.empty? &&
-        mode &&
-        (mode.to_i(8) & 05) == 05
+    def pdf?
+      extname.downcase == '.pdf'
    end
  
    MEGABYTE = 1024 * 1024
@@ -160,6 +187,28 @@ module Linguist
      size.to_i > MEGABYTE
    end
  
+    # Public: Is the blob safe to colorize?
+    #
+    # We use Pygments for syntax highlighting blobs. Pygments
+    # can be too slow for very large blobs or for certain 
+    # corner-case blobs.
+    # 
+    # Return true or false
+    def safe_to_colorize?
+      !large? && text? && !high_ratio_of_long_lines?
+    end
+
+    # Internal: Does the blob have a ratio of long lines?
+    #
+    # These types of files are usually going to make Pygments.rb
+    # angry if we try to colorize them.
+    #
+    # Return true or false
+    def high_ratio_of_long_lines?
+      return false if loc == 0
+      size / loc > 5000
+    end
+
    # Public: Is the blob viewable?
    #
    # Non-viewable blobs will just show a "View Raw" link
@@ -190,7 +239,12 @@ module Linguist
    #
    # Returns an Array of lines
    def lines
-      @lines ||= (viewable? && data) ? data.split("\n", -1) : []
+      @lines ||=
+        if viewable? && data
+          data.split(/\r\n|\r|\n/, -1)
+        else
+          []
+        end
    end
  
    # Public: Get number of lines of code
@@ -211,153 +265,16 @@ module Linguist
      lines.grep(/\S/).size
    end
  
-    # Internal: Compute average line length.
-    #
-    # Returns Integer.
-    def average_line_length
-      if lines.any?
-        lines.inject(0) { |n, l| n += l.length } / lines.length
-      else
-        0
-      end
-    end
-
    # Public: Is the blob a generated file?
    #
-    # Generated source code is supressed in diffs and is ignored by
+    # Generated source code is suppressed in diffs and is ignored by
    # language statistics.
    #
-    # Requires Blob#data
-    #
-    # Includes:
-    # - XCode project XML files
-    # - Minified JavaScript
-    #
-    # Please add additional test coverage to
-    # `test/test_blob.rb#test_generated` if you make any changes.
+    # May load Blob#data
    #
    # Return true or false
    def generated?
-      if xcode_project_file? || generated_net_docfile?
-        true
-      elsif generated_coffeescript? || minified_javascript?
-        true
-      elsif name == 'Gemfile.lock'
-        true
-      else
-        false
-      end
-    end
-
-    # Internal: Is the blob an XCode project file?
-    #
-    # Generated if the file extension is an XCode project
-    # file extension.
-    #
-    # Returns true of false.
-    def xcode_project_file?
-      ['.xib', '.nib', '.pbxproj', '.xcworkspacedata', '.xcuserstate'].include?(extname)
-    end
-
-    # Internal: Is the blob minified JS?
-    #
-    # Consider JS minified if the average line length is
-    # greater then 100c.
-    #
-    # Returns true or false.
-    def minified_javascript?
-      return unless extname == '.js'
-      average_line_length > 100
-    end
-
-    # Internal: Is the blob JS generated by CoffeeScript?
-    #
-    # Requires Blob#data
-    #
-    # CoffeScript is meant to output JS that would be difficult to
-    # tell if it was generated or not. Look for a number of patterns
-    # outputed by the CS compiler.
-    #
-    # Return true or false
-    def generated_coffeescript?
-      return unless extname == '.js'
-
-      # CoffeeScript generated by > 1.2 include a comment on the first line
-      if lines[0] =~ /^\/\/ Generated by /
-        return true
-      end
-
-      if lines[0] == '(function() {' &&     # First line is module closure opening
-          lines[-2] == '}).call(this);' &&  # Second to last line closes module closure
-          lines[-1] == ''                   # Last line is blank
-
-        score = 0
-
-        lines.each do |line|
-          if line =~ /var /
-            # Underscored temp vars are likely to be Coffee
-            score += 1 * line.gsub(/(_fn|_i|_len|_ref|_results)/).count
-
-            # bind and extend functions are very Coffee specific
-            score += 3 * line.gsub(/(__bind|__extends|__hasProp|__indexOf|__slice)/).count
-          end
-        end
-
-        # Require a score of 3. This is fairly arbitrary. Consider
-        # tweaking later.
-        score >= 3
-      else
-        false
-      end
-    end
-
-    # Internal: Is this a generated documentation file for a .NET assembly?
-    #
-    # Requires Blob#data
-    #
-    # .NET developers often check in the XML Intellisense file along with an
-    # assembly - however, these don't have a special extension, so we have to
-    # dig into the contents to determine if it's a docfile. Luckily, these files
-    # are extremely structured, so recognizing them is easy.
-    #
-    # Returns true or false
-    def generated_net_docfile?
-      return false unless extname.downcase == ".xml"
-      return false unless lines.count > 3
-
-      # .NET Docfiles always open with <doc> and their first tag is an
-      # <assembly> tag
-      return lines[1].include?("<doc>") &&
-        lines[2].include?("<assembly>") &&
-        lines[-2].include?("</doc>")
-    end
-
-    # Public: Should the blob be indexed for searching?
-    #
-    # Excluded:
-    # - Files over 0.1MB
-    # - Non-text files
-    # - Langauges marked as not searchable
-    # - Generated source files
-    #
-    # Please add additional test coverage to
-    # `test/test_blob.rb#test_indexable` if you make any changes.
-    #
-    # Return true or false
-    def indexable?
-      if binary?
-        false
-      elsif language.nil?
-        false
-      elsif !language.searchable?
-        false
-      elsif generated?
-        false
-      elsif size > 100 * 1024
-        false
-      else
-        true
-      end
+      @_generated ||= Generated.generated?(name, lambda { data })
    end
  
    # Public: Detects the Language of the blob.
@@ -366,33 +283,15 @@ module Linguist
    #
    # Returns a Language or nil if none is detected
    def language
-      if defined? @language
-        @language
+      return @language if defined? @language
+
+      if defined?(@data) && @data.is_a?(String)
+        data = @data
      else
-        @language = guess_language
+        data = lambda { (binary_mime_type? || binary?) ? "" : self.data }
      end
-    end
-
-    # Internal: Guess language
-    #
-    # Please add additional test coverage to
-    # `test/test_blob.rb#test_language` if you make any changes.
-    #
-    # Returns a Language or nil
-    def guess_language
-      return if binary_mime_type?
-
-      # Disambiguate between multiple language extensions
-      disambiguate_extension_language ||
-
-        # See if there is a Language for the extension
-        pathname.language ||
-
-        # Look for idioms in first line
-        first_line_language ||
  
-        # Try to detect Language from shebang line
-        shebang_language
+      @language = Language.detect(name.to_s, data, mode)
    end
  
    # Internal: Get the lexer of the blob.
@@ -402,269 +301,16 @@ module Linguist
      language ? language.lexer : Pygments::Lexer.find_by_name('Text only')
    end
  
-    # Internal: Disambiguates between multiple language extensions.
-    #
-    # Delegates to "guess_EXTENSION_language".
-    #
-    # Please add additional test coverage to
-    # `test/test_blob.rb#test_language` if you add another method.
-    #
-    # Returns a Language or nil.
-    def disambiguate_extension_language
-      if Language.ambiguous?(extname)
-        name = "guess_#{extname.sub(/^\./, '')}_language"
-        send(name) if respond_to?(name)
-      end
-    end
-
-    # Internal: Guess language of .cls files
-    #
-    # Returns a Language.
-    def guess_cls_language
-      if lines.grep(/^(%|\\)/).any?
-        Language['TeX']
-      elsif lines.grep(/^\s*(CLASS|METHOD|INTERFACE).*:\s*/i).any? || lines.grep(/^\s*(USING|DEFINE)/i).any?
-        Language['OpenEdge ABL']
-      elsif lines.grep(/\{$/).any? || lines.grep(/\}$/).any?
-        Language['Apex']
-      elsif lines.grep(/^(\'\*|Attribute|Option|Sub|Private|Protected|Public|Friend)/i).any?
-        Language['Visual Basic']
-      else
-        # The most common language should be the fallback
-        Language['TeX']
-      end
-    end
-
-    # Internal: Guess language of header files (.h).
-    #
-    # Returns a Language.
-    def guess_h_language
-      if lines.grep(/^@(interface|property|private|public|end)/).any?
-        Language['Objective-C']
-      elsif lines.grep(/^class |^\s+(public|protected|private):/).any?
-        Language['C++']
-      else
-        Language['C']
-      end
-    end
-
-    # Internal: Guess language of .m files.
-    #
-    # Objective-C heuristics:
-    # * Keywords
-    #
-    # Matlab heuristics:
-    # * Leading function keyword
-    # * "%" comments
-    #
-    # Returns a Language.
-    def guess_m_language
-      # Objective-C keywords
-      if lines.grep(/^#import|@(interface|implementation|property|synthesize|end)/).any?
-        Language['Objective-C']
-
-      # File function
-      elsif lines.first.to_s =~ /^function /
-        Language['Matlab']
-
-      # Matlab comment
-      elsif lines.grep(/^%/).any?
-        Language['Matlab']
-
-      # Fallback to Objective-C, don't want any Matlab false positives
-      else
-        Language['Objective-C']
-      end
-    end
-
-    # Internal: Guess language of .pl files
-    #
-    # The rules for disambiguation are:
-    #
-    # 1. Many perl files begin with a shebang
-    # 2. Most Prolog source files have a rule somewhere (marked by the :- operator)
-    # 3. Default to Perl, because it is more popular
-    #
-    # Returns a Language.
-    def guess_pl_language
-      if shebang_script == 'perl'
-        Language['Perl']
-      elsif lines.grep(/:-/).any?
-        Language['Prolog']
-      else
-        Language['Perl']
-      end
-    end
-
-    # Internal: Guess language of .r files.
-    #
-    # Returns a Language.
-    def guess_r_language
-      if lines.grep(/(rebol|(:\s+func|make\s+object!|^\s*context)\s*\[)/i).any?
-        Language['Rebol']
-      else
-        Language['R']
-      end
-    end
-
-    # Internal: Guess language of .t files.
-    #
-    # Returns a Language.
-    def guess_t_language
-      score = 0
-      score += 1 if lines.grep(/^% /).any?
-      score += data.gsub(/ := /).count
-      score += data.gsub(/proc |procedure |fcn |function /).count
-      score += data.gsub(/var \w+: \w+/).count
-
-      # Tell-tale signs its gotta be Perl
-      if lines.grep(/^(my )?(sub |\$|@|%)\w+/).any?
-        score = 0
-      end
-
-      if score >= 3
-        Language['Turing']
-      else
-        Language['Perl']
-      end
-    end
-
-    # Internal: Guess language of .v files.
-    #
-    # Returns a Language
-    def guess_v_language
-      if lines.grep(/^(\/\*|\/\/|module|parameter|input|output|wire|reg|always|initial|begin|\`)/).any?
-        Language['Verilog']
-      else
-        Language['Coq']
-      end
-    end
-
-    # Internal: Guess language of .gsp files.
-    #
-    # Returns a Language.
-    def guess_gsp_language
-      if lines.grep(/<%|<%@|\$\{|<%|<g:|<meta name="layout"|<r:/).any?
-        Language['Groovy Server Pages']
-      else
-        Language['Gosu']
-      end
-    end
-
-    # Internal: Guess language from the first line.
-    #
-    # Look for leading "<?php" in Drupal files
-    #
-    # Returns a Language.
-    def first_line_language
-      # Only check files with drupal php extensions
-      return unless drupal_extname?
-
-      # Fail fast if blob isn't viewable?
-      return unless viewable?
-
-      if lines.first.to_s =~ /^<\?php/
-        Language['PHP']
-      end
-    end
-
-    # Internal: Extract the script name from the shebang line
-    #
-    # Requires Blob#data
-    #
-    # Examples
-    #
-    #   '#!/usr/bin/ruby'
-    #   # => 'ruby'
-    #
-    #   '#!/usr/bin/env ruby'
-    #   # => 'ruby'
-    #
-    #   '#!/usr/bash/python2.4'
-    #   # => 'python'
-    #
-    # Please add additional test coverage to
-    # `test/test_blob.rb#test_shebang_script` if you make any changes.
-    #
-    # Returns a script name String or nil
-    def shebang_script
-      # Fail fast if blob isn't viewable?
-      return unless viewable?
-
-      if lines.any? && (match = lines[0].match(/(.+)\n?/)) && (bang = match[0]) =~ /^#!/
-        bang.sub!(/^#! /, '#!')
-        tokens = bang.split(' ')
-        pieces = tokens.first.split('/')
-        if pieces.size > 1
-          script = pieces.last
-        else
-          script = pieces.first.sub('#!', '')
-        end
-
-        script = script == 'env' ? tokens[1] : script
-
-        # python2.4 => python
-        if script =~ /((?:\d+\.?)+)/
-          script.sub! $1, ''
-        end
-
-        # Check for multiline shebang hacks that exec themselves
-        #
-        #   #!/bin/sh
-        #   exec foo "$0" "$@"
-        #
-        if script == 'sh' &&
-            lines[0...5].any? { |l| l.match(/exec (\w+).+\$0.+\$@/) }
-          script = $1
-        end
-
-        script
-      end
-    end
-
-    # Internal: Get Language for shebang script
-    #
-    # Returns the Language or nil
-    def shebang_language
-      # Skip file extensions unlikely to have shebangs
-      return unless shebang_extname?
-
-      if script = shebang_script
-        Language[script]
-      end
-    end
-
    # Public: Highlight syntax of blob
    #
    # options - A Hash of options (defaults to {})
    #
    # Returns html String
    def colorize(options = {})
-      return if !text? || large? || generated?
+      return unless safe_to_colorize?
      options[:options] ||= {}
      options[:options][:encoding] ||= encoding
      lexer.highlight(data, options)
    end
-
-    # Public: Highlight syntax of blob without the outer highlight div
-    # wrapper.
-    #
-    # options - A Hash of options (defaults to {})
-    #
-    # Returns html String
-    def colorize_without_wrapper(options = {})
-      if text = colorize(options)
-        text[%r{<div class="highlight"><pre>(.*?)</pre>\s*</div>}m, 1]
-      else
-        ''
-      end
-    end
-
-    Language.overridden_extensions.each do |extension|
-      name = "guess_#{extension.sub(/^\./, '')}_language".to_sym
-      unless instance_methods.map(&:to_sym).include?(name)
-        warn "Language##{name} was not defined"
-      end
-    end
  end
 end
--- a/lib/linguist/classifier.rb
+++ b/lib/linguist/classifier.rb
+require 'linguist/tokenizer'
+
+module Linguist
+  # Language bayesian classifier.
+  class Classifier
+    # Public: Train classifier that data is a certain language.
+    #
+    # db       - Hash classifier database object
+    # language - String language of data
+    # data     - String contents of file
+    #
+    # Examples
+    #
+    #   Classifier.train(db, 'Ruby', "def hello; end")
+    #
+    # Returns nothing.
+    #
+    # Set LINGUIST_DEBUG=1 or =2 to see probabilities per-token,
+    # per-language.  See also dump_all_tokens, below.
+    def self.train!(db, language, data)
+      tokens = Tokenizer.tokenize(data)
+
+      db['tokens_total'] ||= 0
+      db['languages_total'] ||= 0
+      db['tokens'] ||= {}
+      db['language_tokens'] ||= {}
+      db['languages'] ||= {}
+
+      tokens.each do |token|
+        db['tokens'][language] ||= {}
+        db['tokens'][language][token] ||= 0
+        db['tokens'][language][token] += 1
+        db['language_tokens'][language] ||= 0
+        db['language_tokens'][language] += 1
+        db['tokens_total'] += 1
+      end
+      db['languages'][language] ||= 0
+      db['languages'][language] += 1
+      db['languages_total'] += 1
+
+      nil
+    end
+
+    # Public: Guess language of data.
+    #
+    # db        - Hash of classifier tokens database.
+    # data      - Array of tokens or String data to analyze.
+    # languages - Array of language name Strings to restrict to.
+    #
+    # Examples
+    #
+    #   Classifier.classify(db, "def hello; end")
+    #   # => [ 'Ruby', 0.90], ['Python', 0.2], ... ]
+    #
+    # Returns sorted Array of result pairs. Each pair contains the
+    # String language name and a Float score.
+    def self.classify(db, tokens, languages = nil)
+      languages ||= db['languages'].keys
+      new(db).classify(tokens, languages)
+    end
+
+    # Internal: Initialize a Classifier.
+    def initialize(db = {})
+      @tokens_total    = db['tokens_total']
+      @languages_total = db['languages_total']
+      @tokens          = db['tokens']
+      @language_tokens = db['language_tokens']
+      @languages       = db['languages']
+    end
+
+    # Internal: Guess language of data
+    #
+    # data      - Array of tokens or String data to analyze.
+    # languages - Array of language name Strings to restrict to.
+    #
+    # Returns sorted Array of result pairs. Each pair contains the
+    # String language name and a Float score.
+    def classify(tokens, languages)
+      return [] if tokens.nil?
+      tokens = Tokenizer.tokenize(tokens) if tokens.is_a?(String)
+
+      scores = {}
+      if verbosity >= 2
+        dump_all_tokens(tokens, languages)
+      end
+      languages.each do |language|
+        scores[language] = tokens_probability(tokens, language) +
+                                   language_probability(language)
+        if verbosity >= 1
+          printf "%10s = %10.3f + %7.3f = %10.3f\n",
+            language, tokens_probability(tokens, language), language_probability(language), scores[language]
+        end
+      end
+
+      scores.sort { |a, b| b[1] <=> a[1] }.map { |score| [score[0], score[1]] }
+    end
+
+    # Internal: Probably of set of tokens in a language occurring - P(D | C)
+    #
+    # tokens   - Array of String tokens.
+    # language - Language to check.
+    #
+    # Returns Float between 0.0 and 1.0.
+    def tokens_probability(tokens, language)
+      tokens.inject(0.0) do |sum, token|
+        sum += Math.log(token_probability(token, language))
+      end
+    end
+
+    # Internal: Probably of token in language occurring - P(F | C)
+    #
+    # token    - String token.
+    # language - Language to check.
+    #
+    # Returns Float between 0.0 and 1.0.
+    def token_probability(token, language)
+      if @tokens[language][token].to_f == 0.0
+        1 / @tokens_total.to_f
+      else
+        @tokens[language][token].to_f / @language_tokens[language].to_f
+      end
+    end
+
+    # Internal: Probably of a language occurring - P(C)
+    #
+    # language - Language to check.
+    #
+    # Returns Float between 0.0 and 1.0.
+    def language_probability(language)
+      Math.log(@languages[language].to_f / @languages_total.to_f)
+    end
+
+    private
+      def verbosity
+        @verbosity ||= (ENV['LINGUIST_DEBUG'] || 0).to_i
+      end
+
+      # Internal: show a table of probabilities for each <token,language> pair.
+      #
+      # The number in each table entry is the number of "points" that each
+      # token contributes toward the belief that the file under test is a
+      # particular language.  Points are additive.
+      #
+      # Points are the number of times a token appears in the file, times
+      # how much more likely (log of probability ratio) that token is to
+      # appear in one language vs. the least-likely language.  Dashes
+      # indicate the least-likely language (and zero points) for each token.
+      def dump_all_tokens(tokens, languages)
+        maxlen = tokens.map { |tok| tok.size }.max
+        
+        printf "%#{maxlen}s", ""
+        puts "    #" + languages.map { |lang| sprintf("%10s", lang) }.join
+        
+        tokmap = Hash.new(0)
+        tokens.each { |tok| tokmap[tok] += 1 }
+        
+        tokmap.sort.each { |tok, count|
+          arr = languages.map { |lang| [lang, token_probability(tok, lang)] }
+          min = arr.map { |a,b| b }.min
+          minlog = Math.log(min)
+          if !arr.inject(true) { |result, n| result && n[1] == arr[0][1] }
+            printf "%#{maxlen}s%5d", tok, count
+            
+            puts arr.map { |ent|
+              ent[1] == min ? "         -" : sprintf("%10.3f", count * (Math.log(ent[1]) - minlog))
+            }.join
+          end
+        }
+      end
+  end
+end
--- a/lib/linguist/generated.rb
+++ b/lib/linguist/generated.rb
+module Linguist
+  class Generated
+    # Public: Is the blob a generated file?
+    #
+    # name - String filename
+    # data - String blob data. A block also maybe passed in for lazy
+    #        loading. This behavior is deprecated and you should always
+    #        pass in a String.
+    #
+    # Return true or false
+    def self.generated?(name, data)
+      new(name, data).generated?
+    end
+
+    # Internal: Initialize Generated instance
+    #
+    # name - String filename
+    # data - String blob data
+    def initialize(name, data)
+      @name = name
+      @extname = File.extname(name)
+      @_data = data
+    end
+
+    attr_reader :name, :extname
+
+    # Lazy load blob data if block was passed in.
+    #
+    # Awful, awful stuff happening here.
+    #
+    # Returns String data.
+    def data
+      @data ||= @_data.respond_to?(:call) ? @_data.call() : @_data
+    end
+
+    # Public: Get each line of data
+    #
+    # Returns an Array of lines
+    def lines
+      # TODO: data should be required to be a String, no nils
+      @lines ||= data ? data.split("\n", -1) : []
+    end
+
+    # Internal: Is the blob a generated file?
+    #
+    # Generated source code is suppressed in diffs and is ignored by
+    # language statistics.
+    #
+    # Please add additional test coverage to
+    # `test/test_blob.rb#test_generated` if you make any changes.
+    #
+    # Return true or false
+    def generated?
+      name == 'Gemfile.lock' ||
+        minified_files? ||
+        compiled_coffeescript? ||
+        xcode_project_file? ||
+        generated_parser? ||
+        generated_net_docfile? ||
+        generated_net_designer_file? ||
+        generated_protocol_buffer?
+    end
+
+    # Internal: Is the blob an XCode project file?
+    #
+    # Generated if the file extension is an XCode project
+    # file extension.
+    #
+    # Returns true of false.
+    def xcode_project_file?
+      ['.xib', '.nib', '.storyboard', '.pbxproj', '.xcworkspacedata', '.xcuserstate'].include?(extname)
+    end
+
+    # Internal: Is the blob minified files?
+    #
+    # Consider a file minified if it contains more than 5% spaces.
+    # Currently, only JS and CSS files are detected by this method.
+    #
+    # Returns true or false.
+    def minified_files?
+      return unless ['.js', '.css'].include? extname
+      if data && data.length > 200
+        (data.each_char.count{ |c| c <= ' ' } / data.length.to_f) < 0.05
+      else
+        false
+      end
+    end
+
+    # Internal: Is the blob of JS generated by CoffeeScript?
+    #
+    # CoffeeScript is meant to output JS that would be difficult to
+    # tell if it was generated or not. Look for a number of patterns
+    # output by the CS compiler.
+    #
+    # Return true or false
+    def compiled_coffeescript?
+      return false unless extname == '.js'
+
+      # CoffeeScript generated by > 1.2 include a comment on the first line
+      if lines[0] =~ /^\/\/ Generated by /
+        return true
+      end
+
+      if lines[0] == '(function() {' &&     # First line is module closure opening
+          lines[-2] == '}).call(this);' &&  # Second to last line closes module closure
+          lines[-1] == ''                   # Last line is blank
+
+        score = 0
+
+        lines.each do |line|
+          if line =~ /var /
+            # Underscored temp vars are likely to be Coffee
+            score += 1 * line.gsub(/(_fn|_i|_len|_ref|_results)/).count
+
+            # bind and extend functions are very Coffee specific
+            score += 3 * line.gsub(/(__bind|__extends|__hasProp|__indexOf|__slice)/).count
+          end
+        end
+
+        # Require a score of 3. This is fairly arbitrary. Consider
+        # tweaking later.
+        score >= 3
+      else
+        false
+      end
+    end
+
+    # Internal: Is this a generated documentation file for a .NET assembly?
+    #
+    # .NET developers often check in the XML Intellisense file along with an
+    # assembly - however, these don't have a special extension, so we have to
+    # dig into the contents to determine if it's a docfile. Luckily, these files
+    # are extremely structured, so recognizing them is easy.
+    #
+    # Returns true or false
+    def generated_net_docfile?
+      return false unless extname.downcase == ".xml"
+      return false unless lines.count > 3
+
+      # .NET Docfiles always open with <doc> and their first tag is an
+      # <assembly> tag
+      return lines[1].include?("<doc>") &&
+        lines[2].include?("<assembly>") &&
+        lines[-2].include?("</doc>")
+    end
+
+    # Internal: Is this a codegen file for a .NET project?
+    #
+    # Visual Studio often uses code generation to generate partial classes, and
+    # these files can be quite unwieldy. Let's hide them.
+    #
+    # Returns true or false
+    def generated_net_designer_file?
+      name.downcase =~ /\.designer\.cs$/
+    end
+
+    # Internal: Is the blob of JS a parser generated by PEG.js?
+    #
+    # PEG.js-generated parsers are not meant to be consumed by humans.
+    #
+    # Return true or false
+    def generated_parser?
+      return false unless extname == '.js'
+
+      # PEG.js-generated parsers include a comment near the top  of the file
+      # that marks them as such.
+      if lines[0..4].join('') =~ /^(?:[^\/]|\/[^\*])*\/\*(?:[^\*]|\*[^\/])*Generated by PEG.js/
+        return true
+      end
+
+      false
+    end
+
+    # Internal: Is the blob a C++, Java or Python source file generated by the
+    # Protocol Buffer compiler?
+    #
+    # Returns true of false.
+    def generated_protocol_buffer?
+      return false unless ['.py', '.java', '.h', '.cc', '.cpp'].include?(extname)
+      return false unless lines.count > 1
+
+      return lines[0].include?("Generated by the protocol buffer compiler.  DO NOT EDIT!")
+    end
+  end
+end
--- a/lib/linguist/language.rb
+++ b/lib/linguist/language.rb
@@ -2,6 +2,9 @@ require 'escape_utils'
 require 'pygments'
 require 'yaml'
  
+require 'linguist/classifier'
+require 'linguist/samples'
+
 module Linguist
  # Language names that are recognizable by GitHub. Defined languages
  # can be highlighted, searched and listed under the Top Languages page.
@@ -9,28 +12,22 @@ module Linguist
  # Languages are defined in `lib/linguist/languages.yml`.
  class Language
    @languages       = []
-    @overrides       = {}
    @index           = {}
    @name_index      = {}
    @alias_index     = {}
-    @extension_index = {}
-    @filename_index  = {}
+
+    @extension_index          = Hash.new { |h,k| h[k] = [] }
+    @filename_index           = Hash.new { |h,k| h[k] = [] }
+    @primary_extension_index  = {}
  
    # Valid Languages types
    TYPES = [:data, :markup, :programming]
  
-    # Internal: Test if extension maps to multiple Languages.
+    # Names of non-programming languages that we will still detect
    #
-    # Returns true or false.
-    def self.ambiguous?(extension)
-      @overrides.include?(extension)
-    end
-
-    # Include?: Return overridden extensions.
-    #
-    # Returns extensions Array.
-    def self.overridden_extensions
-      @overrides.keys
+    # Returns an array
+    def self.detectable_markup
+      ["AsciiDoc", "CSS", "Creole", "Less", "Markdown", "MediaWiki", "Org", "RDoc", "Sass", "Textile", "reStructuredText"]
    end
  
    # Internal: Create a new Language object
@@ -43,18 +40,18 @@ module Linguist
  
      @languages << language
  
-      # All Language names should be unique. Warn if there is a duplicate.
+      # All Language names should be unique. Raise if there is a duplicate.
      if @name_index.key?(language.name)
-        warn "Duplicate language name: #{language.name}"
+        raise ArgumentError, "Duplicate language name: #{language.name}"
      end
  
      # Language name index
      @index[language.name] = @name_index[language.name] = language
  
      language.aliases.each do |name|
-        # All Language aliases should be unique. Warn if there is a duplicate.
+        # All Language aliases should be unique. Raise if there is a duplicate.
        if @alias_index.key?(name)
-          warn "Duplicate alias: #{name}"
+          raise ArgumentError, "Duplicate alias: #{name}"
        end
  
        @index[name] = @alias_index[name] = language
@@ -62,33 +59,56 @@ module Linguist
  
      language.extensions.each do |extension|
        if extension !~ /^\./
-          warn "Extension is missing a '.': #{extension.inspect}"
+          raise ArgumentError, "Extension is missing a '.': #{extension.inspect}"
        end
  
-        unless ambiguous?(extension)
-          # Index the extension with a leading ".": ".rb"
-          @extension_index[extension] = language
-
-          # Index the extension without a leading ".": "rb"
-          @extension_index[extension.sub(/^\./, '')] = language
-        end
+        @extension_index[extension] << language
      end
  
-      language.overrides.each do |extension|
-        if extension !~ /^\./
-          warn "Extension is missing a '.': #{extension.inspect}"
-        end
-
-        @overrides[extension] = language
+      if @primary_extension_index.key?(language.primary_extension)
+        raise ArgumentError, "Duplicate primary extension: #{language.primary_extension}"
      end
  
+      @primary_extension_index[language.primary_extension] = language
+
      language.filenames.each do |filename|
-        @filename_index[filename] = language
+        @filename_index[filename] << language
      end
  
      language
    end
  
+    # Public: Detects the Language of the blob.
+    #
+    # name - String filename
+    # data - String blob data. A block also maybe passed in for lazy
+    #        loading. This behavior is deprecated and you should always
+    #        pass in a String.
+    # mode - Optional String mode (defaults to nil)
+    #
+    # Returns Language or nil.
+    def self.detect(name, data, mode = nil)
+      # A bit of an elegant hack. If the file is executable but extensionless,
+      # append a "magic" extension so it can be classified with other
+      # languages that have shebang scripts.
+      if File.extname(name).empty? && mode && (mode.to_i(8) & 05) == 05
+        name += ".script!"
+      end
+
+      possible_languages = find_by_filename(name)
+
+      if possible_languages.length > 1
+        data = data.call() if data.respond_to?(:call)
+        if data.nil? || data == ""
+          nil
+        elsif result = Classifier.classify(Samples::DATA, data, possible_languages.map(&:name)).first
+          Language[result[0]]
+        end
+      else
+        possible_languages.first
+      end
+    end
+
    # Public: Get all Languages
    #
    # Returns an Array of Languages
@@ -124,33 +144,22 @@ module Linguist
      @alias_index[name]
    end
  
-    # Public: Look up Language by extension.
-    #
-    # extension - The extension String. May include leading "."
-    #
-    # Examples
-    #
-    #   Language.find_by_extension('.rb')
-    #   # => #<Language name="Ruby">
-    #
-    # Returns the Language or nil if none was found.
-    def self.find_by_extension(extension)
-      @extension_index[extension]
-    end
-
-    # Public: Look up Language by filename.
+    # Public: Look up Languages by filename.
    #
    # filename - The path String.
    #
    # Examples
    #
    #   Language.find_by_filename('foo.rb')
-    #   # => #<Language name="Ruby">
+    #   # => [#<Language name="Ruby">]
    #
-    # Returns the Language or nil if none was found.
+    # Returns all matching Languages or [] if none were found.
    def self.find_by_filename(filename)
      basename, extname = File.basename(filename), File.extname(filename)
-      @filename_index[basename] || @extension_index[extname]
+      langs = [@primary_extension_index[extname]] +
+              @filename_index[basename] +
+              @extension_index[extname]
+      langs.compact.uniq
    end
  
    # Public: Look up Language by its name or lexer.
@@ -231,16 +240,18 @@ module Linguist
        raise(ArgumentError, "#{@name} is missing lexer")
  
      @ace_mode = attributes[:ace_mode]
+      @wrap = attributes[:wrap] || false
  
      # Set legacy search term
      @search_term = attributes[:search_term] || default_alias_name
  
      # Set extensions or default to [].
      @extensions = attributes[:extensions] || []
-      @overrides  = attributes[:overrides]  || []
      @filenames  = attributes[:filenames]  || []
  
-      @primary_extension = attributes[:primary_extension] || default_primary_extension || extensions.first
+      unless @primary_extension = attributes[:primary_extension]
+        raise ArgumentError, "#{@name} is missing primary extension"
+      end
  
      # Prepend primary extension unless its already included
      if primary_extension && !extensions.include?(primary_extension)
@@ -320,6 +331,11 @@ module Linguist
    # Returns a String name or nil
    attr_reader :ace_mode
  
+    # Public: Should language lines be wrapped
+    #
+    # Returns true or false
+    attr_reader :wrap
+
    # Public: Get extensions
    #
    # Examples
@@ -331,7 +347,7 @@ module Linguist
  
    # Deprecated: Get primary extension
    #
-    # Defaults to the first extension but can be overriden
+    # Defaults to the first extension but can be overridden
    # in the languages.yml.
    #
    # The primary extension can not be nil. Tests should verify this.
@@ -343,11 +359,6 @@ module Linguist
    # Returns the extension String.
    attr_reader :primary_extension
  
-    # Internal: Get overridden extensions.
-    #
-    # Returns the extensions Array.
-    attr_reader :overrides
-
    # Public: Get filenames
    #
    # Examples
@@ -377,13 +388,6 @@ module Linguist
      name.downcase.gsub(/\s/, '-')
    end
  
-    # Internal: Get default primary extension.
-    #
-    # Returns the extension String.
-    def default_primary_extension
-      extensions.first
-    end
-
    # Public: Get Language group
    #
    # Returns a Language
@@ -441,11 +445,36 @@ module Linguist
    def hash
      name.hash
    end
+
+    def inspect
+      "#<#{self.class} name=#{name}>"
+    end
  end
  
+  extensions = Samples::DATA['extnames']
+  filenames = Samples::DATA['filenames']
  popular = YAML.load_file(File.expand_path("../popular.yml", __FILE__))
  
  YAML.load_file(File.expand_path("../languages.yml", __FILE__)).each do |name, options|
+    options['extensions'] ||= []
+    options['filenames'] ||= []
+
+    if extnames = extensions[name]
+      extnames.each do |extname|
+        if !options['extensions'].include?(extname)
+          options['extensions'] << extname
+        end
+      end
+    end
+
+    if fns = filenames[name]
+      fns.each do |filename|
+        if !options['filenames'].include?(filename)
+          options['filenames'] << filename
+        end
+      end
+    end
+
    Language.create(
      :name              => name,
      :color             => options['color'],
@@ -453,12 +482,12 @@ module Linguist
      :aliases           => options['aliases'],
      :lexer             => options['lexer'],
      :ace_mode          => options['ace_mode'],
+      :wrap              => options['wrap'],
      :group_name        => options['group'],
      :searchable        => options.key?('searchable') ? options['searchable'] : true,
      :search_term       => options['search_term'],
-      :extensions        => options['extensions'],
+      :extensions        => options['extensions'].sort,
      :primary_extension => options['primary_extension'],
-      :overrides         => options['overrides'],
      :filenames         => options['filenames'],
      :popular           => popular.include?(name)
    )

--- a/lib/linguist/languages.yml
+++ b/lib/linguist/languages.yml
 # Defines all Languages known to GitHub.
 #
 # All languages have an associated lexer for syntax highlighting. It
-# defaults to name.downcase, which covers most cases. Make sure the
-# lexer exists in lexers.yml. This is a list of available in our
-# version of pygments.
+# defaults to name.downcase, which covers most cases.
 #
 # type              - Either data, programming, markup, or nil
-# lexer             - An explicit lexer String (defaults to name.downcase)
+# lexer             - An explicit lexer String (defaults to name)
 # aliases           - An Array of additional aliases (implicitly
 #                     includes name.downcase)
 # ace_mode          - A String name of Ace Mode (if available)
+# wrap              - Boolean wrap to enable line wrapping (default: false)
 # extension         - An Array of associated extensions
 # primary_extension - A String for the main extension associated with
-#                     the langauge. (defaults to extensions.first)
-# overrides         - An Array of extensions that takes precedence over conflicts
+#                     the language. Must be unique. Used when a Language is picked
+#                     from a dropdown and we need to automatically choose an
+#                     extension.
 # searchable        - Boolean flag to enable searching (defaults to true)
 # search_term       - Deprecated: Some languages maybe indexed under a
 #                     different alias. Avoid defining new exceptions.
@@ -22,7 +22,12 @@
 # Any additions or modifications (even trivial) should have corresponding
 # test change in `test/test_blob.rb`.
 #
-# Please keep this list alphabetized.
+# Please keep this list alphabetized. Capitalization comes before lower case.
+
+ABAP:
+  type: programming
+  lexer: ABAP
+  primary_extension: .abap
  
 ASP:
  type: programming
@@ -38,7 +43,6 @@ ASP:
  - .ascx
  - .ashx
  - .asmx
-  - .asp
  - .aspx
  - .axd
  
@@ -49,43 +53,53 @@ ActionScript:
  search_term: as3
  aliases:
  - as3
-  extensions:
-  - .as
+  primary_extension: .as
  
 Ada:
  type: programming
  color: "#02f88c"
+  primary_extension: .adb
  extensions:
-  - .adb
  - .ads
  
+ApacheConf:
+  type: markup
+  aliases:
+  - apache
+  primary_extension: .apacheconf
+
 Apex:
  type: programming
  lexer: Text only
-  extensions:
-  - .cls
+  primary_extension: .cls
  
 AppleScript:
+  type: programming
  aliases:
  - osascript
-  primary_extension: .scpt
-  extensions:
-  - .applescript
-  - .scpt
+  primary_extension: .applescript
  
 Arc:
  type: programming
  color: "#ca2afe"
  lexer: Text only
-  extensions:
-  - .arc
+  primary_extension: .arc
  
 Arduino:
  type: programming
  color: "#bd79d1"
  lexer: C++
+  primary_extension: .ino
+
+AsciiDoc:
+  type: markup
+  lexer: Text only
+  ace_mode: asciidoc
+  wrap: true
+  primary_extension: .asciidoc
  extensions:
-  - .ino
+  - .adoc
+  - .asc
  
 Assembly:
  type: programming
@@ -94,13 +108,11 @@ Assembly:
  search_term: nasm
  aliases:
  - nasm
-  extensions:
-  - .asm
+  primary_extension: .asm
  
 Augeas:
  type: programming
-  extensions:
-  - .aug
+  primary_extension: .aug
  
 AutoHotkey:
  type: programming
@@ -108,8 +120,16 @@ AutoHotkey:
  color: "#6594b9"
  aliases:
  - ahk
+  primary_extension: .ahk
+
+Awk:
+  type: programming
+  lexer: Awk
+  primary_extension: .awk
  extensions:
-  - .ahk
+  - .gawk
+  - .mawk
+  - .nawk
  
 Batchfile:
  type: programming
@@ -119,42 +139,33 @@ Batchfile:
  - bat
  primary_extension: .bat
  extensions:
-  - .bat
  - .cmd
  
 Befunge:
-  extensions:
-  - .befunge
+  primary_extension: .befunge
  
 BlitzMax:
-  extensions:
-  - .bmx
+  primary_extension: .bmx
  
 Boo:
  type: programming
  color: "#d4bec1"
-  extensions:
-  - .boo
+  primary_extension: .boo
  
 Brainfuck:
+  primary_extension: .b
  extensions:
-  - .b
  - .bf
  
 Bro:
  type: programming
-  extensions:
-  - .bro
+  primary_extension: .bro
  
 C:
  type: programming
  color: "#555"
-  overrides:
-  - .h
  primary_extension: .c
  extensions:
-  - .c
-  - .h
  - .w
  
 C#:
@@ -164,8 +175,9 @@ C#:
  color: "#5a25a2"
  aliases:
  - csharp
+  primary_extension: .cs
  extensions:
-  - .cs
+  - .csx
  
 C++:
  type: programming
@@ -176,23 +188,19 @@ C++:
  - cpp
  primary_extension: .cpp
  extensions:
+  - .C
  - .c++
-  - .cc
-  - .cpp
-  - .cu
  - .cxx
-  - .h
+  - .H
  - .h++
  - .hh
-  - .hpp
  - .hxx
  - .tcc
  
 C-ObjDump:
  type: data
  lexer: c-objdump
-  extensions:
-  - .c-objdump
+  primary_extension: .c-objdump
  
 C2hs Haskell:
  type: programming
@@ -200,25 +208,42 @@ C2hs Haskell:
  group: Haskell
  aliases:
  - c2hs
-  extensions:
-  - .chs
+  primary_extension: .chs
+
+CLIPS:
+  type: programming
+  lexer: Text only
+  primary_extension: .clp
  
 CMake:
+  primary_extension: .cmake
  extensions:
-  - .cmake
  - .cmake.in
  filenames:
  - CMakeLists.txt
  
+COBOL:
+  type: programming
+  primary_extension: .cob
+  extensions:
+  - .cbl
+  - .ccp
+  - .cobol
+  - .cpy
+
 CSS:
  ace_mode: css
-  extensions:
-  - .css
+  color: "#1f085e"
+  primary_extension: .css
+
+Ceylon:
+  type: programming
+  lexer: Ceylon
+  primary_extension: .ceylon
  
 ChucK:
  lexer: Java
-  extensions:
-  - .ck
+  primary_extension: .ck
  
 Clojure:
  type: programming
@@ -226,8 +251,10 @@ Clojure:
  color: "#db5855"
  primary_extension: .clj
  extensions:
-  - .clj
  - .cljs
+  - .cljx
+  filenames:
+  - riemann.config
  
 CoffeeScript:
  type: programming
@@ -235,8 +262,12 @@ CoffeeScript:
  color: "#244776"
  aliases:
  - coffee
+  - coffee-script
+  primary_extension: .coffee
  extensions:
-  - .coffee
+  - ._coffee
+  - .cson
+  - .iced
  filenames:
  - Cakefile
  
@@ -251,7 +282,6 @@ ColdFusion:
  primary_extension: .cfm
  extensions:
  - .cfc
-  - .cfm
  
 Common Lisp:
  type: programming
@@ -260,27 +290,32 @@ Common Lisp:
  - lisp
  primary_extension: .lisp
  extensions:
-  - .lisp
+  - .asd
  - .lsp
  - .ny
+  - .podsl
  
 Coq:
  type: programming
-  extensions:
-  - .v
+  primary_extension: .coq
  
 Cpp-ObjDump:
  type: data
  lexer: cpp-objdump
+  primary_extension: .cppobjdump
  extensions:
-  - .cppobjdump
  - .c++objdump
  - .cxx-objdump
  
+Creole:
+  type: markup
+  lexer: Text only
+  wrap: true
+  primary_extension: .creole
+
 Cucumber:
  lexer: Gherkin
-  extensions:
-  - .feature
+  primary_extension: .feature
  
 Cython:
  type: programming
@@ -289,42 +324,37 @@ Cython:
  extensions:
  - .pxd
  - .pxi
-  - .pyx
  
 D:
  type: programming
  color: "#fcd46d"
+  primary_extension: .d
  extensions:
-  - .d
  - .di
  
 D-ObjDump:
  type: data
  lexer: d-objdump
+  primary_extension: .d-objdump
+
+DOT:
+  type: programming
+  lexer: Text only
+  primary_extension: .dot
  extensions:
-  - .d-objdump
+  - .gv
  
 Darcs Patch:
  search_term: dpatch
  aliases:
  - dpatch
+  primary_extension: .darcspatch
  extensions:
-  - .darcspatch
  - .dpatch
  
 Dart:
  type: programming
-  extensions:
-  - .dart
-
-Delphi:
-  type: programming
-  color: "#b0ce4e"
-  primary_extension: .pas
-  extensions:
-  - .dpr
-  - .lpr
-  - .pas
+  primary_extension: .dart
  
 DCPU-16 ASM:
  type: programming
@@ -332,43 +362,50 @@ DCPU-16 ASM:
  primary_extension: .dasm16
  extensions:
  - .dasm
-  - .dasm16
  aliases:
  - dasm16
  
 Diff:
-  extensions:
-  - .diff
-  - .patch
+  primary_extension: .diff
  
 Dylan:
  type: programming
  color: "#3ebc27"
-  extensions:
-  - .dylan
+  primary_extension: .dylan
  
 Ecere Projects:
  type: data
  group: JavaScript
  lexer: JSON
+  primary_extension: .epj
+
+Ecl:
+  type: programming
+  color: "#8a1267"
+  primary_extension: .ecl
+  lexer: ECL
  extensions:
-  - .epj
+  - .eclxml
  
 Eiffel:
  type: programming
  lexer: Text only
  color: "#946d57"
-  extensions:
-  - .e
+  primary_extension: .e
  
 Elixir:
  type: programming
  color: "#6e4a7e"
  primary_extension: .ex
  extensions:
-  - .ex
  - .exs
  
+Elm:
+  type: programming
+  lexer: Haskell
+  group: Haskell
+  primary_extension: .elm
+
 Emacs Lisp:
  type: programming
  lexer: Scheme
@@ -378,24 +415,24 @@ Emacs Lisp:
  - emacs
  primary_extension: .el
  extensions:
-  - .el
  - .emacs
  
 Erlang:
  type: programming
-  color: "#949e0e"
+  color: "#0faf8d"
  primary_extension: .erl
  extensions:
-  - .erl
  - .hrl
  
 F#:
  type: programming
  lexer: FSharp
  color: "#b845fc"
-  search_term: ocaml
+  search_term: fsharp
+  aliases:
+  - fsharp
+  primary_extension: .fs
  extensions:
-  - .fs
  - .fsi
  - .fsx
  
@@ -417,7 +454,6 @@ FORTRAN:
  - .f03
  - .f08
  - .f77
-  - .f90
  - .f95
  - .for
  - .fpp
@@ -425,8 +461,10 @@ FORTRAN:
 Factor:
  type: programming
  color: "#636746"
-  extensions:
-  - .factor
+  primary_extension: .factor
+  filenames:
+    - .factor-rc
+    - .factor-boot-rc
  
 Fancy:
  type: programming
@@ -434,13 +472,21 @@ Fancy:
  primary_extension: .fy
  extensions:
  - .fancypack
-  - .fy
+  filenames:
+  - Fakefile
  
 Fantom:
  type: programming
  color: "#dbded5"
+  primary_extension: .fan
+
+Forth:
+  type: programming
+  primary_extension: .fth
+  color: "#341708"
+  lexer: Text only
  extensions:
-  - .fan
+  - .4th
  
 GAS:
  type: programming
@@ -448,49 +494,50 @@ GAS:
  primary_extension: .s
  extensions:
  - .S
-  - .s
  
-Genshi:
+GLSL:
+  group: C
+  type: programming
+  primary_extension: .glsl
  extensions:
-  - .kid
+  - .fp
+  - .frag
+  - .geom
+  - .glslv
+  - .shader
+  - .vert
+
+Genshi:
+  primary_extension: .kid
  
 Gentoo Ebuild:
  group: Shell
  lexer: Bash
-  extensions:
-  - .ebuild
+  primary_extension: .ebuild
  
 Gentoo Eclass:
  group: Shell
  lexer: Bash
-  extensions:
-  - .eclass
+  primary_extension: .eclass
  
 Gettext Catalog:
  search_term: pot
  searchable: false
  aliases:
  - pot
+  primary_extension: .po
  extensions:
-  - .po
  - .pot
  
 Go:
  type: programming
-  color: "#8d04eb"
-  extensions:
-  - .go
+  color: "#a89b4d"
+  primary_extension: .go
  
 Gosu:
  type: programming
  color: "#82937f"
  primary_extension: .gs
-  extensions:
-  - .gs
-  - .gsp
-  - .gst
-  - .gsx
-  - .vark
  
 Groff:
  primary_extension: .man
@@ -502,127 +549,133 @@ Groff:
  - '.5'
  - '.6'
  - '.7'
-  - .man
  
 Groovy:
  type: programming
  ace_mode: groovy
  color: "#e69f56"
  primary_extension: .groovy
-  extensions:
-  - .gradle
-  - .groovy
  
 Groovy Server Pages:
  group: Groovy
  lexer: Java Server Page
-  overrides:
-  - .gsp
  aliases:
  - gsp
-  extensions:
-  - .gsp
+  primary_extension: .gsp
  
 HTML:
  type: markup
  ace_mode: html
+  aliases:
+  - xhtml
  primary_extension: .html
  extensions:
  - .htm
-  - .html
  - .xhtml
-  - .xslt
  
 HTML+Django:
  type: markup
  group: HTML
  lexer: HTML+Django/Jinja
+  primary_extension: .mustache # TODO: This is incorrect
  extensions:
+  - .jinja
  - .mustache
  
 HTML+ERB:
  type: markup
  group: HTML
  lexer: RHTML
+  aliases:
+  - erb
  primary_extension: .erb
  extensions:
-  - .erb
+  - .erb.deface
  - .html.erb
+  - .html.erb.deface
  
 HTML+PHP:
  type: markup
  group: HTML
-  extensions:
-  - .phtml
+  primary_extension: .phtml
  
-HaXe:
-  type: programming
-  lexer: haXe
-  ace_mode: haxe
-  color: "#346d51"
-  extensions:
-  - .hx
-  - .hxml
-  - .mtt
+HTTP:
+  type: data
+  primary_extension: .http
  
 Haml:
  group: HTML
  type: markup
+  primary_extension: .haml
  extensions:
-  - .haml
+  - .haml.deface
+  - .html.haml.deface
+
+Handlebars:
+  type: markup
+  lexer: Text only
+  primary_extension: .handlebars
  
 Haskell:
  type: programming
  color: "#29b544"
+  primary_extension: .hs
  extensions:
-  - .hs
  - .hsc
  
+Haxe:
+  type: programming
+  lexer: haXe
+  ace_mode: haxe
+  color: "#346d51"
+  primary_extension: .hx
+  extensions:
+  - .hxsl
+
 INI:
  type: data
  extensions:
-  - .cfg
  - .ini
  - .prefs
  - .properties
-  filenames:
-  - .gitconfig
+  primary_extension: .ini
  
 IRC log:
  lexer: IRC logs
  search_term: irc
  aliases:
  - irc
+  primary_extension: .irclog
  extensions:
  - .weechatlog
  
 Io:
  type: programming
  color: "#a9188d"
-  extensions:
-  - .io
+  primary_extension: .io
  
 Ioke:
  type: programming
  color: "#078193"
-  extensions:
-  - .ik
+  primary_extension: .ik
+
+J:
+  type: programming
+  lexer: Text only
+  primary_extension: .ijs
  
 JSON:
  type: data
  group: JavaScript
  ace_mode: json
  searchable: false
-  extensions:
-  - .json
+  primary_extension: .json
  
 Java:
  type: programming
  ace_mode: java
  color: "#b07219"
-  extensions:
-  - .java
-  - .pde
+  primary_extension: .java
  
 Java Server Pages:
  group: Java
@@ -630,8 +683,7 @@ Java Server Pages:
  search_term: jsp
  aliases:
  - jsp
-  extensions:
-  - .jsp
+  primary_extension: .jsp
  
 JavaScript:
  type: programming
@@ -642,9 +694,9 @@ JavaScript:
  - node
  primary_extension: .js
  extensions:
+  - ._js
  - .bones
  - .jake
-  - .js
  - .jsfl
  - .jsm
  - .jss
@@ -657,26 +709,55 @@ JavaScript:
  
 Julia:
  type: programming
-  extensions:
-  - .jl
+  primary_extension: .jl
  
 Kotlin:
  type: programming
+  primary_extension: .kt
  extensions:
-  - .kt
  - .ktm
  - .kts
  
+LFE:
+  type: programming
+  primary_extension: .lfe
+  color: "#004200"
+  lexer: Common Lisp
+  group: Erlang
+
 LLVM:
-  extensions:
-  - .ll
+  primary_extension: .ll
+
+Lasso:
+  type: programming
+  lexer: Lasso
+  ace_mode: lasso
+  color: "#2584c3"
+  primary_extension: .lasso
+
+Less:
+  type: markup
+  group: CSS
+  lexer: CSS
+  ace_mode: less
+  primary_extension: .less
  
 LilyPond:
  lexer: Text only
  primary_extension: .ly
  extensions:
  - .ily
-  - .ly
+
+Literate CoffeeScript:
+  type: programming
+  group: CoffeeScript
+  lexer: Text only
+  ace_mode: markdown
+  wrap: true
+  search_term: litcoffee
+  aliases:
+  - litcoffee
+  primary_extension: .litcoffee
  
 Literate Haskell:
  type: programming
@@ -684,44 +765,77 @@ Literate Haskell:
  search_term: lhs
  aliases:
  - lhs
+  primary_extension: .lhs
+
+LiveScript:
+  type: programming
+  ace_mode: ls
+  color: "#499886"
+  aliases:
+  - ls
+  primary_extension: .ls
+  extensions:
+  - ._ls
+  filenames:
+  - Slakefile
+
+Logos:
+  type: programming
+  primary_extension: .xm
  extensions:
-  - .lhs
+  - .x
+  - .xi
+  - .xmi
  
 Logtalk:
  type: programming
+  primary_extension: .lgt
  extensions:
-  - .lgt
+  - .logtalk
  
 Lua:
  type: programming
  ace_mode: lua
  color: "#fa1fa1"
+  primary_extension: .lua
  extensions:
-  - .lua
  - .nse
+  - .rbxs
+
+M:
+  type: programming
+  lexer: Common Lisp
+  aliases:
+  - mumps
+  primary_extension: .mumps
+  extensions:
+  - .m
  
 Makefile:
+  aliases:
+  - make
  extensions:
  - .mak
  - .mk
+  primary_extension: .mak
  filenames:
  - makefile
  - Makefile
  - GNUmakefile
  
 Mako:
+  primary_extension: .mako
  extensions:
-  - .mako
  - .mao
  
 Markdown:
  type: markup
  lexer: Text only
  ace_mode: markdown
+  wrap: true
  primary_extension: .md
  extensions:
  - .markdown
-  - .md
  - .mkd
  - .mkdown
  - .ron
@@ -730,16 +844,25 @@ Matlab:
  type: programming
  color: "#bb92ac"
  primary_extension: .matlab
-  extensions:
-  - .m
-  - .matlab
  
-Max/MSP:
+Max:
  type: programming
  color: "#ce279c"
  lexer: Text only
+  aliases:
+  - max/msp
+  - maxmsp
+  search_term: max/msp
+  primary_extension: .mxt
  extensions:
-  - .mxt
+  - .maxhelp
+  - .maxpat
+
+MediaWiki:
+  type: markup
+  lexer: Text only
+  wrap: true
+  primary_extension: .mediawiki
  
 MiniD: # Legacy
  searchable: false
@@ -750,31 +873,46 @@ Mirah:
  lexer: Ruby
  search_term: ruby
  color: "#c7a938"
+  primary_extension: .druby
  extensions:
  - .duby
  - .mir
  - .mirah
  
+Monkey:
+  type: programming
+  lexer: Monkey
+  primary_extension: .monkey
+
 Moocode:
  lexer: MOOCode
-  extensions:
-  - .moo
+  primary_extension: .moo
+
+MoonScript:
+  type: programming
+  primary_extension: .moon
  
 Myghty:
-  extensions:
-  - .myt
+  primary_extension: .myt
+
+NSIS:
+  primary_extension: .nsi
  
 Nemerle:
  type: programming
  color: "#0d3c6e"
-  extensions:
-  - .n
+  primary_extension: .n
+
+Nginx:
+  type: markup
+  lexer: Nginx configuration file
+  primary_extension: .nginxconf
  
 Nimrod:
  type: programming
  color: "#37775b"
+  primary_extension: .nim
  extensions:
-  - .nim
  - .nimrod
  
 Nu:
@@ -783,8 +921,7 @@ Nu:
  color: "#c9df40"
  aliases:
  - nush
-  extensions:
-  - .nu
+  primary_extension: .nu
  filenames:
  - Nukefile
  
@@ -792,7 +929,6 @@ NumPy:
  group: Python
  primary_extension: .numpy
  extensions:
-  - .numpy
  - .numpyw
  - .numsc
  
@@ -802,7 +938,7 @@ OCaml:
  color: "#3be133"
  primary_extension: .ml
  extensions:
-  - .ml
+  - .eliomi
  - .mli
  - .mll
  - .mly
@@ -810,38 +946,44 @@ OCaml:
 ObjDump:
  type: data
  lexer: objdump
-  extensions:
-  - .objdump
+  primary_extension: .objdump
  
 Objective-C:
  type: programming
  color: "#438eff"
-  overrides:
-  - .m
+  aliases:
+  - obj-c
+  - objc
  primary_extension: .m
  extensions:
-  - .h
-  - .m
  - .mm
  
 Objective-J:
  type: programming
  color: "#ff0c5a"
+  aliases:
+  - obj-j
+  primary_extension: .j
  extensions:
-  - .j
  - .sj
  
+Omgrofl:
+  type: programming
+  primary_extension: .omgrofl
+  color: "#cabbff"
+  lexer: Text only
+
 Opa:
  type: programming
-  extensions:
-  - .opa
+  primary_extension: .opa
  
 OpenCL:
  type: programming
  group: C
  lexer: C
+  primary_extension: .cl
  extensions:
-  - .cl
+  - .opencl
  
 OpenEdge ABL:
  type: programming
@@ -850,18 +992,21 @@ OpenEdge ABL:
  - openedge
  - abl
  primary_extension: .p
-  extensions:
-  - .cls
-  - .p
+
+Org:
+  type: markup
+  lexer: Text only
+  wrap: true
+  primary_extension: .org
  
 PHP:
  type: programming
  ace_mode: php
  color: "#6e03c1"
+  primary_extension: .php
  extensions:
  - .aw
  - .ctp
-  - .php
  - .php3
  - .php4
  - .php5
@@ -881,8 +1026,7 @@ Parrot Internal Representation:
  lexer: Text only
  aliases:
  - pir
-  extensions:
-  - .pir
+  primary_extension: .pir
  
 Parrot Assembly:
  group: Parrot
@@ -890,48 +1034,70 @@ Parrot Assembly:
  lexer: Text only
  aliases:
  - pasm
+  primary_extension: .pasm
+
+Pascal:
+  type: programming
+  lexer: Delphi
+  color: "#b0ce4e"
+  primary_extension: .pas
  extensions:
-  - .pasm
+  - .dfm
+  - .lpr
  
 Perl:
  type: programming
  ace_mode: perl
  color: "#0298c3"
-  overrides:
-  - .pl
-  - .t
  primary_extension: .pl
  extensions:
  - .PL
+  - .nqp
  - .perl
  - .ph
-  - .pl
  - .plx
-  - .pm
+  - .pm6
  - .pod
  - .psgi
-  - .t
+
+Pike:
+  type: programming
+  color: "#066ab2"
+  lexer: C
+  primary_extension: .pike
+  extensions:
+  - .pmod
+
+PogoScript:
+  type: programming
+  color: "#d80074"
+  lexer: Text only
+  primary_extension: .pogo
  
 PowerShell:
  type: programming
  ace_mode: powershell
  aliases:
  - posh
-  extensions:
-  - .ps1
-  - .psm1
+  primary_extension: .ps1
+
+Processing:
+  type: programming
+  lexer: Java
+  color: "#2779ab"
+  primary_extension: .pde
  
 Prolog:
  type: programming
  color: "#74283c"
+  primary_extension: .prolog
  extensions:
-  - .pl
  - .pro
-  - .prolog
  
 Puppet:
  type: programming
  color: "#cc5555"
+  primary_extension: .pp
  extensions:
  - .pp
  filenames:
@@ -941,8 +1107,7 @@ Pure Data:
  type: programming
  color: "#91de79"
  lexer: Text only
-  extensions:
-  - .pd
+  primary_extension: .pd
  
 Python:
  type: programming
@@ -950,67 +1115,80 @@ Python:
  color: "#3581ba"
  primary_extension: .py
  extensions:
-  - .py
+  - .gyp
+  - .pyt
  - .pyw
  - .wsgi
  - .xpy
+  filenames:
+  - wscript
  
 Python traceback:
  type: data
  group: Python
  lexer: Python Traceback
  searchable: false
-  extensions:
-  - .pytb
+  primary_extension: .pytb
  
 R:
  type: programming
  color: "#198ce7"
  lexer: S
-  overrides:
-  - .r
  primary_extension: .r
-  extensions:
-  - .R
-  - .r
+  filenames:
+  - .Rprofile
+
+RDoc:
+  type: markup
+  lexer: Text only
+  ace_mode: rdoc
+  wrap: true
+  primary_extension: .rdoc
  
 RHTML:
  type: markup
  group: HTML
-  extensions:
-  - .rhtml
+  primary_extension: .rhtml
  
 Racket:
  type: programming
-  lexer: Scheme
+  lexer: Racket
  color: "#ae17ff"
  primary_extension: .rkt
  extensions:
-  - .rkt
  - .rktd
  - .rktl
-  - .scrbl
+
+Ragel in Ruby Host:
+  type: programming
+  lexer: Ragel in Ruby Host
+  color: "#ff9c2e"
+  primary_extension: .rl
  
 Raw token data:
  search_term: raw
  aliases:
  - raw
-  extensions:
-  - .raw
+  primary_extension: .raw
  
 Rebol:
  type: programming
  lexer: REBOL
  color: "#358a5b"
+  primary_extension: .rebol
  extensions:
-  - .r
  - .r2
  - .r3
-  - .rebol
  
 Redcode:
-  extensions:
-  - .cw
+  primary_extension: .cw
+
+Rouge:
+  type: programming
+  lexer: Clojure
+  ace_mode: clojure
+  color: "#cc0088"
+  primary_extension: .rg
  
 Ruby:
  type: programming
@@ -1029,8 +1207,6 @@ Ruby:
  - .god
  - .irbrc
  - .podspec
-  - .rake
-  - .rb
  - .rbuild
  - .rbw
  - .rbx
@@ -1038,80 +1214,64 @@ Ruby:
  - .thor
  - .watchr
  filenames:
-  - Capfile
+  - Berksfile
  - Gemfile
  - Guardfile
  - Podfile
-  - Rakefile
  - Thorfile
  - Vagrantfile
  
 Rust:
  type: programming
  color: "#dea584"
-  lexer: Text only
-  extensions:
-  - .rs
+  primary_extension: .rs
  
 SCSS:
  type: markup
  group: CSS
  ace_mode: scss
-  extensions:
-  - .scss
+  primary_extension: .scss
  
 SQL:
  type: data
  ace_mode: sql
  searchable: false
-  extensions:
-  - .sql
+  primary_extension: .sql
  
 Sage:
  type: programming
  lexer: Python
  group: Python
-  extensions:
-  - .sage
+  primary_extension: .sage
  
 Sass:
  type: markup
  group: CSS
-  extensions:
-  - .sass
+  primary_extension: .sass
  
 Scala:
  type: programming
  ace_mode: scala
  color: "#7dd3b0"
  primary_extension: .scala
-  extensions:
-  - .sbt
-  - .scala
  
 Scheme:
  type: programming
  color: "#1e4aec"
  primary_extension: .scm
  extensions:
-  - .scm
  - .sls
-  - .sps
  - .ss
  
 Scilab:
  type: programming
  primary_extension: .sci
-  extensions:
-  - .sce
-  - .tst
  
 Self:
  type: programming
  color: "#0579aa"
  lexer: Text only
-  extensions:
-  - .self
+  primary_extension: .self
  
 Shell:
  type: programming
@@ -1124,28 +1284,27 @@ Shell:
  - zsh
  primary_extension: .sh
  extensions:
-  - .bash
-  - .sh
-  - .zsh
+  - .tmux
  filenames:
-  - .bash_profile
-  - .bashrc
-  - .profile
-  - .zlogin
-  - .zsh
-  - .zshrc
-  - bashrc
-  - zshrc
+  - Dockerfile
+
+Slash:
+  type: programming
+  color: "#007eff"
+  primary_extension: .sl
  
 Smalltalk:
  type: programming
  color: "#596706"
-  extensions:
-  - .st
+  primary_extension: .st
  
 Smarty:
-  extensions:
-  - .tpl
+  primary_extension: .tpl
+
+Squirrel:
+  type: programming
+  lexer: C++
+  primary_extension: .nut
  
 Standard ML:
  type: programming
@@ -1153,22 +1312,26 @@ Standard ML:
  aliases:
  - sml
  primary_extension: .sml
-  extensions:
-  - .sig
-  - .sml
  
 SuperCollider:
  type: programming
  color: "#46390b"
  lexer: Text only
-  extensions:
-  - .sc
+  primary_extension: .sc
+
+TOML:
+  type: data
+  primary_extension: .toml
+
+TXL:
+  type: programming
+  lexer: Text only
+  primary_extension: .txl
  
 Tcl:
  type: programming
  color: "#e4cc98"
-  extensions:
-  - .tcl
+  primary_extension: .tcl
  
 Tcsh:
  type: programming
@@ -1176,81 +1339,83 @@ Tcsh:
  primary_extension: .tcsh
  extensions:
  - .csh
-  - .tcsh
  
 TeX:
  type: markup
  ace_mode: latex
+  aliases:
+  - latex
  primary_extension: .tex
-  overrides:
-  - .cls
  extensions:
  - .aux
-  - .cls
  - .dtx
  - .ins
  - .ltx
  - .sty
-  - .tex
  - .toc
  
 Tea:
  type: markup
-  extensions:
-  - .tea
-
-Text:
-  type: data
-  lexer: Text only
-  ace_mode: text
-  extensions:
-  - .txt
+  primary_extension: .tea
  
 Textile:
  type: markup
  lexer: Text only
  ace_mode: textile
-  extensions:
-  - .textile
+  wrap: true
+  primary_extension: .textile
  
 Turing:
  type: programming
  color: "#45f715"
  lexer: Text only
+  primary_extension: .t
  extensions:
-  - .t
  - .tu
  
 Twig:
  type: markup
  group: PHP
  lexer: HTML+Django/Jinja
-  extensions:
-  - .twig
+  primary_extension: .twig
+
+TypeScript:
+  type: programming
+  color: "#31859c"
+  aliases:
+  - ts
+  primary_extension: .ts
+
+Unified Parallel C:
+  type: programming
+  group: C
+  lexer: C
+  ace_mode: c_cpp
+  color: "#755223"
+  primary_extension: .upc
  
 VHDL:
  type: programming
  lexer: vhdl
  color: "#543978"
-  extensions:
-  - .vhd
-  - .vhdl
+  primary_extension: .vhdl
  
 Vala:
  type: programming
  color: "#ee7d06"
+  primary_extension: .vala
  extensions:
-  - .vala
  - .vapi
  
 Verilog:
  type: programming
  lexer: verilog
  color: "#848bf3"
-  overrides:
-  - .v
+  primary_extension: .v
  extensions:
-  - .v
+  - .sv
+  - .svh
+  - .vh
  
 VimL:
  type: programming
@@ -1258,11 +1423,8 @@ VimL:
  search_term: vim
  aliases:
  - vim
-  extensions:
-  - .vim
+  primary_extension: .vim
  filenames:
-  - .gvimrc
-  - .vimrc
  - vimrc
  - gvimrc
  
@@ -1274,85 +1436,151 @@ Visual Basic:
  extensions:
  - .bas
  - .frx
-  - .vb
  - .vba
  - .vbs
  
+Volt:
+    type: programming
+    lexer: D
+    color: "#0098db"
+    primary_extension: .volt
+
+XC:
+  type: programming
+  lexer: C
+  primary_extension: .xc
+
 XML:
  type: markup
  ace_mode: xml
+  aliases:
+  - rss
+  - xsd
+  - wsdl
  primary_extension: .xml
  extensions:
+  - .axml
+  - .ccxml
+  - .dita
+  - .ditamap
+  - .ditaval
  - .glade
+  - .grxml
  - .kml
  - .mxml
  - .plist
+  - .pt
  - .rdf
  - .rss
+  - .scxml
  - .svg
+  - .tmCommand
+  - .tmLanguage
+  - .tmPreferences
+  - .tmSnippet
+  - .tmTheme
+  - .tml
+  - .ui
+  - .vxml
  - .wsdl
  - .wxi
  - .wxl
  - .wxs
+  - .x3d
  - .xaml
  - .xlf
  - .xliff
-  - .xml
+  - .xmi
  - .xsd
-  - .xsl
  - .xul
+  - .zcml
  filenames:
  - .classpath
  - .project
  
+XProc:
+  type: programming
+  lexer: XML
+  primary_extension: .xpl
+  extensions:
+  - .xproc
+
 XQuery:
  type: programming
  color: "#2700e2"
+  primary_extension: .xquery
  extensions:
  - .xq
-  - .xqm
-  - .xquery
  - .xqy
  
 XS:
  lexer: C
+  primary_extension: .xs
+
+XSLT:
+  type: programming
+  aliases:
+  - xsl
+  primary_extension: .xslt
  extensions:
-  - .xs
+    - .xsl
+
+Xtend:
+  type: programming
+  primary_extension: .xtend
  
 YAML:
-  type: markup
+  type: data
+  aliases:
+  - yml
  primary_extension: .yml
  extensions:
+  - .reek
  - .yaml
-  - .yml
-  filenames:
-  - .gemrc
  
 eC:
  type: programming
  search_term: ec
  primary_extension: .ec
  extensions:
-  - .ec
  - .eh
  
+edn:
+  type: data
+  lexer: Clojure
+  ace_mode: clojure
+  color: "#db5855"
+  primary_extension: .edn
+
+fish:
+  type: programming
+  group: Shell
+  lexer: Text only
+  primary_extension: .fish
+
 mupad:
  lexer: MuPAD
-  extensions:
-  - .mu
+  primary_extension: .mu
  
 ooc:
  type: programming
  lexer: Ooc
  color: "#b0b77e"
-  extensions:
-  - .ooc
+  primary_extension: .ooc
  
 reStructuredText:
  type: markup
+  wrap: true
  search_term: rst
  aliases:
  - rst
+  primary_extension: .rst
  extensions:
-  - .rst
  - .rest
+
+wisp:
+  type: programming
+  lexer: Clojure
+  ace_mode: clojure
+  color: "#7582D1"
+  primary_extension: .wisp
--- a/lib/linguist/md5.rb
+++ b/lib/linguist/md5.rb
+require 'digest/md5'
+
+module Linguist
+  module MD5
+    # Public: Create deep nested digest of value object.
+    #
+    # Useful for object comparison.
+    #
+    # obj - Object to digest.
+    #
+    # Returns String hex digest
+    def self.hexdigest(obj)
+      digest = Digest::MD5.new
+
+      case obj
+      when String, Symbol, Integer
+        digest.update "#{obj.class}"
+        digest.update "#{obj}"
+      when TrueClass, FalseClass, NilClass
+        digest.update "#{obj.class}"
+      when Array
+        digest.update "#{obj.class}"
+        for e in obj
+          digest.update(hexdigest(e))
+        end
+      when Hash
+        digest.update "#{obj.class}"
+        for e in obj.map { |(k, v)| hexdigest([k, v]) }.sort
+          digest.update(e)
+        end
+      else
+        raise TypeError, "can't convert #{obj.inspect} into String"
+      end
+
+      digest.hexdigest
+    end
+  end
+end
--- a/lib/linguist/mime.rb
+++ b/lib/linguist/mime.rb
-require 'mime/types'
-require 'yaml'
-
-class MIME::Type
-  attr_accessor :override
-end
-
-# Register additional mime type extensions
-#
-# Follows same format as mime-types data file
-#   https://github.com/halostatue/mime-types/blob/master/lib/mime/types.rb.data
-File.read(File.expand_path("../mimes.yml", __FILE__)).lines.each do |line|
-  # Regexp was cargo culted from mime-types lib
-  next unless line =~ %r{^
-    #{MIME::Type::MEDIA_TYPE_RE}
-    (?:\s@([^\s]+))?
-    (?:\s:(#{MIME::Type::ENCODING_RE}))?
-  }x
-
-  mediatype  = $1
-  subtype    = $2
-  extensions = $3
-  encoding   = $4
-
-  # Lookup existing mime type
-  mime_type = MIME::Types["#{mediatype}/#{subtype}"].first ||
-    # Or create a new instance
-    MIME::Type.new("#{mediatype}/#{subtype}")
-
-  if extensions
-    extensions.split(/,/).each do |extension|
-      mime_type.extensions << extension
-    end
-  end
-
-  if encoding
-    mime_type.encoding = encoding
-  end
-
-  mime_type.override = true
-
-  # Kind of hacky, but we need to reindex the mime type after making changes
-  MIME::Types.add_type_variant(mime_type)
-  MIME::Types.index_extensions(mime_type)
-end
-
-module Linguist
-  module Mime
-    # Internal: Look up mime type for extension.
-    #
-    # ext - The extension String. May include leading "."
-    #
-    # Examples
-    #
-    #   Mime.mime_for('.html')
-    #   # => 'text/html'
-    #
-    #   Mime.mime_for('txt')
-    #   # => 'text/plain'
-    #
-    # Return mime type String otherwise falls back to 'text/plain'.
-    def self.mime_for(ext)
-      mime_type = lookup_mime_type_for(ext)
-      mime_type ? mime_type.to_s : 'text/plain'
-    end
-
-    # Internal: Lookup mime type for extension or mime type
-    #
-    # ext_or_mime_type - A file extension ".txt" or mime type "text/plain".
-    #
-    # Returns a MIME::Type
-    def self.lookup_mime_type_for(ext_or_mime_type)
-      ext_or_mime_type ||= ''
-
-      if ext_or_mime_type =~ /\w+\/\w+/
-        guesses = ::MIME::Types[ext_or_mime_type]
-      else
-        guesses = ::MIME::Types.type_for(ext_or_mime_type)
-      end
-
-      # Use custom override first
-      guesses.detect { |type| type.override } ||
-
-        # Prefer text mime types over binary
-        guesses.detect { |type| type.ascii? } ||
-
-        # Otherwise use the first guess
-        guesses.first
-    end
-  end
-end
--- a/lib/linguist/mimes.yml
+++ b/lib/linguist/mimes.yml
-# Additional types to add to MIME::Types
-#
-# MIME types are used to set the Content-Type of raw binary blobs. All text
-# blobs are served as text/plain regardless of their type to ensure they
-# open in the browser rather than downloading.
-#
-# The encoding helps determine whether a file should be treated as plain
-# text or binary. By default, a mime type's encoding is base64 (binary).
-# These types will show a "View Raw" link. To force a type to render as
-# plain text, set it to 8bit for UTF-8. text/* types will be treated as
-# text by default.
-#
-#   <type> @<extensions> :<encoding>
-#
-# type       - mediatype/subtype
-# extensions - comma seperated extension list
-# encoding   - base64 (binary), 7bit (ASCII), 8bit (UTF-8), or
-#              quoted-printable (Printable ASCII).
-#
-# Follows same format as mime-types data file
-#   https://github.com/halostatue/mime-types/blob/master/lib/mime/types.rb.data
-#
-# Any additions or modifications (even trivial) should have corresponding
-# test change in `test/test_mime.rb`.
-
-# TODO: Lookup actual types
-application/octet-stream @a,blend,gem,graffle,ipa,lib,mcz,nib,o,ogv,otf,pfx,pigx,plgx,psd,sib,spl,sqlite3,swc,ucode,xpi
-
-# Please keep this list alphabetized
-application/java-archive @ear,war
-application/netcdf :8bit
-application/ogg @ogg
-application/postscript :base64
-application/vnd.adobe.air-application-installer-package+zip @air
-application/vnd.mozilla.xul+xml :8bit
-application/vnd.oasis.opendocument.presentation @odp
-application/vnd.oasis.opendocument.spreadsheet @ods
-application/vnd.oasis.opendocument.text @odt
-application/vnd.openofficeorg.extension @oxt
-application/vnd.openxmlformats-officedocument.presentationml.presentation @pptx
-application/x-chrome-extension @crx
-application/x-iwork-keynote-sffkey @key
-application/x-iwork-numbers-sffnumbers @numbers
-application/x-iwork-pages-sffpages @pages
-application/x-ms-xbap @xbap :8bit
-application/x-parrot-bytecode @pbc
-application/x-shockwave-flash @swf
-application/x-silverlight-app @xap
-application/x-supercollider @sc :8bit
-application/x-troff-ms :8bit
-application/x-wais-source :8bit
-application/xaml+xml @xaml :8bit
-image/x-icns @icns
-text/cache-manifest @manifest
-text/plain @cu,cxx
-text/x-logtalk @lgt
-text/x-nemerle @n
-text/x-nimrod @nim
-text/x-ocaml @ml,mli,mll,mly,sig,sml
-text/x-rust @rs,rc
-text/x-scheme @rkt,scm,sls,sps,ss
--- a/lib/linguist/pathname.rb
+++ b/lib/linguist/pathname.rb
-require 'linguist/language'
-require 'linguist/mime'
-require 'pygments'
-
-module Linguist
-  # Similar to ::Pathname, Linguist::Pathname wraps a path string and
-  # provides helpful query methods. Its useful when you only have a
-  # filename but not a blob and need to figure out the language of the file.
-  class Pathname
-    # Public: Initialize a Pathname
-    #
-    # path - A filename String. The file may or maybe actually exist.
-    #
-    # Returns a Pathname.
-    def initialize(path)
-      @path = path
-    end
-
-    # Public: Get the basename of the path
-    #
-    # Examples
-    #
-    #   Pathname.new('sub/dir/file.rb').basename
-    #   # => 'file.rb'
-    #
-    # Returns a String.
-    def basename
-      File.basename(@path)
-    end
-
-    # Public: Get the extname of the path
-    #
-    # Examples
-    #
-    #   Pathname.new('.rb').extname
-    #   # => '.rb'
-    #
-    #   Pathname.new('file.rb').extname
-    #   # => '.rb'
-    #
-    # Returns a String.
-    def extname
-      File.extname(@path)
-    end
-
-    # Public: Get the language of the path
-    #
-    # The path extension name is the only heuristic used to detect the
-    # language name.
-    #
-    # Examples
-    #
-    #   Pathname.new('file.rb').language
-    #   # => Language['Ruby']
-    #
-    # Returns a Language or nil if none was found.
-    def language
-      @language ||= Language.find_by_filename(@path)
-    end
-
-    # Internal: Get the lexer of the path
-    #
-    # Returns a Lexer.
-    def lexer
-      language ? language.lexer : Pygments::Lexer.find_by_name('Text only')
-    end
-
-    # Public: Get the mime type
-    #
-    # Examples
-    #
-    #   Pathname.new('index.html').mime_type
-    #   # => 'text/html'
-    #
-    # Returns a mime type String.
-    def mime_type
-      @mime_type ||= Mime.mime_for(extname)
-    end
-
-    # Public: Return self as String
-    #
-    # Returns a String
-    def to_s
-      @path.dup
-    end
-
-    def eql?(other)
-      other.is_a?(self.class) && @path == other.to_s
-    end
-    alias_method :==, :eql?
-  end
-end
--- a/lib/linguist/popular.yml
+++ b/lib/linguist/popular.yml
@@ -8,6 +8,8 @@
 - C#
 - C++
 - CSS
+- Clojure
+- CoffeeScript
 - Common Lisp
 - Diff
 - Emacs Lisp
@@ -25,5 +27,3 @@
 - SQL
 - Scala
 - Scheme
- TeX
- XML
--- a/lib/linguist/repository.rb
+++ b/lib/linguist/repository.rb
@@ -67,20 +67,20 @@ module Linguist
      return if @computed_stats
  
      @enum.each do |blob|
-        # Skip binary file extensions
-        next if blob.binary_mime_type?
+        # Skip files that are likely binary
+        next if blob.likely_binary?
  
        # Skip vendored or generated blobs
        next if blob.vendored? || blob.generated? || blob.language.nil?
  
-        # Only include programming languages
-        if blob.language.type == :programming
+        # Only include programming languages and acceptable markup languages
+        if blob.language.type == :programming || Language.detectable_markup.include?(blob.language.name)
          @sizes[blob.language.group] += blob.size
        end
      end
  
      # Compute total size
-      @size = @sizes.inject(0) { |s,(k,v)| s + v }
+      @size = @sizes.inject(0) { |s,(_,v)| s + v }
  
      # Get primary language
      if primary = @sizes.max_by { |(_, size)| size }

--- a/lib/linguist/samples.json
+++ b/lib/linguist/samples.json
--- a/lib/linguist/samples.rb
+++ b/lib/linguist/samples.rb
+require 'yaml'
+
+require 'linguist/md5'
+require 'linguist/classifier'
+
+module Linguist
+  # Model for accessing classifier training data.
+  module Samples
+    # Path to samples root directory
+    ROOT = File.expand_path("../../../samples", __FILE__)
+
+    # Path for serialized samples db
+    PATH = File.expand_path('../samples.json', __FILE__)
+
+    # Hash of serialized samples object
+    if File.exist?(PATH)
+      DATA = YAML.load_file(PATH)
+    end
+
+    # Public: Iterate over each sample.
+    #
+    # &block - Yields Sample to block
+    #
+    # Returns nothing.
+    def self.each(&block)
+      Dir.entries(ROOT).each do |category|
+        next if category == '.' || category == '..'
+
+        # Skip text and binary for now
+        # Possibly reconsider this later
+        next if category == 'Text' || category == 'Binary'
+
+        dirname = File.join(ROOT, category)
+        Dir.entries(dirname).each do |filename|
+          next if filename == '.' || filename == '..'
+
+          if filename == 'filenames'
+            Dir.entries(File.join(dirname, filename)).each do |subfilename|
+              next if subfilename == '.' || subfilename == '..'
+
+              yield({
+                :path    => File.join(dirname, filename, subfilename),
+                :language => category,
+                :filename => subfilename
+              })
+            end
+          else
+            if File.extname(filename) == ""
+              raise "#{File.join(dirname, filename)} is missing an extension, maybe it belongs in filenames/ subdir"
+            end
+
+            yield({
+              :path     => File.join(dirname, filename),
+              :language => category,
+              :extname  => File.extname(filename)
+            })
+          end
+        end
+      end
+
+      nil
+    end
+
+    # Public: Build Classifier from all samples.
+    #
+    # Returns trained Classifier.
+    def self.data
+      db = {}
+      db['extnames'] = {}
+      db['filenames'] = {}
+
+      each do |sample|
+        language_name = sample[:language]
+
+        if sample[:extname]
+          db['extnames'][language_name] ||= []
+          if !db['extnames'][language_name].include?(sample[:extname])
+            db['extnames'][language_name] << sample[:extname]
+            db['extnames'][language_name].sort!
+          end
+        end
+
+        if sample[:filename]
+          db['filenames'][language_name] ||= []
+          db['filenames'][language_name] << sample[:filename]
+          db['filenames'][language_name].sort!
+        end
+
+        data = File.read(sample[:path])
+        Classifier.train!(db, language_name, data)
+      end
+
+      db['md5'] = Linguist::MD5.hexdigest(db)
+
+      db
+    end
+  end
+end