Skip to content

Latest commit

 

History

History
52 lines (35 loc) · 1.9 KB

README.md

File metadata and controls

52 lines (35 loc) · 1.9 KB

Stripper

Module Version Hex Docs Total Download License Last Updated

Stripper is an Elixir package for normalizing input from unpredictable sources (such as web scraping), useful as a pre-processing step in ETL pipelines for machine learning or data analysis. It is parser-based (not regular expression based), so it does all its work in one pass and should be performant.

Why the name? Because it describes the purpose and it's memorable -- get over it ;)

Examples

Normalizing whitespace:

iex> Stripper.Whitespace.normalize!("   random\tstuff\fI   scraped\t\t\tfrom\nthe web\n\n")
"random stuff I scraped from the web"

This will reduce all unicode whitespace and separator characters to the humble space -- multiple spaces will be collapsed into one.

Simplifying quotes:

iex> Stripper.Quotes.normalize!(~S|‘make’ «it» „stop“|)
      "'make' \"it\" \"stop\""

See the online documentation for more information.

Installation

If available in Hex, the package can be installed by adding stripper to your list of dependencies in mix.exs:

def deps do
  [
    {:stripper, "~> 1.4.0"}
  ]
end

Contributing

See the Contributing Guidelines for more information.

Image Attribution

The logo image is "wire strippers" by Designs by MB from the the Noun Project