Apache Crunch was developed as an accessible and open source Java library.
Apache Crunch is designed to provide a framework for writing, testing, and running MapReduce pipelines.
The library is based on FlumeJava and can be used to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
What's New in This Release:
Bug:
· Add MRPipelineExecution to expose some MR-specific APIs
· MemPipeline.write() is inconsistent with MemPipeline.read()
· Remove superfluous directory from source distribution
· Jobs with large numbers of directory inputs will fail with odd inputsplit exceptions
· Update cogroup functions to use mapValues functions
· Crunch PipelineResult objects do not capture failures
· Crunch not working with S3
· Crunch should ignore hidden files
· FileTargetImpl cuts off extensions of output files
· Better error handling for incompatible PTypes and Targets
· DoFn initialize method gets called twice where as cleanup gets called only once when join is performed on two PTables.
· Need to Handle InterruptedException in CrunchJobHooks
· PCollectionGetSizeIT test failure with hadoop-2 profile
· Counters don't work on the post-process function in OneToManyJoin
· Improper job dependencies for certain types of long pipelines
· Add numReducers parameter to the SecondarySort APIs
· Make DefaultJoinStrategy.join(PTable, PTable, JoinFn)...