Pig: All content tagged as Pig in NoSQL databases and polyglot persistence
Edd Dumbill enumerates the various components of the Hadoop ecosystem:
Original title and link: The components and their functions in the Hadoop ecosystem ( ©myNoSQL)
From the “Temporal Analytics on Big Data for Web Advertising” paper:
TiMR is a framework that transparently combines a map-reduce (M-R) system with a temporal DSMS1. Users express time-oriented analytics using a temporal (DSMS) query lan- guage such as StreamSQL or LINQ. Streaming queries are declarative and easy to write/debug, real-time-ready, and often several orders of magnitude smaller than equivalent custom code for time-oriented applications. TiMR allows the temporal queries to transparently scale on offline temporal data in a cluster by leveraging existing M-R infrastructure.
Broadly speaking, TiMR’s architecture of compiling higher level queries into M-R stages is similar to that of Pig/SCOPE. However, TiMR specializes in time-oriented queries and data, with several new features such as: (1) the use of an unmodified DSMS as part of compilation, parallelization, and execution; and (2) the exploitation of new temporal parallelization opportunities unique to our setting. In addition, we leverage the temporal algebra underlying the DSMS in order to guarantee repeatability across runs in TiMR within M-R (when handling failures), as well as over live data.
According to the paper, DSMS work well for real-time data, but are not massively scalable. On the other hand, Map-Reduce is extremely scalable, but computation is performed on offline data. TiMR proposes a solution that is getting closer to a real-time map-reduce.
Read or download the paper after the break.
Sifting through the PRish announcements related to Informatica HParser, what I’ve figured out so far is:
- it is the T in ETL
- a visual tool for creating parsing definitions for formats like web logs, XML, JSON, FIX, SWIFT, HL7, CDR, WORD, PDF, XLS, etc.
- transformations can be accessed from Hadoop MapReduce, Hive, or Pig
- the benefits of using HParser come from being able to share the same parsing definitions/transformations in the context of the Hadoop distributed environment
- HParser tries to provide an optimal transformation solution when streaming, splitting, and processing large files
- HParser is available in two licensing formats: community and commercial
Original title and link: What Is Informatica HParser for Hadoop? ( ©myNoSQL)