Re: [RFR] templates://hadoop/{hadoop-namenoded.templates}
Christian PERRIER wrote:
> Your review should be sent as an answer to this mail.
Sorry, I'm running late, comments but no actual patch attached.
> Template: hadoop-namenoded/format
[...]
> +_Description: Should the namenode's file system be formatted?
> The namenode manages the Hadoop Distributed FileSystem (HDFS). Like a
> + normal file system, it needs to be formatted prior to first use. If the
> + HDFS file system is not formatted, the namenode daemon will fail to
> start.
s/FileSystem/File System/. We could save some verbiage here:
The namenode manages the Hadoop Distributed File System (HDFS). Like a
normal file system, it needs to be formatted before use; otherwise
the namenode daemon will not start.
> .
> + This operation does not affect other file systems on this
> + computer. You can safely choose to format the file system if you're
> + using HDFS for the first time and don't have data from previous
> + installations on this computer.
> .
> + If you choose not to format the file system right now, you can do it
> + later by executing "hadoop namenode -format" with the hadoop user
> + privileges.
I want to change that last phrase, but I'm not sure what to. Maybe:
later by executing "hadoop namenode -format" as the user "hadoop".
> --- hadoop.old/debian/control 2010-03-22 09:56:11.717948376 +0100
> +++ hadoop/debian/control 2010-03-26 18:30:25.615052315 +0100
> @@ -44,14 +44,54 @@
> libslf4j-java,
> libxmlenc-java
> Suggests: libhsqldb-java
> -Description: software platform for processing vast amounts of data
> - This package contains the core java libraries.
> +Description: platform for processing vast amounts of data - Java libraries
This doesn't strike me as conveying what Hadoop is; after all, you can
process "vast amounts of data" on any machine as long as you're allowed to
take vast amounts of time. Hadoop's "suite description" should have the
words "cluster" or "distributed" or "parallel" in it somewhere.
Unfortunately there isn't much room, but how about:
Description: data-intensive clustering framework - Java libraries
> + Hadoop is a software platform that lets one easily write and
> + run applications that process vast amounts of data.
The pronoun "one" is just that bit too formal.
Hadoop is a software platform for writing and running applications
that process vast amounts of data.
And it might make sense to insert: on a distributed file system.
> + .
> + Here's what makes Hadoop especially useful:
> + * Scalable: Hadoop can reliably store and process petabytes.
> + * Economical: It distributes the data and processing across clusters
> + of commonly available computers. These clusters can number
> + into the thousands of nodes.
> + * Efficient: By distributing the data, Hadoop can process it in parallel
> + on the nodes where the data is located. This makes it
> + extremely rapid.
> + * Reliable: Hadoop automatically maintains multiple copies of data and
> + automatically redeploys computing tasks based on failures.
I'm not sure I like this layout, but it's all good material.
> + .
> + Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS).
> + MapReduce divides applications into many small blocks of work. HDFS creates
> + multiple replicas of data blocks for reliability, placing them on compute
> + nodes around the cluster. MapReduce can then process the data where it is
> + located.
> + .
> + This package contains the core Java libraries.
I'm not sure they should all carry all three boilerplate paragraphs; maybe
since hadoop-bin is a common dependency it makes sense for it to carry the
"long version".
> Package: libhadoop-index-java
> Architecture: all
> Depends: ${misc:Depends}, libhadoop-java (= ${binary:Version}),
> liblucene2-java
> -Description: Hadoop contrib to create lucene indexes
> +Description: platform for processing vast amounts of data - create Lucene indexes
>
> The original synopsis was quite odd (verb sentence). Keep the "create
> <foo>" style, but I'd actually maybe prefer "Lucene index creation".
I think it was claiming to be (a) contrib, meaning a third-party Java
library, but we want to keep that misleading keyword out of the way;
"create Lucene indexes" *is* a verb-based, um, phrasal constituent of some
sort. I'd suggest:
Description: data-intensive clustering framework - Lucene index support
> + Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS).
> + MapReduce divides applications into many small blocks of work. HDFS creates
> + multiple replicas of data blocks for reliability, placing them on compute
> + nodes around the cluster. MapReduce can then process the data where it is
> + located.
> + .
> This contrib package provides a utility to build or update an index
> using Map/Reduce.
This replaces what was originally a package-specific discussion of
MapReduce (and Lucene and "shards") with something generic.
> Package: hadoop-bin
[...]
> +Description: platform for processing vast amounts of data - binaries
Don't the daemon packages also contain binaries?
Description: data-intensive clustering framework - tools
[...]
> This package contains the hadoop shell interface. See the packages hadoop-.*d
> for the hadoop daemons.
You can't say "the packages hadoop-*d" (or even "the packages Hadoop");
other way round in English. And capitalise where it means the generic
software framework rather than a command/file/packagename:
This package contains the Hadoop shell interface. See the hadoop-.*d
packages for the Hadoop daemons.
> Package: hadoop-daemons-common
[...]
> +Description: platform for processing vast amounts of data - common files
> + Hadoop is a software platform that lets one easily write and
> + run applications that process vast amounts of data.
> + .
> + Here's what makes Hadoop especially useful:
> + * Scalable: Hadoop can reliably store and process petabytes.
> + * Economical: It distributes the data and processing across clusters
> + of commonly available computers. These clusters can number
> + into the thousands of nodes.
> + * Efficient: By distributing the data, Hadoop can process it in parallel
> + on the nodes where the data is located. This makes it
> + extremely rapid.
> + * Reliable: Hadoop automatically maintains multiple copies of data and
> + automatically redeploys computing tasks based on failures.
> + .
> + This package prepares some common things for all hadoop daemon packages:
> * creates the user hadoop
> * creates data and log directories owned by the hadoop user
> * manages the update-alternatives mechanism for hadoop configuration
More s/hadoop/Hadoop/, though not for the user. I gather it's another
case of "I can't think of a name, maybe this three-year-old can help".
> Package: hadoop-jobtrackerd
[...]
> + The Job Tracker is a central service which is responsible for managing
> + the Task Tracker services running on all nodes in an Hadoop Cluster.
> + The Job Tracker allocates work to the tasktracker nearest to the data
> with an available work slot.
the Task Tracker
> Package: hadoop-namenoded
[...]
> The Hadoop Distributed Filesystem (HDFS) requires one unique server, the
> - namenode, which manages the block locations of files on the filesystem.
> + name node, which manages the block locations of files on the file system.
Should this be Name Node?
> Package: hadoop-secondarynamenoded
[...]
> + The secondary name node is responsible for checkpointing file system images.
> + It is _not_ a failover pair for the name node, and may safely be run on the
> same machine.
Likewise, Name Node(?); is "pair" here a false-friend for "partner"?
> Package: hadoop-datanoded
[...]
> + The data nodes in the Hadoop Cluster are responsible for serving up
> blocks of data over the network to Hadoop Distributed Filesystem
> (HDFS) clients.
Likewise for Data Nodes. Oh, you missed a File^System.
--
JBR with qualifications in linguistics, experience as a Debian
sysadmin, and probably no clue about this particular package
Reply to: