apache spark 2

昨日の続き。

WebUIというのが起動したらすぐ使えそうだという記事があったので確認。
http://www.slideshare.net/quintia/apache-spark-47630950

昨日の起動ではWARNが出まくっていたのでダメそう。
他にも外部アクセス許可シリーズ一通り何もしてないのでアクセスできないはず。
(Firewall、selinux)

とはいえ、http://・・・はコンソールログで確認できてもいい気がするのでWARNを解消する方向で対応。

[root@spark spark-1.6.1]# ll conf
total 40
-rw-r--r-- 1 500 500  987 Feb 27 14:01 docker.properties.template
-rw-r--r-- 1 500 500 1105 Feb 27 14:01 fairscheduler.xml.template
-rw-r--r-- 1 500 500 1734 Feb 27 14:01 log4j.properties.template
-rw-r--r-- 1 500 500 6671 Feb 27 14:01 metrics.properties.template
-rw-r--r-- 1 500 500  865 Feb 27 14:01 slaves.template
-rw-r--r-- 1 500 500 1292 Feb 27 14:01 spark-defaults.conf.template
-rwxr-xr-x 1 500 500 4209 Feb 27 14:01 spark-env.sh.template

confの配下にpropertiesのtemplateが存在したので、ひとまずlog4jのみコピー

cd conf
cp -p log4j.properties.template log4j.properties

spark-shell起動。メッセージもだいぶ変わった。

[root@spark spark-1.6.1]# ./bin/spark-shell
16/03/14 13:38:48 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/03/14 13:38:49 INFO SecurityManager: Changing view acls to: root
16/03/14 13:38:49 INFO SecurityManager: Changing modify acls to: root
16/03/14 13:38:49 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/03/14 13:38:49 INFO HttpServer: Starting HTTP Server
16/03/14 13:38:49 INFO Utils: Successfully started service 'HTTP class server' on port 55595.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
16/03/14 13:38:57 INFO SparkContext: Running Spark version 1.6.1
16/03/14 13:38:57 INFO SecurityManager: Changing view acls to: root
16/03/14 13:38:57 INFO SecurityManager: Changing modify acls to: root
16/03/14 13:38:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
16/03/14 13:38:57 INFO Utils: Successfully started service 'sparkDriver' on port 47250.
16/03/14 13:38:58 INFO Slf4jLogger: Slf4jLogger started
16/03/14 13:38:58 INFO Remoting: Starting remoting
16/03/14 13:38:58 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 49986.
16/03/14 13:38:59 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@153.120.171.150:49986]
16/03/14 13:38:59 INFO SparkEnv: Registering MapOutputTracker
16/03/14 13:38:59 INFO SparkEnv: Registering BlockManagerMaster
16/03/14 13:38:59 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-6e77a1be-01ab-48fd-9965-8736095697f3
16/03/14 13:38:59 INFO MemoryStore: MemoryStore started with capacity 517.4 MB
16/03/14 13:38:59 INFO SparkEnv: Registering OutputCommitCoordinator
16/03/14 13:38:59 INFO Utils: Successfully started service 'SparkUI' on port 4040.
16/03/14 13:38:59 INFO SparkUI: Started SparkUI at http://153.120.171.150:4040
16/03/14 13:39:00 INFO Executor: Starting executor ID driver on host localhost
16/03/14 13:39:00 INFO Executor: Using REPL class URI: http://153.120.171.150:55595
16/03/14 13:39:00 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 55319.
16/03/14 13:39:00 INFO NettyBlockTransferService: Server created on 55319
16/03/14 13:39:00 INFO BlockManagerMaster: Trying to register BlockManager
16/03/14 13:39:00 INFO BlockManagerMasterEndpoint: Registering block manager localhost:55319 with 517.4 MB RAM, BlockManagerId(driver, localhost, 55319)
16/03/14 13:39:00 INFO BlockManagerMaster: Registered BlockManager
16/03/14 13:39:00 INFO SparkILoop: Created spark context..
Spark context available as sc.
16/03/14 13:39:01 INFO HiveContext: Initializing execution hive, version 1.2.1
16/03/14 13:39:02 INFO ClientWrapper: Inspected Hadoop version: 2.0.0-cdh4.2.0
16/03/14 13:39:02 INFO ClientWrapper: Loading Hadoop shims org.apache.hadoop.hive.shims.Hadoop20SShims
16/03/14 13:39:02 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop20SShims for Hadoop version 2.0.0-cdh4.2.0
16/03/14 13:39:03 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/03/14 13:39:03 INFO ObjectStore: ObjectStore, initialize called
16/03/14 13:39:03 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/03/14 13:39:03 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/03/14 13:39:03 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/14 13:39:04 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/14 13:39:08 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/03/14 13:39:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/14 13:39:10 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/14 13:39:14 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/14 13:39:14 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/14 13:39:15 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/03/14 13:39:15 INFO ObjectStore: Initialized ObjectStore
16/03/14 13:39:15 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/03/14 13:39:15 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/03/14 13:39:16 INFO HiveMetaStore: Added admin role in metastore
16/03/14 13:39:16 INFO HiveMetaStore: Added public role in metastore
16/03/14 13:39:16 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/03/14 13:39:16 INFO HiveMetaStore: 0: get_all_databases
16/03/14 13:39:16 INFO audit: ugi=root  ip=unknown-ip-addr      cmd=get_all_databases
16/03/14 13:39:16 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/03/14 13:39:16 INFO audit: ugi=root  ip=unknown-ip-addr      cmd=get_functions: db=default pat=*
16/03/14 13:39:16 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
16/03/14 13:39:17 INFO SessionState: Created local directory: /tmp/d2929571-b6ac-4952-a5ca-1fdfb9e16963_resources
16/03/14 13:39:17 INFO SessionState: Created HDFS directory: /tmp/hive/root/d2929571-b6ac-4952-a5ca-1fdfb9e16963
16/03/14 13:39:17 INFO SessionState: Created local directory: /tmp/root/d2929571-b6ac-4952-a5ca-1fdfb9e16963
16/03/14 13:39:17 INFO SessionState: Created HDFS directory: /tmp/hive/root/d2929571-b6ac-4952-a5ca-1fdfb9e16963/_tmp_space.db
16/03/14 13:39:17 INFO HiveContext: default warehouse location is /user/hive/warehouse
16/03/14 13:39:17 INFO HiveContext: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes.
16/03/14 13:39:17 INFO ClientWrapper: Inspected Hadoop version: 2.0.0-cdh4.2.0
16/03/14 13:39:17 INFO ClientWrapper: Loading Hadoop shims org.apache.hadoop.hive.shims.Hadoop20SShims
16/03/14 13:39:17 INFO ClientWrapper: Loaded org.apache.hadoop.hive.shims.Hadoop20SShims for Hadoop version 2.0.0-cdh4.2.0
16/03/14 13:39:17 INFO HiveMetaStore: 0: Opening raw store with implemenation class:org.apache.hadoop.hive.metastore.ObjectStore
16/03/14 13:39:17 INFO ObjectStore: ObjectStore, initialize called
16/03/14 13:39:18 INFO Persistence: Property hive.metastore.integral.jdo.pushdown unknown - will be ignored
16/03/14 13:39:18 INFO Persistence: Property datanucleus.cache.level2 unknown - will be ignored
16/03/14 13:39:18 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/14 13:39:18 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/14 13:39:20 INFO ObjectStore: Setting MetaStore object pin classes with hive.metastore.cache.pinobjtypes="Table,StorageDescriptor,SerDeInfo,Partition,Database,Type,FieldSchema,Order"
16/03/14 13:39:23 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/14 13:39:23 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/14 13:39:23 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MFieldSchema" is tagged as "embedded-only" so does not have its own datastore table.
16/03/14 13:39:23 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MOrder" is tagged as "embedded-only" so does not have its own datastore table.
16/03/14 13:39:24 INFO Query: Reading in results for query "org.datanucleus.store.rdbms.query.SQLQuery@0" since the connection used is closing
16/03/14 13:39:24 INFO MetaStoreDirectSql: Using direct SQL, underlying DB is DERBY
16/03/14 13:39:24 INFO ObjectStore: Initialized ObjectStore
16/03/14 13:39:24 INFO HiveMetaStore: Added admin role in metastore
16/03/14 13:39:24 INFO HiveMetaStore: Added public role in metastore
16/03/14 13:39:24 INFO HiveMetaStore: No user is added in admin role, since config is empty
16/03/14 13:39:24 INFO HiveMetaStore: 0: get_all_databases
16/03/14 13:39:25 INFO audit: ugi=root  ip=unknown-ip-addr      cmd=get_all_databases
16/03/14 13:39:25 INFO HiveMetaStore: 0: get_functions: db=default pat=*
16/03/14 13:39:25 INFO audit: ugi=root  ip=unknown-ip-addr      cmd=get_functions: db=default pat=*
16/03/14 13:39:25 INFO Datastore: The class "org.apache.hadoop.hive.metastore.model.MResourceUri" is tagged as "embedded-only" so does not have its own datastore table.
16/03/14 13:39:25 INFO SessionState: Created local directory: /tmp/f6fb3a97-5dc2-460f-bf46-a4977378af3c_resources
16/03/14 13:39:25 INFO SessionState: Created HDFS directory: /tmp/hive/root/f6fb3a97-5dc2-460f-bf46-a4977378af3c
16/03/14 13:39:25 INFO SessionState: Created local directory: /tmp/root/f6fb3a97-5dc2-460f-bf46-a4977378af3c
16/03/14 13:39:25 INFO SessionState: Created HDFS directory: /tmp/hive/root/f6fb3a97-5dc2-460f-bf46-a4977378af3c/_tmp_space.db
16/03/14 13:39:25 INFO SparkILoop: Created sql context (with Hive support)..
SQL context available as sqlContext.

apache spark

忙しくて最近更新してません。。。
今回も自分の作業メモです。。

最初はWindowsでやろうとしたものの
最終的にはLinux上でやる予定なのと
細かい設定が割に合わない気がしたのでクラウド新規で作って実践。

■OS クラウドのサービスデフォルト
CentOS 7.2
apache spark 1.6.1(site : http://www.apache.org/dyn/closer.lua/spark/spark-1.6.1/spark-1.6.1-bin-cdh4.tgz)
apache hadoop 2.7.2(site : http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz)

hadoop2.6 laterにしたかったが、buildが必要そうだったので必要なさそうなcdh4で実施

■Javaインストール

yum install java
yum install java-1.8.0-openjdk-devel

■wget

yum install wget

■spark

wget http://ftp.jaist.ac.jp/pub/apache/spark/spark-1.6.1/spark-1.6.1-bin-cdh4.tgz
cd /opt
tar zxvf ~/spark-1.6.1-bin-cdh4.tgz
mv spark-1.6.1-bin-cdh4/ spark-1.6.1
export SPARK_HOME=/opt/spark-1.6.1

つい最近ver上がったんですね。
このままでは、/opt/spark-1.6.1/bin/spark-shellも動かない
なので、hadoopをインストール

■Scala

cd ~
wget http://downloads.lightbend.com/scala/2.11.8/scala-2.11.8.tgz
cd /opt
tar zxvf ~/scala-2.11.8.tgz
export SCALA_HOME=/opt/scala-2.11.8/

■hadoop

cd ~
wget http://ftp.jaist.ac.jp/pub/apache/hadoop/common/hadoop-2.7.2/hadoop-2.7.2.tar.gz
cd /opt
tar zxvf ~/hadoop-2.7.2.tar.gz

■実行

cd /opt/spark-1.6.1
./bin/spark-shell
log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's repl log4j profile: org/apache/spark/log4j-defaults-repl.properties
To adjust logging level use sc.setLogLevel("INFO")
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.6.1
      /_/

Using Scala version 2.10.5 (OpenJDK 64-Bit Server VM, Java 1.8.0_71)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
16/03/13 20:58:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/13 20:58:53 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/13 20:59:05 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/03/13 20:59:05 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
16/03/13 20:59:07 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/13 20:59:08 WARN Connection: BoneCP specified but not present in CLASSPATH (or one of dependencies)
16/03/13 20:59:17 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recording the schema version 1.2.0
16/03/13 20:59:18 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
SQL context available as sqlContext.
scala> val textFile = sc.textFile("README.md")
textFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[1] at textFile at :27

scala> textFile.count()
res0: Long = 95

scala> textFile.first()
res1: String = # Apache Spark

Warning出まくってる&公式のQuickStartと微妙に結果が違うけど
どうやらうまく動いている様子。